This document describes detail of how to use applications of Hyper Estraier. If you have never read the introduction document, please read it beforehand.
Hyper Estraier is a full-text search system using index database. So, before search, it is needed to prepare an index into which target documents have been registered. Hyper Estraier provides the administration command `estcmd
' and the CGI script `estsearch.cgi
' for search. The former is used in order to administrate the index by command line interface. The latter is used in order to search the index for documents with a web browser.
estcmd
can handle various file formats and features various operations to administrate index. How to use it is described in this document.
Hyper Estraier supports such various methods for search as combining some search phrase and search with attributes of documents. Moreover, it is possible to customize presentation according to the configuration of estseek.cgi
. How to do it is described in this document.
Not only information of the body text but also such attributes as the title, the modification date, and so on can be added to documents handled by Hyper Estraier. Attributes are used for such various purposes as search with attributes and determination of difference updating.
Any attribute has a name. As the name can be determined arbitrarily, some names are reserved for being used as system attributes. Names of system attributes begin with "@
". There are the following system attributes.
The other attributes except for system attributes are called user-defined attributes. They can be defined by document draft said later. Meta attributes in HTML and headers of MIME are also treated as user-defined attributes. Any attribute name should not begin with "%
".
There are two data types for attributes; string and number. Data of the string type are arbitrary strings. There are such operations as full matching, forward matching, backward matching, partial matching. Data of the number type are numbers or date information. A string of the number type is converted into the number and calculated according to the following formats. If the format is for date, the value is computed based on the UNIX epoch (1 Jan 1970).
The data type is not determined when registration. It is determined when search. Length of the value of an attribute is not limited.
Attributes and the body text of a document should be expressed in UTF-8 encoding. If another encoding is used, it should be converted into UTF-8. By the way, estcmd
detect the encoding automatically if it is not clearly specified.
estcmd
defines the URI attribute begins with "file://
" for each document. However, if a document defines its own URI, it comes first. The URI of the local file system is defined as an attribute whose name is "_lpath
". The absolute path on the local file system is defined as an attribute whose name is "_lreal
". The file name is normalized to UTF-8 is defined as an attribute whose name is "_lfile
". The encoding of the value of each attribute is normalized as UTF-8. Attributes whose name begins with "_
" are hidden in detail display by estseek.cgi
.
estcmd handles four file formats. This section describes how the four are processed.
A document of plain-text is composed of strings with no structure. By default, files whose names end with ".txt
", ".text
", or ".asc
" are treated as plain-text.
As we all know, a document of HTML is used as a hyper-text on the Web. By default, files whose names end with ".html
", ".htm
", "xhtml
", or ".xht
" are treated as HTML.
MIME is used for communication by e-mail based on RFC822 and so on. By default, files whose names end with ".eml
", ".mime
", ".mht
", or ".mhtml
" are treated as HTML.
If the content of each part of multipart is "text/plain", "text/html", or "message/rfc822", the content is treated as a part of the body text so that web archive can be supported.
Document draft is a original format of Hyper Estraier. It is possible to handle various formats in the integrative way by using document draft as intermediate format. By default, files whose names end with ".est
" are treated as document draft.
Though format of document draft is similar to RFC822, detail points differ. The delimiter for headers is not ":
" but "=
". Moreover, no space character is needed after "=
". The following is an example data to handle a MIDI document.
@uri=http://www.music-estraier.com/mididb/t/tw/twinkle.kar @title=Twinkle Twinkle Little Star @author=Jane Taylor @cdate=2004-11-01T23:11:18+09:00 @mdate=2005-03-21T08:07:45+09:00 category=chorus,dance Twinkle, twinkle, little star, How I wonder what you are. Up above the world so high, Like a diamond in the sky. Twinkle, twinkle, little star, How I wonder what you are! Twinkle Twinkle Little Star Jane Taylor
The following specifications are required for document draft.
In the attribute section, lines which begin with "%" are regarded as control commands and are ignored.
A hidden text is the same as normal text except not displayed in the snippet of the result. It is useful to search with some attributes.
Two kinds of search conditions are supported. One is for full-text search and the other is for attribute search. If both are specified at the same time, documents corresponding to the both are searched for. Moreover, usual format and simplified format are supported for full-text search condition.
The purpose of full-text search expression is to search for documents including some specified words. For example, if you search for documents including a word "computer
", specify "computer
" in the search phrase as it is.
You can specify two or more words. For example, if you specify "United Nations
", documents including "united
" followed by "nations
" are searched for. In case of simplified form, specify the following.
"united nations"
Intersection operation is supported by the "AND
" operator. For example, if you specify "internet AND security
", documents including both of "internet
" and "security
" are searched for. In case of simplified form, specify the following.
internet security
Difference operation is supported by the "ANDNOT
" operator. For example, if you specify "hacker ANDNOT cracker
", documents including "hacker
" but not including "cracker
" are searched for. In case of simplified form, specify the following.
hacker ! cracker
Union operation is supported by the "OR
" operator. For example, if you specify "proxy OR firewall
", documents including one or both of "proxy
" and "firewall
" are searched for. In case of simplified form, specify the following.
proxy | firewall
Note that the priority of "OR
" is higher than ones of "AND
" and "ANDNOT
". For example, if you specify "F1 OR F-1 OR Formula One AND Champion OR Victory
", documents including one or both of "f1
", "f-1
", and "formula one
", and including one or both of "champion
" and "victory
". In case of simplified form, specify the following.
F1 | F-1 | "Formula One" Champion | Victory
Search words are case insensitive. However, operators are case sensitive. If you want to search for documents including "AND
", specify "and
" instead.
Wild card is also supported. It can be used for forward match search and backward match search. For example, "[BW] euro
" matches words which begin with "euro
". And, "[EW] shere
" matches words which end with "sphere
". In case of simplified form, "euro*
" and "*sphere
" are used instead.
The purpose of attribute search expression is to search for documents whose attributes are corresponding to the specified expression. An expression of attribute search is composed of an attribute name, an operator, and a value. They are separated with space characters. For example, if you specify "@title STRINC IMPORTANT
", documents whose title includes "IMPORTANT
". The following operators for attribute search are supported.
If an operator is leaded by "!
", the meaning is inverted. If an operator is leaded by "I
", case of the value is ignored.
You can specify the order of the result by an expression. An ordering expression is composed of an attribute name and an operator. For example, if you specify "@size NUMA", documents in the result are in ascending order of the size. The following operators for ordering are supported.
By default, the order of the result is descending by score. The score is calculated by the number of specified words in each document.
This section describes specification of estcmd
. estcmd
can do not only indexing but also search.
estcmd
is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument db specifies the path of an index.
All sub commands return 0 if the operation is success, else return 1. As for put, out, gather, purge, randput, wicked, and regression, they finish with closing the database when they catch the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), 13 (SIGPIPE), or 15 (SIGTERM).
The encoding name specified by -ic
option should be such name registered to IETF as UTF-8
, ISO-8859-1
, and so on. The language name specified by -il
option should be one of "en
" (English), "ja
" (Japanese), "zh
" (Chinese), "ko
" (Korean).
The outer command specified by -fx
option of gather receives the path of the target document by the first argument and the path for output by the second argument. The original path of the target document is given as the value of the environment variable `ESTORIGFILE
'.
Note that similarity search is very slow, by default. To improve the performance of similarity search, running "estcmd extkeys
" beforehand is strongly recommended.
The following is to register mail files of mh format.
find /home/mikio/Mail -type f | egrep 'inbox/(business|friends)/[0-9]+$' | estcmd gather -cl -fm -cm casket -
The following is to register MS-Office files. estfxmsotohtml
requires wvWare and xlhtml.
PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH estcmd gather -cl -fx ".doc,.xls,.ppt" "H@estfxmsotohtml" -fz -sd -cm casket .
The following is to register PDF files. estfxpdftohtml
requires pdftotext.
PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH estcmd gather -cl -fx ".pdf" "H@estfxpdftohtml" -fz -sd -cm casket .
The following is to register cache files of WWWOFFLE, a proxy server. estwolefind
requires WWWOFFLE.
estwolefind /var/spool/wwwoffle | estcmd gather -cl -fm -bc -px @uri -px _lfile -sd -cm casket -
The following is to output the search result as XML.
estcmd search -vx -max 8 casket 'socket AND shutdown'
This section describes specification of estseek.cgi
. The subject matter is to write configuration files.
estseek.cgi
needs three configuration files; the prime configuration file, the template file, and the top page file. Their default names are `estseek.cgi
', `estseek.tmpl
', and `estseek.top
'.
The name of the prime configuration file is determined by changing the suffix of the CGI script to ".conf
". If you change the name of `estseek.cgi
' to `estsearch.cgi
', `estsearch.conf
' is read. Names of the template file and the top page file is described in the prime configuration file. So, you can install some sets of search scripts in one directory.
As estseek.cgi
is installed as `/usr/local/libexec/estseek.cgi
', copy it to a directory for CGI scripts. Moreover, as samples of configurations are installed in `/usr/local/share/hyperestraier/
', copy and modify them.
The prime configuration file is composed of lines and the name of an variable and the value separated by ":
" are in each line. By default, the following configuration is there.
indexname: casket tmplfile: estseek.tmpl topfile: estseek.top logfile: lprefix: file:///home/mikio/public_html/ gprefix: http://localhost/ gsuffix: dirindex: index.html replace: //localhost/{{!}}//127.0.0.1/ replace: //127.0.0.1:80/{{!}}//127.0.0.1/ showlreal: false perpage: 10,20,30,40,50,100 attrselect: false showscore: false extattr: author|Author extattr: from|From extattr: to|To extattr: cc|Cc extattr: date|Date snipwwidth: 480 sniphwidth: 96 snipawidth: 96 condgstep: 2 dotfidf: true scancheck: false smplphrase: true candetail: true smlrvnum: 0 spcache:
Means of each variable is the following.
{{!}}
". This can be more than once.true
" or "false
".|
". This can be more than once.1
" is to check every key. "2
" is to check keys of N-gram are checked every two. "3
" is every three. "4
" is every four.true
" or "false
".true
" or "false
".true
" or "false
".true
" or "false
".The template file is to determine appearance of the page. It describes HTML and the data is shown as it is. However, "<!--ESTFORM-->
" is replaced by the form to input search conditions. "<!--ESTRESULT-->
" is replaced by the search result. "<!--ESTINFO-->
" is replaced by information of the index.
When a user access the CGI script first or if no configuration is input, the content of the top page file is displayed instead of the search result. By default, usage of the CGI script is described there.
If you want set the search form in another page, write the following HTML.
<form method="get" action="estseek.cgi"> <div> <input type="text" name="phrase" value="" size="32" /> <input type="submit" value="Search" /> <input type="hidden" name="enc" value="UTF-8" /> </div> </form>
Change "estseek.cgi
" to the URI of setseek.cgi
. Change "UTF-8
" to the encoding name of the page.