NAME
swish-e - web indexing and retrieval system
SYNOPSIS
swish-e -w words [-m maxresults] [-t tags] [-d delimiter] [-p properties] [-f file file ...] [-c config] [-M] [-l] [-D] [-V]
DESCRIPTION
swish-e is Simple Web Indexing System for Humans - Enhanced. siwsh-e searches words in a previously created index of HTML pages. It returns a ranked output of file names whose contents match the words.Please note that this documentation is not complete. The definitve source can only be found on the web: http://sunsite.berkeley.edu/SWISH-E/.
OPTIONS
- -w word word ...
- This performs a case-insensitive search using a number of keywords. If no index file to search is specified, swish-e will try to search a file called index.swish-e in the current directory. See below for the syntax and semantics of kewwords.
- -t HBthec
- The -t option allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag to search for the word in. H means all HEAD tags, B stands for BODY tags, t is all TITLE tags, h is H1 to H6 (header) tags, e is emphasized tags (this may be B, I, EM, or STRONG), and c is HTML comment tags
- -f indexfile1 indexfile2 ...
- If you are indexing, this specifies the file to save the generated index in, and you can only specify one file. If you are searching, this specifies the index files (one or more) to search from. The default index file is index.swish-e in the current directory.
- -p property1 property2 ...
- NOTE: it is necessary to have indexed with the proper PropertyNames directive in the user config file in order to use this option.
- -d character
- The delimiter string can be anything you like, although the special string ``dq'' will be interpreted to mean a single double quote character. To parse a line of the output in Perl (using the ``dq'' option) use:
($rank, $filename, $title, $filesize) = split(/\"/, $_);
- -m number
- While searching, this specifies the maximum number of results to return. The default is 40. If no numerical value is given, the default is assumed. If the value is 0 or the string all, there will be no limit to the number of results. The configuration file value overrides this value.
- -c config-file
- Start indexing using the parameters from config-file.
- -S [ fs | http ]
- Specify which indexing system to use: fs to index the file system, http index web sites using a web crawler.
- -l
- Follow symbolic links when indexing.
- -M file file ...
- Merge indexing files.
- -D index
- Decode an index file.
- -V
- Print the current version.
SEARCHING
Boolean Operators
You can use the booleans operators and, or, or not in searching. Without these booleans, swish-e will assume you're anding the words together. The operators are case sensitive -- use lowercase ONLY.Evaluation takes place from left to right only, although you can use parentheses to force the order of evaluation .
% swish-e -w "smilla or snow" -f myIndexretrieves files containing either the words ``smilla'' or ``snow''
% swish-e -w "smilla and snow not sense" -f myIndexretrieves first the files that contain both the words ``smilla'' and ``snow''; then among those the ones that do not contain the word ``sense''
Truncation
The only wildcard available at this time is (*), however it can only be used at the end of a word. Usage at the beginning or in the middle of the word will yield no results.
% swish-e -w "librarian" -f myIndexthis query only retrieves files which contain the given word.
On the other hand:
% swish-e -w "librarian*" -f myIndexretrieves ``librarians'', ``librarianship'', etc. along with ``librarian''.
Meta Tags
The equal sign indicates the presence of a metaName and the search results in all the files where the META tag with NAME=``metaName'' has CONTENT=``word'' (or where ``word'' is contained in the area marked by the <!--META START... --> and <!--META END... --> tags).It is not necessary to have spaces at either side of the '=', consequently the following are equivalent:
% swish-e -w "metaName = word" -f
% swish-e -w "metaName=word" -f
% swish-e -w "metaName= word" -fTo search on a word that contains a '=', have a '/' precede the '=':
% swish-e -w "test/=3 = x/=4 or y/=5" -f <index.file>this query returns the files where the word ``x=4'' is associated with the metaName ``test=3'' or that contains the word ``y=5'' not associated with any metaName.
Queries can be also constructed using any of the usual search features, moreover metaName and plain search can be mixed in a single query.
% swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f yyyThis query will retrieve all the files in which the ``metaName1'' is associated either with ``a1'' or ``a4'' and that do not contain the words ``a3'' and ``a7'', where ``a3'' and ``a7'' are not associated to any meta name.
Order of Evaluation
Expressions are always evaluated left to right:
% swish -w "juliet not ophelia and pac" -f myIndexretrieves files which contain ``juliet'' and ``pac'' but not ``ophelia'' However it is always possible to force the order of evaluation by using parenthesis. For example:
% swish-e -w "juliet not (ophelia and pac)" -f myIndexretrieves files with ``juliet'' and containing neither ``ophelia'' nor ``pac''.
Context
At times you might not want to search for a word in every part of your files since you know that theword(s)are present in a particular tag. The ability to seach according to context greatly increases the chances that your hits will be relevant, and swish-e provides a mechanism to do just that.The -t option in the search command line allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag in which the word is searched; that is you can use any combinations of the following charactes:
- H
- means all HEAD tags
- B
- stands for BODY tags
- t
- is all TITLE tags
- h
- is H1 to H6 (header) tags
- e
- is emphasized tags (this may be B, I, EM, or STRONG)
- c
- is HTML comment tags (<!-- ... -->)
Examples
swish-e -w "apples oranges" -t t -f myIndexThis search will look for files with these two words in their titles only.
swish-e -w "keywords draft release" -t c -f myIndexThis search will look for files with these words in comments only.
swish-e -w "world wide web" -t the -f myIndexThis search will look for words in titles, headers, and emphasized tags.
CONFIG FILE
Some Basic Variables in the User Configuration File. If not otherwise specified, all directives are used by both the FILESYSTEM and HTTP methods
- IndexDir directory
- The IndexDir variable tells swish-e what directories and files to index. Each specified directory will be indexed recursively. You can use more than one of these directives - here are some examples:
IndexDir /usr/local/www /src/code.html IndexDir /users/tony/public_html/home.html /webFor the HTTP method specify the url's from which the spidering need to start.
- IndexFile indexfile
- The IndexFile variable tell swish-e what to save the indexed results as. Indexes generated by swish-e should have a suffix of .swish-e.
- IndexOnly .suffix1 .suffix2 .suffix3 ...
- Only files with these suffixes will be indexed. If you omit this variable, swish-e will index every file it comes across. Suffix checking is not case sensitive. This directive in only available for the FILESYSTEM method.
- PropertyNames author
- List of names that can be retrieved with the -p option. Index size increases as by the formula in the manual. Comment out if no PropertyNames
- UseStemming no
- Set this directive to yes if you would like stemming
- IndexReport 3
- This variable can have the values 0 to 3. If you specify 3, swish-e will tell you what's going on while it's indexing, printing out directory and file names, number of words indexed, and so on, as well as give information about other operations. The value 0 will make swish-e completely silent.
- FollowSymLinks yes|no
- Normally swish-e ignores symbolic links to files when indexing. If you want it to follow such links, define this value as yes, else define it as no.
- NoContents .suffix1 .suffix2 .suffix3 ...
- This variable lets you control what files will have their contents indexed. If a file with a suffix in this list is indexed, only its file name (and not any words in the file) will be indexed. This is useful because normally swish-e will try to index the contents of every file, even files without words (such as images or movies). Suffix checking is case-insensitive.
- IgnoreWords word1 word2 ...
- Here you can specify words to ignore when searching. Usually these words (called stopwords) are words that occur too many times in your data to make indexing them worthwhile. If you specify a word as SwishDefault, it will be replaced with swish-e's default list - a few hundred very common English words.
- IgnoreLimit number1 number2
- After indexing, swish-e can automatically tell which words are the most common and omit them from the index according to these parameters. Here are some examples:
IgnoreLimit 50 50Swish will ignore all words that occur in over 80% of the files and that also occur in over 256 different files.
IgnoreLimit 80 256Swish will ignore all words that occur in over 50% of the files and that also occur in over 50 different files.
Using IgnoreLimit and IgnoreWords can help trim the size of your index files considerably - experiment with parameters to see what works best at your site. You can also use IgnoreLimit to limit the CPU resources that searches take.
- IndexName value
- IndexDescription value
- IndexPointer value
- IndexAdmin value
- These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.
- MetaNames name1 name2
- These variables specify the meta names used in the .html files. Do not comment out or erase this line. MetaNames need to be one word with no quotes.
- WordCharacters abcdefghijklmnopqrstuvwxyz&#;0123456789.@|,-'[](~!@$%^{}_+?
- Wordchars is a string of characters which swish-e permits to be in words. Any strings which do not include these characters will not be indexed. You can choose from any character in the following string:
abcdefghijklmnopqrstuvwxyz0123456789_|/-+=?!@$%^'`~,.[]{}()Note that if you omit 0123456789&#; you will not be able to index HTML entities. DO NOT use the asterisk (*), lesser than and greater than signs, or colon (:). Including any of these four characters may cause funny things to happen. NOTE: Do not escape nor and they cannot be the first letter in the string Commenting out the line will give the defaults If not set it defaults to the value in config.h
- BeginCharacters string
- Of the characters that you decide can go into words, this is a list of characters that words can begin with. It should be a subset of (or equal to) WordCharacters Same rule of syntax as for WordCharacters If not set it defaults to the value in config.h
- EndCharacters string
- Of the characters that you decide can go into words, this is a list of characters that words can begin with. It should be a subset of (or equal to) WordCharacters Same rule of syntax as for B{WordCharacters>. If not set it defaults to the value in config.h
- IgnoreLastChar string
- Array that contains the char that, if considered valid in the middle of a word need to be disreguarded when at the end. It is important to also set the given char's in the ENDCHARS array, otherwise the word will not be indexed because considered invalid. Commenting out the line will give the defaults NOTE: if is the first char in the string it needs to be escaped with Do not escape otherwise
- IgnoreFirstChar string
- Array that contains the char that, if considered valid in the middle of a word need to be disreguarded when at the beginning. This was to solve the problem of parenthesis when there is no space between ( and the beginning of the word. Remember to add the char's to the BEGINCHARS list also. Commenting out the line will give the defaults NOTE: if a double quote is the first char in the string it needs to be escaped with \ .Do not escape otherwise.
- MaxDepth number
- (default 5) This defines how many links the spider should follow before stopping. A value of 0 configures the spider to traverse all links. This directive is only available for the HTTP method.
- Delay seconds
- The number of seconds (default 60) to wait between issuing requests to a server. This directive is only available for the HTTP method.
- TmpDir dir
- The location (default /var/tmp) of a writeable temp directory on your system. The HTTP access method tells the Perl helper to place its files there. This directive is only available for the HTTP method.
- SpiderDirectory dir
- The location (default ./) of the Perl helper script. Remember, if you use a relative directory, it is relative to your directory when you run swish-e, not to the directory that swish-e is in. This directory is only available for the HTTP method.
- EquivalentServer hostname hostname ...
- (default nothing) This allows you to deal with servers that use respond to multiple DNS names. Each line should have a list of all the method/names that should be considered equivalent. If you have multiple directives, each one defines its own set of equivalent servers. This directive is only available for the HTTP method.
EXAMPLE CONFIG FILE
# DIRECTIVES COMMON to HTTP and FILESYSTEM METHODS ################################################### # WINDOWS USERS NOTE: Specify ALL files and directory paths in # the the config file using the forward slash, as in # /thisdirectory. # ###################################################
IndexDir http://www.lib.berkeley.edu/~ghill/spider.html
# For the FileSystem Method: This is a space-separated list of # files and directories you want indexed. You can specify more # than one of these directives. # # For the HTTP Method: Use the URL's from which you want the # spidering to begin. NOTE: use hmtl files rather than # directories for this method.
IndexFile /home/ghill/swishRon/dir1/myindex1 # This is what the generated index file will be.
IndexName "Improvement index" IndexDescription "This is an index to test bug fixes in swish." IndexPointer "http://sunsite/~ghill/swish/index.html" IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)" # Extra information you can include in the index file.
MetaNames first author # List of all the meta names used in the file to index, must be on # one line. If no metanames DO NOT deleted the line.
IndexReport 3 # This is how detailed you want reporting. You can specify # numbers 0 to 3 - 0 is totally silent, 3 is the most verbose.
FollowSymLinks yes # Put "yes" to follow symbolic links in indexing, else "no".
#UseStemming no # Put yes to apply word stemming algorithm during indexing, else # no. See the manual for info about stemming. Default is no.
#PropertyNames author # List of meta tags names that can be retrieved with the -p # option. Index size increases as by the formula in the manual. # Comment out if no PropertyNames. Case insensitive
IgnoreTotalWordCountWhenRanking yes # Put yes to ignore the total number of words in the file # when calculating ranking. Often better with merges and # small files. Default is no.
ReplaceRules remove "ghill/" ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html" # ReplaceRules replace "/ghill" "moreghillmore" ReplaceRules allow # you to make changes to file pathnames before they're indexed. # This directive uses C library regex.h regular expressions. # NOTE: do not use replace <string> "" to remove a string, use # remove <string> instead - you might get a core dump otherwise.
#MinWordLimit 5 # Set the minimum length of an indexable word. Every shorter word # will not be indexed. Commenting out the line will give the # defaults
#MaxWordLimit 5 # Set the maximum length of an indexable word. Every longer word # will not be indexed. Commenting out the line will give the # defaults
#WordCharacters abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+? # WORDCHARS is a string of characters which SWISH permits to be in # words. Any strings which do not include these characters will # not be indexed. You can choose from any character in the # following string: # # abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}() # # Note that if you omit "0123456789&#;" you will not be able to # index HTML entities. DO NOT use the asterisk (*), lesser than # and greater than signs (<), (>), or colon (:). # # Including any of these four characters may cause funny things to # happen. NOTE: Do not escape \ nor " and they cannot be the # first letter in the string Commenting out the line will give the # defaults
#BeginCharacters m" # Of the characters that you decide can go into words, this is a # list of characters that words can begin with. It should be a # subset of (or equal to) WordCharacters Same rule of syntax as # for WordCharacters
#EndCharacters \"\ # Of the characters that you decide can go into words, this is a # list of characters that words can begin with. It should be a # subset of (or equal to) WordCharacters Same rule of syntax as # for WordCharacters
IgnoreLastChar # Array that contains the char that, if considered valid in the # middle of a word need to be disreguarded when at the end. It is # important to also set the given char's in the ENDCHARS array, # otherwise the word will not be indexed because considered # invalid. Commenting out the line will give the defaults NOTE: # if " is the first char in the string it needs to be escaped with # \ Do not escape otherwise
IgnoreFirstChar # Array that contains the char that, if considered valid in the # middle of a word need to be disreguarded when at the beginning. # This was to solve the problem of parenthesis when there is no # space between ( and the beginning of the word. Remember to add # the char's to the BEGINCHARS list also. Commenting out the line # will give the defaults NOTE: if " is the first char in the # string it needs to be escaped with \ Do not escape otherwise
IgnoreLimit 50 1000 # This automatically omits words that appear too often in the # files (these words are called stopwords). Specify a whole # percentage and a number, such as "80 256". This omits words # that occur in over 80% of the files and appear in over 256 # files. Comment out to turn of auto-stopwording.
#IgnoreWords SwishDefault # The IgnoreWords option allows you to specify words to ignore. # Comment out for no stopwords; the word "SwishDefault" will # include a list of default stopwords. Words should be separated # by spaces and may span multiple directives.
IndexComments 0 # This option allows the user decide if to index the comments in # the files default is 1. Set to 0 if comment indexing is not # required.
################################## # DIRECTIVES for FILESYSTEMS ONLY # Comment out if using HTTP ###################################
#IndexOnly .html .q # Only files with these suffixes will be indexed.
#NoContents .gif .xbm .au .mov .mpg .pdf .ps # Files with these suffixes will not have their contents indexed - # only their file names will be indexed.
#FileRules pathname contains .*dir1 #FileRules filename contains # % ~ .bak .orig .old old. #FileRules title contains construction example pointers #FileRules directory contains .htaccess #FileRules filename is index # Files matching the above criteria will *not* be indexed. # The patter matching uses the C library regex.h
################################ # DIRECTIVES for HTTP METHOD ONLY # Comment out if using FILESYSTEM ##################################
MaxDepth 5 # (default 5) This defines how many links the spider should follow # before stopping. A value of 0 configures the spider to traverse # all links
Delay 60 #(default 60) The number of seconds to wait between issuing #requests to a server.
TmpDir /home/ghill/swishRon/ # (default /var/tmp) The location of a writeable temp directory on # your system. The HTTP access method tells the Perl helper to # place its files there.
SpiderDirectory /home/ghill/swishRon/src/ # (default ./) The location of the Perl helper script. Remember, # if you use a relative directory, it is relative to your # directory when you run SWISH-E, not to the directory that # SWISH-E is in.
EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu # (default nothing) This allows you to deal with servers that use # respond to multiple DNS names. Each line should have a list of # all the method/names that should be considered equivalent. If # you have multiple directives, each one defines its own set of # equivalent servers.
AUTHOR
SWISH was created by Kevin Hughes. In Fall 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary. The result is SWISH-Enhanced or swish-e, brought to you by the swish-e Development Team.
SEE ALSO
The definitive (and more complete) documentation can be found here:
http://sunsite.berkeley.edu/SWISH-E/
. [ my Home - SWISH-E Off. Home ]