Intranet Search Engine
Basic Requirements (Musts)
- Document Types - at least searching HTML, Textfiles and PDF documents
- Indexing directories more important than crawling - must be able to index a directory tree of files rather than using a http crawler to just follow links
- Source Code - access to source code, so we could modify the engine by ourself
- Speed - must be as fast as possible
- Filtering - ability to omit specified files from indexing, perfectly with regular expression
- Character set - can handle ISO-Latin-1, Umlaute and spaces
- Operators - at least AND, OR
Nice to have
- Search Option - partial search for words
- Fuzzy searching - Searching with misspelled words
- Output - partial output of found documents
- Licence - Open Source / GNU General Public License
- Score - Relevancy score, should be able to modify the formula
Already known products
Product URL Significant Feature(s) Status Vestris www.vestris.com [ - ] Indexing only via URL spidering, cannot index file trees
[ - ] Commercial & Source code not available--- Swish-e sunsite.berkeley.edu/SWISH-E [ + ] Handles file names with spaces correctly (had to modify Perl script, but engine can do)
[ + ] Better matches (compare e.g. «3COM Adapter»)
[ + ] well documentation, GNU puclic licenseInstalled UDMSearch mysearch.udm.net [ + ] well documentation, GNU puclic license
[ + ] Database-drivenInstalled ht://Dig www.htdig.org [ + ] Synonym base, transforms known english words in grammar cases, e.g. contact to contacted, contacting, contacts
[ - ] Sent a question if there's any way to tell ht://Dig to index a directory tree of files rather than using the crawler to just follow linksInstalled Glimpse / Webglimpse webglimpse.org [ - ] perplexed organized docu, $200 - $500 Installed
. [ my Home - SWISH-E Off. Home ]