Updated 2007-06-06 06:08:27 by EMJ

KeyWord In Context Indexing

If you're looking for the keyword search page, which was "mini.net/tcl/kw", it's been moved to: http://wiki.tcl.tk/2

CmCc: Google does this, now you can too ... here's a start in pure tcl [1].

My intention was to replace glimpse [2] which is a really good early KWIC indexer with very low overhead. Glimpse is now proprietary, I don't like the alternatives, so wrote kwic.

The code has a simple command wrapper, but it's really a Snit class with some interfaces and a filesystem-persistent representation. I intend to write a tclhttpd interface to it, so we can provide search functionality.

Dependencies: Snit, tcl8.4 or later (uses dicts)

I'll write some docco RSN, promise.

AK I remember glimpse. I stopped using it when it started to consistently crash when applied to my mailspool. It got too big for it I guess.

schlenk Nice, wrote a grossly slow indexing package[3] for Tcl once to use with tclhttpd. (Someone needed a fulltext search system and i had an afternoon to code it...) A good working glimpse replacement (especially if it handles dbs larger than 2 Gigs) would be cool.

AK Some semi-documentation about kwic file-format, etc., taken from the Tcler's chat:
 [11:02] Colin	I can't even remember writing it, and doubt I could again ... I wonder if I've had a stroke
 [11:02] Colin	Anyway ... since I copied glimpse's design, all good.  Let me refresh on details
 [11:04] Colin	Ok.  Packed indices are just like glimpse's.  So, word + file#.  Minimal overhead assuming the set of words is small relative to source file sizes.  Like, typical english is 10K words.
 [11:05] Colin	that's on disk.
 [11:05] Colin	It unpacks the disk files into arrays, again keys are words, contents are lists of file numbers.
 [11:06] Colin	Each file is given a unique number, see, its ordinal number in the ordered set of all files being indexed.
 [11:06] Colin	I think I stipulated a maximum of 64K files, hence 16 bits per occurrence.
 [11:07] Colin	so file overhead is strlen(word)+1 + 2*occurrence_count
 [11:07] Colin	Search, once the index is read is constant (hashed)
 [11:08] Colin	cost to unpack it is mostly linear I guess, traverse the index and turn it into a tcl array.
 [11:08] Colin	cost to build the index is proportional to the size indexed, I guess.  Uses hash/array to build index.
 [11:09] Colin	fundamentally, I think/hope:  linear in textbase size.
 [11:09] Colin	the performance of the thing should be linear and proportional to the size of the stuff searched.
 [11:10] Colin	the size should, for expected text sets (natural language or nearly) be a fraction of the stuff indexed.
 [11:10] Colin	Just like glimpse.

 [11:12] Colin  The *very* clever thing about glimpse is that it realises you can load a file and search for a word very quickly, in ram.
 [11:12] Colin  And so all they need to do is construct the set of all unique words in a set of files, and record in which file each word occurs.
 [11:13] Colin  They give each file a unique number, and so the recording of occurrence is just association of a set of file numbers with a word.
 [11:13] Colin  IIRC, they sort their word->files association, but we don't have to because I load it into an array.
 [11:14] Colin  So they have a second index, being word->word#, then word#->file#+ then file#->filename
 [11:15] aku   so the index is word->file, not word->exact location, and exact location are found later by regular search. smaller index, possibly slower in search, but not too much, as you said.
 [11:15] Colin  Exactly!

Category Application