A little spellchecker

Another weekend fun project by Richard Suchenwirth, 2001-02-16 - This is a toy or demo thing that is not really fit for real-life use, but still I had some fun with it - it's amazing how little code it takes!. Give it a text widget and possibly a tag (my "hilite" is -bg orange, which stands out pretty clearly; curly red underline seems not to possible in Tk), and it will march through the text contents and highlight all those words that don't match its expectations (i.e. not in dictionary or not resolvable by rules):

 proc text:spell {w {tag hilite}} {
    set lineno 1
    $w tag remove $tag 1.0 end
    foreach line [split [$w get 1.0 end-1c] \n] {
        foreach {from to} [string:wordindexes $line] {
            set word [string range $line $from [expr $to-1]]
            if {![spell:ok $word]} {
                $w tag add $tag $lineno.$from $lineno.$to
                update idletasks
            }
        }
        incr lineno
    }
 }

Known bug: embedded images count as one character, but are not seen by the $text get command, so they shift the highlighting to the right.

The following helper produces a list of starting and ending indices of words (as defined by Tcl) in a string:

 proc string:wordindexes s {
    set i 0
    set res {}
    foreach c [split $s ""] {
        ##DKF## Use {$c ne " " && $i eq [string wordstart $s $i]}
        ##DKF## as test from Tcl 8.4 onwards!  It's faster and less buggy
        if {$c!=" " && $i==[string wordstart $s $i]} {
            lappend res $i [string wordend $s $i]
        }
        incr i
    }
    set res
 }

Here comes the word checker, returning 1 or 0 depending on whether it accepts one word (replace by your own if you have a better one - I will sometime in the future experiment with a graph parser):

 proc spell:ok s {
    global word ;# Faster to create local alias
    if {[string length $s]<2}    {return 1}
    if {![regexp {[A-Za-z]} $s]} {return 1}
    set s [string tolower $s]
    if {[info exists word($s)]}  {return 1}
    foreach sfx {s ing ed es d} {
       if {
          [regexp ^(.+)$sfx$ $s -> stem] &&
          [info exists word($stem)] &&
          [lsearch $word($stem) $sfx] >= 0
       } then {
          return 1
       } 
    }
    return 0
 }

The following two are for data preparation, they take a string with possible linebreaks (may be a whole text file), extract the words only, resp. do a frequency count:

 proc string:words s {
    set res {}
    foreach line [split $s \n] {
       for {set i 0} {$i<[string length $line]} {incr i} {
          if {$i==[string wordstart $line $i]} {
             set w [string range $line $i [expr {[string wordend $line $i]-1}]]
             if {$w!=" "} {lappend res $w}
             incr i [expr {[string length $w]-1}];# always loop incr
          }
       }
    }
    set res
 }
 proc words:count s {
    foreach i [string tolower [string:words $s]] {
       if {[string length $i]>1} {
          if {[info exists a($i)]} {
             incr a($i)
          } else {
             set a($i) 1
          }
       }
    }
    set t {}
    foreach {i n} [array get a] {lappend t [list $i $n]}
    ##DKF## Efficient in 8.4, not crippling before
    return [lsort -integer -decreasing -index 1 $t]
 }

And here finally comes the "dictionary" (pretty poor yet, fits on less than a page). It does a crude subcategorization based on possible endings (the value of the array entries), so more words are matched:

 ########## load dictionary, distinguish suffix distributions #####
 foreach i {
    about above after all already also always am an another any and are as at 
    be been before below between body both but by child children could
    data different does doesn during each either empty
    found for from fully given got happy happily
    has have high his how however if in including into is isn it 
    just later legal low may maybe more must never next no none not
    of on onto only or over perhaps same should since slow so some such
    tcl than that the their them then there these they 
    this those three to too two under unless us using 
    was we were what whatever when where whether which while who whom whose 
    why with within would you zero automatic automatically
 } {set ::word($i) ""}
 foreach i {
    add accept allow append approach argument 
    back book brief buffer button 
    call check clear click color command consist contain convert count counter 
    destroy display down end except exist export 
    fill follow form import intend key
    last link list load look mark need number open order overview
    pair perform pick point position print reason represent return
    screen script second select shift show spell start style support
    test treat unit view want word work
 } {set ::word($i) "s ing ed"}
 foreach i {
    bind break do field find mean read see will window
 } {set ::word($i) "s ing"}
 foreach i {
    access focus index match search
 } {set ::word($i) "es ing ed"}
 foreach i {
    actual additional complete current definite direct exact frequent
    general immediate normal occasional optional previous proper quick
    recent silent symbolical total
 } {set ::word($i) "ly"}
 foreach i {
    action application area bar bottom can case center come context
    character computer content control current database effect element
    error even event example first font forget format friend
    get give global handler height her image information input it item 
    left let make menu mouse new nothing one operation option other
    output package pattern procedure program real red refer region reset
    resolution right selection set simple single space special standard
    step stop string system table tag take text top up variable white
    widget width write your
 } {set ::word($i) "s"}
 foreach i {
    abbreviate associate change code coordinate create date declare
    define delete describe determine double execute file force generate
    ignore include indicate line like name note outline page remove rule 
    size state terminate time type use value
 } {set ::word($i) "s d"}

DKF: Modified to run faster. :^)

LV: Any of you familar enough with the Wikit code to figure how to add this code so that after one edits a page, there could be a button for spell-checking the page, with the possible misspelled words highlighted in some manner?

RS: Before building this into the Wiki, remember I said this is a toy project. The problem is the dictionary, which has to be very much more comprehensive than the one above - otherwise you'll get so many false positives that it doesn't help much. So we need

data (10,000s of frequent English words)
an efficient access method (the one above will get slow with much data, because of the many regexps)

Highlighting via http will also be very different, though not difficult: retransmit the received form contents, with dubious words braced in <b>..</b>. (Does HTML allow markup in a form?)

DKF: No, you can't provide HTML markup in a form. Delivering this sort of functionality would require an applet of some form (either Java or Tcl.)

Also, the ispell english dictionary is 300kB long (after I extract it from its storage format) with over 33 thousand words, omits many common prefixes and suffixes, and I still use a lot of words which it doesn't know about. I tempted to say that instead of writing our own spelling checker, we should just wrap up ispell instead... :^)

FYI, the shell command I used to extract it was:

  strings /usr/common/lib/ispell/britishmed+.hash | sed '/[^A-Z]/d' | tr A-Z a-z | sort

RS: Could not find ispell in our Solaris or Linux boxes, but the old spell with a flat ASCII wordlist of 25143. Not bad. I'd only prefer a pure-Tcl solution, since my W95 box at home misses so many goodies...

Arjen Markus On our Solaris system we have a program "spell" - seems quite similar :-)

AK: See http://freshmeat.net/appindex/console/text%20utilities.html

for several spellcheckers, especially pspell [1], the portable spell checker interface library. Contains an ispell module, appears to handle UTF-8.

LV: the aspell/pspell project has several word lists, as does the fsf.org people. So coming up with a word list isn't the problem. However, perhaps embedding such large word lists into a Wikit would be counter-productive...

NEM: Couldn't we write some Tcl scripts to trawl on-line dictionaries writing data to a Metakit or other Tcl database? Some intelligent language parsing could pick out endings etc, and create a nice database. Could take a while to work tho - all those HTTP requests....

Instead of embedding the dictionary in the Wikit, we could create a metakit database in a Tequila server...

More possibilities, these requiring IP connectivity: use of Google's spell-corrector (CL reports that programmability is as easy as

    package require SOAP
    SOAP::create doSpellingSuggestion \
        -proxy http://api.google.com/search/beta2 \
        -uri urn:GoogleSearch \
        -action urn:GoogleSearchAction \
        -params {key string phrase string}
    puts [doSpellingSuggestion $key "mangeld word"]

); or 'Net connections to several on-line dictionaries (...)

Once concern about using the Google web service is the fact I seem to recall that one needs to obtain a login / password for google and the use that login and password as a part of the interaction. Or is this something difference?

MG April 21st, 2004 - I looked into the Google web service recently; you do indeed need to obtain a login/password from them, and there's a limit to how much it can be used in (I believe) any 24-hour period. As a pure-tcl alternative, though, I found the code above (with a few small modifications, mainly for handling words with apostrophes and such) and a large wordlist (the one I have is 1.3 megabytes) works brilliantly; the code is in Potato MUSH (MUD) Client, but I'll extract the procs I changed later and add them here, just in case anyone wants them. (The word list is available at [2], incidently.)

Joachim Kock <25/04/2004> : I don't think it is worth to try to collect words from web services or compile databases in fancy formats. You can probably get much better results by using some serious external spell checker like aspell, which is very easy to control through a pipe, or otherwise just use a binary search through a well-prepared word list --- these can be found on the internet (for example on the aspell site [3] or the Excalibur site [4]) and they have been fine-tuned over many years by clever spell-checking freaks.

The spellchecker aspell (http://aspell.sourceforge.net

) is very fast and has many features. It is easy to call it from a Tcl programme via a pipe. Alpha(Tk) (http://alphatcl.sourceforge.net

) uses aspell as spellchecker and this is all implemented in Tcl. See the file spellcheck.tcl in the AlphaTcl library [5]. There is also a check-as-you-type spell checker for Alpha(Tk) where misspelled words are underlined while a list of suggestions appear in a small auxiliary window. This goes as fast as you can type, and ctrl-k 4 for accepting suggestion number 4... The Tcl code is here [6].

Alternatively, and in particular if you are not interested in suggestions for corrections, but only want a boolean, a very convenient data format is a plain text file with one word per line, alphabetically sorted. There are very good such wordlists available, and doing a binary search for a word is faster than you can type it. Here is a code snippet stolen from another Alpha(Tk) package, 'autoAccents' (this package automatically sets accents when you type in French (or in other heavily accented languages, depending on the supplied wordlist). The following is rather minimal:

  proc checkWord { word } {
      set word [string tolower $word]

      # Assuming that there is a sorted wordlist here:
      set wordList /Data/dics/wordlists/BritishDictionary2.2
      # ftp://ftp.eg.bucknell.edu/pub/mac/Excalibur-dictionaries/
 
      set f [open $wordList r]
      set lowerlimit 0
      seek $f 0 end
      set upperlimit [tell $f]
      # ------------------
      # Rough binary search, to narrow the interval:
      while { [expr $upperlimit - $lowerlimit >= 20] } {
          set mid [expr ($upperlimit + $lowerlimit) / 2] 
          seek $f $mid
          gets $f linje ; #first chunk is junk
          gets $f linje
          if { [string compare $word $linje] == 1 } {
              set lowerlimit $mid
          } else {
              set upperlimit $mid
          }
      }
      # ------------------
      # Now the goal is within the narrow interval.
      # (In very unlucky cases the goal may actually be a litte after the 
      # interval, but this doesn't matter because we):
      # Go back a little further and read forward linearly:
      if { $lowerlimit > 20 } {
          seek $f [expr $lowerlimit - 20]
          gets $f linje ; #first chunk is junk
      } else {
          seek $f 0
      }
      gets $f linje 
      while { [string compare $word [string trim $linje]] == 1 } {
          if { [gets $f linje] == -1 } {
              break
          }
      }
      # ------------------
      # Found the first non-smaller word.
      close $f
      if { [string equal $word [string trim $linje]] } {
          return 1 
      } else { 
          return 0
      }
  }

Here is GJW's alternative version. Conceptually it's similar to Richard Suchenwirth's implementation. It consists of one part which downloads a word list from an HTTP server, and another part which scans the text from a fancy text field. (We use an in-house package that wraps the standard Tk fields. Some minor adjustment may be required to use it with standard text fields.)

Since I'm using nginx as my web server, I had to add one line of configuration to it: "if_modified_since before;". (Otherwise, it fails to respond with 304 when the client's mtime is newer than the server's mtime.)

package require http

proc eb_UpdateSpellingWords {} {
    global tcl_platform ebase
    upvar #0 spelling_words words

    # Local copy of spelling word list

    switch -- $tcl_platform(platform) {
        windows { set ebase(wordfile) C:/Ebase5/words }
        default { set ebase(wordfile) ~/.ebase-words }
    }
    set mtime 0
    catch {set mtime [file mtime $ebase(wordfile)]}

    # Remote copy of spelling word list

    if { ! [info exists ebase(host)] } {
        set ebase(host) [eb_Call set ebase(host)]
    }
    set url http://$ebase(host)/words
    set web_mtime [clock format $mtime \
        -format {%a, %d %b %Y %H:%M:%S GMT} -gmt 1]
    set token [::http::geturl $url -validate 1 \
        -headers [list If-Modified-Since $web_mtime]]
    upvar #0 $token state
    switch -glob -- $state(http) *200* {
        # Remote copy has been modified since local copy was created, so
        # actually download it.
        set fd [open $ebase(wordfile)_tmp w]
        fconfigure $fd -translation binary
        ::http::geturl $url -channel $fd
        close $fd
        file rename -force -- $ebase(wordfile)_tmp $ebase(wordfile)
    }

    # Read word list into memory.

    set fd [open $ebase(wordfile)]
    fconfigure $fd -translation binary
    while { [gets $fd word] >= 0 } {
        set words($word) 1
    }
    close $fd
}

proc eb_SpellCheck {w} {
    upvar #0 spelling_words words
                        
    $w tag configure highlight -background pink
    $w tag remove highlight 1.0 end
                        
    set content [$w getval]
    set i 1             
    foreach line [split $content \n] {
        set c 0; set len [string length $line]
        while { $c < $len } {
            set char [string range $line $c $c]
            if { [string match {[A-Za-z]} $char] } {
                set end [string wordend $line $c]
                set word [string range $line $c [expr {$end - 1}]]
                if { ! [info exists words($word)] } {
                    # Try again, lower-cased (Sentence beginnings, plus
                    # they like ALL CAPS pseudo-headers)
                    if { ! [info exists words([string tolower $word])] } {
                        $w tag add highlight $i.$c $i.$end
                    }
                }       
                set c $end
            } else { 
                incr c
            }
        }    
        incr i
    }              
    update idletasks
}

bind Ff_TextField <FocusOut> {+eb_SpellCheck %W}

Category Algorithm

Category Spell Checking