April 12, 2007

A Simplified Version of Google's Spell Checker

Peter Norvig (from Google) explains in a detailed article how to write in 20 lines of Python code a spell checker almost as good as the one used by Google to show the famous "did you mean" corrections. Well, at least for one-word corrections.
We will read a text file, holmes.txt (that I happened to have on my laptop) which is a collection of Sherlock Holmes stories (from Project Gutenberg) consisting of about 100,000 words. We then extract the individual words from the file (using the function words, which converts everything to lowercase, so that "the" and "The" will be the same). Next we train a probability model, which is a fancy way of saying we count how many times each word occurs. (...)

Now let's look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of the edit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). (...)

The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target.

A simple way to define the error model was to say that "all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0". From all the candidates for the correction, you can choose the most frequent word.

In Peter Norvig's tests, this simple algorithm returned correct answers in more than 80% of the cases. Of course that Google has more data than the holmes.txt file (it crawls the web, right?) and has access to a huge list of queries and refinements that could improve the algorithm, but this is an example of a simple yet powerful program.

7 comments:

  1. I would like to see Google donate their spell check algorithm to OpenOffice.org, their spell check is terrible.

    ReplyDelete
  2. Pfft, I can write this in two lines of code by screenscraping Google results, and it'll be not 80-90% as good as Google's corrections, but 100% as good ;)

    ReplyDelete
    Replies
    1. plz, can you enplane for me more about implementing "did you mean" ? for my search engine project.

      Delete
  3. Nice. The original article has changed, now it is 21 lines.

    Could someone compile this into an .exe w/ the big.txt and upload it, so we can try it? If not I'll do it.

    ReplyDelete
  4. I thought google stored all times someone queries and changes the query to something very similar without clicking on one of the results. This is also the way how people googlebomb the 'did you mean'.

    ReplyDelete
  5. I need to spell check a list of 50.000 names and this code does not take Literal expression "Angela Merkel" for exemple.

    And I repeat the demand of Zachary Liu : can anybody compile this with a GUI ?
    I did not try this dev :
    http://www.peterbe.com/plog/spellcorrector
    Gspell (for mac) which should work does not Literal.

    thanks

    ReplyDelete

Note: Only a member of this blog may post a comment.