May 9, 2008

Using Google's N-Gram Corpus

Two years ago, Google released a collection of n-grams from web pages and made it available on Linguistic Data Consortium's website. "We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times." Here are some examples of 3-grams, followed by their frequencies:

ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52

While this huge corpora is useful to build linguistic models, there are other ways to use it. Chris Harrison created some visualizations for bigrams and trigrams that start with pronouns. "These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love'."

Chris DiBona from Google works on IsolWrite, a word processing program that will include a text prediction option. "I gotta get my greasy hands on an open version of our published n-gram data (which is ranked) and incorporate that, if it makes sense."

