An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

May 9, 2008

Using Google's N-Gram Corpus

Two years ago, Google released a collection of n-grams from web pages and made it available on Linguistic Data Consortium's website. "We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times." Here are some examples of 3-grams, followed by their frequencies:

ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52

While this huge corpora is useful to build linguistic models, there are other ways to use it. Chris Harrison created some visualizations for bigrams and trigrams that start with pronouns. "These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love'."


Chris DiBona from Google works on IsolWrite, a word processing program that will include a text prediction option. "I gotta get my greasy hands on an open version of our published n-gram data (which is ranked) and incorporate that, if it makes sense."

{ via information aesthetics }

7 comments:

  1. Ummm...yeah. Okay. LOL! Don't you hate those days where there's nothing to blog? I do.

    ReplyDelete
  2. wow... this is really great! Google always do something different,i like it.

    ReplyDelete
  3. unfortunately the ngram data set is not really open at all...

    ReplyDelete
  4. It is $150 from LDC, but more to the point it is about 1 TB. Not something you can slip into a WP package.

    ReplyDelete
  5. Microsoft/Bing has the data available also! We are talking a huge amount of data! http://web-ngram.research.microsoft.com/info/

    ReplyDelete
  6. well sound interesting but just search "he loves" and you will find a bunch of link, so your analysis is not correct.

    ReplyDelete
  7. @ Anonymous I had the same thought, but this is said in the context of the popularity of the top 120 trigrams, meaning it does exist, but is not popular. Clearly he loves and she argues, just not as much as she loves and he argues.

    More interesting to me is that the phrase "ceramics collectables collectibles" appears 55 times. This is not actually a phrase in English, but looks like some Google adwords mashup. This is what you are basing linguistic analysis on? Uh oh...

    ReplyDelete

Note: Only a member of this blog may post a comment.