Using Google's N-Gram Corpus

May 9, 2008

Using Google's N-Gram Corpus

Two years ago, Google released a collection of n-grams from web pages and made it available on Linguistic Data Consortium's website. "We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times." Here are some examples of 3-grams, followed by their frequencies:

ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52

While this huge corpora is useful to build linguistic models, there are other ways to use it. Chris Harrison created some visualizations for bigrams and trigrams that start with pronouns. "These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love'."

Chris DiBona from Google works on IsolWrite, a word processing program that will include a text prediction option. "I gotta get my greasy hands on an open version of our published n-gram data (which is ranked) and incorporate that, if it makes sense."

{ via information aesthetics }

7 comments:

UnknownMay 10, 2008 at 5:59 PM
Ummm...yeah. Okay. LOL! Don't you hate those days where there's nothing to blog? I do.
ReplyDelete
Replies
UnknownMay 19, 2009 at 12:14 AM
wow... this is really great! Google always do something different,i like it.
ReplyDelete
Replies
paintball barrelsApril 14, 2010 at 8:48 PM
unfortunately the ngram data set is not really open at all...
ReplyDelete
Replies
Rich FarmbroughMay 5, 2010 at 10:39 AM
It is $150 from LDC, but more to the point it is about 1 TB. Not something you can slip into a WP package.
ReplyDelete
Replies
ThomasOctober 21, 2010 at 1:32 PM
Microsoft/Bing has the data available also! We are talking a huge amount of data! http://web-ngram.research.microsoft.com/info/
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 5:10 PM
well sound interesting but just search "he loves" and you will find a bunch of link, so your analysis is not correct.
ReplyDelete
Replies
eslchillMay 2, 2011 at 11:41 AM
@ Anonymous I had the same thought, but this is said in the context of the popularity of the top 120 trigrams, meaning it does exist, but is not popular. Clearly he loves and she argues, just not as much as she loves and he argues.

More interesting to me is that the phrase "ceramics collectables collectibles" appears 55 times. This is not actually a phrase in English, but looks like some Google adwords mashup. This is what you are basing linguistic analysis on? Uh oh...
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Google Operating System

Unofficial news and tips about Google

May 9, 2008

Using Google's N-Gram Corpus

7 comments:

Follow

Labels

Popular Posts

Blog Archive

Recommended Sites