Data Mining Using Google

March 17, 2010

Data Mining Using Google

Today's xkcd comic is about quantitative Google queries. Randall Munroe found the number of search results for queries like "My IQ is X", where X is a variable, and plotted a graph for each query. While the results aren't reliable (Google only shows an estimation for the number of search results), it's an interesting way to mine Google's index of the web.

If you are familiar with Google Spreadsheets, try to create a sheet that lets you enter a query like "My IQ is X", a variable name and the values for that variable. The result should be a graph that shows the number of Google search results for each instance of your query. Use importXML and an XPath expression to find the number of Google search results: "//p[@id='resultStats']/b[3]". Here's an example.

{ Image licensed as Creative Commons Attributions-Noncommercial. }

10 comments:

WaldirMarch 17, 2010 at 9:41 AM
Thanks for the example, it was very instructive. Unfortunately I tried to adapt it to parse a Facebook event's data and create a nice chart, but that's behind a login-wall... :(
ReplyDelete
Replies
David ScrimshawMarch 17, 2010 at 4:36 PM
You say "While the results aren't reliable..."

But I'm finding huge discrepancies. For example:

I just Googled "My IQ is 80" and got:

Results 1 - 15 of 15 for "My IQ is 80". (0.46 seconds)

Meanwhile, the spreadsheet says: 25800 in the "number of results" column.

15 hits when I do the Google search rather than the 25,800 in the spreadsheet.
ReplyDelete
Replies
Alex ChituMarch 17, 2010 at 4:43 PM
That's probably because your search results page shows more than 10 results. If you use the default setting (10 results per page), you'll see that inaccurate estimation.
ReplyDelete
Replies
David ScrimshawMarch 17, 2010 at 5:54 PM
My default is 50 results per page. Google was only giving me 15 results. 22 when I told it to also give me duplicates.
ReplyDelete
Replies
iPad DockMarch 18, 2010 at 4:04 AM
For the Googles, Amazons and EBays, "data" was too cumbersome for the old database paradigm. Vast quantities of data, lots of concurrent users, minimal consistency issues. They used and developed new technologies.
ReplyDelete
Replies
AnonymousMarch 20, 2010 at 4:53 PM
All this says is that there are tons of liars on the Net.

52,000 people boast that they have a penis which is 14 inches long. The largest medically verified penis was only 13.5 inches.
ReplyDelete
Replies
AnonymousMarch 22, 2010 at 12:34 PM
Yeah, estimated hits for a given query are highly inflated (not just in Google). Even with default setting if you go down the list of pages you will get the right number of hits in Google. Here are some interesting experiments with Google Search hits in the xkcd flavor.

http://buzzintechnology.com/2010/03/bugs-in-google-search-results-count/
ReplyDelete
Replies
web design irelandMarch 27, 2010 at 8:01 AM
it could be useful for gaining broad insights but the inaccuracy make it a bit unreliable for most real world data mining that could be used in research
ReplyDelete
Replies
RickerJuly 8, 2010 at 1:10 PM
The xpath has changed since Google updated its UI. Does anyone have the updated formula?
ReplyDelete
Replies
psykickNovember 13, 2010 at 5:49 AM
Yeah I'm seconding Andrew's question... Does anyone have the updated formula? Or better, can anyone upload a link to a fixed version of the example spreadsheet? That would be really, really helpful. Thanks.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Google Operating System

Unofficial news and tips about Google

March 17, 2010

Data Mining Using Google

10 comments:

Follow

Labels

Popular Posts

Blog Archive

Recommended Sites