March 17, 2010

Data Mining Using Google

Today's xkcd comic is about quantitative Google queries. Randall Munroe found the number of search results for queries like "My IQ is X", where X is a variable, and plotted a graph for each query. While the results aren't reliable (Google only shows an estimation for the number of search results), it's an interesting way to mine Google's index of the web.



If you are familiar with Google Spreadsheets, try to create a sheet that lets you enter a query like "My IQ is X", a variable name and the values for that variable. The result should be a graph that shows the number of Google search results for each instance of your query. Use importXML and an XPath expression to find the number of Google search results: "//p[@id='resultStats']/b[3]". Here's an example.

{ Image licensed as Creative Commons Attributions-Noncommercial. }

10 comments:

  1. Thanks for the example, it was very instructive. Unfortunately I tried to adapt it to parse a Facebook event's data and create a nice chart, but that's behind a login-wall... :(

    ReplyDelete
  2. You say "While the results aren't reliable..."

    But I'm finding huge discrepancies. For example:

    I just Googled "My IQ is 80" and got:

    Results 1 - 15 of 15 for "My IQ is 80". (0.46 seconds)

    Meanwhile, the spreadsheet says: 25800 in the "number of results" column.

    15 hits when I do the Google search rather than the 25,800 in the spreadsheet.

    ReplyDelete
  3. That's probably because your search results page shows more than 10 results. If you use the default setting (10 results per page), you'll see that inaccurate estimation.

    ReplyDelete
  4. My default is 50 results per page. Google was only giving me 15 results. 22 when I told it to also give me duplicates.

    ReplyDelete
  5. For the Googles, Amazons and EBays, "data" was too cumbersome for the old database paradigm. Vast quantities of data, lots of concurrent users, minimal consistency issues. They used and developed new technologies.

    ReplyDelete
  6. All this says is that there are tons of liars on the Net.

    52,000 people boast that they have a penis which is 14 inches long. The largest medically verified penis was only 13.5 inches.

    ReplyDelete
  7. Yeah, estimated hits for a given query are highly inflated (not just in Google). Even with default setting if you go down the list of pages you will get the right number of hits in Google. Here are some interesting experiments with Google Search hits in the xkcd flavor.

    http://buzzintechnology.com/2010/03/bugs-in-google-search-results-count/

    ReplyDelete
  8. it could be useful for gaining broad insights but the inaccuracy make it a bit unreliable for most real world data mining that could be used in research

    ReplyDelete
  9. The xpath has changed since Google updated its UI. Does anyone have the updated formula?

    ReplyDelete
  10. Yeah I'm seconding Andrew's question... Does anyone have the updated formula? Or better, can anyone upload a link to a fixed version of the example spreadsheet? That would be really, really helpful. Thanks.

    ReplyDelete

Note: Only a member of this blog may post a comment.