An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

September 23, 2006

Google Has the Largest Number of Dead and Old Pages

Ziv Bar-Yossef, from Google, wrote a paper about sampling random pages from a search engine's index using queries. He explains some of the technical details in this video, including the utility of sampling random pages: comparing search engines, estimating the amount of spam, of fresh results etc.

He applied the results from his paper and compared Google, Yahoo and MSN Search. Here are three charts that show a comparison of the index size, how many dead pages are in each search engine and how fresh the results are. The charts are only an estimation, and they have a bias of around 10%. As you can see, Google doesn't do very well.

To find out more, watch the video, which is fairly long (1 hour) or skip to the results. There's also the paper "Random Sampling from a Search Engine's Index" (PDF), that got the best paper award at WWW 2006.

11 comments:

  1. "Google doesn't do very well".
    I'm not sure I agree. Using the cached feature on a dead page gets you the information you're searching for... if it IS information you're after, and not a download.
    Maybe Google should change the main link of a dead page to the cached version...

    I don't know... Just a thought.

    ReplyDelete
  2. I agree, but that also means Google is slow at recrawling pages, because when recrawls a page that returns a 404, it removes the page from the index, at least temporarily.

    From some of my tests, I could see that Yahoo crawls pages before Google, most of the times, and that's consistent with the result that Yahoo has more fresh pages in the index than Google.

    ReplyDelete
  3. True, but Google is incredibly large, so managing all that is going to take a bit longer.

    And i think thats a pretty good idea about the dead pages becoming the cached version.
    Alot of people don't know about the Google Cache feature, which is ashame cause its a really useful feature!
    I think i may suggest this to them, you should as well, everyone should!

    ReplyDelete
  4. Sometimes it's hard to read, isn't it? It's very difficult to read English words, but it's so easy to write: crap, disgust, marketing ploy.

    The post starts with "Ziv Bar-Yossef, from Google, wrote a paper about...". Maybe I should repeat it one more time: "from Google". Feel free to repeat it several times until you get it.

    ReplyDelete
  5. Automated querying goes against the terms of service on all of the top search engines. I understand that they might get permission to do it for their own data - but for the other engines? I wonder how many queries go to Y+M from the internal G network :-).

    ReplyDelete
  6. Google, Yahoo and MSN have search APIs.

    Google -> 1,000 queries / day (or more if you ask)
    Yahoo -> 5,000 queries / IP / day
    MSN -> 10,000 queries / IP / day

    ReplyDelete
  7. I should have read it all before I posted :-)
    In our exploration experiments, conducted in April-May 2006, we submitted 395,000 queries to Google, 448,000 queries to MSN Search, and 370,000 queries to Yahoo!. Due to legal restrictions on automatic queries, we used the Google, MSN, and Yahoo! Web Search APIs, which are, reportedly, served from older and smaller corpora than the corpora used to serve human users. These APIs are limited to submitting only a few thousands of queries a day, which limited the scale of the experiments we could perform.
    Seeing how they used 7 queries per dataset, that would mean that they checked about 64'000 pages. Is that enough to interpolate to the rest of the web? (maybe I should read the rest before posting more embarrasing posts :-))

    ReplyDelete
  8. I agree 100%. Google is stale because 90% of it is supplimental which in only crawled once in a blue moon.

    ReplyDelete
  9. Google results aren't what they used to be anymore. I'm not sure if this is because of the old pages, but their search is more of a history book than a search of what's new on the web.

    ReplyDelete
  10. Same old story about life and business. Take a baseball player for example. Once they make their money and hit it big with the values and work ethics that got them there.... they go commercial and focus on making more money, not the end user. Ever see a baseball player perform less after the big check?

    ReplyDelete
  11. i agree with the last person
    and some people who left comments r
    nerds too!!!!!!!!

    ReplyDelete

Note: Only a member of this blog may post a comment.