March 27, 2009

Hosted by Google, but Not Open to Search Engines

Like many other sites, Google uses robots.txt files to prevent search engines from indexing some of the content from google.com. In most cases, Google includes search results pages and other pages generated automatically, which would pollute indexes.


But sometimes Google excludes useful content, either directly using robots.txt files or using addresses that are difficult to index:

* published documents, spreadsheets and presentations from Google Docs - I suspect that the main reason why search engines aren't allowed to index Google Docs pages is that many documents would become public if search engines indexed invitation URLs.

* public pages for Google Reader's shared items - most of the content from these pages is copied from other pages, but Google Notebooks can be indexed by search engines.

* the albums and the photos hosted by Picasa Web Albums (the photos are indexed by Google Image Search, while the albums are included in Google's main search results). Picasa Web's front-end uses AJAX and URLs like http://picasaweb.google.com/guedin/AdriChezLesKiwisToutesLesPhotos12#5312778271091234418 can't be indexed by search engines, which usually remove fragments.

* the answers and questions from Google Moderator, another AJAX app that uses addresses like http://moderator.appspot.com/#15/e=cc&t=6. The application powers a new section from White House's website called "Open for Questions", which also can't be indexed by search engines.

* the LIFE photo archive, which is only available in Google Image Search. "It's disappointing that Google gets exclusive access to index these images and every other search engine is out of luck. Exclusivity like this doesn't seem in line with Google's philosophy," says Andy Baio.

* the books scanned by Google that are available in Google Book Search (they're included in Google's main search results, as part of Universal Search)

* the patents from the United States Patent and Trademark Office that are available in Google Patent Search

* the charts generated using Google Chart API

* the captions from videos hosted by YouTube and Google Video (they're indexed by YouTube and Google Video)

8 comments:

  1. If you monitor the robots.txt, sometimes you will notice that Google puts a new service in the file. I use Website Watcher for this or good old Copernic Tracker. As a tech writer, it came handy at least three times.

    ReplyDelete
  2. Which of these things do you think should be unblocked? For example, I tend to think that if you want to put a doc on the web, I would put it on Sites rather than Docs, which I tend to think of for intranet-type sharing.

    ReplyDelete
  3. Seriously...my notes in Notebook are indexed on search engines?

    ReplyDelete
  4. @TB:
    Only if they're public.

    @Matt:
    Public Google Docs, the photos hosted by Picasa Web Albums and the LIFE photo archive should be indexed by any search engine. I find the book summaries from Google Book Search and the captions from Google Video/YouTube videos very useful and I don't see why Google would restrict the access to this data.

    ReplyDelete
  5. @Alex:
    Whew! Thanks. I didn't notice the share capabilities -- yet another reason why Google should not have killed the Notebooks program.

    ReplyDelete
  6. I would like Google to provide a clear way to manage Bookmarks !

    ReplyDelete
  7. @Bruno if you use chrome, "bookmark sync" can be enabled, which will keep your bookmarks the same across all chrome browsers that you enable this for.

    ReplyDelete
  8. Any one know how to prevent the photo being search in google images?

    ReplyDelete

Note: Only a member of this blog may post a comment.