April 13, 2008

Google Starts to Index the Invisible Web


Google Webmaster Central Blog has recently announced that Google started to index web pages hidden behind web forms. "In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page." For now, only a small number of websites will be affected by this change and Google will only fill forms that use GET to submit data and don't require personal information.

Many web pages are difficult to find because they're not indexed by search engines and they're only available if you know where to search and what to use as a query. All these web pages create the Invisible Web, which was estimated to include 550 billion documents in 2001. "Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not see or retrieve content in the deep Web -- those pages do not exist until they are created dynamically as the result of a specific search."

Anand Rajaraman found that the new feature is related to a low-profile Google acquisition from 2005.
Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. (...) The key problem in indexing the Invisible Web are:

1. Determining which web forms are worth penetrating.
2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging - we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2). It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic. (...) The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler.

It's not clear what are the high-quality sites used by Google for the new feature, but this list includes some good options. Along with Google Book Search, Google Scholar, Google News Archive, this is yet another way to bring to light valuable information.

12 comments:

  1. You might want to remove the misleading and meaningless picture on top.

    ReplyDelete
  2. @Zeta: I thought it was an example of the types of sites Google is now trying to index?

    ReplyDelete
  3. Seems like a win all around. The end user wins by getting what they're looking for, the publishers win by having their content more readily indexed, and Google wins by getting advertising revenues and increasing their search market share. +3

    ReplyDelete
  4. One (partial?) solution to the invisible web and how to populate forms is to "ask" the web site: define a standard way of exposing typical or possible results of a form. Add an extra element to the FORM tag, like "PRESULTS=/results.xml" which could be a URL to a static or dynamic list of URLs that might result from a search. For example: if you are searching by author in a box, and you search your database for the author, results.xml could contain a dynamically generated list of URLs for all authors in your database.

    I think there is some possibility of people abusing this for black hat SEO, but it'd be a great tool for white hat SEO folks.

    ReplyDelete
  5. So, how long before Google starts indexing its own searches and falls into an endless vortex of recursion?

    ReplyDelete
  6. I'd call this evil. Let's assume the form on your website is used to collect information and store it in your database for future use. If the technique works as described, you will have garbage data in your database. Even if your form is not about collecting the data, but just searching, that search data is now less meaningful because it includes the queries google "generates."

    There has to be a better way.

    ReplyDelete
  7. Well, if you collect information from your visitors, you should use POST in your forms, not GET.

    From w3.org:

    << The "get" method should be used when the form is idempotent (i.e., causes no side-effects). Many database searches have no visible side-effects and make ideal applications for the "get" method.

    If the service associated with the processing of a form causes side effects (for example, if the form modifies a database or subscription to a service), the "post" method should be used. >>

    Google only submits the forms that use GET and don't include personal information.

    ReplyDelete
  8. I think that using link popularity as a way to determine what is seen or can be found on the internet is flawed. I am always annoyed at having to sort through the chaff served by google and others before I find something relevant. Sure its great for ad serving but it wastes huge amounts of time that might be better spent elsewhere. What if we asked page creators to imbed some sort of Dewey Decimal or Library of Congress classification in their headers? Might improve sorting and seeking? We use DNS to find pages on the internet, why not have some sort of information naming system to help with building a useful index. You could still do your ad serving but to a less frustrated and potentially more receptive audience.

    ReplyDelete
  9. I would say that sounds pertty insane. Isn't it difficult enough indexing billions of webpages anyway?

    ReplyDelete
  10. Hitesh are you mad.... That is not working... I am bored of this complaints.... I am fighting from since 2008 and today is 2010 july.... I am thinking to join facebook..... Google If you are alive then listen carefully in to hell with your m.Orkut... Good bye

    ReplyDelete
  11. Those who employ and put up dummy websites with just links to an intended site should beware. Google is cracking down on such malpractices and in turn will penalize all violators.

    ReplyDelete
  12. This is a good feature for google to be able to help track dummy sites that aims to defraud people. When is this going to be implemented?

    ReplyDelete

Note: Only a member of this blog may post a comment.