Peter Norvig gave a talk at UC Berkeley on September 25. Among other things, he talked about some Google projects that use artificial intelligence. He also said that a large corpus of data can be much more valuable than an efficient algorithm. One example of project where Google uses a lot of data is Google Q&A, that is extracting facts from web pages and delivering as answers to common questions like "what is the population of Japan?". Google doesn't use predefined patterns, they find the patterns from examples, as this approach is more scalable. They extract data by matching the patterns against the top results for a query.
The presentation is
available online (MMS stream).
They extract data by matching the patterns against the top results for a query.
ReplyDeleteYep.
And then they ban the site from which they take the data....
Search for: % of African American in Los Angeles. At the top of the page is a Google Q&A answer - 11% - based on data retrieved from this page.
But the site itself has been removed from the main index. No pages in site command. Not allinurl command. Not in info command.
Crawling been done pretty recently. Here is the September 2006 cache.
Thus - a site that has been deemed good enough to serve as an Answer reference for the Google Q&A feature – yet has been banned from the main index.
Sneaky? Evil?
Yep.
I think that's more "stupid" than "evil". After all, if their one algorithm considers the source good and their other bad, then that's not intentional, it's a glitch. Other than that, the Q&A feature *does* link to the Idcide.com, so fair attribution is given.
ReplyDeleteWhat Google does may be a lot of things – but it ain’t stupid.
ReplyDeleteIt is not that the algorithm controlling the SERPS considers the site to be a bad source of data. It is actually the other way around – it considers it a good source of data and ranks it high. So high as to trigger a “how come this site ranks that high for these keywords?” flag.
As Matt “noticed” the site has commercial content and the site has new ownership.
Those two things have nothing to do with the quality of the SERPs and everything to do with Google’s business interest.
Attribution is not the problem here – intentional interference in commerce is.