I wrote last year a post about content separation that suggested a way to separate the main content of a page from other content that's not very interesting. Most of the elements of a template (navigation, footer etc.) could confuse search engines into thinking a page talks about something else than it does. As a result, a page could end up ranking well for unrelated queries and not so well for the right queries.
As a solution for this problem, Yahoo introduces a 'robots-nocontent' class that can be added to any HTML tag.
"This tag is really about our crawler focusing on the main content of your page and targeting the right pages on your site for specific search queries. Since a particular source is limited to the number of times it appears in the top ten, it's important that the proper matching and targeting occur in order to increase both the traffic as well as the conversion on your site. It also improves the abstracts for your pages in results by omitting unrelated text from search result summaries.
To do this, webmasters can now mark parts of a page with a 'robots-nocontent' tag which will indicate to our crawler what parts of a page are unrelated to the main content and are only useful for visitors. We won't use the terms contained in these special tagged sections as information for finding the page or for the abstract in the search results."
While this could be useful to reduce the importance of unrelated parts of your site (like AdSense's section targeting), I can't stop wondering if this isn't the search engine's job. For example, Google can detect the navigation links from a page (you can notice this if you use the mobile version), but I don't think it minimizes the importance of the keywords used in that area.
This is the work for microformats. They even implemented it in a way similar to how microformats works (by attaching meaning to a class name). Idea is good.
ReplyDeleteHowever, there is already a draft for that which they didn't follow.
If this has to be done at all, it ought to be done in a robots.txt file using css/xpath selectors.
ReplyDeleteTemplate detection in web pages is an "old" and active research topic first addressed by Broader.
ReplyDeleteSee for instance this recent paper.
http://portal.acm.org/citation.cfm?id=1141534
This seems to be an interesting idea from Yahoo. Regarding your question about this "being a search engine's job", its just like SiteMaps introduced by Google and recently adopted by the whole industry.
Its a "search engine's job" that could benefit from a little help from webmasters ;)
I agree with Sergio. Webmasters have to help the search engine in order to optimize the searches made. The webmaster benefit from it (because it receives a lot more visitors) and the search engine also benefits with it - because its search capacity is improved.
ReplyDeleteGood article.
Maybe that's a “search engine's job”, but it's a hard one. I think everyone agree that this giving webmasters possibility to do this job is much faster and require less effort.
ReplyDeleteI thought that making this an µf is a good idea, however now that kjwa mentioned using robots.txt I think he's right.
But unfortunately, current robots.txt standard don't allow using using XPath. And by the way, what about pure HTML pages, XPath shouldn't be applied on SGML but on XML. Also what about other content (not-XML/SGML).
Now I think both methods shall be used.
Search engines are going to have to identify templates anyway. Not everyone is going to use the new attribute correctly or even use it at all.
ReplyDeleteI think that this new Yahoo idea just adds unnecessary complexity.