December 9, 2006

Using Anchor Text to Translate Search Queries

Google was granted a new patent titled: "Systems and methods for using anchor text as parallel corpora for cross-language information retrieval". Because queries are usually short and can have multiple meanings, Google tries to find the best search results for the translated query by looking at anchor text.

The method includes receiving a search query that includes terms in a first language; determining possible translations of the terms of the search query into a second language; locating documents in the first language that match the terms of the search query; identifying documents in the second language that contain references to the first language documents; and disambiguating among the possible translations of the terms of the search query using the second language documents to identify one of the possible translations as a likely translation of the search query.

Here's an example:
Assume that a user provides a search query to the server in Spanish, but desires documents to be returned in English. Further, assume that the user desires documents relating to "banks interest." In this case, the query provided by the user may include the terms "bancos" and "interes." To facilitate English-language document retrieval, the server may translate the Spanish query to English.

The query translation engine may perform an initial translation of the terms of the query using, for example, the dictionary. In this case, the query translation engine finds that each of the terms of the query has more than one possible translation. For example, the Spanish word "bancos" could be translated as "banks" or "benches" (among other possibilities) in English. The Spanish word "interes" could be translated as "interest" or "concern" (among other possibilities) in English. The query translation engine disambiguates among the possible translations using documents identified by the search engine.

The search engine performs a search using the original Spanish query (i.e., "bancos interes") to identify Spanish-language documents that include anchors that contain all of the query terms and point to English-language documents. The search engine provides the English-language documents that are pointed to by the anchors to the query translation engine.

The query translation engine analyzes the text of the English-language documents to, for example, compute the frequency of co-occurrence of the various translation possibilities. Specifically, the query translation engine determines how often the word "banks" occurs with "interest," "banks" occurs with "concern," "benches" occurs with "interest," and "benches" occurs with "concern." Presumably, the query translation engine would determine that "banks" and "interest" are the most frequent combination and use these terms as the correct translation for the Spanish query "bancos interes."

Google didn't implement this method into the search engine yet. If you could also translate documents into the first language (your native language), you would need one language to search the web.

3 comments:

  1. Weird, it says "December 5, 2006" and below "filed: August 28, 2001". I can't seem to open the attached drawings (the images link) by the way...

    ReplyDelete
  2. Philipp, you should know that patents are filed long before they are approved.

    The images from patents are a mystery. Firefox says it needs a plugin (QuickTime) to display them. Of course I have QuickTime installed.

    ReplyDelete
  3. Patents are so often criticised. But this seems like a legitimately un-intuitive idea.

    ReplyDelete

Note: Only a member of this blog may post a comment.