An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

February 9, 2007

Powerset, Natural Language Search Engine

"I think there's a ton of challenges, because in my view, search is in its infancy, and we're just getting started. I think the most pressing, immediate need as far as the search interface is to break paradigm of the expectation of "You give us a keyword, and we give you 10 URL's". I think we need to get into richer, more diverse ways you're able to express their query, be it though natural language, or voice, or even contextually. I'm always intrigued by what the Google desktop sidebar is doing, by looking at your context, or what Gmail does, where by looking at your context, it actually produces relevant webpages, ads and things like that. So essentially, a context based search."
(Marissa Mayer, VP at Google)

New York Times reports that Powerset, a start-up focused on search, licensed natural language technology from the famous Palo Alto Research Center (PARC). Its purpose: "build a search engine that could some day rival Google".

Unlike keyboard-based search engines like Google, Powerset wants to let users type questions in natural language, by developing a system which recognizes and represents the implicit and the explicit meaning in a text.

The problem is that even if Powerset has great algorithms for understanding the meaning of a query (and there aren't fool-proof algorithms for that), building a search engine requires a huge infrastructure and processing power. Fernando Pereira, an expert in natural language from the University of Pennsylvania, even questions if PARC's NLP technology is a good approach for search: "The question of whether this technology is adequate to any application, whether search or anything else, is an empirical question that has to be tested".

Besides, Google's own approaches for delivering answers show that it's hard to give a single relevant answer for most queries, which are by default ambiguous. Google is rather inclined to use its huge corpus and apply statistical algorithms instead of using grammar rules. Peter Norvig, director of research at Google, says: "I have always believed (well, at least for the past 15 years) that the way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons. The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective." Google uses statistics for machine translation, question answering, spell checking and more.

People tend to be lazy and type queries that contain an average of 2-3 words - that wouldn't help too much a natural language search engine, so it would ask more in-depth questions about your query. For a lot of queries (e.g.: navigational queries, like "snap"), you'll spend more time refining the ambiguous query. Google tries to balance the top results, and the most important pages are first.

Powerset might be launched at the end of the year. Hakia, another search engine that uses NLP, is already available, but its results don't look promising.

4 comments:

  1. "search is in its infancy" - well, look at google. what have they done after the page rank thing? there are many bugs in it - cross-site-linking, google bombs... dah! google only deals with additional services, search is in the 2nd place.

    ReplyDelete
  2. I am really interested in natural language (NL) computing, but am not really convinced by the relevance of a NL based search engine.

    In so far as people have been used to searching the Net with keywords, it may be difficult to talk them into searching with NL. If they cannot find any relevant information with NL query, they will turn back to keyword based queries.
    We have adapted our minds to think with keywords, we are using more and more tags to sort and access information; I am not convinced the NL->keyword preprocessing will be efficient, at least more efficient than keyword input.

    NL is technically beautiful, but as long as we keep inputting data with keyboards, we incounsciously sort data and assign keywords to what we have in mind prior to giving it to computer system. However, if we imagine a search engine with vocal recognition input, then NL preprocessing cannot be bypassed.

    Maybe keywords are rather adapted to written input and NL to oral input?

    ReplyDelete
  3. I think NL may have some application in the search box, but limited. It's always going to be the specific keywords that fundamentally determine which pages are returned.

    NL could be helpful in determining the general grammatical form of the search query (e.g. posed as a question), though Google already sort of does this anyway.

    Even where comprehending the grammar might disambiguate an otherwise ambiguous associated search term, statistical/AI methods will likely account for such disambiguations anyway via word order and proximity.

    Also, look where Google is going with machine translation. By using statistical models rather than handcoded grammars, they essentially ARE building an accurate grammar of the materials used in training the system. It IS NL processing, of a sort - just not handbuilt. And Google's approach has already beaten the older, handbuilt translation engines out there.

    Where NL will really find an application in search, I think, is when Google gets to automatically translating foreign-language sites such that they are both searched and displayed in the querier's chosen language. Here, differences in meaning implied by grammatical structures will need to be considered both over the submitted search query, and within the index itself. Though again, a comprehensive statistical approach is still going to win over a handbuilt approach.

    ReplyDelete
  4. You should take a look at www.linguisticagents.com. It�s a start-up company that has developed a natural language understanding technology that will be used in many applications in addition to search. This technology uses a deep parsing algorithm that is based on nano-syntax technology.

    ReplyDelete

Note: Only a member of this blog may post a comment.