December 16, 2007

Google Is All About Large Amounts of Data


In a very interesting interview from October, Google's VP Marissa Mayer confessed that having access to large amounts of data is in many instances more important than creating great algorithms.
Right now Google is really good with keywords, and that's a limitation we think the search engine should be able to overcome with time. People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions -- not about what words will appear on the page but more like "what is this about?" A lot of people will turn to things like the semantic Web as a possible answer to that. But what we're seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they're done through brute force.

When you type in "GM" into Google, we know it's "General Motors." If you type in "GM foods" we answer with "genetically modified foods." Because we're processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn't really. It has to do with brute force. That said, I think the best algorithm for search is a mix of both brute-force computation and sheer comprehensiveness and also the qualitative human component.

Marissa Mayer admitted that the main reason why Google launched the free 411 service is to get a lot of data necessary for training speech recognition algorithms.
You may have heard about our [directory assistance] 1-800-GOOG-411 service. Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model ... that we can use for all kinds of different things, including video search.

The speech recognition experts that we have say: If you want us to build a really robust speech model, we need a lot of phonemes, which is a syllable as spoken by a particular voice with a particular intonation. So we need a lot of people talking, saying things so that we can ultimately train off of that. ... So 1-800-GOOG-411 is about that: Getting a bunch of different speech samples so that when you call up or we're trying to get the voice out of video, we can do it with high accuracy.

Peter Norvig, director of research at Google, seems to agree. "I have always believed (well, at least for the past 15 years) that the way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons. The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective." Google uses statistics for machine translation, question answering, spell checking and more, as you can see in this video. The same video explains that the more data you have, the better your AI algorithm will perform, even if it isn't the best.

Peter Norvig says that Google developed its own speech recognition technology. "We wanted speech technology that could serve as an interface for phones and also index audio text. After looking at the existing technology, we decided to build our own. We thought that, having the data and computational resources that we do, we could help advance the field. Currently, we are up to state-of-the-art with what we built on our own, and we have the computational infrastructure to improve further. As we get more data from more interaction with users and from uploaded videos, our systems will improve because the data trains the algorithms over time."

Google is in the privileged position to gain access to large amounts of data that could be used to improve other services.

3 comments:

  1. Hmmm, I think there may be more to this.

    Searching the entire Internet used to be a challenge. Some might say that Google was the first to do a really good job of it. But now, Microsoft and Yahoo, and others are being competitive (not VERY competitive, but they are at least playing in the same league).

    As more companies create vast self-healing server farms the number of companies in that club might grow. I think the enormous power of these vast armies of server are Google's strength. What would be called in other industries, a "barrier to entry".

    The more stuff there is to search now, the bigger that barrier will get. Books, photos, blogs, documents, maps and so on. Google is basically saying "we can handle all of this stuff and more, and do it so cheaply that we pay for it with advertising (and make a profit to boot)".

    Suppose that other big companies like Microsoft and Yahoo not only fail to catch Google in search or advertising, but also fail to make their ever growing server farms run at any sort of profit margin at all.

    Who is left holding all the marbles?

    ReplyDelete
  2. I heard of such things around 10 years ago, such approach was used for voice recognition, OCR and other AI applications. The statistical analysis of large amount of data has been considered to be the most accurate, practical and effective approach of applying AI to commercial applications. Of course, from some academic point of view it is not elegant, and very much like "brute-force computation".

    Google lifted applications and researches of such analysis to new frontier. More algorithms and models for statistical analysis are still emerging, and Google even provided interfaces for researchers around the world to access the data. Good works, Google.

    ReplyDelete
  3. The GOOG-411 service is AMAZING! I was out clubbing and on my way back home at 4:00 am my car brook down and I had to call a towing company. Guess what I used, GOOG-411! It gives you so many options to choose from. It is just great!

    ReplyDelete