October 31, 2008

Google Uses OCR to Index Scanned PDF Files

Google started to index to full text of "scanned" PDF files using a technique called OCR (optical character recognition). "Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world," says Evin Levey.

The great thing about the new feature is that you won't notice it unless you look for it, but it improves the quality of Google's search results. Google doesn't mention how many of the 300 million indexed PDF files were converted into text, but you can see some examples if you search for: [repairing aluminium wiring], [Steady success in a volatile world] and click on "View as HTML".


Google sponsors an open-source OCR software called OCRopus and it's likely that Google used it for indexing PDF files from the web. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. (...) It's initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications."

11 comments:

  1. Its interesting that google would sponsor a project that directly impacts their main form of spam control, Captchas.

    ReplyDelete
  2. Google I think actually has some interest in spam remaining, though not overtly. In one sense, the more pages people view, the more money they make. In fact, I'd imagine that spam is more likely to have context ads that people are more willing to click on than most legitimate emails/communication, or spam wouldn't be profitable. Also, if you've ever been on the google groups website, you'll know they've really got a lot of work to do to improve spam control on their own groups (the android group for example has lots of spam!).

    ReplyDelete
  3. @brad dunbar: A decent Captcha shouldn't be as easy to machine-decipher as the text targetted by OCR, anyway.

    ReplyDelete
  4. Wow.. every day you are making us happy with your engine innovations. BIG THANKS!

    ReplyDelete
  5. The whole idea of Captchas is to fool OCR technology. If you see most of the captcha's the text is normally twisted or have some other elements added along with it. this makes it difficult for OCR to read it. However there are some programs available which are based on OCR which are capable of reading captchas with a certain level of accuracy

    ReplyDelete
  6. sounds familiar

    http://scribd.com/

    *cough* *cough*

    ReplyDelete
  7. Yeah! This news makes me feels good! :)

    ReplyDelete
  8. Google never fails to impress me with its innovative ways to provide accurate and detailed search results that overcome black hat sites. Thanks Google for all the free services that impact my web experience everyday. It won't be long until 71% market share is 100%. Go Google.

    ReplyDelete
  9. This is a great article, but i would be more cool if there was a tab for certain types of files...

    ReplyDelete
  10. Best PDF OCR Tools to Convert Scanned images to Text / Word Documents
    http://www.globinch.com/2010/06/08/best-pdf-ocr-tools-to-convert-scanned-images-to-text-word-documents/

    ReplyDelete
  11. Google is able to search in pdf-files?

    ReplyDelete

Note: Only a member of this blog may post a comment.