Google Uses OCR to Index Scanned PDF Files

October 31, 2008

Google Uses OCR to Index Scanned PDF Files

Google started to index to full text of "scanned" PDF files using a technique called OCR (optical character recognition). "Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world," says Evin Levey.

The great thing about the new feature is that you won't notice it unless you look for it, but it improves the quality of Google's search results. Google doesn't mention how many of the 300 million indexed PDF files were converted into text, but you can see some examples if you search for: [repairing aluminium wiring], [Steady success in a volatile world] and click on "View as HTML".

Google sponsors an open-source OCR software called OCRopus and it's likely that Google used it for indexing PDF files from the web. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. (...) It's initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications."

11 comments:

brad dunbarOctober 31, 2008 at 6:30 AM
Its interesting that google would sponsor a project that directly impacts their main form of spam control, Captchas.
ReplyDelete
Replies
AnonymousOctober 31, 2008 at 9:18 AM
Google I think actually has some interest in spam remaining, though not overtly. In one sense, the more pages people view, the more money they make. In fact, I'd imagine that spam is more likely to have context ads that people are more willing to click on than most legitimate emails/communication, or spam wouldn't be profitable. Also, if you've ever been on the google groups website, you'll know they've really got a lot of work to do to improve spam control on their own groups (the android group for example has lots of spam!).
ReplyDelete
Replies
MysteriusOctober 31, 2008 at 12:58 PM
@brad dunbar: A decent Captcha shouldn't be as easy to machine-decipher as the text targetted by OCR, anyway.
ReplyDelete
Replies
AnonymousNovember 2, 2008 at 1:37 PM
Wow.. every day you are making us happy with your engine innovations. BIG THANKS!
ReplyDelete
Replies
HadenNovember 2, 2008 at 11:35 PM
The whole idea of Captchas is to fool OCR technology. If you see most of the captcha's the text is normally twisted or have some other elements added along with it. this makes it difficult for OCR to read it. However there are some programs available which are based on OCR which are capable of reading captchas with a certain level of accuracy
ReplyDelete
Replies
comment gravity wellNovember 15, 2008 at 3:54 PM
sounds familiar

http://scribd.com/

*cough* *cough*
ReplyDelete
Replies
lermFebruary 1, 2010 at 3:26 PM
Yeah! This news makes me feels good! :)
ReplyDelete
Replies
AnonymousFebruary 11, 2010 at 7:00 AM
Google never fails to impress me with its innovative ways to provide accurate and detailed search results that overcome black hat sites. Thanks Google for all the free services that impact my web experience everyday. It won't be long until 71% market share is 100%. Go Google.
ReplyDelete
Replies
Coupon JohnApril 19, 2010 at 4:56 AM
This is a great article, but i would be more cool if there was a tab for certain types of files...
ReplyDelete
Replies
GlobinchJune 12, 2010 at 11:24 PM
Best PDF OCR Tools to Convert Scanned images to Text / Word Documents
http://www.globinch.com/2010/06/08/best-pdf-ocr-tools-to-convert-scanned-images-to-text-word-documents/
ReplyDelete
Replies
CheapMay 2, 2012 at 12:37 AM
Google is able to search in pdf-files?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Google Operating System

Unofficial news and tips about Google

October 31, 2008

Google Uses OCR to Index Scanned PDF Files

11 comments:

Follow

Labels

Popular Posts

Blog Archive

Recommended Sites