An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

October 31, 2008

Google Uses OCR to Index Scanned PDF Files

Google started to index to full text of "scanned" PDF files using a technique called OCR (optical character recognition). "Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world," says Evin Levey.

The great thing about the new feature is that you won't notice it unless you look for it, but it improves the quality of Google's search results. Google doesn't mention how many of the 300 million indexed PDF files were converted into text, but you can see some examples if you search for: [repairing aluminium wiring], [Steady success in a volatile world] and click on "View as HTML".


Google sponsors an open-source OCR software called OCRopus and it's likely that Google used it for indexing PDF files from the web. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. (...) It's initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications."

This blog is not affiliated with Google.