August 30, 2006

Google Relaunches OCR Software

Tesseract is an OCR engine developed at the HP Labs between 1985 and 1995. HP decided to abandon OCR research and, for ten years, the software's development has been frozen. Last year, HP made Tesseract open source (Apache License) and Google, together with a research institute, have continued the development of the program. Now Google announces that the new version is pretty stable and that it's the best open source OCR engine.

"A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there."

OCR is useful for Google Book Search and it could be useful for Picasa or Image Search in addition to an object recognition engine. And, if Google improves the software, it could be launched as a successful alternative to commercial applications. Currently, the software has no UI and it can be run in Linux and Windows.

Use camera phones for OCR

