An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

April 10, 2007

Open-Source OCR Software, Sponsored by Google

Google sponsors the development of an open-source OCR software at the IUPR research group. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities."

"The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use," explains Thomas Breuel, who leads the project.

The software is partly based on Tesseract, the best open source OCR engine available for now. While the project is expected to be released at the end of next year and will be used for Google's book scanning project, the team has some interesting applications in mind:

* a web service interface
* PDF, camera, and screen OCR
* integration with desktop search tools: Beagle, Spotlight, Google Desktop

The most popular OCR software are ABBYY FineReader, Omnipage, Readiris and Presto OCR, but they're pretty expensive (starting at $100). A decent solution to perform OCR on a document is Microsoft Office Document Imaging, included in Microsoft Office XP/2007. Microsoft Office OneNote 2007 also lets you OCR imported images. A free online alternative is Scanr, a site that lets you digitize documents by sending a mobile phone photo by email.

31 comments:

  1. Dear Google,

    Please release an OSX version ASAP!! This sounds great!

    -- ge

    ReplyDelete
  2. It's nice to see investment in basic OCR technologies. In the past, open source OCR really hasn't come close to the performance level of commercial packages (scanR has 2 OCR vendors).

    scanR will extract the text and create a searchable PDF file from any picture of a document sent to doc@scanR.com. You can search your documents on www.scanR.com or forward the PDF in email. For business cards, scanR will tag each word with its context and create a vCard. Just send pictures of business cards to bc@scanR.com. We'll even sync them with your Plaxo account.

    Thanks for the post on scanR.

    ReplyDelete
  3. Fantastic. Well done. As a visually impaired person, this sounds like it could have great impact on the Assistive Technology Market. Specifically on the price of other applications for this area such as Kurzweil 1000.

    ReplyDelete
  4. With Google's statistics of words/phases and images, the OCR will be able to do more intelligent works while improving accuracy.

    ReplyDelete
  5. One day, we can use ocr.google.com.

    I upload a few images of scanned documents, then Google will return text or PDF.

    I think Google will probably charge a reasonable fee for such service. Google will encourage users to store the text of PDF file online, so Google can easily analyze the amendments made by users, in order to improve the models of recognizing words.

    ReplyDelete
  6. If the people from scanr.com can make the service (partially) free, I'm sure Google's online OCR service will be free. With some limitations.

    They could also add OCR options when you import PDFs in Google Docs, for Gmail attachments, in Google Desktop etc.

    ReplyDelete
  7. This is just the right thing to invest money into; even without charging for the end product.
    I have this sort of Google-hostile human-friendly site ( grafted.interface ) and it seems that it is scheduled for an involuntary SE optimization in the near future.
    For some sort of feedback check out my short sort of treatise ( structural formulae searchability ) on how SEs have neglected a sort of information very important in many fields of science which is very important IMO

    ReplyDelete
  8. There's bound to be an ad-funded business model for this - users will be gasping for it!

    ReplyDelete
  9. I'd love to have something like this. I've just tested a bunch of OCR converters and they really didn't do the job -- they fail when it comes to converting pages that have both text and images on them. I'm trying to convert pdfs and save them as Microsoft Word documents (or Powerpoint) and nothing quite seems to do what I want which makes me reluctant to purchase something without trying it.

    ReplyDelete
  10. Hope you can work with the Open office and SANE guys to create an integrated system (i.e. scan a document and it automatically gets OCRd and appears in OpenOffice).

    ReplyDelete
  11. what about using your digital camera for OCR? You should check out TopOCR This is freeware and has more features an accuracy than some of the expensive commercial versions.

    ReplyDelete
  12. Hello everyone ,

    Can someone tell me how can we reach Google to ask it for sponsoring our Graduation Project?

    Who do we contact

    Thanks in Advance :)

    ReplyDelete
  13. There is also OCR Terminal - a web service that allows you to upload scanned documents/ images in a variety of formats and download OCRed text in doc, rtf and so on.

    ReplyDelete
  14. You also have mobitra.mobi/image to get the recognition and/or translation of the text contained in an image, specially designed for mobile phones.

    ReplyDelete
  15. This is absolutely fantastic project. Once this is complete I am sure the way we deal with the documents will undergo a sea change and more amd more software products will be built around it. Google! all the best

    ReplyDelete
  16. Just trying to unzip the download file is a pain.

    ReplyDelete
  17. It is a pain to download and get working and there is also a lot of work to be done on it, but the idea of having an open source OCR program being actively developed is exciting!

    ReplyDelete
  18. If there is a god there will be a release for OS X.

    ReplyDelete
  19. in that case i'm god, im gonna port it to osx this weekend :D

    ReplyDelete
  20. Hey I'm doing an OCR software project which should extract the text from the camera captured image file. Is there anybody who can help me out? I'm pretty short of time.

    ReplyDelete
  21. This project will be really useful. There are some other free OCR tools available Best PDF OCR Tools to Convert Scanned images to Text / Word Documents

    ReplyDelete
  22. Has anyone tried WatchOCR from http://www.watchocr.com ? It is simple, open source, and seems to work very well for making searchable pdfs.

    ReplyDelete
    Replies
    1. WatchOCR is really nice OS with OCR system and works fine ! I stated to script an GED web interface (PHP/Mysql) with pdftotxt for parsing content and inject into database.

      Delete
  23. This is great. I am sure the OCR will be more intelligent and improved I am sure.

    ReplyDelete
  24. This is great im visually impaired the info you have provided is invaluable!

    ReplyDelete
  25. PrimeOCR (primerecognition.com ) is a commercial package.

    ReplyDelete
  26. I use ABBYY FineReader,and I can say that this is a wonderful software! I can't but agree with the statement that FineReader is pretty expensive, but still it costs its money. Besides, it supports arabic OCR!

    ReplyDelete
  27. From where i can get this software a complete version?

    ReplyDelete
  28. I am trying to make it for my own language.

    ReplyDelete