Google Operating System: Open-Source OCR Software, Sponsored by Google

April 10, 2007

Open-Source OCR Software, Sponsored by Google

Google sponsors the development of an open-source OCR software at the IUPR research group. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities."

"The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use," explains Thomas Breuel, who leads the project.

The software is partly based on Tesseract, the best open source OCR engine available for now. While the project is expected to be released at the end of next year and will be used for Google's book scanning project, the team has some interesting applications in mind:

* a web service interface
* PDF, camera, and screen OCR
* integration with desktop search tools: Beagle, Spotlight, Google Desktop

The most popular OCR software are ABBYY FineReader, Omnipage, Readiris and Presto OCR, but they're pretty expensive (starting at $100). A decent solution to perform OCR on a document is Microsoft Office Document Imaging, included in Microsoft Office XP/2007. Microsoft Office OneNote 2007 also lets you OCR imported images. A free online alternative is Scanr, a site that lets you digitize documents by sending a mobile phone photo by email.

33 comments:

George EntenmanApril 10, 2007 at 5:32 PM
Dear Google,

Please release an OSX version ASAP!! This sounds great!

-- ge
ReplyDelete
Replies
UnknownApril 11, 2007 at 2:39 AM
It's nice to see investment in basic OCR technologies. In the past, open source OCR really hasn't come close to the performance level of commercial packages (scanR has 2 OCR vendors).

scanR will extract the text and create a searchable PDF file from any picture of a document sent to doc@scanR.com. You can search your documents on www.scanR.com or forward the PDF in email. For business cards, scanR will tag each word with its context and create a vCard. Just send pictures of business cards to bc@scanR.com. We'll even sync them with your Plaxo account.

Thanks for the post on scanR.
ReplyDelete
Replies
AnonymousApril 11, 2007 at 2:42 AM
Fantastic. Well done. As a visually impaired person, this sounds like it could have great impact on the Assistive Technology Market. Specifically on the price of other applications for this area such as Kurzweil 1000.
ReplyDelete
Replies
Zijian AApril 11, 2007 at 8:05 PM
With Google's statistics of words/phases and images, the OCR will be able to do more intelligent works while improving accuracy.
ReplyDelete
Replies
Zijian AApril 11, 2007 at 8:16 PM
One day, we can use ocr.google.com.

I upload a few images of scanned documents, then Google will return text or PDF.

I think Google will probably charge a reasonable fee for such service. Google will encourage users to store the text of PDF file online, so Google can easily analyze the amendments made by users, in order to improve the models of recognizing words.
ReplyDelete
Replies
Alex ChituApril 12, 2007 at 1:04 AM
If the people from scanr.com can make the service (partially) free, I'm sure Google's online OCR service will be free. With some limitations.

They could also add OCR options when you import PDFs in Google Docs, for Gmail attachments, in Google Desktop etc.
ReplyDelete
Replies
pApril 13, 2007 at 12:06 AM
This is just the right thing to invest money into; even without charging for the end product.
I have this sort of Google-hostile human-friendly site ( grafted.interface ) and it seems that it is scheduled for an involuntary SE optimization in the near future.
For some sort of feedback check out my short sort of treatise ( structural formulae searchability ) on how SEs have neglected a sort of information very important in many fields of science which is very important IMO
ReplyDelete
Replies
Simon ReddingApril 16, 2007 at 2:39 PM
There's bound to be an ad-funded business model for this - users will be gasping for it!
ReplyDelete
Replies
AnonymousApril 30, 2007 at 3:12 PM
I'd love to have something like this. I've just tested a bunch of OCR converters and they really didn't do the job -- they fail when it comes to converting pages that have both text and images on them. I'm trying to convert pdfs and save them as Microsoft Word documents (or Powerpoint) and nothing quite seems to do what I want which makes me reluctant to purchase something without trying it.
ReplyDelete
Replies
AnonymousJune 20, 2007 at 2:04 AM
Hope you can work with the Open office and SANE guys to create an integrated system (i.e. scan a document and it automatically gets OCRd and appears in OpenOffice).
ReplyDelete
Replies
AnonymousSeptember 5, 2007 at 4:07 PM
what about using your digital camera for OCR? You should check out TopOCR This is freeware and has more features an accuracy than some of the expensive commercial versions.
ReplyDelete
Replies
AnonymousSeptember 12, 2007 at 12:46 PM
Hello everyone ,

Can someone tell me how can we reach Google to ask it for sponsoring our Graduation Project?

Who do we contact

Thanks in Advance :)
ReplyDelete
Replies
AnonymousNovember 10, 2008 at 8:36 PM
There is also OCR Terminal - a web service that allows you to upload scanned documents/ images in a variety of formats and download OCRed text in doc, rtf and so on.
ReplyDelete
Replies
AnonymousNovember 26, 2008 at 1:16 PM
You also have mobitra.mobi/image to get the recognition and/or translation of the text contained in an image, specially designed for mobile phones.
ReplyDelete
Replies
AnonymousNovember 26, 2008 at 1:17 PM
the link of the explanation
ReplyDelete
Replies
AnonymousDecember 26, 2008 at 5:49 PM
This is absolutely fantastic project. Once this is complete I am sure the way we deal with the documents will undergo a sea change and more amd more software products will be built around it. Google! all the best
ReplyDelete
Replies
AnonymousMarch 1, 2009 at 6:56 PM
Just trying to unzip the download file is a pain.
ReplyDelete
Replies
DummeyMarch 10, 2009 at 7:50 PM
It is a pain to download and get working and there is also a lot of work to be done on it, but the idea of having an open source OCR program being actively developed is exciting!
ReplyDelete
Replies
UnknownJuly 24, 2009 at 2:07 PM
If there is a god there will be a release for OS X.
ReplyDelete
Replies
HaxorflexOctober 2, 2009 at 8:32 AM
in that case i'm god, im gonna port it to osx this weekend :D
ReplyDelete
Replies
UnknownJanuary 12, 2010 at 9:28 PM
Hey I'm doing an OCR software project which should extract the text from the camera captured image file. Is there anybody who can help me out? I'm pretty short of time.
ReplyDelete
Replies
GlobinchJune 11, 2010 at 12:46 AM
This project will be really useful. There are some other free OCR tools available Best PDF OCR Tools to Convert Scanned images to Text / Word Documents
ReplyDelete
Replies
AnonymousJuly 1, 2010 at 8:14 AM
Has anyone tried WatchOCR from http://www.watchocr.com ? It is simple, open source, and seems to work very well for making searchable pdfs.
ReplyDelete
Replies
ankit.nagpalDecember 21, 2010 at 11:55 AM
one more ocr software. works good for me.
ReplyDelete
Replies
KoowieApril 30, 2011 at 9:47 AM
This is great. I am sure the OCR will be more intelligent and improved I am sure.
ReplyDelete
Replies
johnJune 25, 2011 at 12:16 AM
This is great im visually impaired the info you have provided is invaluable!
ReplyDelete
Replies
AnonymousJanuary 24, 2012 at 2:20 PM
PrimeOCR (primerecognition.com ) is a commercial package.
ReplyDelete
Replies
AnonymousJune 5, 2012 at 6:28 AM
I use ABBYY FineReader,and I can say that this is a wonderful software! I can't but agree with the statement that FineReader is pretty expensive, but still it costs its money. Besides, it supports arabic OCR!
ReplyDelete
Replies
Hank HendricksJune 11, 2012 at 11:40 PM
From where i can get this software a complete version?
ReplyDelete
Replies
Cheat Code TyperAugust 24, 2012 at 7:19 AM
I am trying to make it for my own language.
ReplyDelete
Replies
Yunmai OCR SDKJuly 5, 2016 at 8:27 PM
When choosing OCR software, I always think about the recognition accuracy and recognition speed. As I know, Yunmai Technology is also very professional on OCR technology. Yunmai Document Recognition is really great for me. The average time for recognition of a document less than 6 seconds. The recognition accuracy can reach 99%. It can convert documents into PDF, Word, Text format files.
ReplyDelete
Replies
AnonymousApril 29, 2017 at 1:45 PM
password using that,,,
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.