Google Pushes Open Source OCR 212
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
Build instructions are outdated (Score:2, Informative)
svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
Re:The beginning of the end? (Score:5, Informative)
Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.
A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
Re:So much for captcha (Score:4, Informative)
Re:The goal of the project (Score:3, Informative)
I'm wrong, eh?
Re:Wonderful! (Score:1, Informative)
Join the Distributed proofreaders [pgdp.net]
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...
More details here [pgdp.net]
Re:captchas (Score:4, Informative)
Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
Re:Finally... (Score:2, Informative)
http://www.vividata.com/index.html [vividata.com]
It's not free software, but it works extremely well.
Well... (Score:5, Informative)
required_hits 5
score SARE_GIF_ATTACH 5
I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
Already done - 1GB and counting! (Score:2, Informative)
Re:Well... (Score:2, Informative)
Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a
Re:The beginning of the end? (Score:3, Informative)
As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, with more research and more powerful computers, it's much easier to develop more free-formed speech recognition systems that can accurately recognize arbitrary strings of numbers/letters. (account numbers, phone numbers, etc) Given that the capabilities of speech regonition systems have grown so much I'd be willing to bet that OCR capabilities have grown in similar ways.
More build info; Ubuntu Feisty (Score:5, Informative)
Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.
To build tesseract-ocr you must install autoconf.
If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.
I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.
to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.
Re:Finally... NOT so final... (Score:2, Informative)
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.
Re:More build info; Ubuntu Feisty (Score:1, Informative)