Google Pushes Open Source OCR

Google Pushes Open Source OCR 212

Posted by Zonk on Tuesday April 10, 2007 @02:02PM from the google-has-taken-all-knowledge-to-be-its-provice dept.

SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"

Google Pushes Open Source OCR

This discussion has been archived. No new comments can be posted.

Search 212 Comments Log In/Create an Account

Comments Filter:

Build instructions are outdated (Score:2, Informative)

by What the Frag ( 951841 ) * writes: on Tuesday April 10, 2007 @02:06PM (#18679073) Journal

Use this line to checkout ocropus: svn co http://ocropus.googlecode.com/svn/trunk/ ocropus

Re:The beginning of the end? (Score:5, Informative)

by lawpoop ( 604919 ) writes: on Tuesday April 10, 2007 @02:38PM (#18679619) Homepage Journal

I doubt it.

Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.

Re:So much for captcha (Score:4, Informative)

by cyphercell ( 843398 ) writes: on Tuesday April 10, 2007 @02:42PM (#18679673) Homepage Journal

Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible.

Re:The goal of the project (Score:3, Informative)

by UbuntuDupe ( 970646 ) * writes: on Tuesday April 10, 2007 @02:53PM (#18679851) Journal

Isn't that the same principle behind PGP? Correct me if I'm wrong (and I freely admit encryption is not my area of expertise), but to crack (in reasonable time) PGP-encrypted data, you have to solve a problem no one in the world has been able to solve yet (quick solution for a certain class of problems). Similarly, if captchas get to the point where you need a major theoretical advance to beat them, thanks to wide use of OCR-type programs, that would either foil all spammers, or cause them to solvea mathematically/AI significant problem.

I'm wrong, eh?

Re:Wonderful! (Score:1, Informative)

by Anonymous Coward writes: on Tuesday April 10, 2007 @02:59PM (#18679971)

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

Join the Distributed proofreaders [pgdp.net]
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...

More details here [pgdp.net]

Re:captchas (Score:4, Informative)

by arrrrg ( 902404 ) writes: on Tuesday April 10, 2007 @02:59PM (#18679977)

Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.

Re:Finally... (Score:2, Informative)

by stilbon ( 69689 ) writes: on Tuesday April 10, 2007 @03:04PM (#18680029)

Vividata OCR Shop XTR

http://www.vividata.com/index.html [vividata.com]

It's not free software, but it works extremely well.

Well... (Score:5, Informative)

by Shawn is an Asshole ( 845769 ) writes: on Tuesday April 10, 2007 @03:40PM (#18680543)

If you're sick of image spam, you can do what I did. Add the OpenProtect [openprotect.com] channel to SpamAssassin and then add these line to your SpamAssassin config:

required_hits 5 score SARE_GIF_ATTACH 5

I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.

Already done - 1GB and counting! (Score:2, Informative)

by Anonymous Coward writes: on Tuesday April 10, 2007 @03:41PM (#18680571)

Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!

Re:Well... (Score:2, Informative)

by Auntie Virus ( 772950 ) writes: on Tuesday April 10, 2007 @04:07PM (#18681011)

"required_hits 5 score SARE_GIF_ATTACH 5 I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day."

Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a .gif of the company logo. SARE_GIF_ATTACH is ok with a lower score, adding to other scoring parameters. What you REALLY want for image spam is the FuzzyOCR plugin.

Re:The beginning of the end? (Score:3, Informative)

by Iphtashu Fitz ( 263795 ) writes: on Tuesday April 10, 2007 @04:19PM (#18681191)

Part of what makes OCR work is that it assumes that the text was written to communicate meaning.

As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, with more research and more powerful computers, it's much easier to develop more free-formed speech recognition systems that can accurately recognize arbitrary strings of numbers/letters. (account numbers, phone numbers, etc) Given that the capabilities of speech regonition systems have grown so much I'd be willing to bet that OCR capabilities have grown in similar ways.

More build info; Ubuntu Feisty (Score:5, Informative)

by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Tuesday April 10, 2007 @04:25PM (#18681295) Homepage Journal

Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

To build tesseract-ocr you must install autoconf.

If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

Re:Finally... NOT so final... (Score:2, Informative)

by WrongDecision ( 803195 ) writes: on Wednesday April 11, 2007 @12:03AM (#18685289)

Actually, GOCR works very well (100%) on the image-based text that some sites use to prevent screen scrapping.
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.

Re:More build info; Ubuntu Feisty (Score:1, Informative)

by Anonymous Coward writes: on Wednesday April 11, 2007 @06:56AM (#18686925)

You can svn co -N to grab trunk, then you will need to individually checkout each of those 12 under trunk.
Better IMO is to "svn switch" the directories you don't want to an empty directory in the ocropus tree (assuming you can find one) then you can carry on working as if you had a complete tree.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Pushes Open Source OCR 212

Google Pushes Open Source OCR More Login

Google Pushes Open Source OCR

Build instructions are outdated (Score:2, Informative)

Re:The beginning of the end? (Score:5, Informative)

Re:So much for captcha (Score:4, Informative)

Re:The goal of the project (Score:3, Informative)

Re:Wonderful! (Score:1, Informative)

Re:captchas (Score:4, Informative)

Re:Finally... (Score:2, Informative)

Well... (Score:5, Informative)

Already done - 1GB and counting! (Score:2, Informative)

Re:Well... (Score:2, Informative)

Re:The beginning of the end? (Score:3, Informative)

More build info; Ubuntu Feisty (Score:5, Informative)

Re:Finally... NOT so final... (Score:2, Informative)

Re:More build info; Ubuntu Feisty (Score:1, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot