Forgot your password?
typodupeerror
Google Businesses The Internet Software Technology

Google Pushes Open Source OCR 212

Posted by Zonk
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
This discussion has been archived. No new comments can be posted.

Google Pushes Open Source OCR

Comments Filter:
  • by What the Frag (951841) * on Tuesday April 10, 2007 @01:06PM (#18679073) Journal
    Use this line to checkout ocropus:

    svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
  • by lawpoop (604919) on Tuesday April 10, 2007 @01:38PM (#18679619) Homepage Journal
    I doubt it.

    Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

    A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
  • by cyphercell (843398) on Tuesday April 10, 2007 @01:42PM (#18679673) Homepage Journal
    Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible.
  • by UbuntuDupe (970646) * on Tuesday April 10, 2007 @01:53PM (#18679851) Journal
    Isn't that the same principle behind PGP? Correct me if I'm wrong (and I freely admit encryption is not my area of expertise), but to crack (in reasonable time) PGP-encrypted data, you have to solve a problem no one in the world has been able to solve yet (quick solution for a certain class of problems). Similarly, if captchas get to the point where you need a major theoretical advance to beat them, thanks to wide use of OCR-type programs, that would either foil all spammers, or cause them to solvea mathematically/AI significant problem.

    I'm wrong, eh?
  • Re:Wonderful! (Score:1, Informative)

    by Anonymous Coward on Tuesday April 10, 2007 @01:59PM (#18679971)
    This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

    Join the Distributed proofreaders [pgdp.net]
    and do any or all of:
    1) do some proofreading or formatting of a PG text
    or
    2) Smooth read a near-finished text looking for overlooked oddities
    or
    3) help improve DP's processing software. Lots of extra features wanted...
    or
    4) Get copyright clearance, scan a book and upload to DP's OCR pool
    or
    4) Run your Windows OCR under WINE like I do...

    More details here [pgdp.net]

  • Re:captchas (Score:4, Informative)

    by arrrrg (902404) on Tuesday April 10, 2007 @01:59PM (#18679977)
    Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

    Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
  • Re:Finally... (Score:2, Informative)

    by stilbon (69689) on Tuesday April 10, 2007 @02:04PM (#18680029)
    Vividata OCR Shop XTR

    http://www.vividata.com/index.html [vividata.com]

    It's not free software, but it works extremely well.
  • Well... (Score:5, Informative)

    by Shawn is an Asshole (845769) on Tuesday April 10, 2007 @02:40PM (#18680543)
    If you're sick of image spam, you can do what I did. Add the OpenProtect [openprotect.com] channel to SpamAssassin and then add these line to your SpamAssassin config:

    required_hits 5
    score SARE_GIF_ATTACH 5


    I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
  • by Anonymous Coward on Tuesday April 10, 2007 @02:41PM (#18680571)
    Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!
  • Re:Well... (Score:2, Informative)

    by Auntie Virus (772950) on Tuesday April 10, 2007 @03:07PM (#18681011)
    "required_hits 5 score SARE_GIF_ATTACH 5 I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day."

    Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a .gif of the company logo. SARE_GIF_ATTACH is ok with a lower score, adding to other scoring parameters. What you REALLY want for image spam is the FuzzyOCR plugin.
  • by Iphtashu Fitz (263795) on Tuesday April 10, 2007 @03:19PM (#18681191)
    Part of what makes OCR work is that it assumes that the text was written to communicate meaning.

    As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, with more research and more powerful computers, it's much easier to develop more free-formed speech recognition systems that can accurately recognize arbitrary strings of numbers/letters. (account numbers, phone numbers, etc) Given that the capabilities of speech regonition systems have grown so much I'd be willing to bet that OCR capabilities have grown in similar ways.
  • by drinkypoo (153816) <martin.espinoza@gmail.com> on Tuesday April 10, 2007 @03:25PM (#18681295) Homepage Journal

    Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

    To build tesseract-ocr you must install autoconf.

    If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

    I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

    to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

  • by WrongDecision (803195) on Tuesday April 10, 2007 @11:03PM (#18685289)
    Actually, GOCR works very well (100%) on the image-based text that some sites use to prevent screen scrapping.
    1. Download and save the image.
    2. If it's a gif, convert it to a jpg.
    gif2jpg -a tmp.gif
    3. Reduce the colors to 2 (black & white).
    djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
    4. If there is a border, crop it off.
    pnmcut a b c d tmp.pnm > OCR.pnm
    (The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
    5. OCR it.
    gocr -n 1 OCR.pnm >> OCR.txt
    Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
    For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.
  • by Anonymous Coward on Wednesday April 11, 2007 @05:56AM (#18686925)

    You can svn co -N to grab trunk, then you will need to individually checkout each of those 12 under trunk.
    Better IMO is to "svn switch" the directories you don't want to an empty directory in the ocropus tree (assuming you can find one) then you can carry on working as if you had a complete tree.

It's a poor workman who blames his tools.

Working...