Forgot your password?
typodupeerror
Google Businesses The Internet Software Technology

Google Pushes Open Source OCR 212

Posted by Zonk
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
This discussion has been archived. No new comments can be posted.

Google Pushes Open Source OCR

Comments Filter:
  • by Anonymous Coward on Tuesday April 10, 2007 @02:05PM (#18679067)
    Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.
    • by Anonymous Coward
      Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!
    • by Instine (963303) on Tuesday April 10, 2007 @03:48PM (#18680709)
      What about a free service to upload scanned images to and recieve html in return?... Please....
      • Wait.
      • Re: (Score:2, Insightful)

        by drinkypoo (153816)
        Now wait a second... you would rather upload a scanned image, which should be at a pretty decent resolution if you want good results, than run the OCR software locally? What, are you using a system with a 33MHz CPU or something?
      • I have no doubt at all that this is coming in the future. Why? Because Google wants to see all data, analyze that data, and catalog it. That's exactly what would happen if you uploaded your scanned document to Google: Sure they would OCR it and do a good job, but they would also save the OCR'ed copy for later data mining.

        One awesome application of this: I teach university courses that require term papers. If I could scan and upload the term papers I receive and Google could OCR them and tell me whether th

  • Use this line to checkout ocropus:

    svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
    • by drinkypoo (153816) <martin.espinoza@gmail.com> on Tuesday April 10, 2007 @04:25PM (#18681295) Homepage Journal

      Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

      To build tesseract-ocr you must install autoconf.

      If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

      I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

      to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

  • by user24 (854467) on Tuesday April 10, 2007 @02:07PM (#18679089) Homepage
    The goal of the project is to stop the damn email image spammers.

    among other things, sure, but it's got to be a high priority for google.
    • by sammy baby (14909) on Tuesday April 10, 2007 @02:15PM (#18679253) Journal
      And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)
      • And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

        True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.

        Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.
        • captcha's (Score:2, Insightful)

          captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)

        • by Pxtl (151020) on Tuesday April 10, 2007 @03:39PM (#18680535) Homepage
          You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.

          Captchas are by far the better solution.

          The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
        • Well... (Score:5, Informative)

          by Shawn is an Asshole (845769) on Tuesday April 10, 2007 @03:40PM (#18680543)
          If you're sick of image spam, you can do what I did. Add the OpenProtect [openprotect.com] channel to SpamAssassin and then add these line to your SpamAssassin config:

          required_hits 5
          score SARE_GIF_ATTACH 5


          I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
          • Re: (Score:3, Interesting)

            by drinkypoo (153816)
            All I want is a plugin for thunderbird that will detect when a message is written in another language other than English and mark it spam if it is. No one ever sends me an email in anything other than English except for spam. I have no fucking idea why this has not yet been implemented. I get absolute shitloads of russian spam.
            • That and the ability to import and export from your message store. I love Thunderbird and have been using it exclusively since about version 0.4, but simply cannot believe some of the functionality it lacks.

          • Re: (Score:2, Informative)

            by Auntie Virus (772950)
            "required_hits 5 score SARE_GIF_ATTACH 5 I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day."

            Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a .gif of the company logo. SARE_GIF_ATTACH is ok with a lower score, adding to other scoring parameters. What you REALLY want for image spam is the FuzzyOCR plugin.
          • I simply OCR every image attachment. Works flawlessly.
        • by mypalmike (454265)
          they're really not all that effective against a concerted enemy when there's a lot of money on the line... Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email.

          Little feifdom bulletin boards don't generally have a lot of money on the line, which is why captcha works so well. The cost of paying human captcha solvers is high enough that it's fairly rare to see spam on a ca
        • The CAPTCHA solution (Score:4, Interesting)

          by dj245 (732906) on Tuesday April 10, 2007 @05:01PM (#18681855) Homepage
          Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view. You could also change the item being asked for to defeat simple image recognition, and have several pictures of kittens/what-have-yous.

          To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
          • by drinkypoo (153816)

            Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view.

            The problem with your idea is that we've seen multiple examples of image search tools lately (well, two) that are capable of doing that kind of analysis. That idea is better than nothing but will

          • Re: (Score:2, Insightful)

            by Espinas217 (677297)
            this just slows down the spammers but can't stop them. If you have a small number of choices it's just a matter of how much the spammer must try to get through. You present 10 images with 1 correct, the spammer has a 10% chance and that's enough to make his bussiness work. I'm not really in favor of captchas but multiple choices won't work for long.
        • by Flwyd (607088)
          Even good OCR will have trouble with captchas. Heck, even I have trouble with captchas and I beat good OCR most of the time.

          FWIW, LiveJournal, which is essentially several million easy-to-find blogs, has remarkably little comment spam without captchas. And most of the comment spams I've received are devoid of things like websites I could click on. I don't know what their technique is, though.
      • Re: (Score:3, Informative)

        by UbuntuDupe (970646) *
        Isn't that the same principle behind PGP? Correct me if I'm wrong (and I freely admit encryption is not my area of expertise), but to crack (in reasonable time) PGP-encrypted data, you have to solve a problem no one in the world has been able to solve yet (quick solution for a certain class of problems). Similarly, if captchas get to the point where you need a major theoretical advance to beat them, thanks to wide use of OCR-type programs, that would either foil all spammers, or cause them to solvea mathe
    • by ajs (35943) <ajs@ a j s . com> on Tuesday April 10, 2007 @03:25PM (#18680307) Homepage Journal

      The goal of the project is to stop the damn email image spammers.

      among other things, sure, but it's got to be a high priority for google.
      I don't buy either one. I think the goal of the project is to get sued.

      Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.

      I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
      • by slashbob22 (918040) on Tuesday April 10, 2007 @03:50PM (#18680745)
        Ok, I'll bite and play DA for a bit.

        Why Google wouldn't want this:
        1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
        2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.

        IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.

        // End DA
        • by ajs (35943) <ajs@ a j s . com> on Tuesday April 10, 2007 @04:36PM (#18681451) Homepage Journal

          Why Google wouldn't want this:
          1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
          Google takes the same stand on patent reform as IBM, as far as I know: the current law hurts innovation. They're not looking to have all of their patents stripped, just to reform the system so that innovation is encouraged. At the very least, IBM has (and I think Google too) lobbied for open source exemption. Keep in mind that IBM and Google hold tons of patents, but they mostly use them as a "warchest" to dissuade others from filing patent-related suits.

          2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.
          I don't buy that one. Patent and copyright law are radically different, and in the copyright case Google is just trying to argue for existing interpretation of the law, not a change.

          Google is doing this under appropriate conditions there shouldn't be concern of suits
          That's not how patent law works. If someone holds a patent on looking at the pixel to the left of the the one you're evaluating, and Google's software does that, then the holder could sue. What's more, there are many dozens of such simple patents surrounding OCR. It's probably the second-most over-patented area of CS next to color-space management.[1] [google.com]
    • I'm already using gocr to ocr every image in my email. Works very nicely.
  • Oh great. I, for one, do not welcome the increase in message board spamming.
  • by Iphtashu Fitz (263795) on Tuesday April 10, 2007 @02:08PM (#18679113)
    ... for Captchas [wikipedia.org]? If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
    • by X0563511 (793323) *
      When the computer can parsed a Captcha better than a human can... it means that we need to move on to something else. What that else is (do NOT mention kitten-captcha) I don't know.
      • When we can make a computer that can tell the difference between a kitten and an adult cat (or hell even another furred mamal) with any kind of accuracy, I think the LEAST of your problems at that point is coming up with captchas. You should be more worried about how you're going to escape from Skynet.
      • by walt-sjc (145127)
        Then we need to move to simple logic questions such as "what is the sum of 5 and 4?" or "how many inches in a foot", etc.
        • Well, your first question fooled it (probably due to the unusual phrasing), but Google can already answer [google.com] your second one.

        • by user24 (854467) on Tuesday April 10, 2007 @02:58PM (#18679955) Homepage
          Please, please, please, everybody, stop claiming that "what is 2+2?" is a hard AI question. I could code something in a hour to defeat most of this sort of question, and give me a week and a budget and I'll write something to get past 95% of these type of questions.

          If the text is parsable, it takes nothing to google it.
          I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.

          So the next generation of crapchas will ask "what color is the sky".
          Go and take a glance at ultraHal or another relatively advance NLP AI; a large knowledgebase is not hard to construct. When it doesn't know, it guesses. If it gets it right, then the knowledgebase increases by one fact.

          So then, what, you have to ask "Given that all bleeps and blue, and blank is a bleep, is blank blue?"
          Not only is that also easily computationally solved, but also a lot of people aren't going to be able to answer (smartass questions about stopping spam and idiots aside)

          So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?"
          and give the user a 255 character textarea to put their answer in.

          So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.
          • by asninn (1071320)

            So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.

            While I agree with you in principle, I think youre definition of "work" with regard to captchas is flawed. Captchas don't need to be 100% undefeatable; they just need to work well enough so that the time/energy/computing power/manpower/money needed to solve them en gros makes sure doing so isn't worth it to the spammer.

            Your claim that they're useless because they don't work perfectly makes as

            • by user24 (854467)
              that is true. even if just have three submit buttons, and only one submits to the right place*, you'll still cut your spam by 66%.

              But that is only true today.

              If everyone did that, spammers would soon figure out the system and bypass it.

              Captcha authors** are trying to avoid an arms race. sure, you can upgrade your simple crapcha every 6 months to keep up to date with spammers, or you can put a good one in place once. Much better the latter, methinks.

              It's not about what spammers are doing now, it's
            • by werfele (611119)

              Still, in reality, I hardly ever get postal spam; the rate is probably less than 1 unsolicited letter per month.

              I'd like to know how you manage that. It's a good day that I have only 1 unsolicited letter. We get something like to 10 to 12 per day, and we have sacks of the stuff to bring out for recycling. On the up side, I'll never have to buy address labels again (nonprofits tend to include them as incentive to actually open the envelope). On the down side, I hardly ever send mail anymore, so I don't h

          • Okay, then how would you defeat a text captcha like:

            "alright now I want you to tell me basically, that number three, which ever number comes after it, wait, make that before it, what is that again?"

            Would google or a knowledge base beat it? And you can arbitrarily increase the complexity like with pictures.
            • by user24 (854467)
              if it's generated by an algorithm, it can be deconstructed using an algorithm.

              find a representation of a number, find the last word relating to precedence that is not prefaced by a "not" word, do the math, enter the answer.

              next super-hard conundrum?

              notice, if I'd said
              "find a representation of a number, find the last word relating to precedence, do the math, enter the answer."
              you could reply "ahh, but what if I write this:"
              "tell me basically, that number three, which ever number comes after it, wait, not bef
    • by lawpoop (604919) on Tuesday April 10, 2007 @02:38PM (#18679619) Homepage Journal
      I doubt it.

      Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

      A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
      • Re: (Score:3, Funny)

        A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
        Hell, if we obscure it enough it can be practically buried under geometric noise; and once we do that, we've solved the AC problem on /.!
      • Re: (Score:3, Informative)

        by Iphtashu Fitz (263795)
        Part of what makes OCR work is that it assumes that the text was written to communicate meaning.

        As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, w
    • If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
      I hope so. I'm looking forward to a Firefox extension that'll let me decode a captcha so I don't have to figure it out. Some of the captchas I've seen lately are so confusing, with warped text, noise, and fonts that make zero and oh look identical, that I have to go through two or three of them before I can get an entry correct.
    • I could see it eventually becoming good enough to parse at least some types of captchas.

      That'll be great. Then the spammers can crack the weak captchas to get free e-mail addresses and can flood everyone's inbox with captchaesque text that's strong enough to fool the OCR. This seems like a brillantly thought-out plan.

  • by User 956 (568564) on Tuesday April 10, 2007 @02:10PM (#18679157) Homepage
    The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis

    So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.
    • Re: (Score:2, Funny)

      by adickerson0 (884626)
      No need, Dubya turns his work into the Secretary of Education so she can put a gold star on each page. While this may seem like a childish system it is really the only sort of over site he would agree to. The original plan was to scan everything and place an RFID Gold Star on each page for tracking, that way the Executive Branches work could be preserved, however this led to a few problems. Apparently the Sec of Ed got to busy and turned the work over to an intern. The intern decided to not only put a Gold
  • Finally... (Score:3, Interesting)

    by Searinox (833879) on Tuesday April 10, 2007 @02:10PM (#18679159) Homepage
    An OCR system that runs on Linux. I've been waiting for quite some time for something like this.
    • Re: (Score:2, Informative)

      by stilbon (69689)
      Vividata OCR Shop XTR

      http://www.vividata.com/index.html [vividata.com]

      It's not free software, but it works extremely well.
      • it actually has many issues, and it is lagging behind the Windows version that Nuance produces. My company owns several licenses.

        it is, however, the best OCR on Linux right now. I'm looking forward to having an alternative.
  • So will something like this eventually render captchas used as a security/anti-spam measure obsolete?

    Not like something wasn't bound to eventually come out to counter that idea, anyway.

  • Very cool. (Score:5, Insightful)

    by Kadin2048 (468275) <slashdot DOT kadin AT xoxy DOT net> on Tuesday April 10, 2007 @02:17PM (#18679305) Homepage Journal
    I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.

    There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR [sourceforge.net], and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.

    I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.

    Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.
    • by CastrTroy (595695)
      You aren't going to get a good shot of the document with a digital camera for a lot of reasons. First of all, the lighting is uneven. Then there's the problem with the lens distorting things. Then there's problems with getting it to focus properly. I'm sure lots of people would love to point out other problems with using a digital camera to capture documents. It may work fine for a human looking at the picture, but it's going to make the job of the OCR program a lot harder. Even things like dust can throw
      • Actually lots of people do book "scanning" with digital cameras. In fact, you can sometimes get much better results off of a book using a digital camera than you can by pressing it down against the bed of a flatbed scanner (because if the page wasn't typeset with a wide gutter, you'll start to distort some of the letters as you get close to the binding). Plus, it's a lot easier on the books, which is important when you're talking about books that are all going to be 75 years old and some much, much older.

        Th
    • I'd love to have a reasonably accurate OCR to read LCD screens. It would make for a vastly cheaper automated test equipment market when I can use a cheap webcam and some cheap digital power supplies and handheld voltmeters to do mass measurements of power conversion efficiency and characterization. Right now, that's all done via GPIB or ethernet, either of which options adds about $600 onto instruments that already cost a minimum of $600. I have played with using GOCR, with my multimeter face-down on the
  • Orcopus? (Score:4, Funny)

    by voice_of_all_reason (926702) on Tuesday April 10, 2007 @02:23PM (#18679377)
    Orcopus:

    Level: 15
    Race: Fell Marine
    HP: 290/290
    EP: 200/200
    Water elemental
    Drops: Tentacle
  • Wonderful! (Score:4, Insightful)

    by jshriverWVU (810740) on Tuesday April 10, 2007 @02:27PM (#18679461)
    This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
  • Okay, so one thing will lead to another and soon Google will be creating technology to recognize non-symbol shapes... How long before I can login to my G-Accounts by smiling at my computer?
  • captchas (Score:5, Insightful)

    by gEvil (beta) (945888) on Tuesday April 10, 2007 @02:40PM (#18679651)
    All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.
    • Re:captchas (Score:4, Informative)

      by arrrrg (902404) on Tuesday April 10, 2007 @02:59PM (#18679977)
      Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

      Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
    • by MoriaOrc (822758)

      Captchas are far from human-readable (the good ones at least)
      While I've run into not a few captchas that are not human-readable, I would argue that they are not, in fact, the good ones. Good Captchas are human readable, but extremely difficult to solve using automation (this, other OCR software, what have you).
    • Captchas are far from human-readable (the good ones at least)...

      Yeah, that's why they suck.

      Some forums, I have to try *four* times to get past the captcha, just to post a message about how libsomething won't compile.

      If they really wanted good captchas, they need to start using problems that are very easy for humans to solve, but very hard for computers to solve. For example, picking the one photo of a puppy out of a matrix of photos of full-grown dogs.

      Computers are currently really bad at recognizing images in photos, but they do a decent job of recognizing text with c

      • If they really wanted good captchas, they need to start using problems that are very easy for humans to solve, but very hard for computers to solve. For example, picking the one photo of a puppy out of a matrix of photos of full-grown dogs.

        Image identification raises it's own set of problems. If you're working with photos, what is your source of photos? You're going to need a lot, if you've only got a thousand or so images, spammers will scrape them from your site and flag them by hand (possibly outsour

        • by drinkypoo (153816)

          If you use a shared resource so you get get mind boggling numbers of photographs ("Bob's Puppy Captcha Service") you just create more incentive for attackers to index all of the service's images.

          this problem is simply solved. create images with the subjects to be recognized in the center. now use image processing utilities to cut a semirandom rectangle out of the image (producing images of varying sizes) and to apply some effects to the image which will change all values in the image significantly without

          • by user24 (854467)
            that solution is simply circumvented; crop the resultant images to a 5x5 pixel rectangle in the center, and use the md5 hash of that. i'm sure you'd end up with a workable lookup table after a while.

            "but my transform function is changing the RGB values of each pixel"

            yes, but with a 5x5 chunk you can account for +1/-1/0 deviations easily, and even higher transformations won't be too hard. I'm sure each image would represent a unique range of possible deviations. Even if you end up needing a 300Gb database, t
          • this problem is simply solved. create images with the subjects to be recognized in the center. now use image processing utilities to cut a semirandom rectangle out of the image (producing images of varying sizes) and to apply some effects to the image which will change all values in the image significantly without making it unrecognizable.

            As I said, it's not good enough. Detecting a subset of an image such a well understood problem that a $20 optical mouse does it 30 or so times per second. Ultimately

            • by drinkypoo (153816)

              One possible improvement that leaps to mind is procedurally generated images; that is, rendering "3D" images from models, with random (but constrained) angled, backgrounds, positions, colors, lighting, and poses. This way your image set can be extremely large.

              I've actually seen a prototype of this approach used. It's a nifty idea. But it's horribly computationally expensive compared to basically any other proposed alternative; a major site would spend far more using this approach than spammers would pay pe

  • searchable pdfs (Score:5, Interesting)

    by radarsat1 (786772) on Tuesday April 10, 2007 @02:44PM (#18679705) Homepage
    Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
    (Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

    Perhaps this library could be used to build such an application if none exists...
    • by Nasarius (593729)
      Not even the best commercial software (ABBYY, OmniPage) can do more than a half-assed job of that. If you want accurate, well-formatted results, expect to do a lot of manual work.
      • Although Acrobat's OCR engine leaves a bit to be desired, the approach there works pretty well. You can have it create the OCR'd text page layout that uses the original image as an overlay. So, in essence you get a page that looks like the original scanned image, but that lets you highlight/select the text from the background text layer. I'm sure other programs out there can do this, too. None that are OS (to my knowledge), as per the GP's requirements, though.
        • by Nasarius (593729)
          Yeah. Though IMO, Acrobat's OCR feature is only a toy at the moment, since the tools for manual intervention are awful. So you get things like images being interpreted as text, frequent mistakes not marked as "suspect", and total butchery of anything with umlauts, accents, etc. It does a nice job of deskewing scanned pages of text, though.
  • Language? (Score:5, Interesting)

    by ceeam (39911) on Tuesday April 10, 2007 @02:45PM (#18679731)
    English only I suppose?
    • by fireboy1919 (257783) <.rustyp. .at. .freeshell.org.> on Tuesday April 10, 2007 @03:07PM (#18680083) Homepage Journal
      Since the official language of the Googleplex is Googlese, and the original project was developed by the US Census bureau - notorious for their use of no languages except Esperanto, it goes without saying (though I'm saying it anyway), that it will read only Klingon.

      Remember kids, there are no stupid questions.
      Only people who don't RTFA who ask questions.
    • by xlv (125699)
      It looks like the curernt OCR engine they use, Tesseract OCR, only supports English as its roadmap includes "support for languages other than English" but from a quick look at the various links, they are developing other engines as well.

      Besides, the research group being based in Germany, you'd assume that German and latin based languages will be supported pretty soon...
  • Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.
    • by drinkypoo (153816)

      Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.

      The only thing you need to scan a ballot is Scan-Tron technology. It's a reflect/no-reflect tech like the optical write detect hardware in your floppy disk drive and very, very simple (as there is a sync signal at the side of the page.) Very little processing power is involved.

  • Comics (Score:3, Interesting)

    by rbanffy (584143) on Tuesday April 10, 2007 @03:26PM (#18680339) Homepage Journal
    Will I be able to search my comics strips (downloaded since ever) by keyword?

    I would love that!
  • All the OCR available to my Ubuntu 6.10 (Edgy) APT are worthless (< 50% correct characters), after trying them on real scans (usually faxes) that are perfectly clear to my eye:

    clara - Free OCR program for Unix Systems
    gocr - A command line OCR
    ocrad - Optical Character Recognition program
    unpaper - post-processing tool for scanned pages

    Will this Google OCR really work, and can I install it with APT?

    Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Rec
    • by drinkypoo (153816)

      Will this Google OCR really work, and can I install it with APT?

      Yes and no. (I've tested it, but you have to install from subversion.)

      Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?

      Teas Willis [luc.edu], and the sticky tours
      Did gym and Gibbs in the wake.
      All mimes were the borrowers,
      And the moderate Belg

      • by Doc Ruby (173196)
        I'd prefer to get mistakes turning nonsense into sense than the ones I get the other way around that don't even preserve meaningful nonsense.

        Do you have a result from scanning Jabberwocky (or other verse in a similar vein) with Google's OCR?
        • by drinkypoo (153816) <martin.espinoza@gmail.com> on Tuesday April 10, 2007 @06:06PM (#18682753) Homepage Journal

          Do you have a result from scanning Jabberwocky (or other verse in a similar vein) with Google's OCR?

          Just for you, I made one, because I'm that fucking cool.

          1. Visited http://www.jabberwocky.com/carroll/jabber/jabberwo cky.html [jabberwocky.com].
          2. Printed page 1 (all but one link at the bottom of the page) with default settings on a HP LaserJet 2300.
          3. Scanned on an Epson 3170 as a 300 dpi grayscale PNG with otherwise default settings. (God DAMN this scanner is fast. But then my scanner at home is a shitty Mustek 1200UB since I broke my Canon LiDe.) 2528x3281 pixels.
          4. mespinoza@sec2lpt7-linux:~/ocropus/ocropus-cmd$ ./ocropus ocr ~/Desktop/out.png | tee /home/mespinoza/Desktop/jabberwocky.html (lots of output)

          Prepare to be unimpressed, because Results follow:

          JABBERWOCKY Lewis Carroll

          (from Through the Looking-Glass and What Alice Found There, 1872) `Twas bri11ig,_ andjghe 4s1it_hy toyes Digl gyre amid gimblejn thg wabe: All xiiimsy wei^e thg borogovgs, And theamome raths outigrabe. ''ggwqre thg Jalgbervvpck,_my sqn! The jaw; that bijtel the clayksathat catch! Bgyvaiie the Jubjub bird, anti shun The frumidus Bandersnatch!' I-Ie took his yorpal sword in hand: Long timg tlgewmangome foe he sought So rgSted he by the Tu_mtum tree, And stood awhile in thought. And, as in uffish thought he stood, The Jalgbgjwoclg, with eyes of flame, Cqmgwhjfflixgg through fhe tulgey wood, And burbled as it came! Qne, two! One, two! And through and thIi`Ollgh The jrorpgal b]ade went; snicaker-snack! I-Ie left iifdead, and with its head He went galumphing back. ''And, has thou slain thejabbexfwpck? Cpmg to my a_rxps!_my ljgaxjgishboyl Ojralqjousi dwgy! Qalladhl Callayl' He chortled in his joy. S

          \ A S

          X A ?`^s :

          , ' Was ga. ka%#* mm. -- M 1 1 Q at ) a iv 2. `Ail A it 3*,* `i 2 (V H ;. ````( * 4 ^Nq@ Eu..*s..%im X M is ? lgh ~ ``A? S [ A Fax I /),2*gE it ^`* 4 ~ *: ' X A mg x ix, ,t~;;;..: v' it ix '~ t ~ ^ ,4~ ---= =-^ A A i gv ; * XX, x> . . N S A ft 1 A-`A 3; `> ' ''YY \Jh ^***`(?i* , ~~ x `* at -;v- *<~ ' H ~~~-=.- ; `Twas bri11ig,_ and_the 4s1it_hy toyes Dig gyre arid gimblejn the wabe; All Qiixjnsy wei^e thq borogovgs, And thdmome raths outvgrabe.

          dshaw@iabbenNockv.com

          Return to Glorious Nonsense Return to Lewis Carroll

          Results End.

          Beautiful, eh? I also tried a 100 dpi grayscale scan, which came out even more like hash (one big paragraph) and a 300 dpi bitmap (1bpp) which was about the same as the 100 dpi gray scan in quality, though a bit better.

          Looks like ocropus has a while to go before it can slay the Jabberwock instead of thejabbexfwpck.

  • It's fascinating that Google has chosen the Apache license for the release of this product. Given that Eben Moglen has explicitly stated that the Apache License is incompatible with GPLv3, what does this mean for mixing this code into other projects?

    Even though v3 no longer has the anti-google Affero provisions, Google still chooses Apache instead of GPLv3 or even v2 with a rider to upgrade to v3. You gotta believe the Google lawyers were thinking about this issue before release...

  • Ocropus? (Score:3, Funny)

    by 6Yankee (597075) on Tuesday April 10, 2007 @06:10PM (#18682791)
    Is that a Chinese mispronunciation? ;)

Take care of the luxuries and the necessities will take care of themselves. -- Lazarus Long

Working...