Forgot your password?
typodupeerror
Android Google Software Handhelds Input Devices

Google Docs' OCR Quality Tested 99

Posted by timothy
from the weighed-in-the-balance-and-found-wanting dept.
orenh writes "Google has released a Google Docs application for Android, which includes the ability to create documents by OCR-ing photos. I tested the application's OCR quality and found that it's mediocre under the best conditions and poor under real-world conditions. However, I believe that this poor performance is caused in part by an intentional decision by Google."
This discussion has been archived. No new comments can be posted.

Google Docs' OCR Quality Tested

Comments Filter:
  • /b/ (Score:1, Interesting)

    by stonewallred (1465497)
    Since the standard practice on 4chan is to use the word niggers for any word in a recaptcha that has a punctuation mark, I question just how good the OCR is.
    • by Snaller (147050)

      Trump? Is that you?

    • by zill (1690130)
      You realize that recaptcha knows exactly which site the captchas come from, right? It would only take a single line of code to filter out all the noise from 4chan.
  • I've played around with Google's OCR framework (tesseract) and it is far from perfect. So, this isn't really a surprise.
    • by icebike (68054)

      Its also far from new. Didn't they get that from some long dead Open Source project?

      • I've played around with Google's OCR framework (tesseract) and it is far from perfect. So, this isn't really a surprise.

        Its also far from new. Didn't they get that from some long dead Open Source project?

        Answered in the order you mentioned each:

        Yes, far from new (project started 26 years ago).

        No, not long dead. Just "recently" (roughly 6 years ago, give or take) open sourced and ported/compiled for Linux, OS/2 (and other platforms I am sure).

        Yes (open source project), and I think it was called... Tesseract. Kinda like the poster you responded to mentioned. ;-)

        To save you the work, it was an HP/UNLV project, started in 1985, that was open-sourced in 2005. It is still available on SourceForge [sourceforge.net].

    • by owlstead (636356)

      Yeah, I was looking for an android OCR library, and that one was the only one that came up. Although there are a few other Linux options, none of those seemed to be right on the money either. This article is strengthening the already published reports on open source OCR software: basically, it's not performing all that well. I wish it was.

      • This article is strengthening the already published reports on open source OCR software: basically, it's not performing all that well. I wish it was.

        Now that it's getting some exposure, I'd say it'll be performing a lot better soon.

        Nothing like being in the public eye for attracting clever people's attention.

    • by camperslo (704715)

      I guess it'll be a little while before we'll see an app I'd wondered about. I thought it would be useful to be able to take snapshots of things like news reports (streamed on the web, El Gato Eye-TV domestic or satellite t.v., YouTube etc.) and do OCR on them, AND get an English translation of it. With the events so far this year, support for Japanese and Arabic languages would have been a good start.

      • Definitely, weirdly I was wondering this afternoon if Goggles can already do OCR and translation on full pages of text.. I have a French book that I'd love to read, but I have basically no French!

        • by owlstead (636356)

          With the top-notch translators that are around today, you may be able to get the gist of the book. But the chance that the translation of the book will be a joy to read is about zero, zip, nada, nothing. You'd better buy a good translation or, if that's not available, try and learn French (with the book itself as source material maybe).

          • by retchdog (1319261)

            there was a paper about combining a (crappy) machine translation with low-skilled workers, who natively understand the target language, to patch up the glaring flaws. the idea is that _most_ of the errors made by the machine don't require understanding of the source language to detect. of course you lose out on anything 'deep' or artistic in the source language, and i would be hesitant to trust it for scientific papers or legal documents, but it's an interesting idea.

            • by ggeens (53767)

              there was a paper about combining a (crappy) machine translation with low-skilled workers, who natively understand the target language, to patch up the glaring flaws.

              I'm working on a project where the translations are handled like that. We send all texts to an external company, and a few hours later, they send back the translation. This seems to work relatively well.

              The next phase involves immediate translation without human intervention. I'm curious as to how that will work out.

            • there was a paper about combining a (crappy) machine translation with low-skilled workers

              Even better, Distributed Proofreaders [wikipedia.org] is Project Gutenberg's version of just that. They've probably passed 20,000 books OCR'd and proofed by now.

          • There are no translations available or I'd buy them. It's a book about Parkour by David Belle.. I'm just interested in basic history and his opinions rather than flowery language or whatever. If there is much discussion of technique it might be really hard to understand though - I auto-translated a French tutorial on rolling before, and it would just read as gibberish to someone who didn't already have a good idea of the technique.

            • by lxs (131946)

              Or you could learn French if much of the literature in your field of interest is in that language. It isn't that hard if you're not interested in fluency. You also have gained a valuable skill.

              Hey, it's more useful than either Elvish or Klingon.

  • by icebike (68054) on Thursday April 28, 2011 @06:34PM (#35969858)

    There are a number of scanner apps in the market that do a much better job in the first step of this process, which is taking the picture. They then concentrate their efforts on producing a clean usable PDF of the document. I tested one of these and found that the PDF rendered by it was much better than the PDF produced by Google. [android.com]
    Everything is crisp and readable.

    If the first fails, its no wonder the second OCR step fails.

    • by X0563511 (793323)

      And how do you plan on searching, indexing, or otherwise having an computer operate on the contents of that document?

      • by icebike (68054)

        Just how many of such documents do you expect to have to index taken with a cellphone? Seriously, this is a toy. Don't go all corporate archives on me here.

        • Even then, I have yet to work for a company that has a searchable PDF archive. Even when I worked for Fairfax (media company here in AU that publishes national & local newspapers), the PDF archive that came straight out of the publishing app wasn't searchable. Hell, it only had 3 months of the paper on servers, the rest were on archive DVDs.

          The whole idea of searchable PDFs died a long time ago, this is why business use purpose built products.

          Also, the OP stated that it was the original PDF that was gen

          • by afidel (530433)
            We OCR everything that's scanned into our document management system, search would be basically impossible without it since relying on users to accurately enter metadata is suicidal if you want useful data.
            • My job entails working with our office's document management system to manually enter metadata.
              In part, I essentially end up parsing the data which users entered in various formats.
              However, since the original form is entered electronically to begin with, I figure this could be a lot more automated. (The people in my office definitely have a clue; however, fat chance moving this up through the bureaucracy.)

          • by pruss (246395)
            Searchable pdfs are not dead. For instance, jstor.org's large repository of scholarly journals is searchable pdfs. jstor is very heavily used in my field. Not perfect, but pretty good.
        • by X0563511 (793323)

          Just how many of such documents do you expect to have to index taken with a cellphone? Seriously, this is a toy. Don't go all corporate archives on me here.

          Well, that's the whole point to OCR. If you're just scanning, then you're just scanning. OCR'ing lets you do all kinds of text processing, analysis, format shifting etc. A scan is... just a picture of a document. Makes me think of microfiche.

          • by icebike (68054)

            True, but again, this is a cell phone app. You don't expect document management system level capabilities, especially not in release 1.0.
            If you want that level of quality you bring something more than a cell phone to the task. Maybe a flatbed or something.

            My point here is this: I've had much better luck going direct to PDF On the phone than via Google Docs.

            Try this test if you have a Google Docs account, (even a free one):

            Upload some PDF, even one created using something on your phone like CamScanner. [appbrain.com].
            Th

            • by X0563511 (793323)

              I don't touch Google anything, except for email. I much rather use -real- solutions, with my nice flatbed etc :)

              It is odd that your phone does that better than Google...

            • by AmiMoJo (196126)

              This is a Google product. They like to release early and do public betas lasting years, so expect rapid improvements.

              There seems little point in reviewing a new Google product until it has matured somewhat because the first version is always half done sort-of-works quality code. The first version of Android typed everything entered into the phone into a hidden root shell for crying out loud. About the only area they seem to hold off in is the front page of their search engine.

  • by CajunArson (465943) on Thursday April 28, 2011 @06:34PM (#35969864) Journal

    The end of the article is pretty telling. Basically any professional OCR software from the mid 1990's and normal consumer grade commercial software from today is lightyears ahead of open source solutions. Which is kind of sad, but the problem is that there really isn't a huge market for OCR in the way that there is for web browsers and other more successful projects, coupled with the inherent difficulty in doing good OCR.

    • CAPTCHA Breakers (Score:4, Interesting)

      by MoonBuggy (611105) on Thursday April 28, 2011 @06:55PM (#35970038) Journal

      If the increasing absurdity of the CAPTCHAs I tend to see is anything to go by, there are programs out there that'll read normal printed text from even the crappiest photo without missing a beat. The question is, are the spammers using standard commercial solutions, or have they got some useful tech of their own that we might be able to get our hands on (seize it as part of a settlement and make it public domain, for instance).

      • Re: (Score:3, Insightful)

        by jewelises (739285)

        I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.

        • by perpenso (1613749) on Thursday April 28, 2011 @07:32PM (#35970312)

          I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.

          I once worked on an OCR project. The client specified a 99% success rate and we strained to restrain our grins. 99% is about one error every one or two lines of text. We got 99.6% in our first implementation before we even began to work on accuracy. Admittedly we had excellent image quality. This was a custom solution that had its own optics.

          • by martin-boundary (547041) on Thursday April 28, 2011 @09:08PM (#35970850)
            Heh, it's always fun to reinterpret requirements to make them easier to implement :)

            A 99% success rate could also mean 99 pages with zero errors out of a 100 pages attempted. With 250 words per page that would represent a mandated success rate of 99.995%

            • by thegarbz (1787294)

              QUICK A LAWYER, LET'S GET HIM!

              As an aside. Stupid slashdot filter is telling me using caps is like yelling. Well I AM yelling.

            • by perpenso (1613749)

              Heh, it's always fun to reinterpret requirements to make them easier to implement :)

              A 99% success rate could also mean 99 pages with zero errors out of a 100 pages attempted. With 250 words per page that would represent a mandated success rate of 99.995%

              Thankfully the client specified 99% with respect to character recognition not correct pages. If they were specifying pages we would have been straining to suppress pissing our pants rather than suppressing grins. :-)

            • by tompaulco (629533)
              Heh, it's always fun to reinterpret requirements to make them easier to implement That's what out customer's do to us. We promised 95% accuracy rate on OCR per CHARACTER, but they generate their numbers off of how many fields of data had a wrong character in them.
              Of course, we also specified that based on clean images scanned at 300 DPI, and they give us crap images scanned at 200 DPI with fold lines , highlighter and pen scribble and apparently their mailing machine sprays some kind of serial number on ev
            • obligatory XKCD [xkcd.com] (alt text is relevant)

          • by AmiMoJo (196126)

            Google's approach to accuracy appears to be somewhat novel. Most OCR software uses spelling correction and grammar rules to improve accuracy but Google use data derived from the contents of pages they index. They use it for translation too which, when it works, gives their output a more natural quality compared to previous efforts. I find that Chinese to English works particularly well.

            Doubling OCR accuracy is exponentially harder. Unlike a human that can easily pick up on what type of document it is (lette

            • by perpenso (1613749)

              ... Google use data derived from the contents of pages they index ...

              Interesting. I guess that adapts for common usage deviating from proper spelling and grammar.

              ... pick up on what type of document it is (letter, technical manual, novel, newspaper article) and make informed mental corrections ...

              Machines will do this to a degree, for example favoring lowercase L when the surrounding characters are alphabetic and favoring one when the surrounding characters are numeric. But yeah, context rules, the preceding works well enough in prose but often fails in source code.

        • Don't tell him this. It's funnier to let him keep PH3AR1NG TEH 3L33T HAXORZ.

    • People expect OCR to be magic so are always disappointed when they first run the stuff. They do not understand that one or two uncertainties per page is a pretty spectacularly good result until you've been able to train the thing with identically laid out documents on the same paper etc for a long time. Feeding in stuff printed on a dot matrix makes the secretaries cry ten minutes after they greeted the arrival of the OCR software with joy. Of course it works a bit better on later pages after tweaking or
      • by tompaulco (629533)
        Well, that is where the commercial software has open source beat. They have already trained their OCR on millions of characters. But then, there is no retraining most of them, other than upgrading to the next version when it comes out. Tesseract you can train, but it starts out pretty crappy. Whether Tesseract is of any use to you depends on what your needs are. If you are going to be OCRing something that has a fairly narrow range of image quality and font, then you can train Tesseract to pick it up very s
  • according to the article, it doesn't have a flash. which is completely incorrect. I thought maybe the Docs application doesn't use the flash when taking pictures, but again...this is incorrect.
    • by icebike (68054)

      Google DOCs will use the flash or not, based on user settings, so, yeah, he just missed that.

      But In my tests with Nexus One, (Not Nexus S), using the flash at the range needed to see the picture just puts a
      white blob in the center of the shot and is actually worse than using bright room lights.

    • by Idbar (1034346)
      What article? The link seems to be pointing to a 403 Error page. At least to me.
  • I'm in the market for a good way of recognizing OCR-B based characters on an android device (mostly uppercase characters and digits). I know the location (on a flat 2D plane in a 3D space) of the characters, but they do not form sentences or even words. Does anyone have a good algorithm to do this kind of low-level character recognition? A library would be even better of course, especially if it is open source. I'm personally thinking of comparing bitmaps or vectors.

    As a hint to other devs, many commercial

  • Um... (Score:5, Insightful)

    by Shadow Wrought (586631) * <shadow.wroughtNO@SPAMgmail.com> on Thursday April 28, 2011 @06:46PM (#35969952) Homepage Journal
    He uploaded the 120 dpi image instead of the 300 dpi image and is surprised the OCR sucks. Really? Lossy isn't the concern when you're OCR'ing bloack text on a white background. Seriously. Think about what the image is actually going to be used for, then make your decision.

    And, seriously, how effective of OCR'ing are you really imagining you're going to get off of a camera phone pic, anyway?
    • It seems TFA is giving 403 errors, but Google's 300 DPI PDFs that you can download for public domain books often have incredibly poor quality, much poorer than you get with 300 DPI on a cheap home scanner. While they might be marginally acceptable for novels, for the old math books I'm interested in, the Google PDFs are mostly useless. Often you can't disambiguate small blurry subscripts by eye, never mind OCR. On the other hand, I have never had a problem reading 300DPI subscripts on scans I make at hom
    • by sootman (158191)

      > And, seriously, how effective of OCR'ing are you really imagining
      > you're going to get off of a camera phone pic, anyway?

      Camera phones are getting quite good. An iPhone 4 takes 5MP images and there are many others out now that are as good or better.

      Specifically, the images are 2592x1936 pixels which equates to 225 dpi at 8.5" x 11". That's plenty to OCR a typical page--say, 8.5x11 with clean 12-point type. I've carefully taken photos of documents with my phone and printed them and they're indistingu

  • I suppose this retard thinks he's clever.

    Bad Kitty!

    Verily, you may not link directly to images. Link to their containing web page instead.

    You tried to access: /blog/

    From: http://hurvitz.org/ [hurvitz.org]

    I have spoken!

  • They Why (Score:3, Informative)

    by RileyCR (672169) on Thursday April 28, 2011 @06:58PM (#35970054) Homepage
    Google took the Tesseract OCR engine, one of the first engines, and wrapped document analysis and some high level improvements on it. In the current OCR market landscape there are only 4 commercial engines, and two that make up 98% of the market. Compared to those two OCROpus is not even close because of the legacy engine. So the real reason is it's old technology, very old. Unless Google licenses ABBYY or Nuance they will not get any better. The reality is OCR takes 50 man-years to develop to compete with these top two engines, and it's just not practical for even Google to go out and start from scratch.
    • by camperslo (704715)

      Does that mean it couldn't be a viable candidate for some Summer of Code work then?

      • by perpenso (1613749) on Thursday April 28, 2011 @07:39PM (#35970370)

        Does that mean it couldn't be a viable candidate for some Summer of Code work then?

        More like a bunch of masters/phd thesis to get started.

        OCR is an area of AI research under the topic of Computer Vision. It is yet another area that seems simple in concept but turns out to be incredibly difficult in practice.

        • by Lehk228 (705449)
          seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.

          only problem is sometimes GA make a solution that makes no sense and should not work but somehow does http://www.damninteresting.com/on-the-origin-of-circuits [damninteresting.com]
          • by perpenso (1613749)

            seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.

            Sounds like a great thesis project. :-)

          • by koxkoxkox (879667)

            Genetic algorithms are an optimisation algorithm, but what do you want to optimise exactly ? What are your individuals here ?

            The idea of using a large collection of solved problem to check and improve the accuracy of the method looks more like neural network to me. Indeed, this seems to be a common method for OCR. For example : http://www.codeproject.com/KB/dotnet/simple_ocr.aspx [codeproject.com]

            • by Tacvek (948259)

              While neural networks are a good solution, genetic algorithms can still be used in conjunction with them.

              One possible training method for neural networks happens to be genetic algorithms. The genes being the link strengths, and the fitness function being say the percentage of correct results. (If you reach a sufficiently high level, you might want to change to minimizing uncertainty, with a fitness dropping exponentially if the correct percentage drops too low.)

              In the alternative genetic algorithms can be u

    • Until the day you can hold up a document in front of your iBhone camera and have it snap and convert that document with 99%+ accuracy and have spell and grammatical checking solve the other 1% accurately to 99% also, meaning 99.99% conversion is done properly in any language, the technology won't be tolerated by end users. That will take more as you say than Tesseract, as you so well pointed out. Google should stop whoring themselves as OpenSource focused and just do the right thing by purchasing outright
      • You haven't done the math. 1% is not nearly enough. This reply is 300 chars long and getting 3 wrong is annoying enough. Error rate should go down to .000001% for OCR to be a commodity, and that's with good 600dpi originals. Factor in crappy scans, poor resolution/contrast and you are in for a pretty tought ride.
      • Just put the OCR'd text in a side channel with the image, as PDF does. Then you get a searchable, copyable document, andd preserve the original formatting and avoid the need for extremely low error rate.
    • by afidel (530433)
      Hmm, of the four engines we use you mentioned two. Abbyy has by far the worst recognition rate (but is most flexible for scan setup so we use it for arbitrary documents rather than the forms based stuff going into our document management system). We also use Nuance through Adlib. The other two we use are Kofax AIP, and DokuStar.
      • by tompaulco (629533)
        My shop uses Nuance through two different products, and we are looking into directly interfacing with Abbyy. The results we have seen from Abbyy have been much better than what we have seen through Nuance. I guess mileage varies.
  • by Nick Ives (317)

    Did anyone else mirror this? I'm just getting a 403.

  • Forbidden

    You don't have permission to access /blog/2011/04/ocr-quality-of-google-docs on this server.
    Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

    Nice link, asshole.

  • by Anonymous Coward

    Self-promote to /. and host on a box that can't handle the limited traffic of a 25-comment popularity story?

    GOOD WORK SON

  • You can get better results by using CamScanner [appbrain.com] to capture the image, then upload the JPG to Google Docs. I found that uploading the JPG works better than uploading the PDF.
  • Google prides itself on having supposedly the best quality apps and features, which is why they take years to leave Beta. Why would they intentionally release a crippled version of their app? That will be the worst thing since Google Books with the missing pages.

  • Wikipedia says:

    Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artif
  • "You're holding it wrong."
  • I think the quality is tolerable. I photographed a document lying on my desk, without doing anything special to make it smooth or adjust lighting. This is a good simulation of a real-world situation where you can photograph a piece of text. There were errors in the transcription but it was readable, and with a very little editing would have been perfect. What surprised me was that apparently the whole image was uploaded from my phone to Google Docs, and then downloaded again, which is a little bit inefficient; I think that the OCR process runs server side.

    I see this as very useful. This afternoon I'm going in to the local planning office to look at some planning applications; I won't be able to take them away, and I doubt I'll be allowed to use a photocopier, but I will have my phone. That's a real world application. I can think of hundreds more.

  • I've tried a couple of the free applications that Google has made available, and they've been really inferior products. It's no surprise that they've put out yet another amateurish effort.

After an instrument has been assembled, extra components will be found on the bench.

Working...