Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Google Technology

Google Adds OCR To PDF and Images 76

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
This discussion has been archived. No new comments can be posted.

Google Adds OCR To PDF and Images

Comments Filter:
  • F1r5t p0st? (Score:5, Funny)

    by Chrisq ( 894406 ) on Tuesday June 22, 2010 @09:00AM (#32651916)
    F1r5t p0st? (OCR's by Google)
    • Captcha correction? (Score:4, Interesting)

      by 0100010001010011 ( 652467 ) on Tuesday June 22, 2010 @09:05AM (#32651968)

      Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.

  • lolwut? (Score:3, Insightful)

    by Pojut ( 1027544 ) on Tuesday June 22, 2010 @09:00AM (#32651924) Homepage

    I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?

    • Maybe the font they were using was ShittyLowRezScan-Serifs.

    • Re:lolwut? (Score:4, Interesting)

      by erikdalen ( 99500 ) <erik.dalen@mensa.se> on Tuesday June 22, 2010 @09:19AM (#32652114) Homepage

      Didn't fail at all on a PDF with typed text for me. Did you actually try it?

      I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.

      • If you uploaded a PDF with typed text, it probably didn't even do OCR on it. It'd be pointless. You have to convert the pages to images for that to be necessary and I'm guessing you didn't.

        Open in Acrobat Reader and use the snapshot tool to capture an entire page. Paste into Word as an image, then re-export to PDF. Upload that and then see how the OCR fares. Of course you'll also get an excellent quality in the snapshot since it's a pure digital copy and it won't have the blemishes that you'd get by printin

      • I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though. Have you tried it on a PDF that was an image of text, such as a scanned or photographed text document. That's the real test.

    • by AHuxley ( 892839 )
      Someone at google made a mistake with the dpi setting? Between Tesseract and reCAPTCHA something should work.
    • by mlk ( 18543 )

      It is likely that the PDF tried above was scanned pages.

      • Re: (Score:3, Informative)

        by mlk ( 18543 )

        I've just tried with the extract [37signals.com].

        The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.

        • PDFs are not designed with extraction to a editable format in mind

          Spoken like someone who has never read the PDF spec. PDFs are, in fact, specifically designed to allow editing. Everything in a PDF is stored as an object inside the document, indexed via an object table. Text runs are single objects containing a stream of commands sent to a PostScript-like VM to control their positioning. You can relatively easily map these to rich text in some other format, and you can trivially replace any object in a PDF by adding a new version and appending a new object table with

          • by mlk ( 18543 )

            Nope - I'll repharse it to "most tools do not output a format that is extraction to a common editable format (such as Word)" if you like.

            The spec may allow for easy editing, but converting a PDF (a PDF contain text with formating, not a image stored in a PDF) is hard. Chunks of unrelated text get bunched together into a single object, while other chucks of text that are related get throw into sum unrelated chunk so extracting it all logically becomes a royal pain. Sure this is the "fault" of the creation to

    • OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

      Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because the

  • by AHuxley ( 892839 ) on Tuesday June 22, 2010 @09:08AM (#32652004) Journal
    With all the words deciphered, no bump in the OCR backend?
    • ReCAPTCHA was to fix bad scans in specific works- I didn't think it was ever designed to further OCR, but I see how it could possibly be useful.

      • by AHuxley ( 892839 )
        http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org] seems to be in use for some form of OCR?
        "The reCAPTCHA software itself is not open source" could be the issue?
        • I think the point Loconut was making is that ReCaptcha does not 'further' machine OCR (ie. it doesn't improve the recognition algorithms used by the OCR software), instead using humans used to 'OCR' words that otherwise aren't legible.

          • by AHuxley ( 892839 )
            Pity they did not improve the recognition algorithms with all the data flowing in.
            Cost vs a tiny % in better recognition vs a free network of humans.
            Thanks for the info, I was thinking that a quality private OCR system was getting the ReCAPTCHA inputs and it was learning.
  • by OzPeter ( 195038 ) on Tuesday June 22, 2010 @09:19AM (#32652104)
    How long before you see an automated system to upload and process Captcha images on google?
    • by mrops ( 927562 )

      A little offtopic.

      I have always wondered that google does a whole lot of processing. More so than any other corporation in recent times. Stuff like this OCR, searches, building heuristics for searches etc etc etc. Combined, these are no small tasks, is there a number on what kind of processing power google has, does google's computing grid qualify to be categorized as a super computing grid? What is its standing when compared to all those other super computers?

    • Re: (Score:2, Interesting)

      One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by stegano
    • I can't read captcha's 60% of the time, and am not always in an area where I can listen to the audio hint. An OCR would be nice. On Boing Boing, I usually mistype the captcha's about 3-4 times before finally stumbling on one I can actually read.

      • I had to use a captcha for work once, and the captcha itself was incorrect. I have no idea what key combination would have worked, but what the captcha said certainly did. It had an audio option, so I tried it, but the audio was so garbled I couldn't pick out a single word, let alone the three necessary to complete the captcha.

        I like captcha as a basic form of protection from bots, but when it keeps me from accessing a website it is beyond worthless.

  • For now it sucks, but we know if google wants it throws out the better in the market.
    Just wondering if this gets so good as to make mass captcha cracking cheap.

  • by AdmiralXyz ( 1378985 ) on Tuesday June 22, 2010 @09:28AM (#32652216)
    I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.
    • Google translate between western languages I encountered are pretty good, but they need a lot of work on the asian languages imo.

    • by steveg ( 55825 )

      Awesome might be pushing it a bit, but I'll agree it's gotten better. It's never quite right, but I can usually get the gist of the message even before I have a chance to listen directly.

  • OCR efficiency (Score:1, Interesting)

    by Anonymous Coward

    > the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

    Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)

  • by clone53421 ( 1310749 ) on Tuesday June 22, 2010 @09:36AM (#32652298) Journal

    They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.

    • The funny thing is that their OCR seems to be pretty good for Google Books. Yes, its photographed pictures, but you can search the text, which means some type of OCR must be going on. So, unless they are using a completely different technology, than this should really only have issues with hand-written text.

      • Photographed text. Blah. Should have proofread before I hit submit.

      • Copy-and-paste some text from it and see how good the OCR was. You’ll be able to see the mistakes that were previously hidden.

        I’m guessing it’s exactly the same engine, but done exactly as I said it should be, correctly.

  • First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.

    How can a small company like Zoho beat Google on usability?

    Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...

    I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.

    My search term was "details" and Gmail returned 311 messages. I also knew t

    • By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back!

      You can hardly expect Google to make up for your lack of search skills or memory. I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results. It's not Google's fault that you have 11,317 emails with the word 'when' in it.

      • You still do not get it, I am afraid! And that's the very reason that companies like Apple and Microsoft at one point in the past made life incredibly easy for computer users. This is why they excelled, of course making users "dumb" in the process.

        You can hardly expect Google to make up for your lack of search skills or memory.

        This is the very mistake you make...How come Google now categorizes results of search terms at google.com? Tell me why. I just searched for "House Skills" and had categories of videos, discussions, books, news, blogs, updates returned. So according to you, categor

        • Possibly because computers and networks can't be expected to infer meaning? Remember, in your search field you've entered no context, no kinds of specifying statements, so you're expecting the computer to be able to read your mind?
          • ...so you're expecting the computer to be able to read your mind?

            No sir! I expect the computer to categorize, and I know it is possible because I have seen it elsewhere...even in applications by the same vendor.

    • Search:
      details has:attachment .tiff

      I see your point about it being nice if they'd automatically label this stuff, but you can search for attachments. This might turn up something that has a different kind of attachment and merely mentions ".tiff" in the email, but what you're looking for should turn up.

      I have found Gmail search to be vastly superior to Yahoo! and Outlook since I've switched. They have some great tips [google.com] on how to search.
    • The thing I've had trouble with in gmail search is that it lacks any sort of lemmatisation. This would be fine if it would match sub-strings within words, but it seems to only match full words that are morphologically identical.
    • The problem is that you weren't using Google's search engine properly. You failed to give it all the relevant information you DID remember. Next time include "tiff" in your search as well as clicking the box "Has attachment" in search options.

      In fact, go do that now, then come back and tell me how many results you get.

  • It'd be cool if it was GPL'd :).
  • OCR Reality (Score:3, Informative)

    by Shadow Wrought ( 586631 ) * <shadow@wrought.gmail@com> on Tuesday June 22, 2010 @12:19PM (#32654498) Homepage Journal
    About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

    What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

    Maybe you should actually know something about the particular field before you judge?
    • Yeah, I was thinking the same thing. This sounds like someone who hasn't actually done OCR prior to these fancy Google docs.

      OCR has always been somewhat inaccurate. It's just the nature of the beast.

  • Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.

  • What are the supported character sets? Is it only roman characters or what?
  • This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ [www.watchocr.com] It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.

Competence, like truth, beauty, and contact lenses, is in the eye of the beholder. -- Dr. Laurence J. Peter