Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Google Technology

Google Adds OCR To PDF and Images 76

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
This discussion has been archived. No new comments can be posted.

Google Adds OCR To PDF and Images

Comments Filter:
  • Captcha correction? (Score:4, Interesting)

    by 0100010001010011 ( 652467 ) on Tuesday June 22, 2010 @09:05AM (#32651968)

    Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.

  • by AHuxley ( 892839 ) on Tuesday June 22, 2010 @09:08AM (#32652004) Journal
    With all the words deciphered, no bump in the OCR backend?
  • by OzPeter ( 195038 ) on Tuesday June 22, 2010 @09:19AM (#32652104)
    How long before you see an automated system to upload and process Captcha images on google?
  • Re:lolwut? (Score:4, Interesting)

    by erikdalen ( 99500 ) <erik.dalen@mensa.se> on Tuesday June 22, 2010 @09:19AM (#32652114) Homepage

    Didn't fail at all on a PDF with typed text for me. Did you actually try it?

    I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.

  • OCR efficiency (Score:1, Interesting)

    by Anonymous Coward on Tuesday June 22, 2010 @09:34AM (#32652276)

    > the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

    Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)

  • by BrightSpark ( 1578977 ) on Tuesday June 22, 2010 @11:15AM (#32653640)
    One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by steganography is detectable by scan now, eg http://www.outguess.org/detection.php [outguess.org] so the battle continues. Of course one can work offline and send letters to each other and be protected by law :-) I wonder if one day sending stuff my mail will seem shady?
  • by Anonymous Coward on Tuesday June 22, 2010 @01:29PM (#32655408)
    Check into how the current reCaptcha works. The user is presented with two words. One is known to be correct. The other is suspect. User is unaware of which is known and which is suspect. User types both words, and backend system verifies the known word was typed correctly. Logs suspect word value typed by user. Returns the suspect word image to a few more users, and if they all respond with same text along with correct known word, the system can assume the suspect image contains the text returned. http://stackoverflow.com/questions/1435696/how-does-recaptcha-work [stackoverflow.com]
  • by mike.mondy ( 524326 ) on Tuesday June 22, 2010 @02:01PM (#32655906)

    Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.

  • by tmbdev ( 1320455 ) on Tuesday June 22, 2010 @09:55PM (#32660852)

    OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

    Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.

    And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.

    We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj [bit.ly] for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.

    Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.

"A car is just a big purse on wheels." -- Johanna Reynolds

Working...