Google Adds OCR To PDF and Images 76

Posted by CmdrTaco on Tuesday June 22, 2010 @08:58AM from the typing-is-for-suckers dept.

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"

This discussion has been archived. No new comments can be posted.

Google Adds OCR To PDF and Images

Load All Comments

Search 76 Comments Log In/Create an Account

Comments Filter:

F1r5t p0st? (Score:5, Funny)

by Chrisq ( 894406 ) writes: on Tuesday June 22, 2010 @09:00AM (#32651916)

F1r5t p0st? (OCR's by Google)

Share
twitter facebook
- Captcha correction? (Score:4, Interesting)
  
  by 0100010001010011 ( 652467 ) writes: on Tuesday June 22, 2010 @09:05AM (#32651968)
  
  Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
  
  Parent Share
  twitter facebook
  - - Re: (Score:1, Offtopic)
      
      by Ardeaem ( 625311 ) writes:
      
      I save my mod points exclusively for people like you, and you just missed being "off-topic" by about 4 hours because that's when they expire.
      Yeah, well, I save my mod points for people who post responses to offtopic posts, too. People like you suck. I would NEVER do such a thing.
    - Re: (Score:2)
      
      by 0100010001010011 ( 652467 ) writes:
      
      Well, a captcha service would have corrected "F1r5t p0st". Seems relevant to me.
      - Re: (Score:2)
        
        by somersault ( 912633 ) writes:
        
        I fail to see how it works as a captcha if the "correct" interpretation is unknown..
        
        Re: (Score:1, Interesting)
        
        by Anonymous Coward writes:
        
        Check into how the current reCaptcha works. The user is presented with two words. One is known to be correct. The other is suspect. User is unaware of which is known and which is suspect. User types both words, and backend system verifies the known word was typed correctly. Logs suspect word value typed by user. Returns the suspect word image to a few more users, and if they all respond with same text along with correct known word, the system can assume the suspect image contains the text returned. http:// [stackoverflow.com]
        
        Re: (Score:2, Informative)
        
        by pcgc1xn ( 922943 ) writes:
        
        I am pretty sure that with recatpcha only one of the two words you type in is unknown.
        So if I have some text that looks like 'first known) p0sh bi4ches'.
        Captcha user one will get "first p0sh".
        If they correctly identify first, then I will accept their reading of posh, say "post".
        User 2 gets "p0sh b14ches"
        If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".
        Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" know
lolwut? (Score:3, Insightful)

by Pojut ( 1027544 ) writes: on Tuesday June 22, 2010 @09:00AM (#32651924) Homepage

I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?

Share
twitter facebook
- Re: (Score:2)
  
  by Sir_Lewk ( 967686 ) writes:
  
  Maybe the font they were using was ShittyLowRezScan-Serifs.
- Re:lolwut? (Score:4, Interesting)
  
  by erikdalen ( 99500 ) writes: <erik.gustav.dalen@gmail.com> on Tuesday June 22, 2010 @09:19AM (#32652114)
  
  Didn't fail at all on a PDF with typed text for me. Did you actually try it?
  I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by clone53421 ( 1310749 ) writes:
    
    If you uploaded a PDF with typed text, it probably didn't even do OCR on it. It'd be pointless. You have to convert the pages to images for that to be necessary and I'm guessing you didn't.
    Open in Acrobat Reader and use the snapshot tool to capture an entire page. Paste into Word as an image, then re-export to PDF. Upload that and then see how the OCR fares. Of course you'll also get an excellent quality in the snapshot since it's a pure digital copy and it won't have the blemishes that you'd get by printin
  - Changing ridiculously stupid subject line (Score:2)
    
    by mjwx ( 966435 ) writes:
    
    I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though. Have you tried it on a PDF that was an image of text, such as a scanned or photographed text document. That's the real test.
- Re: (Score:2)
  
  by AHuxley ( 892839 ) writes:
  
  Someone at google made a mistake with the dpi setting? Between Tesseract and reCAPTCHA something should work.
- Re: (Score:1)
  
  by mlk ( 18543 ) writes:
  
  It is likely that the PDF tried above was scanned pages.
  - Re: (Score:3, Informative)
    
    by mlk ( 18543 ) writes:
    
    I've just tried with the extract [37signals.com].
    The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.
    - Re: (Score:2)
      
      by TheRaven64 ( 641858 ) writes:
      
      PDFs are not designed with extraction to a editable format in mind
      Spoken like someone who has never read the PDF spec. PDFs are, in fact, specifically designed to allow editing. Everything in a PDF is stored as an object inside the document, indexed via an object table. Text runs are single objects containing a stream of commands sent to a PostScript-like VM to control their positioning. You can relatively easily map these to rich text in some other format, and you can trivially replace any object in a PDF by adding a new version and appending a new object table with
      - Re: (Score:1)
        
        by mlk ( 18543 ) writes:
        
        Nope - I'll repharse it to "most tools do not output a format that is extraction to a common editable format (such as Word)" if you like.
        The spec may allow for easy editing, but converting a PDF (a PDF contain text with formating, not a image stored in a PDF) is hard. Chunks of unrelated text get bunched together into a single object, while other chucks of text that are related get throw into sum unrelated chunk so extracting it all logically becomes a royal pain. Sure this is the "fault" of the creation to
- lots of tough problems in OCR (Score:2, Interesting)
  
  by tmbdev ( 1320455 ) writes:
  
  OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
  Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because the
- Re: (Score:1)
  
  by morgan_greywolf ( 835522 ) writes:
  
  Sadly, I had no issues reading this: "This is going to make document scanning a real time saver from now on!"
  Obviously, I've spent way too much time correcting bad OCR.
Where did all the ReCAPTCHA go? (Score:3, Interesting)

by AHuxley ( 892839 ) writes: on Tuesday June 22, 2010 @09:08AM (#32652004) Journal

With all the words deciphered, no bump in the OCR backend?

Share
twitter facebook
- Re: (Score:1)
  
  by Loconut1389 ( 455297 ) writes:
  
  ReCAPTCHA was to fix bad scans in specific works- I didn't think it was ever designed to further OCR, but I see how it could possibly be useful.
  - Re: (Score:2)
    
    by AHuxley ( 892839 ) writes:
    
    http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org] seems to be in use for some form of OCR?
    "The reCAPTCHA software itself is not open source" could be the issue?
    - Re: (Score:1)
      
      by slaingod ( 1076625 ) writes:
      
      I think the point Loconut was making is that ReCaptcha does not 'further' machine OCR (ie. it doesn't improve the recognition algorithms used by the OCR software), instead using humans used to 'OCR' words that otherwise aren't legible.
      - Re: (Score:2)
        
        by AHuxley ( 892839 ) writes:
        
        Pity they did not improve the recognition algorithms with all the data flowing in.
        Cost vs a tiny % in better recognition vs a free network of humans.
        Thanks for the info, I was thinking that a quality private OCR system was getting the ReCAPTCHA inputs and it was learning.
Google Captcha processor here I come!!!! (Score:3, Interesting)

by OzPeter ( 195038 ) writes: on Tuesday June 22, 2010 @09:19AM (#32652104)

How long before you see an automated system to upload and process Captcha images on google?

Share
twitter facebook
- Re: (Score:2)
  
  by mrops ( 927562 ) writes:
  
  A little offtopic.
  I have always wondered that google does a whole lot of processing. More so than any other corporation in recent times. Stuff like this OCR, searches, building heuristics for searches etc etc etc. Combined, these are no small tasks, is there a number on what kind of processing power google has, does google's computing grid qualify to be categorized as a super computing grid? What is its standing when compared to all those other super computers?
  - - Re: (Score:1)
      
      by Kral_Blbec ( 1201285 ) writes:
      
      What you didn't know is that Google is the botnets...
  - Re: (Score:2)
    
    by rah1420 ( 234198 ) writes:
    
    A super computing grid?
    Oolcay itay. [wikipedia.org]
- Re: (Score:2, Interesting)
  
  by BrightSpark ( 1578977 ) writes:
  
  One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by stegano
- Re: (Score:2)
  
  by gravis777 ( 123605 ) writes:
  
  I can't read captcha's 60% of the time, and am not always in an area where I can listen to the audio hint. An OCR would be nice. On Boing Boing, I usually mistype the captcha's about 3-4 times before finally stumbling on one I can actually read.
  - Re: (Score:2)
    
    by Bigjeff5 ( 1143585 ) writes:
    
    I had to use a captcha for work once, and the captcha itself was incorrect. I have no idea what key combination would have worked, but what the captcha said certainly did. It had an audio option, so I tried it, but the audio was so garbled I couldn't pick out a single word, let alone the three necessary to complete the captcha.
    I like captcha as a basic form of protection from bots, but when it keeps me from accessing a website it is beyond worthless.
captcha cracking (Score:1)

by aiwarrior ( 1030802 ) writes:

For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.
Is there a "this translation is bad" option? (Score:5, Informative)

by AdmiralXyz ( 1378985 ) writes: on Tuesday June 22, 2010 @09:28AM (#32652216)

I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.

Share
twitter facebook
- Re: (Score:2)
  
  by rolfwind ( 528248 ) writes:
  
  Google translate between western languages I encountered are pretty good, but they need a lot of work on the asian languages imo.
- Re: (Score:2)
  
  by steveg ( 55825 ) writes:
  
  Awesome might be pushing it a bit, but I'll agree it's gotten better. It's never quite right, but I can usually get the gist of the message even before I have a chance to listen directly.
OCR efficiency (Score:1, Interesting)

by Anonymous Coward writes:

> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)
- - Re: (Score:2)
    
    by selven ( 1556643 ) writes:
    
    OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.
    Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.
    To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi
    People who can't figure things out from context would have a much harder time than you think.
    - Re: (Score:1)
      
      by SpeZek ( 970136 ) writes:
      
      Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".
      Don't argue for argument's sake.
      - Re: (Score:1, Offtopic)
        
        by clone53421 ( 1310749 ) writes:
        
        Don't argue for argument's sake.
        But those are the best kind... it doesn’t even really matter who’s wrong.
        Never let a day go by when you can’t say to yourself as you’re falling asleep, “Well, I was wrong on the internet today, but damn, I had fun.” That’s what I say...
      - Re: (Score:2)
        
        by clone53421 ( 1310749 ) writes:
        
        Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.
        Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Re
Doesn't have to be perfect (Score:5, Insightful)

by clone53421 ( 1310749 ) writes: on Tuesday June 22, 2010 @09:36AM (#32652298) Journal

They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.

Share
twitter facebook
- Re: (Score:2)
  
  by gravis777 ( 123605 ) writes:
  
  The funny thing is that their OCR seems to be pretty good for Google Books. Yes, its photographed pictures, but you can search the text, which means some type of OCR must be going on. So, unless they are using a completely different technology, than this should really only have issues with hand-written text.
  - Re: (Score:2)
    
    by gravis777 ( 123605 ) writes:
    
    Photographed text. Blah. Should have proofread before I hit submit.
  - Re: (Score:2)
    
    by clone53421 ( 1310749 ) writes:
    
    Copy-and-paste some text from it and see how good the OCR was. You’ll be able to see the mistakes that were previously hidden.
    I’m guessing it’s exactly the same engine, but done exactly as I said it should be, correctly.
Google should concentrate elsewhere (Score:2)

by bogaboga ( 793279 ) writes:

First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.
How can a small company like Zoho beat Google on usability?
Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...
I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.
My search term was "details" and Gmail returned 311 messages. I also knew t
- Re: (Score:2)
  
  by MozeeToby ( 1163751 ) writes:
  
  By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back!
  You can hardly expect Google to make up for your lack of search skills or memory. I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results. It's not Google's fault that you have 11,317 emails with the word 'when' in it.
  - Re: (Score:2)
    
    by bogaboga ( 793279 ) writes:
    
    You still do not get it, I am afraid! And that's the very reason that companies like Apple and Microsoft at one point in the past made life incredibly easy for computer users. This is why they excelled, of course making users "dumb" in the process.
    You can hardly expect Google to make up for your lack of search skills or memory.
    
    This is the very mistake you make...How come Google now categorizes results of search terms at google.com? Tell me why. I just searched for "House Skills" and had categories of videos, discussions, books, news, blogs, updates returned. So according to you, categor
    - Re: (Score:1)
      
      by KeNickety ( 1416855 ) writes:
      
      Possibly because computers and networks can't be expected to infer meaning? Remember, in your search field you've entered no context, no kinds of specifying statements, so you're expecting the computer to be able to read your mind?
      - Re: (Score:2)
        
        by bogaboga ( 793279 ) writes:
        
        ...so you're expecting the computer to be able to read your mind?
        No sir! I expect the computer to categorize, and I know it is possible because I have seen it elsewhere...even in applications by the same vendor.
- Re: (Score:1)
  
  by HamburglerJones ( 1539661 ) writes:
  
  Search:
  details has:attachment .tiff
  
  I see your point about it being nice if they'd automatically label this stuff, but you can search for attachments. This might turn up something that has a different kind of attachment and merely mentions ".tiff" in the email, but what you're looking for should turn up.
  
  I have found Gmail search to be vastly superior to Yahoo! and Outlook since I've switched. They have some great tips [google.com] on how to search.
- Re: (Score:2)
  
  by quickOnTheUptake ( 1450889 ) writes:
  
  The thing I've had trouble with in gmail search is that it lacks any sort of lemmatisation. This would be fine if it would match sub-strings within words, but it seems to only match full words that are morphologically identical.
  - Re: (Score:2)
    
    by bogaboga ( 793279 ) writes:
    
    I see your point. This is another thing Yahoo Mail does well.
- Re: (Score:2)
  
  by Tropaios ( 244000 ) writes:
  
  The problem is that you weren't using Google's search engine properly. You failed to give it all the relevant information you DID remember. Next time include "tiff" in your search as well as clicking the box "Has attachment" in search options.
  In fact, go do that now, then come back and tell me how many results you get.
Anyone know what they're using for the OCR? (Score:2)

by rsilvergun ( 571051 ) writes:

It'd be cool if it was GPL'd :).
- Re:Anyone know what they're using for the OCR? (Score:4, Informative)
  
  by quickOnTheUptake ( 1450889 ) writes: on Tuesday June 22, 2010 @10:47AM (#32653210)
  
  I don't know for sure what's running behind this, but Google's OCRopus [google.com] is Apache, as is the actual OCR engine behind it, tesseract [google.com].
  
  Parent Share
  twitter facebook
  - Re: (Score:1)
    
    by tmbdev ( 1320455 ) writes:
    
    FWIW, I believe a lot of OCRopus hasn't been incorporated at Google yet because OCRopus itself is still under heavy development.
- Re: (Score:1)
  
  by danhs7 ( 970647 ) writes:
  
  Why? Apache is a more liberal license.
OCR Reality (Score:3, Informative)

by Shadow Wrought ( 586631 ) * writes: <shadow.wrought@NOSPam.gmail.com> on Tuesday June 22, 2010 @12:19PM (#32654498) Homepage Journal

About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

Maybe you should actually know something about the particular field before you judge?

Share
twitter facebook
- Re: (Score:2)
  
  by binary paladin ( 684759 ) writes:
  
  Yeah, I was thinking the same thing. This sounds like someone who hasn't actually done OCR prior to these fancy Google docs.
  OCR has always been somewhat inaccurate. It's just the nature of the beast.
Google search had OCR before Google Docs (Score:2, Interesting)

by mike.mondy ( 524326 ) writes:

Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
What character sets? (Score:2)

by HalAtWork ( 926717 ) writes:

What are the supported character sets? Is it only roman characters or what?
Free OCR server. (Score:1)

by rynolangner ( 1847532 ) writes:

This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ [www.watchocr.com] It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

F1r5t p0st? (Score:5, Funny)

Captcha correction? (Score:4, Interesting)

Re: (Score:1, Offtopic)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1, Interesting)

Re: (Score:2, Informative)

lolwut? (Score:3, Insightful)

Re: (Score:2)

Re:lolwut? (Score:4, Interesting)

Re: (Score:2)

Changing ridiculously stupid subject line (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:1)

lots of tough problems in OCR (Score:2, Interesting)

Re: (Score:1)

Where did all the ReCAPTCHA go? (Score:3, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Google Captcha processor here I come!!!! (Score:3, Interesting)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

captcha cracking (Score:1)

Is there a "this translation is bad" option? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

OCR efficiency (Score:1, Interesting)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1, Offtopic)

Re: (Score:2)

Doesn't have to be perfect (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Google should concentrate elsewhere (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Anyone know what they're using for the OCR? (Score:2)

Re:Anyone know what they're using for the OCR? (Score:4, Informative)

Re: (Score:1)

Re: (Score:1)

OCR Reality (Score:3, Informative)

Re: (Score:2)

Google search had OCR before Google Docs (Score:2, Interesting)

What character sets? (Score:2)

Free OCR server. (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals