Google Adds OCR To PDF and Images 76
Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
F1r5t p0st? (Score:5, Funny)
Captcha correction? (Score:4, Interesting)
Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
Re: (Score:1, Offtopic)
I save my mod points exclusively for people like you, and you just missed being "off-topic" by about 4 hours because that's when they expire.
Yeah, well, I save my mod points for people who post responses to offtopic posts, too. People like you suck. I would NEVER do such a thing.
Re: (Score:2)
Well, a captcha service would have corrected "F1r5t p0st". Seems relevant to me.
Re: (Score:2)
I fail to see how it works as a captcha if the "correct" interpretation is unknown..
Re: (Score:1, Interesting)
Re: (Score:2, Informative)
I am pretty sure that with recatpcha only one of the two words you type in is unknown.
So if I have some text that looks like 'first known) p0sh bi4ches'.
Captcha user one will get "first p0sh".
If they correctly identify first, then I will accept their reading of posh, say "post".
User 2 gets "p0sh b14ches"
If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".
Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" know
lolwut? (Score:3, Insightful)
I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?
Re: (Score:2)
Maybe the font they were using was ShittyLowRezScan-Serifs.
Re:lolwut? (Score:4, Interesting)
Didn't fail at all on a PDF with typed text for me. Did you actually try it?
I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.
Re: (Score:2)
If you uploaded a PDF with typed text, it probably didn't even do OCR on it. It'd be pointless. You have to convert the pages to images for that to be necessary and I'm guessing you didn't.
Open in Acrobat Reader and use the snapshot tool to capture an entire page. Paste into Word as an image, then re-export to PDF. Upload that and then see how the OCR fares. Of course you'll also get an excellent quality in the snapshot since it's a pure digital copy and it won't have the blemishes that you'd get by printin
Changing ridiculously stupid subject line (Score:2)
Re: (Score:2)
Re: (Score:1)
It is likely that the PDF tried above was scanned pages.
Re: (Score:3, Informative)
I've just tried with the extract [37signals.com].
The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.
Re: (Score:2)
PDFs are not designed with extraction to a editable format in mind
Spoken like someone who has never read the PDF spec. PDFs are, in fact, specifically designed to allow editing. Everything in a PDF is stored as an object inside the document, indexed via an object table. Text runs are single objects containing a stream of commands sent to a PostScript-like VM to control their positioning. You can relatively easily map these to rich text in some other format, and you can trivially replace any object in a PDF by adding a new version and appending a new object table with
Re: (Score:1)
Nope - I'll repharse it to "most tools do not output a format that is extraction to a common editable format (such as Word)" if you like.
The spec may allow for easy editing, but converting a PDF (a PDF contain text with formating, not a image stored in a PDF) is hard. Chunks of unrelated text get bunched together into a single object, while other chucks of text that are related get throw into sum unrelated chunk so extracting it all logically becomes a royal pain. Sure this is the "fault" of the creation to
lots of tough problems in OCR (Score:2, Interesting)
OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because the
Re: (Score:1)
Sadly, I had no issues reading this: "This is going to make document scanning a real time saver from now on!"
Obviously, I've spent way too much time correcting bad OCR.
Where did all the ReCAPTCHA go? (Score:3, Interesting)
Re: (Score:1)
ReCAPTCHA was to fix bad scans in specific works- I didn't think it was ever designed to further OCR, but I see how it could possibly be useful.
Re: (Score:2)
"The reCAPTCHA software itself is not open source" could be the issue?
Re: (Score:1)
I think the point Loconut was making is that ReCaptcha does not 'further' machine OCR (ie. it doesn't improve the recognition algorithms used by the OCR software), instead using humans used to 'OCR' words that otherwise aren't legible.
Re: (Score:2)
Cost vs a tiny % in better recognition vs a free network of humans.
Thanks for the info, I was thinking that a quality private OCR system was getting the ReCAPTCHA inputs and it was learning.
Google Captcha processor here I come!!!! (Score:3, Interesting)
Re: (Score:2)
A little offtopic.
I have always wondered that google does a whole lot of processing. More so than any other corporation in recent times. Stuff like this OCR, searches, building heuristics for searches etc etc etc. Combined, these are no small tasks, is there a number on what kind of processing power google has, does google's computing grid qualify to be categorized as a super computing grid? What is its standing when compared to all those other super computers?
Re: (Score:1)
Re: (Score:2)
A super computing grid?
Oolcay itay. [wikipedia.org]
Re: (Score:2, Interesting)
Re: (Score:2)
I can't read captcha's 60% of the time, and am not always in an area where I can listen to the audio hint. An OCR would be nice. On Boing Boing, I usually mistype the captcha's about 3-4 times before finally stumbling on one I can actually read.
Re: (Score:2)
I had to use a captcha for work once, and the captcha itself was incorrect. I have no idea what key combination would have worked, but what the captcha said certainly did. It had an audio option, so I tried it, but the audio was so garbled I couldn't pick out a single word, let alone the three necessary to complete the captcha.
I like captcha as a basic form of protection from bots, but when it keeps me from accessing a website it is beyond worthless.
captcha cracking (Score:1)
For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.
Is there a "this translation is bad" option? (Score:5, Informative)
Re: (Score:2)
Google translate between western languages I encountered are pretty good, but they need a lot of work on the asian languages imo.
Re: (Score:2)
Awesome might be pushing it a bit, but I'll agree it's gotten better. It's never quite right, but I can usually get the gist of the message even before I have a chance to listen directly.
OCR efficiency (Score:1, Interesting)
> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)
Re: (Score:2)
OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.
Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.
To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi
People who can't figure things out from context would have a much harder time than you think.
Re: (Score:1)
Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".
Don't argue for argument's sake.
Re: (Score:1, Offtopic)
Don't argue for argument's sake.
But those are the best kind... it doesn’t even really matter who’s wrong.
Never let a day go by when you can’t say to yourself as you’re falling asleep, “Well, I was wrong on the internet today, but damn, I had fun.” That’s what I say...
Re: (Score:2)
Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.
Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Re
Doesn't have to be perfect (Score:5, Insightful)
They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.
Re: (Score:2)
The funny thing is that their OCR seems to be pretty good for Google Books. Yes, its photographed pictures, but you can search the text, which means some type of OCR must be going on. So, unless they are using a completely different technology, than this should really only have issues with hand-written text.
Re: (Score:2)
Photographed text. Blah. Should have proofread before I hit submit.
Re: (Score:2)
Copy-and-paste some text from it and see how good the OCR was. You’ll be able to see the mistakes that were previously hidden.
I’m guessing it’s exactly the same engine, but done exactly as I said it should be, correctly.
Google should concentrate elsewhere (Score:2)
First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.
How can a small company like Zoho beat Google on usability?
Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...
I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.
My search term was "details" and Gmail returned 311 messages. I also knew t
Re: (Score:2)
By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back!
You can hardly expect Google to make up for your lack of search skills or memory. I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results. It's not Google's fault that you have 11,317 emails with the word 'when' in it.
Re: (Score:2)
You still do not get it, I am afraid! And that's the very reason that companies like Apple and Microsoft at one point in the past made life incredibly easy for computer users. This is why they excelled, of course making users "dumb" in the process.
You can hardly expect Google to make up for your lack of search skills or memory.
This is the very mistake you make...How come Google now categorizes results of search terms at google.com? Tell me why. I just searched for "House Skills" and had categories of videos, discussions, books, news, blogs, updates returned. So according to you, categor
Re: (Score:1)
Re: (Score:2)
...so you're expecting the computer to be able to read your mind?
No sir! I expect the computer to categorize, and I know it is possible because I have seen it elsewhere...even in applications by the same vendor.
Re: (Score:1)
details has:attachment
I see your point about it being nice if they'd automatically label this stuff, but you can search for attachments. This might turn up something that has a different kind of attachment and merely mentions ".tiff" in the email, but what you're looking for should turn up.
I have found Gmail search to be vastly superior to Yahoo! and Outlook since I've switched. They have some great tips [google.com] on how to search.
Re: (Score:2)
Re: (Score:2)
I see your point. This is another thing Yahoo Mail does well.
Re: (Score:2)
The problem is that you weren't using Google's search engine properly. You failed to give it all the relevant information you DID remember. Next time include "tiff" in your search as well as clicking the box "Has attachment" in search options.
In fact, go do that now, then come back and tell me how many results you get.
Anyone know what they're using for the OCR? (Score:2)
Re:Anyone know what they're using for the OCR? (Score:4, Informative)
Re: (Score:1)
FWIW, I believe a lot of OCRopus hasn't been incorporated at Google yet because OCRopus itself is still under heavy development.
Re: (Score:1)
Why? Apache is a more liberal license.
OCR Reality (Score:3, Informative)
What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!
Maybe you should actually know something about the particular field before you judge?
Re: (Score:2)
Yeah, I was thinking the same thing. This sounds like someone who hasn't actually done OCR prior to these fancy Google docs.
OCR has always been somewhat inaccurate. It's just the nature of the beast.
Google search had OCR before Google Docs (Score:2, Interesting)
Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
What character sets? (Score:2)
Free OCR server. (Score:1)