Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Optical Character Recognition Still Struggling With Handwriting

Posted by Soulskill on Sun Oct 05, 2008 11:37 AM
from the i-can't-read-my-handwriting-either dept.
Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process: "Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."
+ -

Related Stories

[+] Windows Live Hotmail CAPTCHA Cracked, Exploited 362 comments
eldavojohn passes along what may be the last nail in the coffin for CAPTCHA technology. Coming on the heels of credible accounts of the downfall of first Yahoo's and then Gmail's CAPTCHA, Ars Technica is reporting on Websense Security Labs' deconstruction of the cracking and tuning / exploitation of the Live Hotmail CAPTCHA. Ars calculates that a single zombie computer can sign up over 1400 Live Hotmail accounts in a day, and alternate account creation with spamming. Time to dust off Kitten Auth?
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by Joe The Dragon (967727) on Sunday October 05 2008, @11:39AM (#25264827)

    Beat up Martin = Eat up Martha

      • by Anonymous Coward on Sunday October 05 2008, @06:29PM (#25267987)

        i've outsourced all of my computer applications and software needs to India.

        instead of using PowerPoint at meetings, i just have two Indian women in bikinis hold up large displays with my bullet points written on them--they even do slide transitions.

        instead of an e-mail client, i use an Indian courier. it takes a while for me to communicate with international clients, but i receive practically no spam.

        and rather than a word processor i have a guy with a notepad that a dictate to. he also offers me helpful tips when he notices that i'm trying to write a letter.

        then there's the 17-year-old i have doing my taxes. i don't even think he's out of high school yet, but he beats Turbo Tax any day.

        but you should really see the guy i have simulating Windows Vista for me. he wears this really slick suit, moves really slow, and everyone once in a while he comes up to me and kicks me in the balls.

  • Better approach? (Score:3, Insightful)

    by mandelbr0t (1015855) on Sunday October 05 2008, @11:40AM (#25264835) Journal
    It seems to me that it would be better to OCR everything and contract the proof-reading to the Chinese firm. The wide variation of writing styles and letter forms may make 100% accuracy of OCR impossible for this task, but starting from OCR should reduce the task, shouldn't it?
    • I suppose it depends on how you go about it; correcting specific errors may require more therbligs than typing the entire words.

    • And if the OCR has a mistake, you gotta look for the original? Sometimes having the starting point being wrong leads you in the entirely wrong direction, but you don't know it's wrong. Since these would be people for whom english is a second(at best) to foreign(at worse) language, wouldn't that make them especially vulnerable?

    • by mrsteveman1 (1010381) on Sunday October 05 2008, @12:25PM (#25265257) Homepage

      Chinese proof-reading? Only if you want your documents in Engrish.

    • It seems to me that it would be better to OCR everything and contract the proof-reading to the Chinese firm. The wide variation of writing styles and letter forms may make 100% accuracy of OCR impossible for this task, but starting from OCR should reduce the task, shouldn't it?

      It would probably be more costly to OCR it and then proof read it, especially if the error rate is higher than a certain amount, say 50%. There are written texts that I have a hard enough recognising, and only context allows me to wor

    • Re:Better approach? (Score:5, Informative)

      by PeeAitchPee (712652) on Sunday October 05 2008, @12:47PM (#25265499)

      No.

      I own a microfilm digitization / OCR shop. We work with tons of old records such as the ones referenced in this story, as well as old HR docs, check stubs, time cards, architectural drawings, you name it. If you OCR cursive, you don't get back 80%, or 70%, or even 30% accuracy . . . you get back a bunch of pseudo-random (to our eyes) characters which are in NO WAY related to what the actual text is. About the only handwriting recognizable using today's tech is block-print, like you find on engineering diagrams. The technique in this article is pretty standard operating procedure, and has been for some time -- much easier to put a few hundred people on the project and grind through it (and cheaper too compared to data entry rates here in the US -- about 1/3 the price). That usually includes double-keying to check everything and a 99.99999% accuracy guarantee.

      Just FYI, there are only a few OCR engines out there. Probably the most commonly used is the ABBYY engine, which is both OEMed and sold directly as desktop- and server-based products by ABBYY. There are a few others as well, and despite their differences, most have pretty much the same capabilities and accuracy. But OCR of cursive, especially of the docs cited in the article where you don't have someone sit down and "train" the machine first with handwriting samples, is still one of the great "unsolved" computing problems. I expect we'll have the capability in the next decade or so as processor core density, memory, and storage continues to increase at their current rate -- eventually, the machine will be able to "brute-force" through the docs just like the Chinese data entry folks in this article.

      • As an industry expert I imagine you know a whole lot more about this than I would - and I am sure you are completely correct.

        Perhaps the cursive issue has to do with the effective resolution you can get from the old paper scans? I know using the tablet edition of Windows Vista I can get much higher than 90% recognition of cursive input on the tablet. However that is probably due to the fact that no scanning is needed: Windows has a basically perfectly resolved snapshot of what my scribble looks like withou
        • Re: (Score:3, Informative)

          Couple of points...

          1) MS Research probably has some of the best work being done on handwriting recognition, including imaged documents. However it is no where near the needed levels. Google would be better off to work with Microsoft on stuff like this, than the motto of screw anything MS is doing and we will recreate it ourselves.

          2) On your Vista Tablet PC, the reason you can get 90-99% levels of recognition is that TabletPCs and Vista/Windows use a concept called 'ink' (that goes back to early work at Mic

      • by mi (197448) <mi+slashdot@aldan.algebra.com> on Sunday October 05 2008, @02:20PM (#25266261) Homepage

        I expect we'll have the capability in the next decade or so as processor core density, memory, and storage continues to increase at their current rate -- eventually, the machine will be able to "brute-force" through the docs just like the Chinese data entry folks in this article.

        In the next decade or so we will have increased our processing power about 1000 times over. This work is scalable "sideways" — two pages can be processed by two computers independently. Which means, a thousand of today's computers could've done the work @home-style.

        The problem is not with the processing power — it is the lack of algorithms. You and I reassemble the hand-written characters quite differently from how today's computers do it. The software will need to be created — and it is not the lack of CPU/memory/storage power, that's holding it.

        One thing for sure is that the new algorithms will need to use the spell-checking engine(s) to better guess, what the next letter might be. On top of that, they would need to be equipped with grammar-checkers too, to be able to guess the next word, however illegible. Human speech (and thus writing) is quite redundant often — even if a misplaced coma can reverse the meaning on occasion.

        Our brain certainly uses its knowledge of both the general rules of the language and that of the domain of what's written — this is why another doctor can decipher another doctor's handwriting, for example, that's infamously illegible to mere mortals. The software will have to do the same — and it can start doing it already.

      • The US Post Office has, for years, had fairly reliable automated reading of handwritten digits, which is used to auto-sort and -route mail by zipcode. It can handle some pretty terrible handwriting, crazy arrangement on the envelope, and unlikely variations, so only a relatively small percentage of letters are spit out to be read by human eyes.

        Its task is made easier by the fact that they're locating and segmenting fixed-length sequences that are usually at least somewhat separated: they're looking for eith

    • Re: (Score:3, Insightful)

      Teaching someone English at that level would be more difficult that teaching them to recognize characters. In ancient Rome the people who engraved dies for coins weren't always literate, but they managed for the most part to get the inscriptions right. Barbarians who made copies had more trouble, but then perhaps they thought the inscription part was purely decorative allowing for artistic interpretation. Or perhaps they weren't flogged for making mistakes. Point is, you can copy without having the high

      • I would imagine that the proofreader would have the computerized text and an image of the original text side by side for comparison.
      • Re: (Score:3, Interesting)

        Doesn't this suggest an obvious solution to CAPTCHA? Just use cursive text rather than try to obscure the text with funky backgrounds. If the spammers do manage to crack the CAPTCHA, then incorporate their technology into mainstream OCR programs.

        • You'd be trading false negatives for false positives. Based on TV programmes where they trace people's ancestry, It's hard to tell what language most cursive writing is supposed to be in, let alone read it.

          Then again, my handwriting is so bad I've seen people turn it the other way up.

  • Half the time.. (Score:5, Insightful)

    by Miststlkr (593325) on Sunday October 05 2008, @11:43AM (#25264855)
    I can't even read people's handwriting, I hardly expect a computer to.
    • Re:Half the time.. (Score:5, Insightful)

      by glwtta (532858) on Sunday October 05 2008, @12:07PM (#25265085) Homepage
      Hell, I can't even read my own handwriting. Yeah, this is probably not going to happen.
    • Re: (Score:3, Interesting)

      I've been using a computer since I was a kid, 25 odd years now. I can't write. I don't believe I ever really learned it.

      I can print if I have to, though I usually ask my wife to do it because my hand gets sore after filling out a one page form. (In contrast I can easily type for 14+ hours at a stretch.)

      I guess I get the point of handwriting recognition, for historical documents, but do we really need it for future devices?

    • Re: (Score:2, Interesting)

      by Anonymous Coward

      I've been researching John Steinbeck's personal correspondence recently. Even with familiarity, his writing can be quite difficult to read. While reading a letter or trying to figure out the names he wrote on a photo, I feel sorry for his wife (Carol, at least) who did a great deal of transcription for him. Even though Steinbeck's typing is horrible, it is a huge relief to deal with his typed documents after a session with his handwriting. His handwriting is very neat and consistent, and even so, is mon

  • 1. Use the handwritten words as CAPTCHAs
    2. Wait for the bad guys to come up with programs to break them.
    3. ...
    4. Profit!

    • by aslvrstn (1047588) on Sunday October 05 2008, @11:51AM (#25264939)
      Joking or not, that's kind of the idea behind reCAPTCHA. It takes words that OCR failed on and uses them as CAPTCHAs. The same idea could work for handwriting. http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org]
    • CAPTCHA requres you to know what it says in the first place. You typed in Mary Jones, but that's not what my Chinese transcriber/OCR think it says. You could keep a database of failed CAPTCHAs and accept them as more people repeat, but then the bad guys will use the same bad entry over and over.
        • The other solution for pairs (which I think was also suggested by recaptcha) is to use two words. You are you told that you have to answer both right, but in fact at least one of them can be somewhat uncertain, and the system will accept your input if it matches for the already fixed image. The already certain set can start out as rather small and simple, but it will grow quickly, even for a small site. You can still require 4 or so identical answers (with no conflicting ones) to the same image for it to be
      • But then you submit the same words to several different people and use statistics to pick the most likely answer - and forward entries with no likely answer to someone hired to do it.

        It's very likely that the manual entry being done now is being done redundantly and then compared to find errors (and choose the best data entry operators).

  • by Anonymous Coward on Sunday October 05 2008, @11:46AM (#25264883)

    There is a simple reason that general OCR is much harder than cracking a CAPTCHA. General OCR has to recognize text *reliably*. CAPTCHA breakers are thrilled with a 10% success rate, because they use distributed systems created by worms to do the hard work a million times over. If you got 10% of the words right when scanning historical records you might as well not bother.

  • by Coopjust (872796) on Sunday October 05 2008, @11:49AM (#25264915)
    An OCR program can include a bank of fonts, and even when there is some sort of spill/ink blot/whatever on the paper, it has a solid reference. Handwriting isn't so easy, because humans don't always write their "Q"s with the line in the exact same spot and other fluctuations. Even if you gave a computer a point of reference (neatly drawn letters corresponding with their actual alphabetical values), a computer probably couldn't get it for a lot of people with inconsistent handwriting.

    Now, with context and improved technology, I don't think that handwriting recognition is impossible. I have a feeling that it will be a technology like speech recognition: never perfect, and it will require training.
      • by Kickersny.com (913902) <kickers&gmail,com> on Sunday October 05 2008, @12:04PM (#25265045) Homepage

        While handheld technology is indeed getting better, it's not directly applicable to the problem at hand. Real-time handwriting analysis uses stroke analysis as well as shape analysis to determine the letter(s). That is, the order in which you construct your letters matters very much. For example, if you crossed your T before drawing the vertical bar, the engine may have a difficult time figuring out what you intended.

        When OCRing documents, all of that 'meta-information' is lost.

        • You are right, but on the other hand a really good scan might be able to indicate the order. Paper that was already wet will react slightly different to the next ink stroke, etc.
  • For a moment there, I was picturing some new technology that could distinguish between C, PERL and and Java written on scratch paper.

  • by bigattichouse (527527) on Sunday October 05 2008, @12:06PM (#25265075) Homepage
    Now you take the human translated recognition, and use it to train your genetic algo or neural net against the original images.
  • I hope they didn't give them the Presidential Book of Secrets, we could all be in trouble then!

  • by saigon_from_europe (741782) on Sunday October 05 2008, @12:32PM (#25265333)

    There is an on-line archive of all people that have passed trough Ellis Island (http://www.ellisisland.org/search/passSearch.asp). It consists of retyped (OCR-ed?) ship manifests. Manifests are lists of passengers, with names, places of births and similar information. In original, they are written by hand, in cursive scripts (as expected for late 19th and early 20th century).

    Problem is not with the script, but with appropriate context. Someone who retyped this, did not know what to expect in these forms.

    My grand-grand father's place of origin was written as "Lipovqani, Slovenia". Pair "lj" was recognized as "q". For someone who is native English speaker "lj" one next to other does not make too much sense. But for anyone with Slavic origin, "q" does not make sense (it's only in foreign words), and "lj" does make sense since it is a way to write "soft l" voice like in "Richelieu".

    Ok, maybe that was not the an easy part to guess. But "Slovenia" was serious error. In that moment, Slovenia did not exist. It was part of the Austro-Hungary, and it did not exist as single entity inside it. What was really written was actually "Slavonia". That's an area in Eastern Croatia, and it *was* an entity inside Austro-Hungary.

    Should I mention that I was not able to track my grand-grand mother and my other grand-grand father?

    • Are images of the original hand-written documents available on the Web?

      • Yes, but you can see them only once you find something in the search. And they used to have some funny system trying to prevent people from printing original scans.

    • Many of the immigrants were barely literate in their own language, let alone English, and so names and places might be recorded how the official thought it should be spelled. Maybe they were busy or annoyed and couldn't be bothered to check. They were government employees, after all...

      On top of that you have people who don't wish to stand out or suffer discrimination and intentionally anglicise their names.

      I'm with you 100% about context. And good luck with the searching.

  • by British (51765) <british1500@gmail.com> on Sunday October 05 2008, @12:45PM (#25265477) Homepage Journal

    Can OCR properly trace the lines at least to replicate it? Meaning, it could make a vector replica of the handwriting? Would be neat if it could do that, then try to straighten out the lines, perhaps to simulate the possible path the original writer took to write it. Of course, the software will have to figure out intersections. Maybe a path of logic would be to know what turns a handwriter would NOT take, and then determine individual letters from that.

    Combine that with other logic, like finding "dots" would indicate an i or a j, and maybe it will improve.

    • Can OCR properly trace the lines at least to replicate it? Meaning, it could make a vector replica of the handwriting?

      That is easy enough. Edge detection [wikipedia.org] and morphological thinning [wikipedia.org] can do the job.

      Maybe a path of logic would be to know what turns a handwriter would NOT take...

      And that is a real problem. Topological approach have limited usefulness - similar turns could make different letters. Statistical approaches like baesian networks , ANN can help here, but even human brain often have problems with f

  • Get the guys writing the code that breaks captcha.

    Simple, honestly. Make it economically worthwhile to write the code to do such. Writing code to break handwriting isn't as lucrative as say, writing virii or malware code.

    Take a look at the results...

    disclaimer: I doubt they will EVER break my doc's handwriting.

    --Toll_Free

    • Re:New strategy (Score:5, Interesting)

      by Belial6 (794905) on Sunday October 05 2008, @01:07PM (#25265679) Homepage
      You joke, but there really in very little reason to teach children handwriting/script/cursive (whichever you want to call it). The point of cursive was to speed up writing. It was never any good for readability. In today's world, if you need to write a lot of stuff, you are generally going to type it on a computer. Since just about anything that we would want to write by hand will be short, the speed gain would be minimal. Thus spending time and resource to teach every kid to write a useless, illegibly font is pretty pointless.
      • Re:New strategy (Score:4, Interesting)

        by Jorophose (1062218) on Sunday October 05 2008, @03:18PM (#25266721)

        Except for most of us it's faster to write with your hands.

        Writing by hand, you can jump letters and make abbrevs, you can draw diagrams right in there, and not to mention it feels a lot better. I don't know why but sitting and typing on my computer, and same when I used to paint minis, feels painful and stuffy. With the option of either typing or writing I'd definately take writing. Sure, with typing on a computer you can erase stuff quickly, but text editors have always been shitty for me (stuff like AbiWord often having graphical glitches or plain slow, text editors too or just lame feeling) and hitting a bunch of blocks to make words does not feel as good as actually writing down the words.

        I never mastered cursive properly. I write "script", but write while skipping letters in my notes and using small symbols (batman symbol, drawn as a W in a circle, for example, is distress; three points is "donc", ds dans, etc and it changes depending on context). I write fairly fast, and imho much faster than when I type, if only because when I type I often hit the wrong keys; often being once a paragraph, and it's often because I can't get my mind straight on the keymap, or my fingers hit in the wrong order.

    • Sorry, the use of "code" instead of "character" was my error. I corrected it in TFA, after being notified by a /. editor.
    • I think it has to do with success rates. A 20% success rate is all that a spammer cracking a captcha needs to be profitable.

      The USPS might see a return on investment if their OCR equipment works on 75% of text, routing the hard-to-read 25% to humans. That's a huge reduction in workload, because otherwise every letter would need to be scanned by a human

      Legal and government text, however, needs to be 99.9% or more accurate, because one flipped character in a page of text can cause severe problems; tha