Optical Character Recognition Still Struggling With Handwriting 150
Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process:
"Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."
Re:Use them as CAPTCHA... (Score:4, Interesting)
It's more complex problem than people imagine (Score:5, Interesting)
There is an on-line archive of all people that have passed trough Ellis Island (http://www.ellisisland.org/search/passSearch.asp). It consists of retyped (OCR-ed?) ship manifests. Manifests are lists of passengers, with names, places of births and similar information. In original, they are written by hand, in cursive scripts (as expected for late 19th and early 20th century).
Problem is not with the script, but with appropriate context. Someone who retyped this, did not know what to expect in these forms.
My grand-grand father's place of origin was written as "Lipovqani, Slovenia". Pair "lj" was recognized as "q". For someone who is native English speaker "lj" one next to other does not make too much sense. But for anyone with Slavic origin, "q" does not make sense (it's only in foreign words), and "lj" does make sense since it is a way to write "soft l" voice like in "Richelieu".
Ok, maybe that was not the an easy part to guess. But "Slovenia" was serious error. In that moment, Slovenia did not exist. It was part of the Austro-Hungary, and it did not exist as single entity inside it. What was really written was actually "Slavonia". That's an area in Eastern Croatia, and it *was* an entity inside Austro-Hungary.
Should I mention that I was not able to track my grand-grand mother and my other grand-grand father?
Pretend its a string? (Score:4, Interesting)
Can OCR properly trace the lines at least to replicate it? Meaning, it could make a vector replica of the handwriting? Would be neat if it could do that, then try to straighten out the lines, perhaps to simulate the possible path the original writer took to write it. Of course, the software will have to figure out intersections. Maybe a path of logic would be to know what turns a handwriter would NOT take, and then determine individual letters from that.
Combine that with other logic, like finding "dots" would indicate an i or a j, and maybe it will improve.
This is simple to fix (Score:2, Interesting)
Get the guys writing the code that breaks captcha.
Simple, honestly. Make it economically worthwhile to write the code to do such. Writing code to break handwriting isn't as lucrative as say, writing virii or malware code.
Take a look at the results...
disclaimer: I doubt they will EVER break my doc's handwriting.
--Toll_Free
Re:Half the time.. (Score:3, Interesting)
I've been using a computer since I was a kid, 25 odd years now. I can't write. I don't believe I ever really learned it.
I can print if I have to, though I usually ask my wife to do it because my hand gets sore after filling out a one page form. (In contrast I can easily type for 14+ hours at a stretch.)
I guess I get the point of handwriting recognition, for historical documents, but do we really need it for future devices?
Re:Better approach? (Score:3, Interesting)
Doesn't this suggest an obvious solution to CAPTCHA? Just use cursive text rather than try to obscure the text with funky backgrounds. If the spammers do manage to crack the CAPTCHA, then incorporate their technology into mainstream OCR programs.
Re:New strategy (Score:5, Interesting)
Re:Half the time.. (Score:2, Interesting)
I've been researching John Steinbeck's personal correspondence recently. Even with familiarity, his writing can be quite difficult to read. While reading a letter or trying to figure out the names he wrote on a photo, I feel sorry for his wife (Carol, at least) who did a great deal of transcription for him. Even though Steinbeck's typing is horrible, it is a huge relief to deal with his typed documents after a session with his handwriting. His handwriting is very neat and consistent, and even so, is monumentally difficult to read. It's difficult enough to justify getting the original documents (e.g., going to Stanford for the Special Collections instead of dealing with scans). I cannot imagine OCR managing it.
Re:New strategy (Score:2, Interesting)
(and in some circumstances the keyboard clicking is loud enough to be considered disruptive - true, there are loud pens & pencils, but I run into far more loud laptops than scratching handwriting implements).
Re:New strategy (Score:4, Interesting)
Except for most of us it's faster to write with your hands.
Writing by hand, you can jump letters and make abbrevs, you can draw diagrams right in there, and not to mention it feels a lot better. I don't know why but sitting and typing on my computer, and same when I used to paint minis, feels painful and stuffy. With the option of either typing or writing I'd definately take writing. Sure, with typing on a computer you can erase stuff quickly, but text editors have always been shitty for me (stuff like AbiWord often having graphical glitches or plain slow, text editors too or just lame feeling) and hitting a bunch of blocks to make words does not feel as good as actually writing down the words.
I never mastered cursive properly. I write "script", but write while skipping letters in my notes and using small symbols (batman symbol, drawn as a W in a circle, for example, is distress; three points is "donc", ds dans, etc and it changes depending on context). I write fairly fast, and imho much faster than when I type, if only because when I type I often hit the wrong keys; often being once a paragraph, and it's often because I can't get my mind straight on the keymap, or my fingers hit in the wrong order.
parts of the problem are solved (Score:3, Interesting)
The US Post Office has, for years, had fairly reliable automated reading of handwritten digits, which is used to auto-sort and -route mail by zipcode. It can handle some pretty terrible handwriting, crazy arrangement on the envelope, and unlikely variations, so only a relatively small percentage of letters are spit out to be read by human eyes.
Its task is made easier by the fact that they're locating and segmenting fixed-length sequences that are usually at least somewhat separated: they're looking for either a 5-digit zip code or a 5-dash-4-digit zip+4, and handwritten digits usually don't connect in the way that cursive letters do. That and you have only 10 digits to deal with, instead of 36 alphanumeric characters plus punctuation, but that particular difference is just a matter of computing power and memory to scale up to ~4x the charset.