Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Software

Optical Character Recognition Still Struggling With Handwriting 150

Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process: "Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."
This discussion has been archived. No new comments can be posted.

Optical Character Recognition Still Struggling With Handwriting

Comments Filter:
  • by Coopjust ( 872796 ) on Sunday October 05, 2008 @12:49PM (#25264915)
    An OCR program can include a bank of fonts, and even when there is some sort of spill/ink blot/whatever on the paper, it has a solid reference. Handwriting isn't so easy, because humans don't always write their "Q"s with the line in the exact same spot and other fluctuations. Even if you gave a computer a point of reference (neatly drawn letters corresponding with their actual alphabetical values), a computer probably couldn't get it for a lot of people with inconsistent handwriting.

    Now, with context and improved technology, I don't think that handwriting recognition is impossible. I have a feeling that it will be a technology like speech recognition: never perfect, and it will require training.
  • Re:Better approach? (Score:5, Informative)

    by PeeAitchPee ( 712652 ) on Sunday October 05, 2008 @01:47PM (#25265499)

    No.

    I own a microfilm digitization / OCR shop. We work with tons of old records such as the ones referenced in this story, as well as old HR docs, check stubs, time cards, architectural drawings, you name it. If you OCR cursive, you don't get back 80%, or 70%, or even 30% accuracy . . . you get back a bunch of pseudo-random (to our eyes) characters which are in NO WAY related to what the actual text is. About the only handwriting recognizable using today's tech is block-print, like you find on engineering diagrams. The technique in this article is pretty standard operating procedure, and has been for some time -- much easier to put a few hundred people on the project and grind through it (and cheaper too compared to data entry rates here in the US -- about 1/3 the price). That usually includes double-keying to check everything and a 99.99999% accuracy guarantee.

    Just FYI, there are only a few OCR engines out there. Probably the most commonly used is the ABBYY engine, which is both OEMed and sold directly as desktop- and server-based products by ABBYY. There are a few others as well, and despite their differences, most have pretty much the same capabilities and accuracy. But OCR of cursive, especially of the docs cited in the article where you don't have someone sit down and "train" the machine first with handwriting samples, is still one of the great "unsolved" computing problems. I expect we'll have the capability in the next decade or so as processor core density, memory, and storage continues to increase at their current rate -- eventually, the machine will be able to "brute-force" through the docs just like the Chinese data entry folks in this article.

  • Re:Better approach? (Score:2, Informative)

    by Anonymous Coward on Sunday October 05, 2008 @04:42PM (#25266927)

    It's not resolution, your tablet has less resolution than a scanner. Tablets can do a lot better because they can keep track of which lines you made first. The problem is much easier when you have an ordered and timed series of strokes to examine compared to when you only have a finished picture to look at.

  • Re:Better approach? (Score:3, Informative)

    by TheNetAvenger ( 624455 ) on Monday October 06, 2008 @01:46AM (#25270121)

    Couple of points...

    1) MS Research probably has some of the best work being done on handwriting recognition, including imaged documents. However it is no where near the needed levels. Google would be better off to work with Microsoft on stuff like this, than the motto of screw anything MS is doing and we will recreate it ourselves.

    2) On your Vista Tablet PC, the reason you can get 90-99% levels of recognition is that TabletPCs and Vista/Windows use a concept called 'ink' (that goes back to early work at Microsoft from the late 80s)

    Ink not only stores a 'picture' (Bitmap/Vector) of what you wrote, but also the stroke pressure, speed, direction, and order of each movement. So even if it doesn't look like a 'T' because you use stokes that normally would make a T, Vista can figure this out.

    Because of doing it different than reading the 'image' like OCR has to, you can you use cursive, printed, combinations or whatever and Vista can figure out from the motion and stoke more than from what it looks like.

    Ink is also why Microsoft holds a lot of respect in industries that use handwriting device and TabletPCs like the medical industry, as it can even read Doctor's handwritting. Ink also holds more data that can be further looked at later on for more advanced processing of intent or even how the person writes. This is also why Vista (go look up YouTube demonstrations) is far ahead of handwritting technology in other OSes, like OS X.

    Ink is also a 'crucial' data type that is not an image or a word, and one reason MS has been fighting for OOXML formats, because it retains this data like in onenote and Winword. Without the pen stroke information, ink become worthless, as it is no longer information about what the person wrote, and just a crappy image.

HELP!!!! I'm being held prisoner in /usr/games/lib!

Working...