Slashdot Log In
reCAPTCHA Hard At Work, Rescuing Fading Texts
Posted by
timothy
on Thursday August 14, @09:05PM
from the strange-confluence dept.
from the strange-confluence dept.
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
Related Stories
[+]
IT: Carnegie Mellon CAPTCHA Digitization Project Now Underway 119 comments
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
[+]
Next-Generation CAPTCHA Exploits the Semantic Gap 327 comments
captcha_fun writes "Researchers at Penn State have developed a patent-pending image-based CAPTCHA technology for next-generation computer authentication. A user is asked to pass two tests: (1) click the geometric center of an image within a composite image, and (2) annotate an image using a word selected from a list. These images shown to the users have fake colors, textures, and edges, based on a sequence of randomly-generated parameters. Computer vision and recognition algorithms, such as alipr, rely on original colors, textures, and shapes in order to interpret the semantic content of an image. Because of the endowed power of imagination, even without the correct color, texture, and shape information, humans can still pass the tests with ease. Until computers can 'imagine' what is missing from an image, robotic programs will be unable to pass these tests. The system is called IMAGINATION and you can try it out." This sounds promising given how broken current CAPTCHA technology is.
[+]
IT: Understanding How CAPTCHA Is Broken 148 comments
An anonymous reader writes "Websense Security Labs explains the spammer Anti-CAPTCHA operations and mass-mailing strategies. Apparently spammers are using combination of different tactics — proper email accounts, visual social engineering, and fast-flux — representing a strategy, explains their resident CAPTCHA expert. It is evident that spammers are working towards defeating anti-spam filters with their tactics."
[+]
IT: Fallout From the Fall of CAPTCHAs 413 comments
An anonymous reader recommends Computerworld's look at the rise and fall of CAPTCHAs, and at some of the ways bad guys are leveraging broken CAPTCHAs to ply their evil trade. "CAPTCHA used to be an easy and useful way for Web administrators to authenticate users. Now it's an easy and useful way for malware authors and spammers to do their dirty work. By January 2008, Yahoo Mail's CAPTCHA had been cracked. Gmail was ripped open soon thereafter. Hotmail's top got popped in April. And then things got bad. There are now programs available online (no, we will not tell you where) that automate CAPTCHA attacks. You don't need to have any cracking skills. All you need is a desire to spread spam, make anonymous online attacks against your enemies, propagate malware or, in general, be an online jerk. And it's not just free e-mail sites that can be made to suffer..."
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.

Not new (Score:4, Informative)
Reply to This
Re: (Score:3, Informative)
So is the US Patent and Trademark Office, as part of the process of using PAIR [uspto.gov], the Patent Application Information Retrieval system, which lets the public look at information about patent applications that have been published.
Re: (Score:3, Funny)
Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.
Re:Not new (Score:5, Funny)
Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.
That's not fair.
Plenty of useless adults waste their time on Facebook.
Reply to This
Parent
Re:Not new (Score:4, Informative)
Reply to This
Parent
Re:Not new (Score:4, Funny)
But you...
*sigh* ...Nevermind. It's Friday. Go have a beer or something.
Reply to This
Parent
Re:Not new (Score:4, Informative)
Quoting from the NPR story [npr.org] which aired earlier today:
Reply to This
Parent
Validate your data, guys! (Score:3, Funny)
I can usually tell which of the two words is from a real old text. With high probability (>90%) I can correctly answer the real CAPTCHA and replace someone's OCR'd word with "penis".
I've only ever done this maybe ten or twenty times, but it could easily become an automatic part of using the system.
Reply to This
Re:Validate your data, guys! (Score:4, Interesting)
Since they use entries from several users to validate correct translations for OCR'ed text, this probably won't cause them major problems. OTOH, I wonder if they can track the accuracy of each user's inputs and, if it becomes evident that a user is either incompetent or attempting to screw with the system, take appropriate measures.
When someone's karma starts dropping into the negative range, they should let us know how well this worked out. If anyone can see their posts, that is.
Reply to This
Parent
Cool possible uses (Score:5, Interesting)
Man, I would love to see the results if this technique was used for an ontological [google.com] purpose.
Please type in the word from the choices below that most closely relates to this word: OLD
HISTORIC
LIFESPAN
Interesting shit indeed.
Reply to This
Re:Cool possible uses (Score:5, Funny)
Or perhaps SLASHDOT-READER:
OVERWEIGHT
GEEK
SPENDS-TO-MUCH-TIME-USING-COMPUTERS
ALL-OF-THE-ABOVE
I fit into the category ALL-OF-THE-ABOVE. The only generalisation that is missing about slashdotters is the one about girlfriends.
Reply to This
Parent
Huh? 1908 New York Times? (Score:3, Funny)
The New York Times is already online from 1851 onwards. the concept is cool, truly, but why not CAPTCHA something not already accomplished? Oh, I know. That was, like, a metaphor, right?
Reply to This
Re: (Score:3, Insightful)
DMCA Violation (Score:5, Funny)
Reply to This
Prior art (Score:5, Funny)
Reply to This
Parent
Re: (Score:3, Funny)
gmail captchas (Score:3)
a little OT I know but is anyone else having a bad time with gmail's captchas? I've tried signing up several of our customers for gmail recently and it's becoming really hard to get them right. The "audio" playback used to be the saving grace, but the last two I did it sounded like ten people were talking to me all at once with no discernible key voice. (and last I succeeded, the string to be entered was spoken in three groups, by three different voices)
Reply to This
Image Captchas (Score:4, Informative)
Reply to This
Re:Image Captchas (Score:5, Funny)
Just use an alt tag.
Reply to This
Parent
Recaptcha doesn't recapture context (Score:5, Interesting)
For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.
Reply to This
Use to hide your own email addy (Score:5, Informative)
You can also use reCaptcha for your own email address, and be more willing to provide it "publicly" since they'd have to answer the reCaptcha to get to the mailto... reCaptcha mailhide [recaptcha.net]
Reply to This
Re:reCAPTCHA and Open Source (Score:4, Informative)
There are multiple libraries for reCAPTCHA already published, all under the MIT License. Just see http://code.google.com/p/recaptcha/ [google.com] for a list of them.
Reply to This
Parent
Re:AC for the plain old CAPTCHA (Score:5, Funny)
Reply to This
Parent
Re:One Problem (Score:5, Funny)
One FUNDAMENTAL problem with this
... is that you didn't RTFA.
Reply to This
Parent
Re:Problems With ReCaptcha (Score:4, Informative)
I've seen one ReCAPTCHA string that was just a distorted entirely illegible blob of ink.
Just do what I did: click the "refresh" button to the right for a new word pair and enter that one.
Reply to This
Parent