reCAPTCHA Hard At Work, Rescuing Fading Texts 112
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
Not new (Score:4, Informative)
Re: (Score:3, Informative)
So is the US Patent and Trademark Office, as part of the process of using PAIR [uspto.gov], the Patent Application Information Retrieval system, which lets the public look at information about patent applications that have been published.
Re: (Score:3, Funny)
Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.
Re:Not new (Score:5, Funny)
Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.
That's not fair.
Plenty of useless adults waste their time on Facebook.
Re: (Score:1)
Re: (Score:1)
*duck and covers with flameproof suit under the steel table* *locks and loads AK-47*
Re: (Score:3, Informative)
Do they really? From what I was able to tell, it's not specified as reCAPTCHA anywhere in the window; having looked at the reCAPTCHA site from a development side I could swear that I read that you needed to give credit if developing a custom style for it. Either I'm remembering wrong, they've got a deal, or FB is undergoing one of the stupidest TOS violations ever.
Re:Not new (Score:4, Informative)
Re: (Score:3, Informative)
Do they really? From what I was able to tell, it's not specified as reCAPTCHA anywhere in the window; having looked at the reCAPTCHA site from a development side I could swear that I read that you needed to give credit if developing a custom style for it. Either I'm remembering wrong, they've got a deal, or FB is undergoing one of the stupidest TOS violations ever.
They do give attribution to reCAPTCHA. You have to click on "What's this?"
This is a standard security test that we use to prevent spammers from creating fake accounts and spamming users. Our captchas are provided by ReCaptcha
Re: (Score:3, Insightful)
Re: (Score:1)
Here, found one!
Ok, you can go back to Facebook now Tassach =)
Re: (Score:2)
It would only be useful if teenagrs knew how to spell.
Re:Not new (Score:4, Funny)
But you...
*sigh* ...Nevermind. It's Friday. Go have a beer or something.
Re: (Score:2)
Go have a beer or something.
Waaaaaay ahead of ya.
Re: (Score:3, Insightful)
I would imagine that they use multiple logins to verify one word - it's not like people don't mistype captchas in the first place.
Re:Not new (Score:4, Informative)
Quoting from the NPR story [npr.org] which aired earlier today:
Re: (Score:3, Interesting)
Quoting from the NPR story [npr.org] which aired earlier today:
That's scary. The way ReCaptcha works allows the reCaptcha server to collect the IPs of reCaptcha users (along with the reCaptcha-enabled website they are using). If many websites are using reCaptcha, it allows to track users as they are moving through the web, from one reCaptcha-enabled website to the next.
The idea is cute, but the implementation is fundamentally broken and a huge breach of privacy.
Re: (Score:2, Informative)
That's scary. The way ReCaptcha works allows the reCaptcha server to collect the IPs of reCaptcha users (along with the reCaptcha-enabled website they are using). If many websites are using reCaptcha, it allows to track users as they are moving through the web, from one reCaptcha-enabled website to the next.
Only if you actually use the JavaScript API. If you want to protect the privacy of your site's users, you are free to use the server side API of your choice. This gives them (at most) a count of how many recaptchas your users have solved. By the way, the recaptcha site provides - amongst others - ready-made server side bindings for PHP, Java, Ruby, Python and Perl.
Re: (Score:2)
Re: (Score:2)
If you wanted to blow the extra bandwidth, you could get around that, too. Grab the image onto your server, and let the user get it from there.
Most sites won't do this, because I think it falls way into the tinfoil hat department. :P
Re: (Score:2)
Tinfoil hat much?
Every bigger ad agency (google) can do the same thing.
Re: (Score:2)
Re: (Score:2)
Smart sites are doing something to check and see if you're blocking ads. I notice that I have to disable AdBlock on imeem.com or it will only play the first song in a playlist.
I'm sure there is a way around it, but I haven't hacked at it enough. Not really bugging me that much. Every way I can think of to implement such an "ad checker" can be defeated.
Re: (Score:1)
They most likely require several matching readings from different people before they consider it deciphered.
Validate your data, guys! (Score:3, Funny)
I can usually tell which of the two words is from a real old text. With high probability (>90%) I can correctly answer the real CAPTCHA and replace someone's OCR'd word with "penis".
I've only ever done this maybe ten or twenty times, but it could easily become an automatic part of using the system.
Re: (Score:2)
I'm sure they send the same unknown word out to multiple people, and wait for a concensus on it.
Now, if we ALL started entering "penis" for the obvious unknown words.. :)
Re: (Score:2)
Could be DEVISTATING to the poor fool who blindly follows details from a patent that describes a machine built with a random penis stuck in... that's a machine I don't even wanna think about *shudder*
Re: (Score:2)
...and this morning my spelling happens to also be devastating... grr
Re: (Score:1)
So, what's that first word here, anyways?
http://dl.kaetemi.be/kaetemi/recaptcha.png [kaetemi.be]
Re: (Score:2)
I'm going to guess "formal".
Re: (Score:2)
The thing is, they're often actually both from old texts. It's just that one of them has already been verified.
And TFA states that they do pass every word by multiple people so as to get more accuracy in what they say. I have little doubt that they're well acquainted with people who try spoofing them.
Re:Validate your data, guys! (Score:4, Interesting)
Since they use entries from several users to validate correct translations for OCR'ed text, this probably won't cause them major problems. OTOH, I wonder if they can track the accuracy of each user's inputs and, if it becomes evident that a user is either incompetent or attempting to screw with the system, take appropriate measures.
When someone's karma starts dropping into the negative range, they should let us know how well this worked out. If anyone can see their posts, that is.
Re: (Score:2)
Re: (Score:2)
They most likely give the same word to multiple users and choose the word most often entered.
Exactly. But its possible to adapt a technique used in some AI knowledge acquisition systems wherein the outcome of such scoring is 'back propagated' to rank the relative validity of various data sources, rules, etc. If one source (user in this case) consistently ranks low, they get a lower weight in future solutions. Until eventually they get dropped off the bottom of the list (like bad karma on /.).
Re: (Score:2, Funny)
I can see future generations sitting down for a good read:
MOBY COCK
Chapturd One
Call me LOLOLFAG...
Re: (Score:1)
Hopefully they are only accepting a piece of text when a lot of the people give the same thing.
Re: (Score:2)
Way to make the world a better place. I'm certain your parents are very proud of your accomplishments. Perhaps you can now go find someone else's sandbox to defecate in, I suggest your own, because I certainly would rather you not be here.
I have no doubt that the reCAPTCHA folks understand that there are going to be people who find such childish behavior irresisible or entertaining, and either start discounting such answers (based on IP address) or build in filtering to discount particular words.
But, real
Re: (Score:2)
Re: (Score:1, Insightful)
Both words are from 'real old text'. You won't have any effect on the data output by putting 'penis' because more people will type the correct word.
Cool possible uses (Score:5, Interesting)
Man, I would love to see the results if this technique was used for an ontological [google.com] purpose.
Please type in the word from the choices below that most closely relates to this word: OLD
HISTORIC
LIFESPAN
Interesting shit indeed.
Re:Cool possible uses (Score:5, Funny)
Or perhaps SLASHDOT-READER:
OVERWEIGHT
GEEK
SPENDS-TO-MUCH-TIME-USING-COMPUTERS
ALL-OF-THE-ABOVE
I fit into the category ALL-OF-THE-ABOVE. The only generalisation that is missing about slashdotters is the one about girlfriends.
Re: (Score:2)
Here's you explaination: http://www.funnyhumor.com/jokes/575.php [funnyhumor.com]
Layne
Re: (Score:1)
Haha, loser, you don't have a girlfriend like all the other
Everybody knows that all
Re: (Score:1)
Haha, loser, you don't have a girlfriend like all the other /.ers!
LOL, I can feel quite content in the fact that you are a fellow loser who doesn't either.
your sig (Score:2)
Re: (Score:2)
That's kinda the point moron.
Let me introduce you to the concept of context.
Re: (Score:2)
The linked page is self purporting. That's it's purpose.
If you are smart enough to see through it, you are smart enough to discredit it. In turn that makes it an example, not a message.
The problem with trying to communicate a message of the sort I link to is that the goal is to get you to scream "BULLSHIT!"
Parts apply and others don't, but they do provoke thought. Thought allows you to discard its catalyst for new ideas, but doesn't require it.
If you take the link as truth, you miss the point.
Re: (Score:1)
What context could possibly rescue those writings from being full of hyperbole, dogma, propaganda, and meaningless blatherings?
Re: (Score:1)
Or even google:
old+historic: 66,500,000
old+lifespan: 3,480,000
pwned!
Re: (Score:3, Informative)
The point is to see what the populace thinks the relation is.
If you think google is the end all be all of absolute information then you already fail.
Re: (Score:2)
I bet it's not far from the truth! What's google but the indexing of the [online] expressions of the populace of which you speak?
Re: (Score:2)
Huh? 1908 New York Times? (Score:3, Funny)
The New York Times is already online from 1851 onwards. the concept is cool, truly, but why not CAPTCHA something not already accomplished? Oh, I know. That was, like, a metaphor, right?
Re: (Score:3, Insightful)
Re: (Score:2)
Yeah I was kinda wondering about that too, but from a different perspective... I mean: "So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times."
What the hell is the problem with people? All text is apparently on a single page from the NY Times in 1908... I mean fuck, stop the press, cause its obviously all redundant shit anyways, just keep redistributing that one page across the world!
Re: (Score:2)
"Oh, I know. That was, like, a metaphor, right?"
If it was like a metaphor, does that make it a simile? No wait, this means you're using using metaphors as a simile? Hmm this could get confusing... perhaps we could make a reCAPTCHA like technology but with old metaphors instead of letters and create a big database of abstraction...
(ps: I am my own brother who wrote that above, so I am definitely confused, can someone help please?)
DMCA Violation (Score:5, Funny)
Prior art (Score:5, Funny)
Re: (Score:3, Funny)
Re: (Score:2)
gmail captchas (Score:3)
a little OT I know but is anyone else having a bad time with gmail's captchas? I've tried signing up several of our customers for gmail recently and it's becoming really hard to get them right. The "audio" playback used to be the saving grace, but the last two I did it sounded like ten people were talking to me all at once with no discernible key voice. (and last I succeeded, the string to be entered was spoken in three groups, by three different voices)
Re: (Score:2)
Re: (Score:2)
Wow people are really goin to town with the offtopic mods *sigh* conversation nazis "you will follow STRICTLY the rules of conversation or your karma will be no more!!!". Such a waste, there are posts out there that need modding up than these slightly-offtopic posts need modding down!
And yes I have started coming across more captchas that do seem just impossible to read, they certainly know how to make you feel stupid 'n illiterate. Apparently it's a new system in place, like the one in the article, but for
Image Captchas (Score:4, Informative)
Re: (Score:2)
I find a bit of simple javascript works well, and is out of sight of genuine users. If you wanna account for people who block javascript (rather than a note saying "please turn on javascript for a sec, think of the children") you can have a captcha in a span or div etc, then use javascript to remove it and replace it with a hidden field with a name<-->value pair that can be compared server side when they post the form and have the values checked. Yes, someone could look at the page source and see what
Re:Image Captchas (Score:5, Funny)
Just use an alt tag.
Re: (Score:2)
But that is multiple choice, so it is easier to make a program that can guess the result.
Re: (Score:2)
Finally logged in (Score:2, Funny)
Took me a bit to get past the new security measures, But I got a coupon 5 cents off my next shoe purchase.
reCAPTCHA and Open Source (Score:2)
Re: (Score:2)
from why reCAPTHCHA [recaptcha.net]
Word
Re:reCAPTCHA and Open Source (Score:4, Informative)
There are multiple libraries for reCAPTCHA already published, all under the MIT License. Just see http://code.google.com/p/recaptcha/ [google.com] for a list of them.
Problems With ReCaptcha (Score:1, Interesting)
Re:Problems With ReCaptcha (Score:4, Informative)
I've seen one ReCAPTCHA string that was just a distorted entirely illegible blob of ink.
Just do what I did: click the "refresh" button to the right for a new word pair and enter that one.
Known unknown? (Score:2)
Re: (Score:2)
RTFA.
You get two captchas. One is your standard, let's find out if you're human captcha, where the program knows the answer. The other is the scanned text. It also presents the same scanned text to many people, and then uses the results to figure out which one is the most likely correct result.
It turns out... (Score:3, Informative)
That slashdot's Goatse troll server guy proves useful.
Note: This is not a troll. One of the guys that offers open web services to slashdot trolls is also responsible for considerable development of CAPTCHA breakage and is an eminent Debian developer. This is why I've said that we should respect his efforts despite the unpleasant side effects. The truly brilliant we should grant exceptions from social behavior because they discover things more proper folk would not.
Re: (Score:1)
No. Just no. Being brilliant is no excuse for being an asshole.
Re: (Score:3, Interesting)
How is being responsible for CAPTCHA breakage useful?
Look, just because the guy who more or less invented both trolling and automated trolling is an eminent UNIX guru and textbook author that doesn't mean his trolling on net.suicide was any less disgusting. I was appalled at the people who laughed along with Pike when he revealed that he was behind Bimmler and Shaney. This kind of thing is just not acceptable no matter who you are.
Recaptcha doesn't recapture context (Score:5, Interesting)
For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.
If most captchas are already cracked... (Score:2)
How much worse is this than trusting users to correctly identify the text? I ask because I honestly don't know the succcess rate of the automated system.
Re:RTFA (Score:3, Informative)
The authors also tested software designed to crack CAPTCHAs against images created using reCAPTCHA, and found that they failed completely. The authors ascribe this to the fact that the letters in scanned images contain distortions that are not the result of a clean mathematical transformation. User response times were also measured, but there were no significant differences between the time it took users to handle traditional systems and that required to use reCAPTCHA.
Use to hide your own email addy (Score:5, Informative)
You can also use reCaptcha for your own email address, and be more willing to provide it "publicly" since they'd have to answer the reCaptcha to get to the mailto... reCaptcha mailhide [recaptcha.net]
Interesting field (Score:2, Interesting)
My company is working on digitizing a large volume of old text (19th century government documents). There are a number of problems unique to old text:
- OCR breaks down due to archaic letter shapes, smudging, letter damage and paper deterioration.
- we evaluated OCR versus having the entire text retyped by Indians, and ended up going with the Indians. The only way to get sufficient accuracy (>99%) was to have everything done twice and do a comparison.
- Even then, the typed text has to be checked using both
Yeah... (Score:1)
Re: (Score:1)
Mechanical Turk (Score:1)
Just to try it out I set up a mechanical turk [spy-hill.net] using reCAPTCHA. So if you like the idea you can keep at it, instead of just solving one of them once. It can be a bit addicting.
Re: (Score:1)
But, Where is the data??? (Score:1)
Re: (Score:3, Funny)
The following security test allows us to validate you are a human and not an automated script.
please type the following two words in the text box below
you moron
____________ _____________
Re: (Score:2)
The software presents one optically unreadable word and one "control" CAPTCHA word. Getting the control word right identifies the user as a human, and the program records his or her response to the unreadable word and adds it to a database.
So, there is the real CAPTCHA, and another reCAPTCHA.
Re:One Problem (Score:5, Funny)
One FUNDAMENTAL problem with this
... is that you didn't RTFA.
Re:AC for the plain old CAPTCHA (Score:5, Funny)