Google Pushes Open Source OCR 212
Posted
by
Zonk
from the google-has-taken-all-knowledge-to-be-its-provice dept.
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
Sign of times to come? (Score:3, Interesting)
Finally... (Score:3, Interesting)
searchable pdfs (Score:5, Interesting)
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)
Perhaps this library could be used to build such an application if none exists...
Language? (Score:5, Interesting)
Re:The goal of the project (Score:4, Interesting)
among other things, sure, but it's got to be a high priority for google.
Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.
I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
Comics (Score:3, Interesting)
I would love that!
Re:Small price if it helps email spam. (Score:5, Interesting)
Captchas are by far the better solution.
The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
Re:Well... (Score:3, Interesting)
The CAPTCHA solution (Score:4, Interesting)
To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
Do this for term papers to detect plagiarists? (Score:3, Interesting)
One awesome application of this: I teach university courses that require term papers. If I could scan and upload the term papers I receive and Google could OCR them and tell me whether they're plagiarized (and of course Google would know; they know all!), I'd be prepared to pay them a bit of money for this. Or, more accurately, my university would be prepared to pay them a decent sum of money on my behalf. Then, they could keep the data from the term papers for the future, to make sure that nobody turns in that same paper in a later semester. Google not only gets money for this, but a whole lot of data to crawl through. Who knows what they would learn if a curious goog starts cleverly mining that data? If they do this, I would really love to work for them and use my 20% "downtime" to code a sentence structure analyzer that could predict a grade based just on syntactic features of the writing. In order to get more data, Google might even offer the OCR + plagiarism detection for free if the instructor agrees to use a Google grading and feedback system, so that Google could correlate each essay with a grade and an explanation of the grade. After tens of thousands of examples, Google might learn how to assign fairly accurate grades on its own (machine agrees with human to almost the same degree that humans agree with each other about what grade is deserved), and after that, who knows, Google might learn how to write B- term papers without any human input!
BTW, I am aware of plagiarism.org and their plagiarism-detection service which works like the thing that I want Google to do. Of course, if Google enters this market, they will crush all competition immediately, and plausibly, they'll do a better job because their database is just bigger. Also, Google could charge less, because a part of the payment will be access to the data itself. In fact, Google is already looking like it will accept information as payment for many of its services! And why not?