Forgot your password?

typodupeerror
Google Businesses The Internet Software Technology

Google Pushes Open Source OCR 212

Posted by Zonk
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
This discussion has been archived. No new comments can be posted.

Google Pushes Open Source OCR

Comments Filter:
  • by Anonymous Coward on Tuesday April 10, 2007 @02:05PM (#18679067)
    Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.
  • Finally... (Score:3, Interesting)

    by Searinox (833879) on Tuesday April 10, 2007 @02:10PM (#18679159) Homepage
    An OCR system that runs on Linux. I've been waiting for quite some time for something like this.
  • searchable pdfs (Score:5, Interesting)

    by radarsat1 (786772) on Tuesday April 10, 2007 @02:44PM (#18679705) Homepage
    Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
    (Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

    Perhaps this library could be used to build such an application if none exists...
  • Language? (Score:5, Interesting)

    by ceeam (39911) on Tuesday April 10, 2007 @02:45PM (#18679731)
    English only I suppose?
  • by ajs (35943) <ajs@@@ajs...com> on Tuesday April 10, 2007 @03:25PM (#18680307) Homepage Journal

    The goal of the project is to stop the damn email image spammers.

    among other things, sure, but it's got to be a high priority for google.
    I don't buy either one. I think the goal of the project is to get sued.

    Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.

    I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
  • Comics (Score:3, Interesting)

    by rbanffy (584143) on Tuesday April 10, 2007 @03:26PM (#18680339) Homepage Journal
    Will I be able to search my comics strips (downloaded since ever) by keyword?

    I would love that!
  • by Pxtl (151020) on Tuesday April 10, 2007 @03:39PM (#18680535) Homepage
    You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.

    Captchas are by far the better solution.

    The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
  • Re:Well... (Score:3, Interesting)

    by drinkypoo (153816) <martin.espinoza@gmail.com> on Tuesday April 10, 2007 @04:03PM (#18680949) Homepage Journal
    All I want is a plugin for thunderbird that will detect when a message is written in another language other than English and mark it spam if it is. No one ever sends me an email in anything other than English except for spam. I have no fucking idea why this has not yet been implemented. I get absolute shitloads of russian spam.
  • The CAPTCHA solution (Score:4, Interesting)

    by dj245 (732906) on Tuesday April 10, 2007 @05:01PM (#18681855) Homepage
    Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view. You could also change the item being asked for to defeat simple image recognition, and have several pictures of kittens/what-have-yous.

    To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
  • by Dr. Spork (142693) on Tuesday April 10, 2007 @08:15PM (#18683977)
    I have no doubt at all that this is coming in the future. Why? Because Google wants to see all data, analyze that data, and catalog it. That's exactly what would happen if you uploaded your scanned document to Google: Sure they would OCR it and do a good job, but they would also save the OCR'ed copy for later data mining.

    One awesome application of this: I teach university courses that require term papers. If I could scan and upload the term papers I receive and Google could OCR them and tell me whether they're plagiarized (and of course Google would know; they know all!), I'd be prepared to pay them a bit of money for this. Or, more accurately, my university would be prepared to pay them a decent sum of money on my behalf. Then, they could keep the data from the term papers for the future, to make sure that nobody turns in that same paper in a later semester. Google not only gets money for this, but a whole lot of data to crawl through. Who knows what they would learn if a curious goog starts cleverly mining that data? If they do this, I would really love to work for them and use my 20% "downtime" to code a sentence structure analyzer that could predict a grade based just on syntactic features of the writing. In order to get more data, Google might even offer the OCR + plagiarism detection for free if the instructor agrees to use a Google grading and feedback system, so that Google could correlate each essay with a grade and an explanation of the grade. After tens of thousands of examples, Google might learn how to assign fairly accurate grades on its own (machine agrees with human to almost the same degree that humans agree with each other about what grade is deserved), and after that, who knows, Google might learn how to write B- term papers without any human input!

    BTW, I am aware of plagiarism.org and their plagiarism-detection service which works like the thing that I want Google to do. Of course, if Google enters this market, they will crush all competition immediately, and plausibly, they'll do a better job because their database is just bigger. Also, Google could charge less, because a part of the payment will be access to the data itself. In fact, Google is already looking like it will accept information as payment for many of its services! And why not?

The brain is a wonderful organ; it starts working the moment you get up in the morning, and does not stop until you get to work.

Working...