An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."
Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.
I'd definitely like to see some not-for-profits get involved.
That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.
How do you plan to read it once you find it?
10 year disruption -- content formats have moved on; readers are scarce 100 year disruption -- hard drives, DVDs decay to unreadability 1000 year disruption -- even paper decays, unless specifically preserved >1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct
Yeah, that's why we don't have any idea what anybody [mit.edu] might have said [georgetown.edu] (or meant [lone-star.net]) more than a thousand [ancienttexts.org] years ago.
10 year disruption -- content formats have moved on; readers are scarce
I've been using computers for well more than 10 years, and ASCII is still just as readable as ever.
Mark-up languages like HTML, XML, or RTF may die off eventually (several hundred years at least), but you can always strip the markup (either with code, or mentally by ignoring it). Plus, with the formats being so simple, and book layout being so obvious, it should take 5 minutes to write a new parser for any of them.
100 year disruption -- hard drives, DVDs decay to unreadability
Both of the above would be unreadable by the standard pick-up mechanism, but manually reading it, bit-by-bit with something like an electron microscope should be possible for many, many more years after that. Just as technology has made it possible to read previously erased text on paper, so to will it be easier, in the future, to read physically decaying digital media.
>1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct
It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.
Besides that, you can ensure the language survives by having multiple language tranlations, side-by-side. If any one of them is understood in the distant future, they can use it to learn all the rest. See: The Rosetta Stone
Not to mention that the whole "decaying medium" argument is ridiculous. If a hard drive fails, replace it. If you get something better than hard drives, copy it. It's not like big servers only keep the information in one specific place. There's usually copies.
It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.
A thousand years for a language to disappear? All it takes is a generation who doesn't speak it and it might as well be considered gone. A language is often
100 year disruption -- hard drives, DVDs decay to unreadability
That is, if we imagine a digital archive to function like it's plain-paper counterpart : with huge underground stores with shelves full of discs.
But if we're a little bit realistic we should realise that, in the current age of internet and digital information, the data doesn't hve to remain fixed on a specific medium. The ability to make perfect copies is basically inherent to the nature of digital data.
Google's big mistake was to try to do both PD and copyrighted books. Regardless of the legal merits (which are complicated), it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it. Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists. The whole thing is actually an abject failure, so it doesn't make me worry that Google will somehow get too powerful. Anyway, AFAIK Google doesn't claim any IP rights on their scans of PD books, so they actually don't have any control at all -- other people can take the scans and do whatever they want with them. Google is in the advertising business, not the publishing business.
Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists.
Its on the short list "More" link on the Google search page, and results from it are brought up without special request for certain searches on the main web search engine (apparently, any with the word "book" that get hits, though I'm not certain of that.)
That's hardly Google doing "everything it could to hide from users the fact that the service even exists".
Its on the short list "More" link on the Google search page...
When the service first came online, you would just do a normal Google search, and results from books would pop up, by default. When the lawsuit happened, that stopped happening, and you had to go to books.google.com to get separate results on books. They had an easy way to let millions of people use the service, just by encountering it naturally in their search results, but they got rid of that. The result is that ordinary people have no idea
Regular Google search: Search terms 'book math' [google.com]. Book results come up as a special heading after sponsored links and before regular results.
As far as I can tell, the results returned by books.google.com and google.com are disjoint sets.
They clearly aren't disjoint, but are instead overlapping (particularly, the book results returned by the main search engine are a proper subset of those that would be returned using the main book search page); I think this is typical of the way Google presents "OneBox" resul
I get three Google Books hits from the original search without using books.google.com, just off the main google engine. Now, Google search results aren't particularly consistent (refreshing the search will sometimes change the results, and frequently cause sponsored links and OneBox results to disappear, as will, IIRC, doing multiple different searches in rapid succession).
There was a period when the results from scanned books were always mixed in with web results, and then it abruptly changed.
Since they are a different kind of result, the use of OneBox is consistent with the rest of the Google interface—if you use the web search, you get web results in the main, but if there are particularly appropriate results by some more limited algorithm in one of the other databases, you also may get a handful of those in the OneBox area immediately after the sponsored links, and
anyone else find the irony here funny. Google is on the side of keeping this a closed circuit project and MS is part of the alliance trying to make it open.
It's kind of sad to think that people are already worried about one corporation controlling ALL of the world's books. Let's still think about the reality of it. Google came to a handful (like 5 or so) of libraries (major ones at that), with a plan to digitize out-of-copyright books and put their content on the internet. They've got the search technology, they're trying to innovate. Now, if there were only five libraries in the entire world, yes, we could have a problem here. But in reality, there's A LOT mo
I think you're vastly overestimating the added benefit from scanning books from more libraries after the first few:
Most libraries' collections are very similar to most other libraries' collections, and the greatest overlap occurs with the books that are the most important.
This is all about PD stuff, since OCA isn't proposing to do anything still in copyright. Less ephemeral works (the kind typically preserved in library collections a century later) generally all had their copyrights renewed in the U.S., so that means we're only talking about pre-1923 materials. Since congress keeps on extending copyright terms, nothing is probably ever going to enter the public domain from 1923 on. That means we're talking about the publishing world of 1922, which was vastly smaller than today's publishing world. Amazon.com has on the order of 10^6 books. To get a feel for the size of the publishing industry in past decades, try browsing through the catalog of renewals [upenn.edu]; the number of books published was extremely small in the early 19th century.
There are many books that won't be in any library's collection, simply because they weren't considered very valuable. You could digitize a thousand libraries, and never find them. Handwriting manuals from 1893. Trashy novels. Etc. In fact, there are a lot of books from the 1930's-1950's that are now PD, because they never had their copyrights renewed, but you're not going to find them in libraries' collections, and in fact it's very unlikely that anyone will ever be interested in them.
Oxford University is one of the UK copyright libraries - it has a copy of every book and published in the UK and Ireland since the 1600s - it gets them by default.
Google came to a handful (like 5 or so) of libraries (major ones at that), with a plan to digitize out-of-copyright books and put their content on the internet.
If that was all that happened, nobody would be complaining. The problem is that it wasn't only out-of-copyright books, but every book in their collection, including those clearly in copyright. What's more, they require publishers who have issues with this copyright violation to opt out, and blanket opt-outs are not accepted - the publisher has to p
One things that bugs the heck out of me with Google is their, "Oh we will do this because we have the rights", yet if you want to use their stuff you need EXPLICIT permissions. http://www.google.com/permissions/index.html [google.com]
" All of Google's trademarks, logos, web pages, screen shots, or other distinctive features ("Google Brand Features") are protected by applicable trademark, copyright, and other intellectual property laws. If you would like to use any of Google Brand Features on your website, in an advertis
Already facing a legal challenge for alleged copyright infringement, Google Inc.'s crusade to build a digital library has triggered a philosophical debate with an alternative project promising better online access to the world's books, art and historical documents.
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
The latest tensions revolve around Google's insistence on chaining the digital content to its Internet-leading search engine and the nine major libraries that have aligned themselves with the Mountain View-based company.
Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
Only if the book is expendable. In the case of many pre-1920 books (i.e. out of copyright) any sane library wouldn't even let you push it flat against the glass of a flatbed scanner. Ideally you need a scanner that keeps the book from openin
As I understand it, Google just uses the raw OCR. It's usually good enough for searching, which is what they are intrested in, and requires a lot less manpower than corrected OCR. If you want corrected OCR, you need to look at places like Project Gutenberg (and distributed proofreading).
I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).
Project Gutenburg [gutenberg.org] is open and non proprietary (ASCII text) and has been for quite a while.
After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.
You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?
Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.
Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.
There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures [fromoldbooks.org] on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.
Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.
I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??
by Anonymous Coward
on Wednesday December 20 2006, @07:55PM (#17320366)
A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity...
Did someone break their legs?
See that big building downtown with all the books in it?
Oh wait, get up from your desk, go outside (yes I know, it burns...), get on the bus and go downtown.
OK, now see the big building with the strange letters "LIBRARY" on the front? OK, that's the one, go inside... see all the books?
Now go up to the attendant at the desk and tell them your name and address and show a piece of photo ID. The nice person will give you a card that you can use to borrow books.
What's a book? OK, its many pages of paper bound together usually with glue and string. On each of these "pages" you will find ink (a dye) in the pattern of letters that form words and sentences and paragraphs.
Usually, these "books" tell a story or provide organised information.
No go ahead, pick one out - they'll even let you take it home for a week or two so you can read it. For free!
You can browse the stacks (a colloquialism for those big shelves with books on them) which are organised according to a system known as the Dewey Decimal System. You can use a revolutionary piece of technology known as a "card catalog" to indicate the position of the title you seek on the stacks (though many libraries have this same catalog searchable from computer terminals).
It's revolutionary, I know. But there you have it, free information and entertainment, enough to last a lifetime, with a "less restrictive approach".
But unfortunately, not all of the world has access to such wonderful libraries, and specialized research is somewhat difficult, even if your city is one that is blessed with a nice public library. Boy, I loved it when I discovered sites like this [umich.edu], and this [cornell.edu], and this [uni-goettingen.de], collections to truly warm the heart of a math geek like me. Good luck finding even a tenth of the books and journals in those three collections in your local public library.
Surely we can speed up this process by simply asking the publishers to make available the original digital Latex or SGML files for all books printed since the late 70s right?
Why invest hundreds of hours on scan/ocr/qa for texts which already exist in a digital format?
Just how does Google scanning a book prevent anyone else from doing the same? Does Google own the only copy? I doubt that. This seems like much ado about nothing, or an outright grab to force Google to share what they put the effort into creating in the first place. And I'll bet the sharing is expected to be Free.
1) Google may also have contractual obligations with copyright holders that prevent putting the content in an open format.
2) If point 1 can be overcome & Google could see a competitive advantage over MS's book scanning effort in opening the content then perhaps they'd try it after all...
Exactly right. All these comments about "must show ads over it" pretty much misses the point. Google's project allows you to SEARCH all the books its scanning, and even so, its drawn the ire of copyright holders. Imagine if they said... "Oh, yes... we're OPEN SOURCING all of our scanning results for unfettered public consumption." No judge in the world... nuff said. Open sourcing the actually methodology would not serve much purpose, although its worthy of note that they have open sources some OCR softwa
There is nothing sexy or secret about the methods of scanning, but they must have put an imperial frickton of money into the process...To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.
To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.
Only if you don't expect to reap the benefits of it afterwards and that giving it away might actually be required in order to reap those benefits. You know, kinda like how google gives away search engine results and email accounts.
You know, kinda like how google gives away search engine results and email accounts
Google does not give those things away for free. It exchanges them in return for subjecting you to advertising, which they in turn sell to folks who want to show you advertising.
Thus giving it away is actually required in order to reap those benefits.
Quite the opposite. If they give it away, then I can set up ePhil House o' Classic Literature and reap the benefits of that advertising in place of Google. I can show less advertising because I don't have that nasty overhead of scanning the books. Google's need is to make it available to consumers in exchange for "eyeballs" but keep it away from me. Hammer away on Google's servers and they will cut you off, I ran operations in a comp
Hear, hear! Books want to be open! I find that when books can be open, as they should, they become much more accessible to people than if they were kept closed.
Google 'Do No Evil'... is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?
Oh, do calm down... They never claimed "we do absolutely no evil whatsoever", it's more like - the founders happen to think that "evil should not be done". What's a lie about that? Also, how does inflated stock make them evil?
And how, pray, are they supposed to survive without the adverts? Never mind the fact that Google didn't actually come up with online advertising but were pretty much the first ones to run targeted, non-offensive (as in, no flashing banners, pop-ups, etc.) ads.
I'm no Google fanboy, although I happily use many of their services. But I don't think there's anything inherently wrong with them, and I find it somewhat sad to see this paranoid drivel modded up to +3 Insightful.
Oh damn! You really nailed Google there. They're all about making you see ads. Oh man, they're never going to live that tongue-lashing down. I bet their PR people are going nuts trying to figure out how to clean this mess up.
Are you angry because Google suspended the SOAP API? Or are you just a grumpy troll?
No, I don't think that they'll hold to it forever. I suspect that once the founders are gone, things will erode until that motto will go the way of the dinosaur except for its PR function. That said, based on what they're *doing* (and not what they're merely saying), they're at least making a reasonable effort to live up to an ideal, and that's a hell of a lot more than I can say for any other corp.
In other words, I'll retain some loyalty to Google so long as it shows some loyalty to us. Like I said, they'
Good! (Score:5, Insightful)
Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.
I'd definitely like to see some not-for-profits get involved.
Re: (Score:3, Funny)
Preferably the technology should be RoR.
Re: (Score:3, Interesting)
How do you plan to read it once you find it?
10 year disruption -- content formats have moved on; readers are scarce
100 year disruption -- hard drives, DVDs decay to unreadability
1000 year disruption -- even paper decays, unless specifically preserved
>1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct
Re: (Score:3, Insightful)
I can read data from ten years ago on my home computer with no problems.
If we ahve a 100 year disruption, well then we are probably throwing rocks at one another and rebuilding civilization.
Re: (Score:2, Informative)
Re:Good! (Score:4, Insightful)
I've been using computers for well more than 10 years, and ASCII is still just as readable as ever.
Mark-up languages like HTML, XML, or RTF may die off eventually (several hundred years at least), but you can always strip the markup (either with code, or mentally by ignoring it). Plus, with the formats being so simple, and book layout being so obvious, it should take 5 minutes to write a new parser for any of them.
Both of the above would be unreadable by the standard pick-up mechanism, but manually reading it, bit-by-bit with something like an electron microscope should be possible for many, many more years after that. Just as technology has made it possible to read previously erased text on paper, so to will it be easier, in the future, to read physically decaying digital media.
It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.
Besides that, you can ensure the language survives by having multiple language tranlations, side-by-side. If any one of them is understood in the distant future, they can use it to learn all the rest. See: The Rosetta Stone
Parent
Re: (Score:3, Informative)
Re: (Score:2)
I've been using computers for well more than 10 years, and ASCII is still just as readable as ever.
But EBCDIC is slightly harder. Besides, ASCII is only usable for a subset of human text - basically only for English. It's not really a solution.
Re: (Score:2)
Preservation of languages (Score:2)
It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.
A thousand years for a language to disappear? All it takes is a generation who doesn't speak it and it might as well be considered gone. A language is often
Re: (Score:2)
As to the article I completely agree. If public libraries were undertaking this project they would have a lot more fair use wiggle room.
Decoupling of content and medium (Score:2)
That is, if we imagine a digital archive to function like it's plain-paper counterpart : with huge underground stores with shelves full of discs.
But if we're a little bit realistic we should realise that, in the current age of internet and digital information, the data doesn't hve to remain fixed on a specific medium. The ability to make perfect copies is basically inherent to the nature of digital data.
The problem of preservation isn't anymore
Google's goof (Score:4, Insightful)
Re:Google's goof (Score:4, Interesting)
Its on the short list "More" link on the Google search page, and results from it are brought up without special request for certain searches on the main web search engine (apparently, any with the word "book" that get hits, though I'm not certain of that.)
That's hardly Google doing "everything it could to hide from users the fact that the service even exists".
Parent
Re: (Score:2)
Its on the short list "More" link on the Google search page...
When the service first came online, you would just do a normal Google search, and results from books would pop up, by default. When the lawsuit happened, that stopped happening, and you had to go to books.google.com to get separate results on books. They had an easy way to let millions of people use the service, just by encountering it naturally in their search results, but they got rid of that. The result is that ordinary people have no idea
Re: (Score:2)
They clearly aren't disjoint, but are instead overlapping (particularly, the book results returned by the main search engine are a proper subset of those that would be returned using the main book search page); I think this is typical of the way Google presents "OneBox" resul
Re: (Score:2)
http://www.google.com/search?hl=en&q=origin+of+sp
Is Google also denying the existence of its Froogle service since it's listed below the 'Books' search option in 'more>>'?
Re: (Score:2)
Re: (Score:2)
Since they are a different kind of result, the use of OneBox is consistent with the rest of the Google interface—if you use the web search, you get web results in the main, but if there are particularly appropriate results by some more limited algorithm in one of the other databases, you also may get a handful of those in the OneBox area immediately after the sponsored links, and
funny. (Score:2, Interesting)
Its funny. Laugh.
Google's got a long way to go . . . (Score:2, Interesting)
Re:Google's got a long way to go . . . (Score:5, Informative)
Parent
Re: (Score:2, Interesting)
Re: (Score:2)
If that was all that happened, nobody would be complaining. The problem is that it wasn't only out-of-copyright books, but every book in their collection, including those clearly in copyright. What's more, they require publishers who have issues with this copyright violation to opt out, and blanket opt-outs are not accepted - the publisher has to p
Google says one thing does another (Score:3, Interesting)
" All of Google's trademarks, logos, web pages, screen shots, or other distinctive features ("Google Brand Features") are protected by applicable trademark, copyright, and other intellectual property laws. If you would like to use any of Google Brand Features on your website, in an advertis
Scanning a book is easy... (Score:5, Insightful)
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.
Re: (Score:2)
Only if the book is expendable. In the case of many pre-1920 books (i.e. out of copyright) any sane library wouldn't even let you push it flat against the glass of a flatbed scanner. Ideally you need a scanner that keeps the book from openin
Re: (Score:2)
Project Gutenburg (Score:5, Interesting)
I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).
Project Gutenburg [gutenberg.org] is open and non proprietary (ASCII text) and has been for quite a while.
After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.
Re: (Score:2)
They focus solely on public-domain works, as opposed to fair-use of current, copyrighted works, as Google does.
the books aren't going anywhere... (Score:5, Insightful)
You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?
Please do a better job, not just a bigger job (Score:3, Informative)
Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.
There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures [fromoldbooks.org] on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.
Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.
I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??
Liam
Did someone break their legs? (Score:3, Insightful)
Did someone break their legs?
See that big building downtown with all the books in it?
Oh wait, get up from your desk, go outside (yes I know, it burns...), get on the bus and go downtown.
OK, now see the big building with the strange letters "LIBRARY" on the front? OK, that's the one, go inside... see all the books?
Now go up to the attendant at the desk and tell them your name and address and show a piece of photo ID. The nice person will give you a card that you can use to borrow books.
What's a book? OK, its many pages of paper bound together usually with glue and string. On each of these "pages" you will find ink (a dye) in the pattern of letters that form words and sentences and paragraphs.
Usually, these "books" tell a story or provide organised information.
No go ahead, pick one out - they'll even let you take it home for a week or two so you can read it. For free!
You can browse the stacks (a colloquialism for those big shelves with books on them) which are organised according to a system known as the Dewey Decimal System. You can use a revolutionary piece of technology known as a "card catalog" to indicate the position of the title you seek on the stacks (though many libraries have this same catalog searchable from computer terminals).
It's revolutionary, I know. But there you have it, free information and entertainment, enough to last a lifetime, with a "less restrictive approach".
Enjoy.
Re: (Score:2)
But unfortunately, not all of the world has access to such wonderful libraries, and specialized research is somewhat difficult, even if your city is one that is blessed with a nice public library. Boy, I loved it when I discovered sites like this [umich.edu], and this [cornell.edu], and this [uni-goettingen.de], collections to truly warm the heart of a math geek like me. Good luck finding even a tenth of the books and journals in those three collections in your local public library.
Enclose what? (Score:3, Insightful)
more credible (Score:2)
Digital originals available from publishers (Score:2)
Why invest hundreds of hours on scan/ocr/qa for texts which already exist in a digital format?
You could read this as... (Score:2)
Just How Does...? (Score:2)
Re: (Score:3, Insightful)
Well, the source of the code running the project wouldn't be that helpful, it's the content we're after.
And presuming you meant Google opening the content.... well I doubt it... they want to sell ads on the content after all!
Don't forget, google nice tho' they are haven't given out code/content/etc for any of their "crown jewels"
Re: (Score:2)
1) Google may also have contractual obligations with copyright holders that prevent putting the content in an open format.
2) If point 1 can be overcome & Google could see a competitive advantage over MS's book scanning effort in opening the content then perhaps they'd try it after all...
Re: (Score:2, Informative)
Re:Just Open Source It? (Score:4, Interesting)
There is nothing sexy or secret about the methods of scanning, but they must have put an imperial frickton of money into the process...To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.
Parent
Re: (Score:2, Interesting)
Only if you don't expect to reap the benefits of it afterwards and that giving it away might actually be required in order to reap those benefits. You know, kinda like how google gives away search engine results and email accounts.
Re: (Score:2)
Google does not give those things away for free. It exchanges them in return for subjecting you to advertising, which they in turn sell to folks who want to show you advertising.
There's no such thing as a free lunch.
Re: (Score:3, Interesting)
Quite the opposite. If they give it away, then I can set up ePhil House o' Classic Literature and reap the benefits of that advertising in place of Google. I can show less advertising because I don't have that nasty overhead of scanning the books. Google's need is to make it available to consumers in exchange for "eyeballs" but keep it away from me. Hammer away on Google's servers and they will cut you off, I ran operations in a comp
Re: (Score:3, Funny)
Re: Google 'Do No Evil' ... (Score:4, Insightful)
Oh, do calm down... They never claimed "we do absolutely no evil whatsoever", it's more like - the founders happen to think that "evil should not be done". What's a lie about that? Also, how does inflated stock make them evil?
And how, pray, are they supposed to survive without the adverts? Never mind the fact that Google didn't actually come up with online advertising but were pretty much the first ones to run targeted, non-offensive (as in, no flashing banners, pop-ups, etc.) ads.
I'm no Google fanboy, although I happily use many of their services. But I don't think there's anything inherently wrong with them, and I find it somewhat sad to see this paranoid drivel modded up to +3 Insightful.
Parent
Re: (Score:2)
Are you angry because Google suspended the SOAP API? Or are you just a grumpy troll?
Maybe, but not yet. (Score:2)
That said, based on what they're *doing* (and not what they're merely saying), they're at least making a reasonable effort to live up to an ideal, and that's a hell of a lot more than I can say for any other corp.
In other words, I'll retain some loyalty to Google so long as it shows some loyalty to us. Like I said, they'