Yahoo Competes with Google in Book Scanning 193
UltimaGuy writes "A consortium backed by Yahoo has launched an ambitious effort to digitize classic books and technical papers and make them freely available on the Web. The company is partnering with the newly formed Open Content Alliance, which aims to offer PDF documents of books to the public at no charge. Consumers will be able to search the contents of the Open Content Alliance's database and download the entire content of any work, such as a scanned copy of a book."
RIAA Problems Solved (Score:5, Funny)
Re:RIAA Problems Solved (Score:2)
Rus
Re:RIAA Problems Solved (Score:2)
Yahoo seaches for Creative Commons (Score:2)
Re:RIAA Problems Solved (Score:2)
In PDF format, no less!
Re:RIAA Problems Solved (Score:2)
More expensive books? (Score:3, Interesting)
Will Yahoo scan it like they have yahoo.com? (Score:5, Funny)
Re:Will Yahoo scan it like they have yahoo.com? (Score:2)
no mention of project gutenberg (Score:4, Insightful)
Right you are! See TEI. (Score:3, Interesting)
Luckily, someone's decided to do something about it. See PGTEI [gutenberg.org], a very verbose and flexible method for marking up literary works. The full TEI spec [wikipedia.org] is gargantuan, so PGTEI is actually a dialect of a subset called TEI Lite. It's an XML markup scheme which has output filters (it uses XSLT, it seems) for plain
What a concept. (Score:5, Informative)
Re:Different than Gutenburg (Score:2)
Back in the early 90's, that was called the World Wide Web (and search engines). Which puts Yahoo... well... where they began.
Gutenberg is more than book-scanning. (Score:2)
What do these guys know... (Score:5, Interesting)
It seems to me that they're throwing money at an unnecessary application. Does Yahoo know something that we don't? I'd venture that they're starting with PD books to shake the bugs out of their platform so the app works well in round 2.
Round 2 (current commercial books) won't occur without a massive copyright law change or support of the Author's Guild.
Hmm.
Re:What do these guys know... (Score:2)
well they know it's all about content. Being advertisment driven sites they have to offer content and experiences that will attract people to there portal. ie search engine, e-mail, clubs, blogs etc.
Re:What do these guys know... (Score:2)
Project Gutenberg (Score:5, Informative)
http://www.gutenberg.org/ [gutenberg.org]
Re:Project Gutenberg (Score:5, Interesting)
Re:Project Gutenberg (Score:5, Informative)
http://www.gutenberg.org/etext/16713 [gutenberg.org] the Html zipped version do carry all the original drawings.
Awesome, indeed! (Score:3, Interesting)
But they don't just have HTML; see various [gutenberg.org] examples of files released with filetype "TEI", including PDF (through LaTeX), TXT (in a variety of encodings, i.e. Latin-1, US-ASCII and UTF-8) and HTML.
Re:Project Gutenberg (Score:2)
Format - Encoding - Compression - Size
HTML - iso-8859-1 - none - - - 1.27 MB
HTML - iso-8859-1 - zip - - - 5.95 MB
Re:Project Gutenberg (Score:2, Funny)
best format? (Score:3, Interesting)
Different scope. (Score:2)
Re:Project Gutenberg (Score:2, Troll)
Project Gutenberg is great and all, but there's something to be said for some effort made at presentation. Sometimes italics are a good thing.
Re:Project Gutenberg (Score:4, Interesting)
It's not a great solution, but emphasis _is_ preserved in the etexts, just like that. Or occasionally like THIS
Also, the fact that they are plain text, with no markup, formatting, binary code, whatever in them means that they'll always be accessible to anyone, regardless of software or platform. And that's a good thing, too!
File format issues (Score:2)
I know about the problems that old file formats can cause. However, I doubt that formats like PDF or JPG will ever get "lost". There's just too much information stored in them, and various free libraries available with source code which read and write them.
And if I'm wrong I won't live to see it.
Re:File format issues (Score:2)
My point was that since the emphasis is included in the file, you could always convert it to a nicely formatted PDF if you wanted to. In fact, I used to do almost exactly that a while back - I wrote some perl script to convert etexts to RTF and peanut markup language, and it worked pretty nicely. Keeping things at the lowest common denominator level isn't always a bad thing...
Perso
Re:Project Gutenberg (Score:2)
HTML would accomplish the same thing. It's a public standard, implementable by anyone on any platform, and convertable to plain text by a simple regex substitution. You're no more likely to find someone who can't read an html file than someone who can't read an ascii text file.
Re:Project Gutenberg (Score:2)
I agree, personally. However, you could also argue that _this_ sort of emphasis is convertible to html with a simple regex substitution - my point was simply that the texts haven't lost any information. Ultimately, it doesn't really
Re:Project Gutenberg (Michael Hart essay) (Score:3, Informative)
the founder of Project Gutenberg, and inventor of eBooks.
-- Greg
Yet another consortium of multi-billion dollar institutions
has thrown its hat into the eBook/eLibrary ring today, just
9 months before the 35th Anniversary of Project Gutenberg's
placement on the Internet of the first eLibrary element, on
July 4th, 1971.
Last December 14th Google used a multi-million dollar blitz
of television, radio and print media to announce the Google
Print
Whew! (Score:5, Interesting)
The opt-in rather than opt-out strategy is really what Google probably should have done, but it'll be interesting to see who comes out as a winner, Yahoo or Google, in all of this.
But will they digitize PD works from after 1922? (Score:5, Informative)
Re:But will they digitize PD works from after 1922 (Score:2)
Re:But will they digitize PD works from after 1922 (Score:2)
There are some exceptions to this. Perhaps most well known is Peter Pan which the UK has granted a perpetual copytright in favour of the Great Ormond Street Hospital.
Re:But will they digitize PD works from after 1922 (Score:3, Informative)
When U.S. works pass into the Public Domain [unc.edu] is a good summary of the U.S. issues.
Me, I just want 14+14 back.
You're in luck! (Score:2)
Actually, if no one renewed the copyright (renewal became automatic for works published in 1964 or later), it may be public domain. Read the new and improved Rule 6 HOWTO [pglaf.org] that the fine folks at Project Gutenberg have put together. You can put together a reasonable case that copyright was not renewed, and heck, maybe you could get PG to pick up the book.
Or you could move to Canada and wait until January 1, 2013, when the author's work will enter the pub
Not really an up-stage (Score:4, Informative)
Actually this won't "Upstage" google in any way.
FTA:
all the content will be made available so it can be indexed by all the other major search engines, including Google's
Yahoo is just going to scan, scan and scan. We all already prefer google's indexing and searching and cleaner interfaces, so the only thing Yahoo! will accomplish by this is help google print along, sheilding all (other) copyright law suits. Once the stuff is online, we all know that Google-bots will be all over it "like a fly on a pile of very seductive manure (Zapp)"
Excellent.
I just hope publishers realise that in this case neither google or yahoo is trying to be their best friend.
Re:Not really an up-stage (Score:2)
Internet powerhouse Yahoo Inc. is setting out to build a vast online library of copyrighted books that pleases publishers -- something that rival Google Inc. hasn't been able to achieve.
The Open Content Alliance, a project that Yahoo is backing with several other partners
Re:Not really an up-stage (Score:2)
Not necessarily...you are going to see a Google ad related to your search before you see a Yahoo one related to your search. If you didn't care about ads the first time (at Google), why would you when Yahoo hits you with them again? I think that probably Google benefits from someone finding something through them, and Yahoo's benefit is much reduced.
Re:Not really an up-stage (Score:2)
Re:Not really an up-stage (Score:2)
What about China? (Score:4, Interesting)
Re:What about China? (Score:2)
The difference between Google and Yahoo's effort (Score:5, Insightful)
The OCA likely won't be sued by the Writer's Guild like Google, however, for searching material Google will likely be better being that Google's search will likely include a massive plethora of copyrighted material, legal or not. Also, it seems that Google themselves will be allowed to use all the material from the OCA into their project as well.
Companies should Get Original (Score:2, Insightful)
NOT competing (Score:5, Informative)
Apples and Oranges! This is not Google Print! (Score:5, Informative)
For example, searching "Zoroastrianism" would return a list of book titles on the subject, and links to purchase the books in question. You CANNOT download the content of the book!
The OCA (The group Yahoo just joined) is an opt-in, full content hosting project.
Searching "Zoroastrianism" would return a (much smaller) list of books, with the *full* content of the book available for download with the explicit consent of the publisher/author!
Re:Apples and Oranges! This is not Google Print! (Score:2)
What will library books in Google look like?
If you are in the United States and you search for Books and Culture by Hamilton Wright Mabie, for instance, you'll be able to page through as much of it as you like, because its 1896 copyright means it's now in the public domain in the United States. These public domain books look very similar to publisher-submitted books exce
Sad thing about Yahoo though (Score:3, Interesting)
Re:Sad thing about Yahoo though (Score:2)
Annoying (Score:2)
Re:Annoying (Score:3, Insightful)
Should we turn to you to tell us which provider of each major online activity is the one we should all use? Even if the differences are incremental and subtle, I'm glad when I get to choose between Yahoo's and Google's take on a particular app/service. I'm also glad that Audi and Toyota and GM and Honda all have different ideas on cars... even though someone else built one once already. Come on - not every service offered is
Re:Annoying (Score:2)
As an example of my point, two image search engines require double the effort of one, but only provide incremental benefit to the user. Instead of copying altavista's image search (which I still think is better), google could have implemented something entirely new.
Re:Annoying (Score:4, Informative)
Have you seen Google Earth?
How about the disaster wiki that went together in about 20 minutes, where people were posting status reports of New Orleans properties?
I think you're damning with faint praise. Google, at least, consistently builds superb offerings, and the price is right. Not quite sure what you're grousing about...
Yikes, How long ... (Score:2)
Re:Yikes, How long ... (Score:2)
http://www.gutenberg.org/etext/2600 [gutenberg.org]
University of Calif: Yahoo OK, Guttenburg banned (Score:5, Interesting)
I hate to see a University pander to commercial interests, while at the same time, welcome commercial interests such as Yahoo. Money talks, and I'm sure UC is being paid a lot, but libraries are supposed to be public resources too, not exclusive profit-centers :-(.
Re:University of Calif: Yahoo OK, Guttenburg banne (Score:2)
Erosion of Public Domain--not just Disney and RIAA (Score:3, Informative)
The library can require a legal agreement to view or scan the book, and that is where a lawsuit can occur. Of course, the legal agreement doesn't apply to 3rd parties that haven't signed. It's another example of the erosion of the public domain--it's not just Disney an
Re:Erosion of Public Domain--not just Disney and R (Score:2)
Dumb question... (Score:2)
Physical owner of PD book controls its use (Score:2)
Re:University of Calif: Yahoo OK, Guttenburg banne (Score:3, Interesting)
do you have a source for this? do you mean that a UC library tried to stop someone from checking out books and scanning them? or do you mean that they didn't allow the gutenberg folks to setup a scanning shop inside a library? there's a huge difference between those two.
i work at a UC library, and i've certainly never heard of any policies about project
University of California locks away public domain (Score:3, Interesting)
would not surprise me to learn that a campus counsel or some such wouldn't let a library give away rights to content that UC held the rights to (like a library's special collections
Re:University of California locks away public doma (Score:2)
the UCSD policy you cite says:
which sounds to me like a non-commercial project like gutenberg would probably not have to pay the access fees. the other UCSD policy mostly talks about limiting duplication because it stres
Re:University of Calif: Yahoo OK, Guttenburg banne (Score:2)
University library != public library.
Reading Between the Lines (Score:2, Redundant)
Reading between the lines for this proposal we seem to have another print.google.com, except it will not index a huge number of works whose copyright holders do not "opt in" to the program. The advantage to this is that it may make some copyright holders feel better about the whole thing and, hopefully submit entire works to be viewed by the public. It is also possible that Yahoo is worried about the legal issues and want to wait and see how google weathers any legal challenges.
From a purely technical pe
PDF?! yuck (Score:2, Insightful)
This goes along with the concept that for an electronic format, I do NOT need a sentence (or even worse, hyphenated word) broken up by two inches of top and bottom margin filled with page numbers, miscellaneous watermarks, repetitive titles, etc.
PS. This being flamebait does not make it false.
Re:PDF?! yuck (Score:5, Informative)
No. I just set it to Continuous. See those four icons in the lower right corner? (assuming you've got a recent version) Play with those. You want the second button from the left
"This goes along with the concept that for an electronic format, I do NOT need a sentence (or even worse, hyphenated word) broken up by two inches of top and bottom margin filled with page numbers, miscellaneous watermarks, repetitive titles, etc."
Well, the whole purpose of PDF is to "preserve the look and integrity of your original documents
Bookripper on its way? (Score:5, Interesting)
Soon after Google Mail was introduced, somebody created a SourceForge project that lets you use Google Mail as a database. How long until somebody releases a "Bookripper" app that assembles a whole book from search extracts? As I understand it Google displays two pages at a time (or wait, that's Amazon, but I bet they're similar). All you would need to know is a quote from a book's first page as a seed, and you should be able to grab the whole book by doing a series of searches using text from the second page returned by each search. The trick would be to knit the pieces together and eliminate the overlapping text. Seems almost trivial. Another possibility would be to search for random words and look for overlaps between the results, assembling them like a linear jigsaw puzzle until there are no gaps.
Re:Bookripper on its way? (Score:3, Informative)
Re:Bookripper on its way? (Score:2)
Re:Bookripper on its way? (Score:3, Informative)
I'm already logged in. Why are you telling me the page is unavailable?
As part of our efforts to protect a book's copyright, a set of pages in every in-copyright book will be unavailable to all users.
http://print.google.com/googleprint/help.html#pag e limit [google.com]
Dan East
"Do no Evil" done right (Score:5, Insightful)
Now this is a right step towards making book contents searcheable online. I will hate to see one company like Google copying and caching all books in its massive cluster of servers. I know that Google kool-aid that "we are about general good" is running deeply in the veins of slashdot types.
Since when was scanning books from libraries and making them available to public for a profit was considered "fair use"? This kind of stuff is done by pirates. Go to the major cities in China and India and you will see piles of copied book in the streets all sold for 1/10th the original price without giving anything back to the authors. The pirates can say that they are doing a favor to the authors by driving them out of obscurity.
The message the alliance is sending out to the authors is
Re:"Do no Evil" done right (Score:2, Insightful)
Compare this to what Google is telling the authors
* we will show excerpts of your book, so if a researcher is researching on a topic he can find what you have written about a topic without ever having to buy your book, too bad, heh heh, write a fiction book dude
Except that Google only shows 2-3 sentences of books that are under copyright. I've never found a researcher that can write on a topic by only reading 2 sentences. It's only posters on /. that can claim expertise on a topic without actually
Re:"Do no Evil" done right (Score:3, Informative)
It's not. You are mischaracterizing Google's system. The problem with your claim is that Google's system doesn't make the book available to users to download, it is only a search method that points to the relevant books and provides short excerpts like their search engine does. Google won't provide the book or even whole page without the copyright owner's permission. My impressi
Re:"Do no Evil" done right (Score:2)
1) Making money is not inherently evil. Note that Google's scheme will also make money for authors. Google's scheme takes nothing from authors at all.
2) The click on a link also only brings 2-3 sentences (not pages, Sparky...) of text.
3) The virtue of libraries is not that they pay for books, it is that they make as much information as possible available to as many people as possible.
4) See 2.
When the copyright holders start remembering that the purp
Re:"Do no Evil" done right (Score:2)
It'd basically be a faster interlibrary loan system.
Re:"Do no Evil" done right (Score:2)
So do book stores.
2. The sale of a book brings author money. The click on a link without sale only brings Google money.
But what you first complained was:
we will take sale comissions from amazon, buy.com, bn.com, etc. without sharing anything with you
Sales commissions is different from link clicks.
4. 2-3 pages are sometimes enough to get an idea. A researcher looks at an index of a book and then reads the pages based on keyword. G
Re:"Do no Evil" done right (Score:3, Insightful)
How disingenuous. Google Print shows only a snippet of the text and tells you how to buy the book if it seems like what you need. Not pages, not paragaphs - a couple of sentences. In fact, Google Print instantly returns pretty much what you'd get if you hired a researcher to go find X number of books with such and such text and the researcher prepared a paper with a short quote from eac
Re:"Do no Evil" done right (Score:3, Informative)
Since when is Google doing this? As others have pointed out, Google provides a portion of the work to give the search context - 3 pages. In another post [slashdot.org], you claim that 3 pages is enough information to invalidate the sale of a book. If this is the case, I would have to seriously question the value of your work. Either that - or take a serious look at public libraries, private loaning
Re:"Do no Evil" done right (Score:2)
Interesting. Except for the cut-rate pricing, this is how the recording industry has been operating for a century.
This is huge. IA beat Google and Yahoo to this... (Score:4, Insightful)
I've read through the first few posts, and people really don't have a clue about what this is all about. "Open Content Alliance"... It means what it says. Open f'ing content. Let there be content available to the masses... Is it more important that I can get a snippet from some copyrighted text, or that millions of children can read Alice in Wonderland with all it's wonderful illustrations.
This is beyond PDF or anything like that. Some people want PDF, so Adobe will make them. Some people want decent OCR versions, perhaps to go into Distrubuted Proof readers or into someone's text-only PDA. It's ALL possible. This is NOT an exclusive club, it's an INCLUSIVE community that is dedicated to Open f'ing Content.
Why don't you people get it. By allowing people to have full texts of some of humanities greatest works we are doing more than a few snippets of the latest Ken Follet novel... a lot more.
It's bigger than Yahoo or Google. Yahoo is NOT an also-ran.... The Internet Archive has been scanning books and hosting Milloins Books project texts as well as Project Gutenberg texts for a long time... long before Yahoo or even Google were in the picture. Ignorant comments made here suggest somehow Yahoo is following.
I say Yahoo is leading by embracing a project that by definition is bigger than themselves. Good for them.
New and Radical (Score:4, Funny)
More like... (Score:2)
A DRM-free e-Ink e-book reader on the horizon (Score:2)
Although I don't think it's on sale, it is the Holy EBook Reader Grail we've been seeking for ten years.
If we're gonna download ebooks, we should have a reader to read them with, no?
Re:Why PDF? (Score:5, Informative)
The fact that it's an open, documented [adobe.com] format?
Adobe has made their money the old-fashioned way, by making tools that work well, rather than by locking people into a format. GhostScript, among others, will read those PDF's with or without Adobe.
Re:Why PDF? (Score:2)
Re:Why PDF? (Score:2)
You're right, sorta. The djvu [freshmeat.net] format is better than PDF for scanned books in most respects. Looks better, compresses better (and compresses by default), decompresses + renders faster while using less memory, more easily transformed to/from other formats due to availability of high-quality open source and free tools, etc. The Internet Archive's books collection has several books archived in djvu format.
The downside is that most users do not have a djvu reader installed on their computers, and even thoug
Re:Dupe (Score:4, Funny)
If Google rose to the competition (Score:2)
If there was ever anything we need competition in, it is search engines. Whether project Gutenberg needed any competition is another question.
I don't see a lot of similarity between this project and the one Google is doing. Open versus proprietary. Free (free as in speech) information versus non-free information.
In the case of other search engines Google has put out of business (Altavista, although the web site still exists, no longer exists as the more-advanced search engine it was using the facilitie
Oh, it's not quite the same as PG. (Score:2)
As a fan of Project Gutenberg, I look forward to more page images being made available, since it means more high-qu
Re:its to see... (Score:4, Insightful)
I just wonder how Yahoo! will make $$$ of this very small market of public domain works, or if they DO get repro rights to other books what the price model is to download them, or will you just see advertisements in your e-books? The authors are not going to give up their $$$ nor is Yahoo so somebody is going to have to pay for this content.
PDF Isn't Proprietary (Score:2)
-everphilski-
Re:PDF Isn't Proprietary (Score:3)
Ever hear of a printer? (Score:2)
But wait, there's more! There are even ones that will print on both sides of the paper, and will automatically print two pages onto one side! So you can get 4 pages onto one A4 sheet, thus having text about the same size as a paperback! Put a couple of binding clips on one side and you have an instant book.
More seriously though... Besides the fact that it is both cheaper and nearly insta
Re:i have heard of these "printer" inventions, yes (Score:3, Interesting)
However, what I described does not require any folding and binding takes all of about 10 seconds. I've done this more than a few times and it does work out well.
I have a Brother laser printer that cost about US$300. I bought this printer for other reasons, but it is a great book printer too. (Has a duplexer, supports both PCL6 and PS3, built-in standard 10/100 LAN port. Basically it wi
Condescending? (Score:2)
I'm not sure where you are getting the $3 books from unless they are used, "stripped", review copies, or unauthorized print runs. In any of those cases the author is not getting a cut. "Lonesome Dove" is $7.99 on Amazon + shipping.
I support authors as well -- I certainly buy (more than) my share of books. I print some too though, mostly because of the need to get the information immediately. If the book is particularly good and I