How Anthropic Built Claude: Buy Books, Slice Spines, Scan Pages, Recycle the Remains (msn.com) 122
Court documents unsealed last week in a copyright lawsuit against Anthropic reveal that the AI company ran an operation called "Project Panama" to buy millions of physical books, slice off their spines, scan the pages to train its Claude chatbot, and then send the remains to recycling companies.
The company spent tens of millions of dollars on the effort and hired Tom Turvey, a Google executive who had worked on the legally contested Google Books project two decades earlier. Anthropic bought books in batches of tens of thousands from retailers including Better World Books and World of Books. A vendor document noted the company was seeking to scan between 500,000 and two million books.
Before Project Panama, Anthropic co-founder Ben Mann downloaded books from LibGen, a shadow library of pirated material, over 11 days in June 2021. He later shared a link to the Pirate Library Mirror site with colleagues, writing "this is awesome!!!" Meta employees similarly downloaded books from torrent platforms after approval from Mark Zuckerberg, court filings allege, though one engineer wrote that "torrenting from a corporate laptop doesn't feel right." Anthropic settled for $1.5 billion in August without admitting wrongdoing.
The company spent tens of millions of dollars on the effort and hired Tom Turvey, a Google executive who had worked on the legally contested Google Books project two decades earlier. Anthropic bought books in batches of tens of thousands from retailers including Better World Books and World of Books. A vendor document noted the company was seeking to scan between 500,000 and two million books.
Before Project Panama, Anthropic co-founder Ben Mann downloaded books from LibGen, a shadow library of pirated material, over 11 days in June 2021. He later shared a link to the Pirate Library Mirror site with colleagues, writing "this is awesome!!!" Meta employees similarly downloaded books from torrent platforms after approval from Mark Zuckerberg, court filings allege, though one engineer wrote that "torrenting from a corporate laptop doesn't feel right." Anthropic settled for $1.5 billion in August without admitting wrongdoing.
When they recycle books, they recycle people (Score:5, Insightful)
Re:When they recycle books, they recycle people (Score:5, Insightful)
While not wrong: The machine cannot deliver what they promised and people are getting fed up with it.
What the horrible people want is not exactly a perfect overlap with what is realistically possible.
We will find a new equilibrium.
Re: (Score:3)
While not wrong: The machine cannot deliver what they promised and people are getting fed up with it.
It was obvious from the beginning that it never could. And it was obvious to the people hyping it, too.
What the horrible people want is not exactly a perfect overlap with what is realistically possible.
What the horrible people want is almost never what they say they want. What they generally want is to sell a lot of stock, then bail out before the pyramid collapses and the bubble pops. They generally don't care if what they say they want is possible.
We will find a new equilibrium.
Indeed, but that new equilibrium will be the next bubble, lather, rinse, repeat.
Re:When they recycle books, they recycle people (Score:5, Informative)
I mean yes and no... It makes you more productive... While it can't replace a human in most cases, it does make the humans that are using it significantly more productive when using it.
You say that (significantly more productive) like it's a known fact. Most of the studies and survey results I've seen indicate that it's nearly a wash. Maybe it'll save a programmer 10hr/week up front, but require 8hr/week to fix issues it introduces, and that's not getting into long term maintainability. See for yourself: https://gprivate.com/6js36 [gprivate.com]
I won't go so far as saying it's a net-negative, but it's not yet delivering what we're being sold.
Re: (Score:2)
Not everyone using AI is using it for code.
See all the pictures or movies ...
Those who I know, fix the problems right away. No one lets them accumulate, neither with AI nor without.
Only Scrum/Agile haters pile up bugs and defects and depts in an issue tracker, and then blame Scrum/Agile for it.
Re: (Score:2)
Not everyone using AI is using it for code.
I never said that was the case. The example I provided was for coding, but the link is to google search results with a wide variety of studies and surveys across industries.
Do you have evidence that using a current LLM makes humans "significantly more productive"?
See all the pictures or movies ...
Yeah, and those seem to indicate just how much time is being completely wasted :-P
Those who I know, fix the problems right away. No one lets them accumulate, neither with AI nor without.
And those fixes take time. Studies (see previous link) have shown that those fixes take nearly as much time as the AI use had saved.
Re: (Score:3)
You nailed it. These toxic big-data corpse want to "own the future", so they must own-the-past. Can't remember who observed that.
Re:When they recycle books, they recycle people (Score:4, Informative)
I believe you are remembering a quote from George Orwell's 1984: "He who controls the past controls the future. He who controls the present controls the past."
Re:When they recycle books, they recycle people (Score:4, Interesting)
They made a huge mistake here. Destroying the book (there are literately scanners for scanning books without destroying the book) means that they no longer have the book they claim to have the right to use.
Re: When they recycle books, they recycle people (Score:2)
It's a lot faster and a bit cheaper to do it this way, and you save still more money by not storing the paper. As long as they can afford to bribe their way into their use being considered fair use, or at least successfully forestall a court decision on it until they are done fleecing investors, mission accomplished.
Re: (Score:3)
There's no "backup exemption" [copyright.gov] for physical books like there is for software. They had the first-sale right to use the physcial book they bought, and the limited fair use rights to quote excerpts, that's it.
Making even 1 copy is a flagrant copyright violation. They also probably copied that initial digital copy to thousands of nodes for training purposes, meaning they made thousands of illicit copies of each book. Not to mention all the derivative works contained in the trained model. Now multiply that b
Re: (Score:2)
They made a huge mistake here. Destroying the book (there are literately scanners for scanning books without destroying the book) means that they no longer have the book they claim to have the right to use.
That's irrelevant.
If training the AI on the book is making a copy of the book, then it doesn't matter if they kept the source copy or destroyed it, they created an infringing copy.
If training the AI on the book is not making a copy of the book, then it doesn't matter if they kept the source copy or destroyed it, they did not create an infringing copy.
This question really boils down to "Is a stored copy of an AI's weights a copy of the source material fixed in a tangible medium"? When a human reads a b
Re: (Score:2)
Regardless of training the AI on it, they obviously made a copy of the book when they digitized it
Re: (Score:2)
Regardless of training the AI on it, they obviously made a copy of the book when they digitized it
Only if they kept the copy around. Ephemeral copies are allowed for lots of purposes.
After I posted my comment it occurred to me that there might be a benefit in destroying the paper original. That way they can argue it's just format shifting, for which there is a lot of legal precedent.
Re: (Score:2)
and here i though they were one of the good ones (Score:3, Insightful)
Re:and here i though they were one of the good one (Score:5, Insightful)
How exactly is training from physical books, copyright infringement? This is exactly the kind of use that is allowed under fair use. Once you purchase a physical book, you are allowed to do what you want with it, even scan it to your computer and do data analysis on it or index it, as long as you don't republish it. AI companies are certainly not republishing books they use for training.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Actually, under copyright law you're allowed the same freedoms to use audio CD content under fair use. You can't republish the audio, but you certainly are allowed to index it, analyze it, and even use it for research including AI training.
Re: (Score:2)
Re: (Score:3)
No they do not. Some examples:
- Radio and TV stations routinely use copyrighted music as bumpers, without having to pay royalties.
- Used bookstores can sell books for profit, without permission from the publisher or author.
- Books can be repurposed for art or novelty products for sale, such as book safes, without permission.
- Book reviewer websites and podcasts, do not need to get permission to review books, even if they make a profit on these reviews.
The profit motive does not negate fair use. At all.
Re: (Score:3)
Copying the whole damn thing is not fair use, it's just commercial piracy. I don't care for our copyright regime, but as long as it applies to me it damn well should apply to startups run by techbro shitheads.
- Radio and TV stations routinely use copyrighted music as bumpers, without having to pay royalties.
Small excerpts only, not the whole song.
- Used bookstores can sell books for profit, without permission from the publisher or author.
Yes, but they can't scan the book and sell copies of it.
- Books can be repurposed for art or novelty products for sale, such as book safes, without permission.
That applies to the book as a physical collection of paper with words on it. You aren't allowed to copy the information that the book is a physical embodiment of. Again, you have first-sale rights to the
Re: (Score:2)
Copying isn't a problem, at all. You can scan a whole book to PDF for your own enjoyment. It's redistributing it that's the problem.
Correct, radio and TV stations only play excerpts. AI also only provides excerpts.
Correct, you can't scan a book and sell the copies. AI companies don't do this. You *could* scan a book and sell bumper stickers with a quote from the book, especially if the bumper stickers were satirical or making a political point. You could also scan a book and use excerpts in your own work. T
Re:and here i though they were one of the good one (Score:5, Interesting)
Re: (Score:3)
Yeah, here's an article about that Harry Potter reproduction. https://arstechnica.com/featur... [arstechnica.com]
The researchers could reproduce the work 50 tokens at a time. That's hardly reproduction. That's citing excerpts, an activity that is definitely covered under fair use.
Yes, it's true that fair use was developed for human use. But until the law catches up, electronic fair use is not prohibited.
Re: (Score:2)
Fair use was developed for human use, not use by an entity that is retaining vast amounts of what it processes verbatim.
Fair use was developed to balance creators' rights while allowing the creation of new value for society. Giving too much control to copyright owners stifles innovation (all innovation builds on the work of others). Current use of copyrighted materials in LLM training is clearly fair use, which has been held up every time it has seen a court room so far (pirating content for commercial gain is different than the act of training the LLM with copyrighted work).
ChatGPT can reproduce about 40% of the Harry Potter series books verbatim.
I doubt you could make a convincing argument that
Re: (Score:2)
Plagiarism has nothing to do with copyright law. Plagiarism is a moral violation, not a legal one.
That is nonsense. You can not "plagiate" without a copy right infringement.
After all you copy the work of the original author.
And: in some countries it is a crime. E.g. Acquiring a PhD via plagiarism, is not a simple legal problem it is fraud, which is a hefty crime.
Re: (Score:2)
That is nonsense. You can not "plagiate" without a copy right infringement.
After all you copy the work of the original author.
Just google: copyright alliance differences between copyright infringement and plagiarism. They give a good explanation for why why plagiarism and copyright infringement are completely different things (the main one being that one is a crime and the other isn't), but that they can overlap. That should hopefully clear up why very few plagiarism cases involve copyright infringement. Generally if something is actually copyright infringement, no one even bothers to call it plagiarism (since something much worse
Re: (Score:2)
This is a horseshit argument. Fair use was developed for human use, not use by an entity that is retaining vast amounts of what it processes verbatim.
What about a human with an eidetic memory who can retain vast amounts of what they read verbatim? What about a human who invests a few months or years in memorizing a whole book (or book series)? Does that constitute copyright infringement? If not, why not, and why is it different if the entity is a machine that can do the same thing faster? Copyright defines a copy as "fixed in a tangible medium". Aren't neurons a tangible medium, just as much as bits on an SSD?
The only logically-consistent answer I
Re: (Score:2)
Very true but then if a 'black box" (LLM) can reproduce the exact works...? Note as much as you can do anything you want with paper and ink of the book, the same is not true about particular composition of the ink molecules on the paper (ie the "text", the works). The *COPY"right is about making copy of the essence of the work.
I do not know how the copyright applies to a human that can do the same, ie reproduce exact work out of memory.
Re: (Score:2)
So far, LLMs have only been able to reproduce excerpts, when tricked by researchers. https://arstechnica.com/featur... [arstechnica.com] These excerpts were up to "50 tokens long" in this study. This is definitely covered under fair use.
Re: (Score:2)
So far, LLMs have only been able to reproduce excerpts, when tricked by researchers. https://arstechnica.com/featur... [arstechnica.com] These excerpts were up to "50 tokens long" in this study. This is definitely covered under fair use.
But it means the model must, by definition, contain the entire book inside it. That means every host that runs the LLM has an unlicensed copy of the book on it.
If I put my entire mp3 collection on a server that only lets you play 30s clips at a time, but put no limit on how often you could request those clips, you bet your ass the RIAA would sue me and win.
Re: (Score:2)
Your illustration falls apart because of one thing: Others can play your MP3s in their entirety. If your site instead is a game that plays 2 seconds of one of your MP3s and invites them to guess which song it is, that would be fair use. AI only republishes small excerpts, and even if the data is all there (just like it's in Google's web search index), that doesn't mean the use is illegal under copyright law.
Licensing, which you brought up, is a completely separate matter. However, most books are not license
Re: (Score:2)
Exactly, books aren't licensed. You own the physical copy of the book which you can do pretty much whatever you want with as a physical object, but your rights to reproduce the words depicted in the book in the order they appear in the book are circumscribed by copyright law.
Absent a license, you have no right to copy the whole damn thing and make a profit off of processing and transforming that copy.
Re: (Score:2)
I completely agree with your post.
Re: (Score:2)
Others can play your MP3s in their entirety.
If I create the correct series of prompts, I can reproduce the book in its entirety, 50 words at a time. How is that different, practically, than cycling through a bunch of REST requests to reproduce an MP3 30s at a time?
Re: (Score:2)
It is the licenced use for the source that holds and distributes the copy. Even if you were squirting 30s out of illegal source it would be illegal. Your use is temporary, ie to consume so you are legal, the source is a different story.
Re: (Score:2)
In the book case, it is *you* who are reconstructing the book, not the AI.
In the MP3 case, you are downloading an entire song at once, REST or otherwise.
Re: (Score:2)
Yeah, the interesting thing is that were is no exact place one can point to that is the actual copy of the works. Just like in a human brain. Yet LLMs can reproduce it. Just like some humans.
Re: (Score:2)
The problem is when someone coaxes the model to output a significant portion of a work verbatim.
Yes they can read it in and process as much as they see fit, but if some prompt demonstrates an ability for it to reconstitute the original work, or something that from a human would be called an infringing knock off, what then?
Re: (Score:2)
So far, researchers have only been able to "coax" excerpts up to 50 tokens long. https://arstechnica.com/featur... [arstechnica.com]
That is clearly within the bounds of fair use.
If you then use this ability to "reconstitute" and republish the work, that's on you, not on AI. YOU are not allowed to reconstitute the work in this manner, or by whatever other manner you choose.
Re: (Score:3)
Oh cool, so if I posted the latest movie to youtube in one minute increments, I'm fine because it's the *viewer* choosing to watch them in the order of a film?
Re: (Score:2)
The law often comes down to intent. If your intent is to review the work, and as part of that review, you include clips, that's OK. If your intent is to let viewers reconstruct the movie, that's not OK.
Re: (Score:2)
Re: (Score:2)
Oh cool, so if I posted the latest movie to youtube in one minute increments, I'm fine because it's the *viewer* choosing to watch them in the order of a film?
In court this would come down to intent and whether someone would reasonably choose to view the movie on your platform rather than buying the movie. If you string the one minute increments together making it easy for someone to view the whole movie with little effort, that probably will be copyright infringement. If someone has to spend their whole weekend stitching together the movie, finding the next one minute segment, remembering where they left off, etc. then it probably won't be considered copyright i
Re: (Score:2)
The problem is when someone coaxes the model to output a significant portion of a work verbatim.
Yes they can read it in and process as much as they see fit, but if some prompt demonstrates an ability for it to reconstitute the original work, or something that from a human would be called an infringing knock off, what then?
If it really did easily spit out an entire novel just by being asked, that probably would be copyright infringement. If you have to painstakingly pull the content out of the LLM in small chunks and verify the output for accuracy, that isn't going to be copyright infringement. No potential buyer would ever take that route to read a book.
Re: (Score:2)
They might if someone wrote a 'book downloader' to automate the process of trying to extract the book and people say that the result covers the substance of the work.
But probably more in their worries are things like video/image generators that can rip off their characters.
Re: and here i though they were one of the good on (Score:2)
Because they are maintaining the digital copy of the training corpus, and they are required to destroy all copies when they destroy the original.
Re: (Score:2)
No, there is no requirement to destroy copies when you destroy the original. That's not a thing.
If you buy a book, you can scan it to PDF without violating the law. You can even make personal copies of your PDF, without violating the law. You just can't share those PDFs with your friends or customers, that would violate the law. And if you destroy the original book, you still can keep your PDFs, you came by them honestly by paying for the book.
Re: (Score:2)
No you can't. You can't do any of those things. People do them, and there is rarely any prosecution if it is done for non-commercial personal use, but it's all in violation of copyright law.
Re: (Score:2)
Here's a guide from RIAA, the agency that enforces music copyrights, on the legality of ripping MP3s to a CD. https://www.riaa.com/wp-conten... [riaa.com]. They confirm that when you do this, you do not need to destroy the CD.
Also, in RIAA v. Diamond (1999), the courts ruled that
noncommercial copying of recordings from a PC's hard disk to the Rio is a fair use under Sony v. Universal
.There was no need to destroy the original.
https://en.wikipedia.org/wiki/... [wikipedia.org]
Copyright law treats books the same way. You can make copies for your own use without violating the law or ethics. You also do not need to destroy the original when y
Re: and here i though they were one of the good o (Score:2)
Yeah I'm going to take legal advice from the RIAA any day now
Re: (Score:2)
Even when the RIAA, quite ironically, says you CAN make personal copies and keep the original? If anybody would be likely to say no, it would be the RIAA.
In any case, I also cited a court case that said the same.
Re: (Score:2)
I read your citation. Either it doesn't say what you think it says or you don't understand what I said.
You don't have to destroy media when format shifting. You have to relinquish all copies when you trade the original. The original also is the proof of license. Guess what you've got if you destroy it? Not that.
Re: (Score:2)
Yes, agreed. If you trade or sell your purchased copy of a book, you aren't allowed to keep any electronic copies.
Audio is generally also licensed, with more restrictive terms than copyright itself. You're right, if you trade or give away your original, again, you can't keep the copies. And if you trade or sell a copy, you can't keep the original.
But in all these cases, if it's all for your own use, copyright does not apply, and though license may apply, nobody will come after you for making personal backup
Re: (Score:2)
Yes, agreed. If you trade or sell your purchased copy of a book, you aren't allowed to keep any electronic copies.
They are destroying their purchased copy of a book, and they still aren't allowed to keep any copies, which is what I actually said and you disagreed with. Now you're agreeing? But not acknowledging that's what I said in the first place?
Re: (Score:2)
Yes it's true, if you trade or sell your purchased copy, you aren't allowed to keep electronic copies. But that's not what you said. Here's the quote:
Because they are maintaining the digital copy of the training corpus, and they are required to destroy all copies when they destroy the original.
The difference is, AI models do *not* keep a digital copy of the training materials. There is no way to say to AI, "download a copy of Harry Potter", with or without security restrictions. The researchers didn't do that. Instead, they were able to get it to regurgitate parts of the book in "50-word excerpts."
Courts have also agreed that AI training is fair use
Re: (Score:2)
The difference is, AI models do *not* keep a digital copy of the training materials.
The problem is, I specifically said this was about their training corpus.
Either you can't read, in which case learn to read, or you didn't bother to read before replying, in which case fuck you.
Re: (Score:2)
I'm not sure what hair you are splitting. The courts have ruled that a training database is *not* (legally speaking) a copy of a work. "Training corpus" is not a standard term of art, so I'm a little unclear about what you mean, as distinct from the training database.
Re: (Score:2)
That ruling applied narrowly to digital copies of AUDIO recordings that were format shifted from an AUDIO medium to a DATA medium. I.e. from a CD to a hard drive. This is why data CD-Rs were cheaper than "audio" CD-Rs. Even though the difference was mainly in marketing, the audio formats had to pay a "piracy" tax to the RIAA. The case was mainly about whether Diamond had to pay that sin tax to the recording industry not about whether you as a consumer had the right to copy CDs.
The tax is recognition tha
Re: (Score:2)
although it is per se illegal
It isn't illegal in any way, but is explicitly *legal* under fair use.
Re: (Score:2)
In Europe you defintely can.
Copyright law is mostly a law about distributing copies.
And not about how many copies you hold privately.
Re: (Score:2)
There are laws both about making and distributing copies. They are both illegal. There are fines which can be applied even when you are making copies for personal use. You are generally considered to be allowed to format shift, but only if you keep the originals. Otherwise you could just be buying and returning (or reselling) media after making copies.
Re: (Score:2)
AI companies are certainly not republishing books they use for training.
Is this really certain? They probably do not intend to publish the full books. However, it is apparently possible to extract (almost) the full text of some books from LLMs.
Re: (Score:2)
Intent is a key part of the law, it matters.
As for the reports of people being able to extract most of the text of books from LLMs, these are pretty over-hyped. For example, researchers claimed to be able to extract 40% of a Harry Potter book, but when you read the details, you see that they could only get as many as 50 tokens at a time. https://arstechnica.com/featur... [arstechnica.com] That is certainly not wholesale extraction, and is more like excerpts. Excerpts are certainly covered under fair use.
Re: and here i though they were one of the good on (Score:2)
You don't understand what fair use it, please stfu. There is difference between a private individual makes a copy of a book for their own use, and a corporation making a copy available for each of their employees, let alone training models.
Re: (Score:2)
"Making a copy for each employee" is indeed a blatant violation of copyright law.
Using as input for training models, has not fully been tested in the courts, but so far, the courts seem to be siding with the AI model makers, as long as they acquired the originals legally.
https://www.nge.com/news-insig... [nge.com].
Re: (Score:2)
AI companies are certainly not republishing books they use for training.
I don't think that situation is quite as clear as you think it is.
Re: (Score:2)
Well, why don't you enlighten me then!
Gallica.fr (Score:5, Informative)
One day, a production assistant came to me and said "I don't think we should guillotine this one, what do you think?". I looked at it and...flaming hell, it was the French National Academy of Science's original copy of Principia Mathematica [gallica.bnf.fr] by Isaac Newton. Had we gone ahead and sliced/shredded...Douglas Adams' predictions would have come true, and we'd have been lynched by a rampaging mob of respectable physicists.
Tech - we used a combination of Mac Plus, 486SX, 486DX2 with super-incredible-powerful-specialised graphics cards containing a whole 1Mb of VRAM, and a Netware server so vast it could only be named one thing: Behemoth. I mean, what other name could we have possibly contemplate giving to a machine which had a whole 1Gb available to it...
Re: (Score:3)
Cool story! I got my first 386SX in 1992, with a whopping 2MB RAM, so 1Gb would have blown my mind... "Imagine how well Prince of Persia will run" with my brain at that time xD
Re:Gallica.fr (Score:4, Interesting)
Safe to say that wasn't the official reason we gave to people, and settled for "a restart seemed to fix it all". Oops.
Re: (Score:2)
Cool story! I got my first 386SX in 1992, with a whopping 2MB RAM, so 1Gb would have blown my mind... "Imagine how well Prince of Persia will run" with my brain at that time xD
Child.
Let me tell you about my first computer, a TRS-80 in 1978 with a whopping 4 KB of RAM, so 1 GB would have just confused me, and the only games available were ones I got from computer magazines and had to type in by hand. :-D
Cue someone to tell us about "their" first computer, an IBM mainframe in the 1960s or something...
Re: (Score:2)
Haha I know I'm not remotely the oldest around here... 4KB is *so* incredibly little, not remotely enough even by Gate's standards! Another fun memory is learning/memorizing powers of two, up to 65536, slowly but surely, throughout the years, by seeing the BIOS RAM counters after every RAM upgrade.
Re: (Score:2)
Haha I know I'm not remotely the oldest around here... 4KB is *so* incredibly little, not remotely enough even by Gate's standards! Another fun memory is learning/memorizing powers of two, up to 65536, slowly but surely, throughout the years, by seeing the BIOS RAM counters after every RAM upgrade.
TBH, the TRS-80 was my friend's computer. My first computer was a Timex Sinclair 1000, with 1 KB of RAM. Though I did eventually get a 64 KB RAM upgrade attachment. So much RAM! But unfortunately it didn't attach very securely so sometimes after waiting for 10 minutes to load a game from the cassette drive you'd bump the thing and it would momentarily disconnect the RAM and crash the system.
I learned my powers of two by emulating Ender Wiggins and doing "doublings" in my head... no computer required.
pearls Re:Gallica.fr (Score:3)
How is this legal? (Score:3)
How come blatant large-scale download of online and offline copyrighted works and reuse in derivative works is OK if "AI" corporations does it, but download by private persons of similar works for reading / viewing is classified as piracy? (OK, besides the "AI" hype that it's somehow not a derivative work, "corporate" may be a hint ...)
Re: (Score:2)
Re: (Score:2)
And you aren't even profiting from it...they're using it to (hopefully) generate endless profits as well.
They didn't settle because they downloaded (Score:4, Informative)
Alsup ruled that the scanning was format shifting and thus 'fair use'. Alsup ruled that their usage of downloaded works for training was 'fair use'. But Anthropic kept copies of downloaded works 'as a library' - including works they didn't use for the model training. Alsup ruled that that was not fair use. Alsup also said he would not delay the trial while Anthropic appealed (which is something that usually happens), hence why Anthropic settled.
Re: (Score:2)
"Alsup ruled that the scanning was format shifting and thus 'fair use'. Alsup ruled that their usage of downloaded works for training was 'fair use'."
And the first is clearly fair use regardless of prevailing attitudes here. The second certainly could be depending on the terms of the downloads.
"But Anthropic kept copies of downloaded works 'as a library' - including works they didn't use for the model training. Alsup ruled that that was not fair use."
And it is not, but it also isn't inherently illegal. Ca
Settled? (Score:2)
Anthropic settled for $1.5 billion in August
Settled with who? Who got $1.5 billion dollars?
Re: (Score:2)
Re: (Score:3)
The authors of the books: https://apnews.com/article/ant... [apnews.com]
Re: (Score:3)
Whom! Settled with whom!
Hard to argue with this approach (Score:5, Insightful)
There is no more clear case of fair use than this: buying physical books. Once a person or business buys a book, they don't get to continue to control what happens to that book.
But...what about AI regurgitating all that copyrighted data? Yes, it does regurgitate the data, but only in ways that are compatible with fair use. AI will never reproduce the entire work, or even large sections of it. At most, a paragraph or two. And this is exactly what humans do under fair use. We are allowed to quote small sections of the text, as long as we don't reproduce the work in bulk.
Personally, I think e-books ought to follow the same principle as physical books. You buy it, you can do what you want with it, as long as you don't republish it, and use it only for your own purposes. It should be *yours*.
Re: (Score:2)
At most, a paragraph or two.
Researchers have gotten the major models to regurgitate the vast majority of books by starting with a paragraph of the book. The notion that training only tweaks probabilities is complete and utter nonsense. Alsup was 95% wrong in his decision. He was blatantly lied to by Anthropic, and he bought the crap hook, line, and sinker; lock, stock, and barrel.
Re: (Score:2)
What if a human could regurgitate a book (some can)? Would this also be a copyright violation?
Re: (Score:2)
The research has found that AI models can reproduce short excerpts of books. In one study, AI was able to produce 50-token excerpts at a time. https://arstechnica.com/featur... [arstechnica.com]
That's not really the same as reproducing the work.
Re: (Score:2)
Is the book the paper and ink or something more? I am unclear on what to think about it. There is no way to point to any particular place in the LLM that "holds the copy" but if interacted with correctly it can actually reproduce the work. But so do some humans.
Re: (Score:2)
No, the book is not just he paper and ink. It's illegal, for example, to distribute a PDF copy of a scanned book. It's the contents, the text, that is copyrighted.
Reproducing excerpts, which is what AI does, is definitely covered under fair use.
Re: (Score:2)
I was not clear. Yeah, that was a rhetorical question.
I was not sure how copyright law would treat humans that can reproduce (copy?) content, just like LLMs.
Re: (Score:2)
If a human reproduced a work from memory and then sold or gave it away, that would indeed be a violation of copyright law.
If a human reproduced only small sections from memory, for use in critiques, reviews, or other uses allowed under fair use, this would not be a violation of copyright law.
Rainbows End? (Score:2)
It's one of Vernor Vinge's lesser-known books (or at least I've heard less about it), and doing almost *exactly* this was one of the big story arcs.
I really don't understand why this is illegal (Score:2)
Humans can read books, most of them can't memorize them word for word. If a computer is reading a book, and not creating a copy of it, same thing? Not if the LLM's can spit out the work verbatim. But if you give them rails to say, "hey you read this book, don't violate copyright" who cares if they 'read' the book. (granted anthropic probably has digital copies of all these books to train their modes...)
Re: (Score:2)
James Cameron was famously (and successfully) sued for stealing ideas from old TV shows.
With an LLM you can--at least in theory--determine exactly which books it took words from by following the data through the model to the output. You can't do that with a human, unless, like Cameron, they publicly admit what they did.
Ephemeral Copies (Score:2)
Anthropic was, at the least, making ephemeral copies of copyrighted works (i.e., the copy that existed in computer memory, possibly storage, temporarily).
I'm not sure (IANAL) that this concept has been applied to written works. For audio, however, the US mandates royalties in lieu of requiring a license - as getting a license would be impracticable.
I'm guessing that Anthropic is going get nailed for this aspect, as they likely would be required to acquire this type of license if they did what they did legit
LLMs cannot forge original works (Score:2)
Re: (Score:2)
He later shared a link to the Pirate Library Mirror site with colleagues, writing "this is awesome!!!"