

Meta Beats Copyright Suit From Authors Over AI Training on Books (bloomberglaw.com) 83
An anonymous reader shares a report: Meta escaped a first-of-its-kind copyright lawsuit from a group of authors who alleged the tech giant hoovered up millions of copyrighted books without permission to train its generative AI model called Llama.
San Francisco federal Judge Vince Chhabria ruled Wednesday that Meta's decision to use the books for training is protected under copyright law's fair use defense, but he cautioned that his opinion is more a reflection on the authors' failure to litigate the case effectively. "This ruling does not stand for the proposition that Meta's use of copyrighted materials to train its language models is lawful," Chhabria said.
San Francisco federal Judge Vince Chhabria ruled Wednesday that Meta's decision to use the books for training is protected under copyright law's fair use defense, but he cautioned that his opinion is more a reflection on the authors' failure to litigate the case effectively. "This ruling does not stand for the proposition that Meta's use of copyrighted materials to train its language models is lawful," Chhabria said.
Re:well good (Score:5, Interesting)
Re: (Score:2)
I think everyone agrees that a legitimate copy of the work must be used for training.
This is about authors trying to say that the training itself represents a violation of their copyright.
Which of course, it does. However, it's also broadly agreed to fall under fair use, which the judge in this instance agreed.
Re: well good (Score:2)
Re: (Score:2)
Did you read the summary? The Judge was surprised the plaintiffs didn't bring up the piracy, which is very much not legal.
From the plaintiff's perspective, it makes sense.
They were going for a ruling against LLM training being a transformative fair use, so that they could prevent LLMs from being trained on their work. They didn't give a shit about the $50k a book they'd get for the piracy.
Re: well good (Score:2)
Re: (Score:2)
It's good to know my kids have free reign to download the textbooks they train on. That will save them money.
There are 2 acts here. The downloading, and the training.
The Judge very much did not say that the downloading was legal. The Judge in the Anthropic case said the same thing. Piracy is piracy. That you used the pirated work in a way that falls under fair use does not make the act of piracy fair use.
As for an AI training and a person training- we agree! When you train with a copyrighted work, you transformatively copy it into your brain.
Wouldn't it be insane if everything you produced thereafter
Re: well good (Score:2)
Re: (Score:2)
I didn't know people were actually saying the training itself should be illegal.
Aye. They are.
That part is very simple; if it quotes direct words out of a book as they were written then hopefully that is still illegal.
Well, here's where it gets tricky.
You can get an LLM to reproduce text from its corpus.
There are 2 basic causes of this.
1) is a failure in training- the model memorized some text because it was overtrained on it. This isn't desirable for the LLM providers- and they avoid it as much as possible, simply because a model that has memorized is not generalizing its semantic connections- it's not a good LLM. But it does happen.
2) is part of their nature.
They are trained to produce the next toke
Re: (Score:2)
Re: (Score:2)
You give me nothing but the first word.
I then correctly say the other 9.
You'd think I was fucking psychic.
Now imagine you give me the first 9 words, and I predict the last word.
Far less impressive, isn't it?
LLMs tend to be somewhere in the middle.
If anything, I think it's an indictment for how statistically predictable writing styles are.
Re: well good (Score:2)
Re: (Score:2)
Maybe a regime like the DMCA? Copyright holders can notify us that the LLM produces some kind of infringing text too perfectly, or with too little prompting, and we add to an output filter?
Doesn't seem like a bad idea.
Re: (Score:2)
Re: (Score:2)
Oh, interesting and very legitimate way to look at this - not thought of it that way.
Re: (Score:1)
The knowledge of these books is so thoroughly distilled into training these LLMs it is ridiculous to call them derivative works.
Excellent! Now that's an awesome defense next time I'll pirate anything: the knowledge of the things I pirated is so thoroughly distilled in me that it's ridiculous to consider my act illegal.
Re: well good (Score:2)
failed to litigate (Score:5, Interesting)
When the judge said the plaintiffs failed to litigate effectively, what he meant was this:
The defendant said that the LLM is not capable of reproducing the training materials. Instead, the information derived from them is able to summarize relationships, identify contained information, maybe even mimic the style of the book with new writings. In other words, it can do the things a normal human being can do after reading a book.
To prevail, the plaintiffs would have had to offer evidence that it was likely that the LLM stored sufficient data about the books to reproduce them exactly. If it can exactly reproduce the original works then it's not transformative, it's derivative.
The plaintiffs failed to offer any such evidence.
Re: (Score:1)
I haven't read the opinion (Score:3, Insightful)
Not that any of that matters given how much money is on the table for AI
Re: (Score:3)
Section 230 is part of Title 47 (Telecommunications) of the U.S. Code. It has nothing to do with copyright, which is covered by Title 17.
If you want to consider copyright cases in the context of the Internet we could look at Metallica v. Napster Inc, which ... led to no changes to the law after it became clear that Napster would have to settle or lose.
Or we could consider Viacom International Inc. v. YouTube, Inc., which ... led to no changes in the law after YouTube lost an appeal and settled the case rat
Re: (Score:2)
"led to no changes to the law"
I thought the DMCA was the result of that whole debacle.
Re: (Score:2)
Both of those cases came after the DMCA was passed in 1998, and the second was a major case applying the new rules in the DMCA. The US passed the DMCA to enact two 1996 treaties.
In contrast, Section 230 was passed as part of the Communications Decency Act in 1996, in response to two earlier lawsuits that were not about copyright -- as suggested by the name of the law.
Re:I haven't read the opinion (Score:5, Interesting)
You know what a search engine is, right? It takes all the words in a document and stores them in a database along with a link saying that this word was found in that document. The search engine has stored every word in the document, but it has done it in a way that it's not possible for the search engine to reproduce the document. The legal precedent is crystal clear that this activity does not violate the document's copyright.
Now you have a baseline for storing every word of a copyrighted book without violating its copyright.
When the LLM "trains" on a copyrighted book, how does it store the data? Has it saved the original data, in order, where it can spit it back out on command? Or like the search engine, has it stored relationships learned from the data which allow it to reason about the work but not reproduce it verbatim?
That's the correct question to ask when determining whether an LLM violates the copyrights of its training data. The plaintiffs failed to offer a credible answer to that question.
Re: (Score:1)
But even if we grant that incredible stretch it would just mean that Google has conducted enormous amounts of copyright violations.
Which is fine so long as nobody sues.
Never mind the fact that we have clearly published works being consumed here.
Just because the software can't reproduce the work (which incidentally llms can easily reproduce half the body of a work) doesn't mean they aren't making a copy. That's the prob
Re: (Score:3)
Oh. You don't know how a search engine works. No wonder you don't know how copyright and computer software interacts.
Re: (Score:2)
Wait so you think Google has the entire internet sitting on their servers?
No he ... explicitly said that's not how it works, and that's not how LLMs work either.
Please read what you're replying to and try commenting again. Right now all you've done is demonstrate you've got no capability of parsing basic English, which of course means now any comment you make on the opinion in a legal case is in question.
That's the problem. Not the ability to reproduce the work but the fact that they made the copy in the first place without the author's authorization.
No. They aren't making a copy more than you are making a copy by having this comment in your RAM. What is stored is not a copy of the work and is transformative in every way. The
Re: (Score:2)
Making a copy is 100% legal.
No. This is patently false.
Copyright protects each and every copy made by someone other than the copyright holder.
The set of exceptions to their right to control each and every copy ever made is what is called "fair use".
Re: (Score:3)
Wait so you think Google has the entire internet sitting on their servers?
No, they have the text of all of the pages they have ever indexed sitting on their servers. It's not verbatim copies of pages, but it is literally copies of all of the text. Search engines absolutely contain an actual copy of the meaningful content of web pages. There is a not-very-good argument to be made that LLMs effectively do the same thing, except the difference is that Google absolutely has the capacity to reproduce all of the relevant content from the original works. We know this because they used t
Re: (Score:3)
The legal precetent for search engines also depends on the search engine linking back to the original. In fact, what a search engine does is _not_ legal without permission, but when offered the option to get delisted, litigation was abandoned by the litigating parties.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
The summary is not correct. Like any group of fanatics or mindless fanbois, the AI morons do not see reality, but what they hallucinate reality should be in their opinion.
The judge efectively said that what Meta does may well be illegal, but you need to litigate it in a different way.
Re: (Score:2)
The question is whether or not those copies are covered under fair use. If they are transformative, and not proven to have a negative market impact on the copyright holder- then they are.
The Judge is not wrong, they know the law better than you, and the dumbshits who positively moderated you.
Also, the CDA has literally nothing to do with copyright.
Copyright law is Title 17
Re: (Score:2)
The point is, that there is not "store into the model" step.
The training gets an input and an expected output and then computes how large the error between expected and actual output is. Afterward it updates the weights using the derivatives of the error, not the input or expected output texts.
The only way to "store" something in a model is overfitting, and the pigeon hole principle tells that you can only overfit a small amount of training material.
Re: (Score:2)
if you're summary is correct the judge is just plain wrong. If you store the details of a book into your model congratulations you have made a copy of the book as per our copyright law
The details of a book aren't stored in the model. If they were, you could reproduce the book wholesale. You can't do that even with external tools which analyze the model, let alone the model running as designed; QED the details of the book aren't stored in the model. Only a statistical analysis is stored. This is also not a derivative work, because derivative works are characterized by recognizably reproduced elements — actual copies, which further have not been manipulated to the point of unrecogniz
Re: (Score:2)
That is a gross misstatement and more reflects your desires than the facts of the case. If somebaody, say, transcribes Harry Potter then there woul dbe no exact copy, but it would still niot be fair use.
Re: (Score:2)
This is the second judge to rule that weights trained by backpropagation are transformative, not derivative.
You're going to have to accept that you are, as per the usual, wrong.
Re: (Score:1)
Re: (Score:2)
In other words, it can do the things a normal human being can do after reading a book.
Humans and machines are not treated the same under copyright law. They are different, and the law treats them differently.
Fix copyright law! (Score:2)
Most of the problem comes from the copyright period being so insanely long. Make copyrights expire after ~30 years or so, and much of this problem evaporates (as well as many other problems).
Re: (Score:2)
7+7 is how the Founders understood the Constitutional bargain.
Mark Twain planned to add a new chapter to each of his books every seven years.
"You do want to know what happened next, don't you?"
Life+75 is just abuse and Corporate Welfare (but I repeat myself).
Re: (Score:2)
Why the split and why these numbers? I especially don't get why individuals should get 100 years.
Re: (Score:2)
How about if they expire after 1 year for business use, and 100 years for individuals?
Ha, businesses would just assign them to individuals with revokable contracts, and SCOTUS would side with the businesses. Besides, corporations are people, my friend. Honestly anything up to about 30 years is fine; much longer will give the same problems we have now.
Re: (Score:2)
If an individual sells a copyrighted work, guess what, they have just formed a business.
Re: (Score:2)
The end of copyright (Score:1)
I used to advocate for piracy in my 20s, when I could barely afford anything, but then changed my mind. For more than a decade I advocated for respecting copyrights, because even if the copyright system is truly skewed towards American financial interests, there are ways to profit from it even for poorer countries. I convinced a lot of people that copyright is a good thing, even with its flaws, even when hiding academic content and knowledge behind paywalls and high markups.
Now, it's impossible for me to ar
Re: (Score:2)
Interesting because the moral argument is about reproduction.
I own a notepad and a pencil. But if I start writing "A long time ago in Galaxy far, far away" and continue as such, 'you' will threaten to beat me up and put me in a cage for years (if I persist despite threats).
That just obliterates real property rights in favor of imaginary property rights. And real property rights are the basis of a peaceful civilization.
On the other hand, the AI bros have a moral obligation to not suddenly destroy society's
Re: (Score:2)
Pirating (Score:2)
Re: (Score:2)
They would have needed to make this point as a matter of law. I don't think the plaintiffs did so. In the Anthropic case (also ruled on this week) the judge found the same conclusion on the copyright for training side of the case, but allowed the case to continue on the piracy grounds.
Re: (Score:2)
The Privacy Rapists have now become... (Score:2)
the Content Rapists.
Re: (Score:3)
Indeed. But they have overdone it. Search engines are only legal because they link to the content and nobody that litigated against Google, for example, really wanted to get delisted. So they dropped the complaints. (This is simplified.) But with LLMs, no linking back takes place. In fact I had several cases where ChatGPT was unable to cite sources on request. One of the reasons I rearely use it.
But without the linking back, the content providers have zero motivation to allow the use of their content. In fa
I can't believe Slashdot has come to this ... (Score:2)
Is this place now packed with a bunch of Zoomers who have been slurping up "muh IP law" propaganda from the bigcorps their whole lives?
Or are most of you completely ignorant about the purpose of copyright in the first place?
I for one am happy to see fair use get a fair shake in the courts.
Re: (Score:2)
Is this place now packed with a bunch of Zoomers who have been slurping up "muh IP law" propaganda from the bigcorps their whole lives?
My view is the opposite. The demographic has gotten old enough that the average commenter is more likely to have their name on some piece of work they've created, and so are more likely to have a personal stake in how the law protects their rights to that work.
Re: (Score:2)
Re: (Score:2)
I'm a published music artist with royalty checks to prove it, and my teenaged daughter already has an art design credit in a Simon & Schuster published book. Neither of us is very perturbed about AI art or music.
Re: (Score:2)
As much as I dislike Meta, I'm with then on this case. Training AI on copyrighted works is the very definition of fair use. You can train a human on copyrighted materials (as long as you have legally bought or borrowed the materials in the first place). Why should it be OK to train AI on them, with the same stipulations?
Re: (Score:2)
Re: (Score:2)
Thanks for the correction, yes indeed.
Meta still faces torrent-infringement trial (Score:2)
Meta may have dodged the bullet on its AI-training fair-use defense (for now and only as to these 13 authors) but the battle isn’t over. On July 11, Judge Chhabria will take up the authors’ separate claim that Meta’s torrent-style downloading and distribution of their books infringed copyright. https://www.courthousenews.com... [courthousenews.com] That distribution suit survived the ruling and could still affect how tech giants acquire training data.
Fair use (Score:2)
Re: (Score:2)
First principles - what is copyright FOR? (Score:2)
1) To protect the authors of works so that they can gain an income from their creativity. To the extent that AI doesn't allow complete downloads of an author's book, it won't reduce the tendency of people to buy those books. 'The US district judge Vince Chhabria, in San Francisco, said in his decision on the Meta case that the authors had not presented enough evidence that the technology company’s AI would dilute the market for their work to show that its conduct was illegal under US copyright law.'
2)
Indeed (Score:2)
The failure to litigate was expected.
I wonder if there are any lawyers who bother to learn about the technology first, because when you read the lawsuits, it often seems as if you wouldn't need a lawyer at all. You could refute the claims simply by presenting the neural network's diagram, which shows the technical impossibility of the assertions. Of course, you'd still need a lawyer to translate the technical details into language a judge can comprehend.
Have you read the court documents from the Andersen ca
authors sue (Score:2)
Authors sue to prevent reading their books.
A win for Meta? Hardly. (Score:2)
This was a summary judgment, not a verdict in a trial. The plaintiffs failed to move their case forward to a trial because they argued their main point, market dilution, poorly. This case wasn't about whether training on copyrighted books is fair use in general. It was deciding whether the plaintiffs brought enough specific evidence on their market dilution claim to justify sending the case to a jury. Spoiler: they didn’t. The plaintiffs lost this round because they failed to properly plead and su