Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
Facebook AI

Meta Beats Copyright Suit From Authors Over AI Training on Books (bloomberglaw.com) 83

An anonymous reader shares a report: Meta escaped a first-of-its-kind copyright lawsuit from a group of authors who alleged the tech giant hoovered up millions of copyrighted books without permission to train its generative AI model called Llama.

San Francisco federal Judge Vince Chhabria ruled Wednesday that Meta's decision to use the books for training is protected under copyright law's fair use defense, but he cautioned that his opinion is more a reflection on the authors' failure to litigate the case effectively. "This ruling does not stand for the proposition that Meta's use of copyrighted materials to train its language models is lawful," Chhabria said.

Meta Beats Copyright Suit From Authors Over AI Training on Books

Comments Filter:
  • failed to litigate (Score:5, Interesting)

    by Spazmania ( 174582 ) on Wednesday June 25, 2025 @10:09PM (#65476798) Homepage

    When the judge said the plaintiffs failed to litigate effectively, what he meant was this:

    The defendant said that the LLM is not capable of reproducing the training materials. Instead, the information derived from them is able to summarize relationships, identify contained information, maybe even mimic the style of the book with new writings. In other words, it can do the things a normal human being can do after reading a book.

    To prevail, the plaintiffs would have had to offer evidence that it was likely that the LLM stored sufficient data about the books to reproduce them exactly. If it can exactly reproduce the original works then it's not transformative, it's derivative.

    The plaintiffs failed to offer any such evidence.

    • I am not a lawyer, but I read the lawsuit, it sounded like the plaintiffs weren't arguing copyright infringement but wanted case law to invalidate fair use as a concept.
    • But if you're summary is correct the judge is just plain wrong. If you store the details of a book into your model congratulations you have made a copy of the book as per our copyright law. We could change that. We made significant changes to copyright law in the wake of the internet for example such as section 230 of the CDA. But is the law is written right now ai is absolutely violating copyright

      Not that any of that matters given how much money is on the table for AI
      • by Entrope ( 68843 )

        Section 230 is part of Title 47 (Telecommunications) of the U.S. Code. It has nothing to do with copyright, which is covered by Title 17.

        If you want to consider copyright cases in the context of the Internet we could look at Metallica v. Napster Inc, which ... led to no changes to the law after it became clear that Napster would have to settle or lose.

        Or we could consider Viacom International Inc. v. YouTube, Inc., which ... led to no changes in the law after YouTube lost an appeal and settled the case rat

        • by Khyber ( 864651 )

          "led to no changes to the law"

          I thought the DMCA was the result of that whole debacle.

          • by Entrope ( 68843 )

            Both of those cases came after the DMCA was passed in 1998, and the second was a major case applying the new rules in the DMCA. The US passed the DMCA to enact two 1996 treaties.

            In contrast, Section 230 was passed as part of the Communications Decency Act in 1996, in response to two earlier lawsuits that were not about copyright -- as suggested by the name of the law.

      • by Spazmania ( 174582 ) on Thursday June 26, 2025 @12:33AM (#65476936) Homepage

        You know what a search engine is, right? It takes all the words in a document and stores them in a database along with a link saying that this word was found in that document. The search engine has stored every word in the document, but it has done it in a way that it's not possible for the search engine to reproduce the document. The legal precedent is crystal clear that this activity does not violate the document's copyright.

        Now you have a baseline for storing every word of a copyrighted book without violating its copyright.

        When the LLM "trains" on a copyrighted book, how does it store the data? Has it saved the original data, in order, where it can spit it back out on command? Or like the search engine, has it stored relationships learned from the data which allow it to reason about the work but not reproduce it verbatim?

        That's the correct question to ask when determining whether an LLM violates the copyrights of its training data. The plaintiffs failed to offer a credible answer to that question.

        • Wait so you think Google has the entire internet sitting on their servers?

          But even if we grant that incredible stretch it would just mean that Google has conducted enormous amounts of copyright violations.

          Which is fine so long as nobody sues.

          Never mind the fact that we have clearly published works being consumed here.

          Just because the software can't reproduce the work (which incidentally llms can easily reproduce half the body of a work) doesn't mean they aren't making a copy. That's the prob
          • Oh. You don't know how a search engine works. No wonder you don't know how copyright and computer software interacts.

          • Wait so you think Google has the entire internet sitting on their servers?

            No he ... explicitly said that's not how it works, and that's not how LLMs work either.

            Please read what you're replying to and try commenting again. Right now all you've done is demonstrate you've got no capability of parsing basic English, which of course means now any comment you make on the opinion in a legal case is in question.

            That's the problem. Not the ability to reproduce the work but the fact that they made the copy in the first place without the author's authorization.

            No. They aren't making a copy more than you are making a copy by having this comment in your RAM. What is stored is not a copy of the work and is transformative in every way. The

          • Wait so you think Google has the entire internet sitting on their servers?

            No, they have the text of all of the pages they have ever indexed sitting on their servers. It's not verbatim copies of pages, but it is literally copies of all of the text. Search engines absolutely contain an actual copy of the meaningful content of web pages. There is a not-very-good argument to be made that LLMs effectively do the same thing, except the difference is that Google absolutely has the capacity to reproduce all of the relevant content from the original works. We know this because they used t

        • by gweihir ( 88907 )

          The legal precetent for search engines also depends on the search engine linking back to the original. In fact, what a search engine does is _not_ legal without permission, but when offered the option to get delisted, litigation was abandoned by the litigating parties.

        • Its not just "all the words" but the exact sequence etc, the text is reproduced in their database verbatim exactly as it appears on the webpage itself. You can see this when results are returned, it shows a few sentences around the words that match. Storing just the words but not the arrangement would not be helpful. Google has to reproduce the content on their own, in order to reproduce it to the user in search results. The only thing missing is the artistic aspect of the page, the literal content is repro
        • The key difference is that a search engine downloads a copy of a web page that is specifically intended by the author to be downloaded and viewed by others. If the LLM training is being done only on books that have been downloaded from sites where the books have been made freely available by their owners, then the fair use argument would be much stronger.
      • by gweihir ( 88907 )

        The summary is not correct. Like any group of fanatics or mindless fanbois, the AI morons do not see reality, but what they hallucinate reality should be in their opinion.

        The judge efectively said that what Meta does may well be illegal, but you need to litigate it in a different way.

      • It's not a question as to whether or not a copy was made. Absolutely a copy was made. The act of training requires copies be made.
        The question is whether or not those copies are covered under fair use. If they are transformative, and not proven to have a negative market impact on the copyright holder- then they are.

        The Judge is not wrong, they know the law better than you, and the dumbshits who positively moderated you.

        Also, the CDA has literally nothing to do with copyright.
        Copyright law is Title 17
      • by allo ( 1728082 )

        The point is, that there is not "store into the model" step.

        The training gets an input and an expected output and then computes how large the error between expected and actual output is. Afterward it updates the weights using the derivatives of the error, not the input or expected output texts.
        The only way to "store" something in a model is overfitting, and the pigeon hole principle tells that you can only overfit a small amount of training material.

      • if you're summary is correct the judge is just plain wrong. If you store the details of a book into your model congratulations you have made a copy of the book as per our copyright law

        The details of a book aren't stored in the model. If they were, you could reproduce the book wholesale. You can't do that even with external tools which analyze the model, let alone the model running as designed; QED the details of the book aren't stored in the model. Only a statistical analysis is stored. This is also not a derivative work, because derivative works are characterized by recognizably reproduced elements — actual copies, which further have not been manipulated to the point of unrecogniz

    • by gweihir ( 88907 )

      That is a gross misstatement and more reflects your desires than the facts of the case. If somebaody, say, transcribes Harry Potter then there woul dbe no exact copy, but it would still niot be fair use.

      • Transcription is not transformative under current jurisprudence, it's derivative.
        This is the second judge to rule that weights trained by backpropagation are transformative, not derivative.

        You're going to have to accept that you are, as per the usual, wrong.
    • by KAdamM ( 2996395 )
      How about if I create an "AI" that is trained on a single book (could be latest bestseller by pure coincidence), and then "recreate" the book, and sell it as a AI generated book. Would you consider that fair? If not, where is the limit in number of books you need to use in training in order to your "AI" to be considered fair?
    • In other words, it can do the things a normal human being can do after reading a book.

      Humans and machines are not treated the same under copyright law. They are different, and the law treats them differently.

  • Most of the problem comes from the copyright period being so insanely long. Make copyrights expire after ~30 years or so, and much of this problem evaporates (as well as many other problems).

    • Yeah, but then Stan Lee wouldn't have made any money from the Avengers movies.
  • I used to advocate for piracy in my 20s, when I could barely afford anything, but then changed my mind. For more than a decade I advocated for respecting copyrights, because even if the copyright system is truly skewed towards American financial interests, there are ways to profit from it even for poorer countries. I convinced a lot of people that copyright is a good thing, even with its flaws, even when hiding academic content and knowledge behind paywalls and high markups.

    Now, it's impossible for me to ar

    • Interesting because the moral argument is about reproduction.

      I own a notepad and a pencil. But if I start writing "A long time ago in Galaxy far, far away" and continue as such, 'you' will threaten to beat me up and put me in a cage for years (if I persist despite threats).

      That just obliterates real property rights in favor of imaginary property rights. And real property rights are the basis of a peaceful civilization.

      On the other hand, the AI bros have a moral obligation to not suddenly destroy society's

    • You should really not base your opinion on a case where the judge explicitly said that the verdict is not a precedent of any kind.
  • Wasn't the story that Facebook also pirated these books? Seems like they could also be sued for copyright infringement.
    • They would have needed to make this point as a matter of law. I don't think the plaintiffs did so. In the Anthropic case (also ruled on this week) the judge found the same conclusion on the copyright for training side of the case, but allowed the case to continue on the piracy grounds.

      • Ya- plaintiffs seemed to be making a point to attack the fair use. They weren't really after the money for the piracy.
    • by gweihir ( 88907 )

      Indeed. But they have overdone it. Search engines are only legal because they link to the content and nobody that litigated against Google, for example, really wanted to get delisted. So they dropped the complaints. (This is simplified.) But with LLMs, no linking back takes place. In fact I had several cases where ChatGPT was unable to cite sources on request. One of the reasons I rearely use it.

      But without the linking back, the content providers have zero motivation to allow the use of their content. In fa

  • .. or, rather, that Slashdotters have come this: are y'all arguing for the current fucked up copyright system, and indeed more of it?

    Is this place now packed with a bunch of Zoomers who have been slurping up "muh IP law" propaganda from the bigcorps their whole lives?

    Or are most of you completely ignorant about the purpose of copyright in the first place?

    I for one am happy to see fair use get a fair shake in the courts.
    • Is this place now packed with a bunch of Zoomers who have been slurping up "muh IP law" propaganda from the bigcorps their whole lives?

      My view is the opposite. The demographic has gotten old enough that the average commenter is more likely to have their name on some piece of work they've created, and so are more likely to have a personal stake in how the law protects their rights to that work.

    • As much as I dislike Meta, I'm with then on this case. Training AI on copyrighted works is the very definition of fair use. You can train a human on copyrighted materials (as long as you have legally bought or borrowed the materials in the first place). Why should it be OK to train AI on them, with the same stipulations?

  • Meta may have dodged the bullet on its AI-training fair-use defense (for now and only as to these 13 authors) but the battle isn’t over. On July 11, Judge Chhabria will take up the authors’ separate claim that Meta’s torrent-style downloading and distribution of their books infringed copyright. https://www.courthousenews.com... [courthousenews.com] That distribution suit survived the ruling and could still affect how tech giants acquire training data.

  • I find it interesting that any use of a book for training is now fair use. I guess I'll tell my kids to download their college textbooks. As long as you don't take away anything copyrightable right?
  • 1) To protect the authors of works so that they can gain an income from their creativity. To the extent that AI doesn't allow complete downloads of an author's book, it won't reduce the tendency of people to buy those books. 'The US district judge Vince Chhabria, in San Francisco, said in his decision on the Meta case that the authors had not presented enough evidence that the technology company’s AI would dilute the market for their work to show that its conduct was illegal under US copyright law.'

    2)

  • The failure to litigate was expected.

    I wonder if there are any lawyers who bother to learn about the technology first, because when you read the lawsuits, it often seems as if you wouldn't need a lawyer at all. You could refute the claims simply by presenting the neural network's diagram, which shows the technical impossibility of the assertions. Of course, you'd still need a lawyer to translate the technical details into language a judge can comprehend.

    Have you read the court documents from the Andersen ca

  • Authors sue to prevent reading their books.

  • This was a summary judgment, not a verdict in a trial. The plaintiffs failed to move their case forward to a trial because they argued their main point, market dilution, poorly. This case wasn't about whether training on copyrighted books is fair use in general. It was deciding whether the plaintiffs brought enough specific evidence on their market dilution claim to justify sending the case to a jury. Spoiler: they didn’t. The plaintiffs lost this round because they failed to properly plead and su

"There is such a fine line between genius and stupidity." - David St. Hubbins, "Spinal Tap"

Working...