Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI

Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book (understandingai.org) 65

Timothy B. Lee has written for the Washington Post, Vox.com, and Ars Technica — and now writes a Substack blog called "Understanding AI."

This week he visits recent research by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University that found that Llama 3.1 70BÂ(released in July 2024) has memorized 42% of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time... The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models — three from Meta and one each from Microsoft and EleutherAI — were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright... Llama 3.1 70B — a mid-sized model Meta released in July 2024 — is far more likely to reproduce Harry Potter text than any of the other four models....

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3. Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books — such as The Hobbit and George Orwell's 1984 — than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models...

For AI industry critics, the big takeaway is that — at least for some models and some books — memorization is not a fringe phenomenon. On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That's a tiny fraction of the 42 percent figure for Harry Potter... To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta's favor, since most authors lack the resources to file individual lawsuits.

Why is it happening? "Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources — such as online Harry Potter fan forums, consumer book reviews, or student book reports — that included quotes from Harry Potter and other popular books..."

"Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."

Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book

Comments Filter:
  • Articles and people will quote the book, there will be previews, reviews, translations and quotes in media and study.

    Sure it may have read the book, but recital needs all the rest, built from parts that aren't the original, in order to weigh the NN.

    Reading it once (in training) won't on its own have been enough to allow it to recall the book, so should they be accused of ripping off the copyrighted work if the parts were taken from unrelated (and legal) sources and piecing it together?
    • Reading the article (Score:4, Informative)

      by will4 ( 7250692 ) on Sunday June 15, 2025 @06:58PM (#65451547)

      Research paper summary:
      - Send in LLM prompts for 100 word (token) sequence in the book, skipping forward 10 words for each sequence
      - Match the generated text versus the actual text in the book

      The news article adds:
      - Do the same thing but repeatedly ask the same prompt to get the highest probability matches

      https://arxiv.org/abs/2505.125... [arxiv.org]
      https://doi.org/10.48550/arXiv... [doi.org]

      Computer Science > Computation and Language - [Submitted on 18 May 2025]
      Extracting memorized pieces of (copyrighted) books from open-weight language models
      A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang

      Prompt (prefix) - They were careless people, Tom and Daisy - they smashed up things and creatures and then retreated
      Target (suffix) - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made.
      Generations - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made

      Text extraction method
      1 For a given book, we start at the beginning of the text file in Books3.
      2 We sample a chunk of text that is sufficiently long to contain 100 tokens of corresponding tokenized text,
      3. slide 10 characters forward in the book text and repeat this process.
      4. We do this for the entire length of the book, which results in approximately one example every 10 characters

      By testing overlapping examples, we expect to surface high-probability regions of memorized content within a book, which we can then explore more precisely in follow-up experiments,

      From - https://www.understandingai.or... [understandingai.org]
      Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
      New research could have big implications for copyright lawsuits against generative AI.
      Timothy B. Lee

      - Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time.
      - Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
              Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
              Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
              Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
              Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
      Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
      So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.

      - The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.

      • What happens when you do the same test across multiple LLM models trained by different companies?

        What happens when you combine all the results from repeatedly testing one model with the same for other models.

        • by jhoegl ( 638955 )
          Are people actually enamored by the virtues of a search engine?
          I feel like Im taking crazy pills.
      • - Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
        Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
        Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
        Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
        Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
        Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008

        That's not really how LLMs work, though.
        In real life, logits aren't sampled purely probabilistically.

        As an example, for your example, the realistic final logit probabilities would be more like:
        Peanut: 50%
        Butter: 100%
        And: 100%
        Jelly: 100%

  • More parameters = more plagiarism. Or maybe the same amount, just easier to see.

    • Quoting a book isn't plagiarism. Unless Llama is claiming to be the author of Harry Potter, this is not plagiarism.
      • It depends. If it is quoting Harry Potter and says it is quoting Harry Potter then it is not. If does not acknowledge that it is quoting and pretends that it is own material than it is.
        • by dfghjk ( 711126 )

          Given that it is an AI and not a human, it's not clear that it can ever be plagiarism. To plagiarize you need to be an author, to be an author you need to be a human.

          • by DaTroof ( 678806 )
            If you can plagiarize by using a plagiarism machine and not be guilty of plagiarism, the rules might need to change.
          • by gweihir ( 88907 )

            Completely immaterial. If you build a machine that then plagiarizes, you are plagiarizing. Seriously, why do you AI moron fanbois do not ask your fake God? ChatGPT will readily tell you that.

        • Only if it's intentional. If it's unintentional (which is more likely IMHO) then it's just hallucination.
          • by gweihir ( 88907 )

            No. If it is unintentional, you may escape punishment, but you still must stop doing it.

      • by gweihir ( 88907 )

        Quoting a book isn't plagiarism.

        Wrong. I get that you are uneducated, but look up "fair use". For quoting a book to _not_ be plagiarism, the quote must fall under fair use. Quoting 42% of a book is certainly plagiarism and doing so commercially without a license is a crime.

        • Wrong.
          No quote can ever be plagiarism. You're confusing copyright infringement with plagiarism, I think.

          I do love that you mocked their education while demonstrating that you literally don't know what the fucking word plagiarism means.
          • by gweihir ( 88907 )

            Ah, sure. But who in their right mind commits commercial plagiarism? Oh, my bad, LLMs are involved. Of course, then all bets are off.

            • Anyone who uses the output of an LLM and calls it their own is indeed committing something akin to plagiarism. No argument there.
              However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact. You owe the person you replied to an apology.

              You let a discussion about LLMs shut down the part of your brain that does the whole thinking thing, again.
              • by gweihir ( 88907 )

                However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact.

                Obviously, that is untrue. You very likely can get an LLM to claim that you authored something with quotes of the work. Then quoting that is plagiarism.

      • Nice try. Unless Llama actually includes an explicit reference to the true author in every single one of its responses when completing a sentence, then Llama isn't quoting: It is plagiarism, because the user is being made to believe that Llama is coming up with the completion from its own model weights.

        Now we know that this AI is infringing copyright, and Meta, the responsible entity, is not invoking a fair use defense (not that that would help, since fair use would not apply here).

    • by gweihir ( 88907 )

      Indeed. And the house of cards begins to crumble.

  • We mustn't stop with AI. We need to be very concerned that some humans might memorize 42% of a copyrighted work well enough to reproduce 50-token excerpts at least half the time, and we may need to take measures to prevent this from happening, or at least make sure those humans aren't allowed to interact with other people online. We don't need another Fahrenheit 451 situation on our hands.
    • If the people acknowledge explicitly or implicitly that they are quoting, it is not plagiarism. If they try to pass it off as their own original work that it is.
    • A human at least isn't reproducing the book as part of a billion dollar company's reach for the AI crown.

    • Re:Why Stop With AI (Score:4, Informative)

      by snowshovelboy ( 242280 ) on Sunday June 15, 2025 @09:41PM (#65451799)

      The law already covers that scenario. Reading and memorizing a copywritten work doesn't give you the right to perform it.

      • Yep. I can learn a popular tune on the guitar and thats fine. But if I go int public and perform it, I have to pay a royalty fee. (Thats why the RIAA are always hitting up pubs and venues for performance fees. Its royalties for all those cover songs).

      • by gweihir ( 88907 )

        Indeed. The fascinating thing about all these AI fanboi idiots is that they do not seem to ask AI these questions. They would get told the same thing.

    • by gweihir ( 88907 )

      The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...

      • The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...

        This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not nec

        • by gweihir ( 88907 )

          The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...

          This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not necessarily a breach of fair use.

          I imagine that an AI model would have to be treated the same. Simply knowing the entire book should not be a violation. However, how that information is externalized and shared is the question.

          No. An AI that "knows" the entire book actually has the book stored in digital form. It does not matter if the storage is indirect. And that happens to be an unauthorized copy, because an AI is a machine and what it has stored is a copy of that data.

    • Ya, this is bad. Real bad. If Sony happens to overhear my friends and I quoting Bad Boys 2, we're fucked- because we can do 500 token excerpts with as few as 4 tokens of prompting.
  • by jenningsthecat ( 1525947 ) on Sunday June 15, 2025 @07:03PM (#65451553)

    To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit.

    Of course it won't happen, but this would be the time for the courts to extrapolate from the existing situation, to a future in which AI fully memorizes even the most obscure works and monetizes them in some fashion. Allowing a class action suit now - assuming the suit is successful - will help to prevent future abuses. That's what should happen; but the courts generally seem lacking when it comes to preventing as opposed to punishing.

    • boop boop
    • That's only bounded by RAM limitations, IMO.

      There is a scenario where using Kurzweil's assumptions on miniaturization and power consumption, that networked ai could have access to pretty much any amount of information ... spintronics comes to mind, data storage on an atomic level, or DNA as a long term storage substrate... they are not within our reach now, but there is some conceptual framework out there to approach the problems .
    • by dfghjk ( 711126 )

      The courts should not be making law, they should be applying law. The problem should not be addressed through a class action lawsuit.

  • The headline should read something like:

    "Researchers Waste Time Figuring Out Excruciating Way To Unreliably Tease Out Parts Of Books"
  • Didn't they say some rogue VP set up his laptop to torrent all 72TB of Z-Library to feel o-llama?

    I wish my laptop had that many drive bays!

  • "Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."

    There is no memorization problem, "photographic memory" is an achievement. Violation of copyright occurs during inference, that's where the problem is. Humans with "photographic memory" aren't a problem and aren't copyright violators, unless they use there ability to reproduce protected works.

    AI developers need to make products with the same constraints and respect as is expected of humans, they sho

    • by gweihir ( 88907 )

      Humans with photographic memopry are for sure copyright violators as soon as they perform those memories publicly. And that is legally what this is about. LLAMA may privately hallucinate as much as it likes, but this is about ther version offered publicly.

    • LLaMA might not have the mitigations, but OpenAI, Anthropic and Google filter the outputs to ensure no regurgitation. It's not a problem on the output unless we count abstractions as infringement.
  • Why recall only 42% of these books, while leaving the other 58% in general circulation?

    • The market for people who want to read random 42% snippets of Harry Potter will collapse.
  • Huh...what could be the meaning of this???

  • Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.

    If the book was obtained legitimately, letting an LLM read it is not an issue.

  • so bad that even geeky people are failing to stop and notice the obvious.

    THIS IS NOT AN ACCOMPLISHMENT! It's not even interesting, except in the complete failure of AI it seems to be hiding by claiming a bug is a feature.

    A Z-80 or 6502 based microcomputer, given a big enough mass storage device, could easily store the full text of every Harry Potter book with 100% completeness and accuracy and recover each and every sentence on demand with only a simple program on an 8-bit micro at perhaps 2MHz. To unleash

    • I'm afraid you do not understand what a large language model is.
      Given that very obvious fact, were I you, I'd discard every opinion you have on the matter until you can rectify that.

      Something like a book is not "stored" in an LLM.
      It is torn into a billion sentence fragments, and its weights adjusted toward being able to accurately predict how to complete them based on a ~1000-dimensional embedding of the tokens of that sentence.
      The goal is, in fact, to "memorize" as little as possible in the training.
  • by Mirnotoriety ( 10462951 ) on Monday June 16, 2025 @05:51AM (#65452431)
    AI learning mode is basically ripping off other peoples works without financial compensation.
    --

    This content may violate our usage policies.

"One day I woke up and discovered that I was in love with tripe." -- Tom Anderson

Working...