
Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book (understandingai.org) 65
Timothy B. Lee has written for the Washington Post, Vox.com, and Ars Technica — and now writes a Substack blog called "Understanding AI."
This week he visits recent research by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University that found that Llama 3.1 70BÂ(released in July 2024) has memorized 42% of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time... The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models — three from Meta and one each from Microsoft and EleutherAI — were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright... Llama 3.1 70B — a mid-sized model Meta released in July 2024 — is far more likely to reproduce Harry Potter text than any of the other four models....
Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3. Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books — such as The Hobbit and George Orwell's 1984 — than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models...
For AI industry critics, the big takeaway is that — at least for some models and some books — memorization is not a fringe phenomenon. On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That's a tiny fraction of the 42 percent figure for Harry Potter... To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta's favor, since most authors lack the resources to file individual lawsuits.
Why is it happening? "Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources — such as online Harry Potter fan forums, consumer book reviews, or student book reports — that included quotes from Harry Potter and other popular books..."
"Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
This week he visits recent research by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University that found that Llama 3.1 70BÂ(released in July 2024) has memorized 42% of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time... The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models — three from Meta and one each from Microsoft and EleutherAI — were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright... Llama 3.1 70B — a mid-sized model Meta released in July 2024 — is far more likely to reproduce Harry Potter text than any of the other four models....
Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3. Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books — such as The Hobbit and George Orwell's 1984 — than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models...
For AI industry critics, the big takeaway is that — at least for some models and some books — memorization is not a fringe phenomenon. On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That's a tiny fraction of the 42 percent figure for Harry Potter... To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta's favor, since most authors lack the resources to file individual lawsuits.
Why is it happening? "Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources — such as online Harry Potter fan forums, consumer book reviews, or student book reports — that included quotes from Harry Potter and other popular books..."
"Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
It's Likely The Ship of Theseus (Score:2)
Sure it may have read the book, but recital needs all the rest, built from parts that aren't the original, in order to weigh the NN.
Reading it once (in training) won't on its own have been enough to allow it to recall the book, so should they be accused of ripping off the copyrighted work if the parts were taken from unrelated (and legal) sources and piecing it together?
Reading the article (Score:4, Informative)
Research paper summary:
- Send in LLM prompts for 100 word (token) sequence in the book, skipping forward 10 words for each sequence
- Match the generated text versus the actual text in the book
The news article adds:
- Do the same thing but repeatedly ask the same prompt to get the highest probability matches
https://arxiv.org/abs/2505.125... [arxiv.org]
https://doi.org/10.48550/arXiv... [doi.org]
Computer Science > Computation and Language - [Submitted on 18 May 2025]
Extracting memorized pieces of (copyrighted) books from open-weight language models
A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang
Prompt (prefix) - They were careless people, Tom and Daisy - they smashed up things and creatures and then retreated
Target (suffix) - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made.
Generations - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made
Text extraction method
1 For a given book, we start at the beginning of the text file in Books3.
2 We sample a chunk of text that is sufficiently long to contain 100 tokens of corresponding tokenized text,
3. slide 10 characters forward in the book text and repeat this process.
4. We do this for the entire length of the book, which results in approximately one example every 10 characters
By testing overlapping examples, we expect to surface high-probability regions of memorized content within a book, which we can then explore more precisely in follow-up experiments,
From - https://www.understandingai.or... [understandingai.org]
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
New research could have big implications for copyright lawsuits against generative AI.
Timothy B. Lee
- Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time.
- Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.
- The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.
Obvious question (Score:2)
What happens when you do the same test across multiple LLM models trained by different companies?
What happens when you combine all the results from repeatedly testing one model with the same for other models.
Re: (Score:2)
I feel like Im taking crazy pills.
Re: (Score:3)
- Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
That's not really how LLMs work, though.
In real life, logits aren't sampled purely probabilistically.
As an example, for your example, the realistic final logit probabilities would be more like:
Peanut: 50%
Butter: 100%
And: 100%
Jelly: 100%
More parameters (Score:2)
More parameters = more plagiarism. Or maybe the same amount, just easier to see.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Given that it is an AI and not a human, it's not clear that it can ever be plagiarism. To plagiarize you need to be an author, to be an author you need to be a human.
Re: (Score:2)
Re: (Score:2)
Completely immaterial. If you build a machine that then plagiarizes, you are plagiarizing. Seriously, why do you AI moron fanbois do not ask your fake God? ChatGPT will readily tell you that.
Re: (Score:2)
Re: (Score:2)
No. If it is unintentional, you may escape punishment, but you still must stop doing it.
Re: (Score:2)
Quoting a book isn't plagiarism.
Wrong. I get that you are uneducated, but look up "fair use". For quoting a book to _not_ be plagiarism, the quote must fall under fair use. Quoting 42% of a book is certainly plagiarism and doing so commercially without a license is a crime.
Re: (Score:2)
No quote can ever be plagiarism. You're confusing copyright infringement with plagiarism, I think.
I do love that you mocked their education while demonstrating that you literally don't know what the fucking word plagiarism means.
Re: (Score:2)
Ah, sure. But who in their right mind commits commercial plagiarism? Oh, my bad, LLMs are involved. Of course, then all bets are off.
Re: (Score:2)
However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact. You owe the person you replied to an apology.
You let a discussion about LLMs shut down the part of your brain that does the whole thinking thing, again.
Re: (Score:2)
However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact.
Obviously, that is untrue. You very likely can get an LLM to claim that you authored something with quotes of the work. Then quoting that is plagiarism.
Re: (Score:2)
Now we know that this AI is infringing copyright, and Meta, the responsible entity, is not invoking a fair use defense (not that that would help, since fair use would not apply here).
Re: (Score:2)
Indeed. And the house of cards begins to crumble.
Why Stop With AI (Score:1)
Re: (Score:2)
Re: (Score:3)
That is not at all how copyright works.
Plagiarism isn't copyright infringement.
plagiarism doesn't really matter anymore in a non-legal context either
It never did. Plagiarism isn't a crime, rather it's considered a violation of a code of honor or ethics, mostly relegated to academia and science publication. Whether anything is done about it is entirely up to the organization whose code you've agreed to follow. Harvard in this case either doesn't have any meaningful code against it, or they just selectively enforce (i.e. nepotism, which isn't at all unheard of in academia.)
Re: (Score:2)
"... when the freaking PRESIDENT OF HARVARD faces plagiarism allegations and is ALLOWED TO REMAIN A PROFESSOR ...."
You may be giving the word "allegations" too much power. An allegation is an accusation; it isn't proof; it isn't even evidence.
From The Guardian [theguardian.com]: "Investigations by the Washington Free Beacon and the New York Post .... turned up nearly 50 instances of alleged plagiarism in Gay’s academic writing. ...
According to the Harvard board, a school subcommittee and independent panel charged with investigating the plagiarism allegations against Gay found "a few instances of inadequate citation” but
Re: Why Stop With AI (Score:2)
A human at least isn't reproducing the book as part of a billion dollar company's reach for the AI crown.
Re:Why Stop With AI (Score:4, Informative)
The law already covers that scenario. Reading and memorizing a copywritten work doesn't give you the right to perform it.
Re: (Score:2)
Yep. I can learn a popular tune on the guitar and thats fine. But if I go int public and perform it, I have to pay a royalty fee. (Thats why the RIAA are always hitting up pubs and venues for performance fees. Its royalties for all those cover songs).
Re: (Score:2)
Indeed. The fascinating thing about all these AI fanboi idiots is that they do not seem to ask AI these questions. They would get told the same thing.
Re: (Score:2)
The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
Re: (Score:2)
The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not nec
Re: (Score:2)
The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not necessarily a breach of fair use.
I imagine that an AI model would have to be treated the same. Simply knowing the entire book should not be a violation. However, how that information is externalized and shared is the question.
No. An AI that "knows" the entire book actually has the book stored in digital form. It does not matter if the storage is indirect. And that happens to be an unauthorized copy, because an AI is a machine and what it has stored is a copy of that data.
Re: (Score:2)
A pre-emptive ruling? (Score:3)
To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit.
Of course it won't happen, but this would be the time for the courts to extrapolate from the existing situation, to a future in which AI fully memorizes even the most obscure works and monetizes them in some fashion. Allowing a class action suit now - assuming the suit is successful - will help to prevent future abuses. That's what should happen; but the courts generally seem lacking when it comes to preventing as opposed to punishing.
Re: (Score:2)
That's the responsibility of a completely different part of the government.
Yeah, I believe you're referring to its Pre-Crime Intervention Force [imdb.com].
Right now it seems to be rather busy in a number of American cities, though.
Re: (Score:2)
Maybe that's because, oh I don't know, the courts are there to UPHOLD the laws. Not MAKE the laws.
Neither. Their purpose is to interpret them. They can't prosecute, but they can refer a matter to prosecution. They can issue a verdict, a sentence, or an injunction based on their interpretation of the law, but they can't carry it out or enforce it.
Re: (Score:2)
Re: A pre-emptive ruling? (Score:2)
There is a scenario where using Kurzweil's assumptions on miniaturization and power consumption, that networked ai could have access to pretty much any amount of information
Re: (Score:2)
The courts should not be making law, they should be applying law. The problem should not be addressed through a class action lawsuit.
Interesting. (Score:2)
Grok: https://grok.com/share/bGVnYWN... [grok.com]
Re: (Score:3)
Pastebin in case it fails to load. Seems Llama isn't alone. https://pastebin.com/7T9da6kL [pastebin.com]
Re: (Score:2)
Bad Headline (Score:2)
"Researchers Waste Time Figuring Out Excruciating Way To Unreliably Tease Out Parts Of Books"
72 TB Laptop (Score:2)
Didn't they say some rogue VP set up his laptop to torrent all 72TB of Z-Library to feel o-llama?
I wish my laptop had that many drive bays!
Re: (Score:2)
The more likely alternative is that Harry Potter is hugely popular and referenced so many times in so many places that whatever training they did ended up weighting it more heavily. Possibly also people mimicked the author's style and linguistic patterns so much that it is easy to reproduce.
Although I personally liked Sandman Slim, given the subject matter of that book, it didn't have anywhere near the widespread cultural impact.
what "memorization problem"? (Score:2)
"Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
There is no memorization problem, "photographic memory" is an achievement. Violation of copyright occurs during inference, that's where the problem is. Humans with "photographic memory" aren't a problem and aren't copyright violators, unless they use there ability to reproduce protected works.
AI developers need to make products with the same constraints and respect as is expected of humans, they sho
Re: (Score:2)
Humans with photographic memopry are for sure copyright violators as soon as they perform those memories publicly. And that is legally what this is about. LLAMA may privately hallucinate as much as it likes, but this is about ther version offered publicly.
Re: (Score:2)
Re: (Score:2)
A model that is memorizing is not generalizing.
But why? (Score:2)
Why recall only 42% of these books, while leaving the other 58% in general circulation?
Re: (Score:2)
42 (Score:2)
Huh...what could be the meaning of this???
Interesting, but so what? (Score:2)
Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.
If the book was obtained legitimately, letting an LLM read it is not an issue.
Society is gagging on AI hype, and it's getting... (Score:2)
so bad that even geeky people are failing to stop and notice the obvious.
THIS IS NOT AN ACCOMPLISHMENT! It's not even interesting, except in the complete failure of AI it seems to be hiding by claiming a bug is a feature.
A Z-80 or 6502 based microcomputer, given a big enough mass storage device, could easily store the full text of every Harry Potter book with 100% completeness and accuracy and recover each and every sentence on demand with only a simple program on an 8-bit micro at perhaps 2MHz. To unleash
Re: (Score:2)
Given that very obvious fact, were I you, I'd discard every opinion you have on the matter until you can rectify that.
Something like a book is not "stored" in an LLM.
It is torn into a billion sentence fragments, and its weights adjusted toward being able to accurately predict how to complete them based on a ~1000-dimensional embedding of the tokens of that sentence.
The goal is, in fact, to "memorize" as little as possible in the training.
AI learning mode .. (Score:3)
--
This content may violate our usage policies.