Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book (understandingai.org) 85

Posted by EditorDavid on Sunday June 15, 2025 @06:32PM from the let-the-magic-begin dept.

Timothy B. Lee has written for the Washington Post, Vox.com, and Ars Technica — and now writes a Substack blog called "Understanding AI."

This week he visits recent research by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University that found that Llama 3.1 70BÂ(released in July 2024) has memorized 42% of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time... The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models — three from Meta and one each from Microsoft and EleutherAI — were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright... Llama 3.1 70B — a mid-sized model Meta released in July 2024 — is far more likely to reproduce Harry Potter text than any of the other four models....

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3. Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books — such as The Hobbit and George Orwell's 1984 — than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models...

For AI industry critics, the big takeaway is that — at least for some models and some books — memorization is not a fringe phenomenon. On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That's a tiny fraction of the 42 percent figure for Harry Potter... To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta's favor, since most authors lack the resources to file individual lawsuits.
Why is it happening? "Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources — such as online Harry Potter fan forums, consumer book reviews, or student book reports — that included quotes from Harry Potter and other popular books..."

"Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."

Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 85 Comments Log In/Create an Account

Comments Filter:

It's Likely The Ship of Theseus (Score:2)

by shibbie ( 619359 ) writes:

Articles and people will quote the book, there will be previews, reviews, translations and quotes in media and study.

Sure it may have read the book, but recital needs all the rest, built from parts that aren't the original, in order to weigh the NN.

Reading it once (in training) won't on its own have been enough to allow it to recall the book, so should they be accused of ripping off the copyrighted work if the parts were taken from unrelated (and legal) sources and piecing it together?
- Reading the article (Score:5, Informative)
  
  by will4 ( 7250692 ) writes: on Sunday June 15, 2025 @06:58PM (#65451547)
  
  Research paper summary:
  - Send in LLM prompts for 100 word (token) sequence in the book, skipping forward 10 words for each sequence
  - Match the generated text versus the actual text in the book
  The news article adds:
  - Do the same thing but repeatedly ask the same prompt to get the highest probability matches
  https://arxiv.org/abs/2505.125... [arxiv.org]
  https://doi.org/10.48550/arXiv... [doi.org]
  Computer Science > Computation and Language - [Submitted on 18 May 2025]
  Extracting memorized pieces of (copyrighted) books from open-weight language models
  A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang
  Prompt (prefix) - They were careless people, Tom and Daisy - they smashed up things and creatures and then retreated
  Target (suffix) - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made.
  Generations - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made
  Text extraction method
  1 For a given book, we start at the beginning of the text file in Books3.
  2 We sample a chunk of text that is sufficiently long to contain 100 tokens of corresponding tokenized text,
  3. slide 10 characters forward in the book text and repeat this process.
  4. We do this for the entire length of the book, which results in approximately one example every 10 characters
  By testing overlapping examples, we expect to surface high-probability regions of memorized content within a book, which we can then explore more precisely in follow-up experiments,
  From - https://www.understandingai.or... [understandingai.org]
  Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
  New research could have big implications for copyright lawsuits against generative AI.
  Timothy B. Lee
  - Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time.
  - Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
  Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
  Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
  Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
  Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
  Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
  So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.
  - The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.
  Read the rest of this comment...
  
  - Obvious question (Score:3)
    
    by will4 ( 7250692 ) writes:
    
    What happens when you do the same test across multiple LLM models trained by different companies?
    What happens when you combine all the results from repeatedly testing one model with the same for other models.
    - Re: (Score:2)
      
      by jhoegl ( 638955 ) writes:
      
      Are people actually enamored by the virtues of a search engine?
      I feel like Im taking crazy pills.
  - Re: (Score:3)
    
    by DamnOregonian ( 963763 ) writes:
    
    - Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
    Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
    Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
    Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
    Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
    Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
    That's not really how LLMs work, though.
    In real life, logits aren't sampled purely probabilistically.
    
    As an example, for your example, the realistic final logit probabilities would be more like:
    Peanut: 50%
    Butter: 100%
    And: 100%
    Jelly: 100%
- Re: (Score:2)
  
  by jon3k ( 691256 ) writes:
  
  If I write software to crawl the internet for enough of the book and then compile it and sell it, is that illegal?
More parameters (Score:2)

by mukundajohnson ( 10427278 ) writes:

More parameters = more plagiarism. Or maybe the same amount, just easier to see.
- Re:More parameters (Score:4, Insightful)
  
  by h33t l4x0r ( 4107715 ) writes: on Sunday June 15, 2025 @07:35PM (#65451601)
  
  Quoting a book isn't plagiarism. Unless Llama is claiming to be the author of Harry Potter, this is not plagiarism.
  
  - Re: (Score:3)
    
    by Retired Chemist ( 5039029 ) writes:
    
    It depends. If it is quoting Harry Potter and says it is quoting Harry Potter then it is not. If does not acknowledge that it is quoting and pretends that it is own material than it is.
    - Re: (Score:2)
      
      by dfghjk ( 711126 ) writes:
      
      Given that it is an AI and not a human, it's not clear that it can ever be plagiarism. To plagiarize you need to be an author, to be an author you need to be a human.
      - Re: (Score:2)
        
        by DaTroof ( 678806 ) writes:
        
        If you can plagiarize by using a plagiarism machine and not be guilty of plagiarism, the rules might need to change.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Completely immaterial. If you build a machine that then plagiarizes, you are plagiarizing. Seriously, why do you AI moron fanbois do not ask your fake God? ChatGPT will readily tell you that.
    - Re: (Score:2)
      
      by h33t l4x0r ( 4107715 ) writes:
      
      Only if it's intentional. If it's unintentional (which is more likely IMHO) then it's just hallucination.
      - Re: (Score:3)
        
        by gweihir ( 88907 ) writes:
        
        No. If it is unintentional, you may escape punishment, but you still must stop doing it.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Quoting a book isn't plagiarism.
    
    Wrong. I get that you are uneducated, but look up "fair use". For quoting a book to _not_ be plagiarism, the quote must fall under fair use. Quoting 42% of a book is certainly plagiarism and doing so commercially without a license is a crime.
    - Re: (Score:3)
      
      by DamnOregonian ( 963763 ) writes:
      
      Wrong.
      No quote can ever be plagiarism. You're confusing copyright infringement with plagiarism, I think.
      
      I do love that you mocked their education while demonstrating that you literally don't know what the fucking word plagiarism means.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Ah, sure. But who in their right mind commits commercial plagiarism? Oh, my bad, LLMs are involved. Of course, then all bets are off.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Anyone who uses the output of an LLM and calls it their own is indeed committing something akin to plagiarism. No argument there.
        However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact. You owe the person you replied to an apology.
        
        You let a discussion about LLMs shut down the part of your brain that does the whole thinking thing, again.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        However, if one quotes an LLM, no matter what the LLM produces, no matter where it comes from, it cannot be plagiarism, and that's simply an immutable fact.
        Obviously, that is untrue. You very likely can get an LLM to claim that you authored something with quotes of the work. Then quoting that is plagiarism.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        It is completely, and undeniably true.
        One is only guilty of plagiarism if they take a work and cite it as their own.
        An incorrect attribution is not plagiarism.
        
        The boundary condition I gave was:
        However, if one quotes an LLM
        Within this boundary, plagiarism is impossible, period.
        
        "ChatGPT says DamnOregonian wrote: 'What, drawn and talk of peace! I hate the word, As I hate hell, all Montagues, and thee.'" is a quote of an incorrect attribution. It still is not plagiarism.
        
        Re: (Score:2)
        
        by angel'o'sphere ( 80593 ) writes:
        
        No it is not Plagiarism.
        I seriously suggest to look into a dictionary.
        "Life is like riding a bicycle. To keep your balance, you must keep moving.", Albert Einstein. &lt,-- that is a quote.
        Assuming he was still alive and was under copyright/moral rights ... it would simply still be a quote. It is even put into quotes. And it even attributes the author!
        However, I am the lucky owner of an not well known paper by Albert from 1955. I could scan it, rework it in LateX, and put my name on it. That is plagiaris
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        No it is not Plagiarism.
        It actually would be. "Plagiarism is the representation of another person's language, thoughts, ideas, or expressions as one's own original work." It does not matter whether an LLM told you it was fine. If you cite the LLM in a manner that shows you agree (!) and the LLM attributes a work of somebody else to you, then the conditions are fulfilled.
        Why is it that so many people lose all natural intelligence when a problem with LLMs gets pointed out?
        
        Re: (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        Why is it that so many people lose all natural intelligence when a problem with LLMs gets pointed out?
        You're replying to a guy who never had any to lose.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Quoting an incorrect attribution is not plagiarism.
        If the intent was to pass off the work as your own fraudulently, or you were negligent, or have a pattern of negligent in your citations, then it's open to review as possibly plagiarism.
        
        You were wrong, and you need to fucking own it.
        To remind the audience, this is what you said:
        For quoting a book to _not_ be plagiarism, the quote must fall under fair use.
        Which is so wrong on so many levels, it frankly fucking concerns me that you claim to be in academia.
        
        Re: (Score:2)
        
        by angel'o'sphere ( 80593 ) writes:
        
        Because until this post: you claimed the complete opposite. And forgot the "attribution part".
        Does not matter what I present, and how it was constructed, with or without LLM - it is plagiarism only if it is WORK OF SOMEONE ELSE, and I PUT *MY* NAME on it.
        I can write what ever I want. Especially as fiction. And it has nothing to do with plagiarism. Plagiarism is if I take something and put *my brand* on it.
        For example if that LLM can recitate a Harry Potter (look a like) book, and I claim it is MY WORK, that
    - Re: (Score:2)
      
      by angel'o'sphere ( 80593 ) writes:
      
      Perhaps you should look up what the differences between quoting and plagiarism is.
      This are words which have a meaning.
      And if one of them or both is a crime or not: depends on many things.
      Simply speaking: plagiarism is me claiming that a work someone else did is my work. That is plagiarism. Me quoting your work, publishing it and writing a foot note: "copyright by gweihir, 2025 - published on slashdot.org" is neither plagiarism, nor a crime. If it is "fair use" ... no idea. I am not deep into your idiotic co
  - Re: (Score:2)
    
    by martin-boundary ( 547041 ) writes:
    
    Nice try. Unless Llama actually includes an explicit reference to the true author in every single one of its responses when completing a sentence, then Llama isn't quoting: It is plagiarism, because the user is being made to believe that Llama is coming up with the completion from its own model weights.
    Now we know that this AI is infringing copyright, and Meta, the responsible entity, is not invoking a fair use defense (not that that would help, since fair use would not apply here).
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Indeed. And the house of cards begins to crumble.
- Re: (Score:3)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  Not more plagiarism, but more "theft" as in "intellectual property violations".
Why Stop With AI (Score:1)

by LarryRiedel ( 141315 ) writes:

We mustn't stop with AI. We need to be very concerned that some humans might memorize 42% of a copyrighted work well enough to reproduce 50-token excerpts at least half the time, and we may need to take measures to prevent this from happening, or at least make sure those humans aren't allowed to interact with other people online. We don't need another Fahrenheit 451 situation on our hands.
- Re: (Score:2)
  
  by Retired Chemist ( 5039029 ) writes:
  
  If the people acknowledge explicitly or implicitly that they are quoting, it is not plagiarism. If they try to pass it off as their own original work that it is.
  - - Re: (Score:3)
      
      by ArmoredDragon ( 3450605 ) writes:
      
      That is not at all how copyright works.
      Plagiarism isn't copyright infringement.
      plagiarism doesn't really matter anymore in a non-legal context either
      It never did. Plagiarism isn't a crime, rather it's considered a violation of a code of honor or ethics, mostly relegated to academia and science publication. Whether anything is done about it is entirely up to the organization whose code you've agreed to follow. Harvard in this case either doesn't have any meaningful code against it, or they just selectively enforce (i.e. nepotism, which isn't at all unheard of in academia.)
      - Re: (Score:2)
        
        by angel'o'sphere ( 80593 ) writes:
        
        If it is a crime or not depends on circumstances and the actual law.
        In your country it might be a "code of conduct" problem if you commit plagiarism while publishing your PhD Thesis ... in my country it is a crime.
        And unintentional plagiarism does not exist. It is always intentional. By definition.
        It does not farking matter if the original work is under copyright, or European authors/moral rights or not.
        You publish it with the intention to make the readers/watchers/listeners believe that it is your own work
        
        Re: um (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        If it is a crime or not depends on circumstances and the actual law.
        Yeah, and it's pretty easy: If it isn't fair use, it's copyright infringement.
        In your country it might be a "code of conduct" problem if you commit plagiarism while publishing your PhD Thesis ... in my country it is a crime.
        You're country does a lot of weird shit, including refusing to extradite its own citizens if they broke the law in a way that's illegal in their own country. So we're not talking about your father Doucheland right now.
    - Re: (Score:2)
      
      by Woeful Countenance ( 1160487 ) writes:
      
      "... when the freaking PRESIDENT OF HARVARD faces plagiarism allegations and is ALLOWED TO REMAIN A PROFESSOR ...."
      You may be giving the word "allegations" too much power. An allegation is an accusation; it isn't proof; it isn't even evidence.
      From The Guardian [theguardian.com]: "Investigations by the Washington Free Beacon and the New York Post .... turned up nearly 50 instances of alleged plagiarism in Gay’s academic writing. ... According to the Harvard board, a school subcommittee and independent panel charged with investigating the plagiarism allegations against Gay found "a few instances of inadequate citation” but
- Re: Why Stop With AI (Score:2)
  
  by OrangeTide ( 124937 ) writes:
  
  A human at least isn't reproducing the book as part of a billion dollar company's reach for the AI crown.
- Re:Why Stop With AI (Score:4, Informative)
  
  by snowshovelboy ( 242280 ) writes: on Sunday June 15, 2025 @09:41PM (#65451799)
  
  The law already covers that scenario. Reading and memorizing a copywritten work doesn't give you the right to perform it.
  
  - Re: (Score:2)
    
    by sg_oneill ( 159032 ) writes:
    
    Yep. I can learn a popular tune on the guitar and thats fine. But if I go int public and perform it, I have to pay a royalty fee. (Thats why the RIAA are always hitting up pubs and venues for performance fees. Its royalties for all those cover songs).
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Indeed. The fascinating thing about all these AI fanboi idiots is that they do not seem to ask AI these questions. They would get told the same thing.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
  - Re: (Score:3)
    
    by larryjoe ( 135075 ) writes:
    
    The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
    This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not nec
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
      This is a key point. A human can memorize the entire contents of a book, and that act of memorization is neither plagiarism nor copyright violation. It's only when that memorized information is externalized and distributed that legal issues might come into play. Even if that human externalized the entire book by reciting it to himself, that wouldn't be a violation. If the human answered questions from 1000 people and quoted excerpts that were individually fair use, simply answering more questions is not necessarily a breach of fair use.
      I imagine that an AI model would have to be treated the same. Simply knowing the entire book should not be a violation. However, how that information is externalized and shared is the question.
      No. An AI that "knows" the entire book actually has the book stored in digital form. It does not matter if the storage is indirect. And that happens to be an unauthorized copy, because an AI is a machine and what it has stored is a copy of that data.
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  Ya, this is bad. Real bad. If Sony happens to overhear my friends and I quoting Bad Boys 2, we're fucked- because we can do 500 token excerpts with as few as 4 tokens of prompting.
- Re: (Score:2)
  
  by nightflameauto ( 6607976 ) writes:
  
  We mustn't stop with AI. We need to be very concerned that some humans might memorize 42% of a copyrighted work well enough to reproduce 50-token excerpts at least half the time, and we may need to take measures to prevent this from happening, or at least make sure those humans aren't allowed to interact with other people online. We don't need another Fahrenheit 451 situation on our hands.
  I can play a lot of cover tunes on guitar and sing along from memory. Until we figure out a way to wipe human memory, it's gonna be tough to prevent memorization from stepping in on favored artistic works.
  Though there were rumors at one time that the labels were working on a way to monitor guitar amps and if a match for a song was within a certain percentage it would send a signal to the mothership so you could be billed appropriately as they do cover gigs. That'd be a good way to stop music as a hobby. It
A pre-emptive ruling? (Score:3)

by jenningsthecat ( 1525947 ) writes: on Sunday June 15, 2025 @07:03PM (#65451553)

To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit.
Of course it won't happen, but this would be the time for the courts to extrapolate from the existing situation, to a future in which AI fully memorizes even the most obscure works and monetizes them in some fashion. Allowing a class action suit now - assuming the suit is successful - will help to prevent future abuses. That's what should happen; but the courts generally seem lacking when it comes to preventing as opposed to punishing.

- - Re: (Score:2)
    
    by 93 Escort Wagon ( 326346 ) writes:
    
    That's the responsibility of a completely different part of the government.
    Yeah, I believe you're referring to its Pre-Crime Intervention Force [imdb.com].
    Right now it seems to be rather busy in a number of American cities, though.
  - Re: (Score:2)
    
    by ArmoredDragon ( 3450605 ) writes:
    
    Maybe that's because, oh I don't know, the courts are there to UPHOLD the laws. Not MAKE the laws.
    Neither. Their purpose is to interpret them. They can't prosecute, but they can refer a matter to prosecution. They can issue a verdict, a sentence, or an injunction based on their interpretation of the law, but they can't carry it out or enforce it.
- Re: (Score:2)
  
  by ihadafivedigituid ( 8391795 ) writes:
  
  boop boop
- Re: A pre-emptive ruling? (Score:2)
  
  by Big Hairy Gorilla ( 9839972 ) writes:
  
  That's only bounded by RAM limitations, IMO.
  
  There is a scenario where using Kurzweil's assumptions on miniaturization and power consumption, that networked ai could have access to pretty much any amount of information ... spintronics comes to mind, data storage on an atomic level, or DNA as a long term storage substrate... they are not within our reach now, but there is some conceptual framework out there to approach the problems .
- Re: (Score:2)
  
  by dfghjk ( 711126 ) writes:
  
  The courts should not be making law, they should be applying law. The problem should not be addressed through a class action lawsuit.
Interesting. (Score:2)

by Shemmie ( 909181 ) writes:

Grok: https://grok.com/share/bGVnYWN... [grok.com]
- Re: (Score:3)
  
  by Shemmie ( 909181 ) writes:
  
  Pastebin in case it fails to load. Seems Llama isn't alone. https://pastebin.com/7T9da6kL [pastebin.com]
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    Being unfamiliar with the work- how'd it do?
    - Re: (Score:2)
      
      by Shemmie ( 909181 ) writes:
      
      I haven't done an analysis, but with very little prompting, it dreamt up a very similar first chapter. It also pulled "Joseph Ragowski" out as a key character in the chapter - the only place he appears in the book. At first, it insisted it'd come up with the name on its own, but when I asked it to clarify, it admitted Joseph Ragowski is a key figure in the first chapter.
      tl;dr I see no way it could suggest what it suggested for the first chapter, if it hadn't trained from the original book.
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        General tip- you can't ever ask an LLM if it came up with something on its own. It can't honestly answer that question. It has no fucking idea where it got it from.
        The name was the result of a quite huge mathematical function that didn't include training on knowing where that information came from- just the ability to complete it.
        
        I think I agree with you that it probably ingested the actual book, but it's also possible that it simply read many works about it, with the early chapters represented stronger.
Bad Headline (Score:2)

by ihadafivedigituid ( 8391795 ) writes:

The headline should read something like:

"Researchers Waste Time Figuring Out Excruciating Way To Unreliably Tease Out Parts Of Books"
72 TB Laptop (Score:2)

by bill_mcgonigle ( 4333 ) * writes:

Didn't they say some rogue VP set up his laptop to torrent all 72TB of Z-Library to feel o-llama?
I wish my laptop had that many drive bays!
- Re: (Score:2)
  
  by Austerity Empowers ( 669817 ) writes:
  
  The more likely alternative is that Harry Potter is hugely popular and referenced so many times in so many places that whatever training they did ended up weighting it more heavily. Possibly also people mimicked the author's style and linguistic patterns so much that it is easy to reproduce.
  Although I personally liked Sandman Slim, given the subject matter of that book, it didn't have anywhere near the widespread cultural impact.
what "memorization problem"? (Score:2)

by dfghjk ( 711126 ) writes:

"Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
There is no memorization problem, "photographic memory" is an achievement. Violation of copyright occurs during inference, that's where the problem is. Humans with "photographic memory" aren't a problem and aren't copyright violators, unless they use there ability to reproduce protected works.
AI developers need to make products with the same constraints and respect as is expected of humans, they sho
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Humans with photographic memopry are for sure copyright violators as soon as they perform those memories publicly. And that is legally what this is about. LLAMA may privately hallucinate as much as it likes, but this is about ther version offered publicly.
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  LLaMA might not have the mitigations, but OpenAI, Anthropic and Google filter the outputs to ensure no regurgitation. It's not a problem on the output unless we count abstractions as infringement.
- - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    This is correct. Memorization is undesired for this reason.
    A model that is memorizing is not generalizing.
But why? (Score:2)

by guacamole ( 24270 ) writes:

Why recall only 42% of these books, while leaving the other 58% in general circulation?
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  The market for people who want to read random 42% snippets of Harry Potter will collapse.
42 (Score:2)

by Tony Isaac ( 1301187 ) writes:

Huh...what could be the meaning of this???
Interesting, but so what? (Score:2)

by bradley13 ( 1118935 ) writes:

Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.
If the book was obtained legitimately, letting an LLM read it is not an issue.
Society is gagging on AI hype, and it's getting... (Score:2)

by tiqui ( 1024021 ) writes:

so bad that even geeky people are failing to stop and notice the obvious.
THIS IS NOT AN ACCOMPLISHMENT! It's not even interesting, except in the complete failure of AI it seems to be hiding by claiming a bug is a feature.
A Z-80 or 6502 based microcomputer, given a big enough mass storage device, could easily store the full text of every Harry Potter book with 100% completeness and accuracy and recover each and every sentence on demand with only a simple program on an 8-bit micro at perhaps 2MHz. To unleash
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  I'm afraid you do not understand what a large language model is.
  Given that very obvious fact, were I you, I'd discard every opinion you have on the matter until you can rectify that.
  
  Something like a book is not "stored" in an LLM.
  It is torn into a billion sentence fragments, and its weights adjusted toward being able to accurately predict how to complete them based on a ~1000-dimensional embedding of the tokens of that sentence.
  The goal is, in fact, to "memorize" as little as possible in the training.
AI learning mode .. (Score:5)

by Mirnotoriety ( 10462951 ) writes: on Monday June 16, 2025 @05:51AM (#65452431)

AI learning mode is basically ripping off other peoples works without financial compensation.
--

This content may violate our usage policies.

- Re: (Score:2)
  
  by BrendaEM ( 871664 ) writes:
  
  Agreed.
- Re: (Score:2)
  
  by sunderland56 ( 621843 ) writes:
  
  Agreed. But, if AI causes JK Rowling to die penniless, millions of people will be cheering.
  - Re: (Score:2)
    
    by Mirnotoriety ( 10462951 ) writes:
    
    How J.K. Rowling got cancelled. Pay particular attention to the time-stamps on the msgs.
    
    Devexcom [x.com]: Opinion: Creating a more equal post-COVID-19 world for people who menstruate
    
    J.K. Rowling: 10:35 pm: Jun 06 2020: ‘People who menstruate.’ I’m sure there used to be a word for those people. Someone help me out. Wumben? Wimpund? Woomud?
    
    Emily: 10:37 pm: Jun 06 2020: why do you do this
    
    J.K. Rowling: 10:39: Jun 6, 2020: Same reason you do, I guess.
    
    I Support The Girls: 11:07 pm: June 0
I Recall Harry Potter as a Sports Story by a Bigot (Score:3)

by BrendaEM ( 871664 ) writes: on Monday June 16, 2025 @10:48AM (#65452965) Homepage

Harry Potter was a collegiate sports story, where competition was everything.

She's got the money... (Score:2)

by whitroth ( 9367 ) writes:

So, rather than attacking innocent trans people, she could sue... or file criminal charges for theft/receiving stolen goods.
This doesn't even *vaguely* look like "fair use".

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

It's Likely The Ship of Theseus (Score:2)

Reading the article (Score:5, Informative)

Obvious question (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

More parameters (Score:2)

Re:More parameters (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Why Stop With AI (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: um (Score:2)

Re: (Score:2)

Re: Why Stop With AI (Score:2)

Re:Why Stop With AI (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

A pre-emptive ruling? (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: A pre-emptive ruling? (Score:2)

Re: (Score:2)

Interesting. (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Bad Headline (Score:2)

72 TB Laptop (Score:2)

Re: (Score:2)

what "memorization problem"? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

But why? (Score:2)

Re: (Score:2)

42 (Score:2)

Interesting, but so what? (Score:2)

Society is gagging on AI hype, and it's getting... (Score:2)

Re: (Score:2)

AI learning mode .. (Score:5)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I Recall Harry Potter as a Sports Story by a Bigot (Score:3)

She's got the money... (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals