Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI Technology

Pearson Taking Legal Action Over Use of Its Content To Train Language Models (standard.co.uk) 57

Textbooks giant Pearson is currently taking legal action over the use of its intellectual property to train AI models, chief executive Andy Bird revealed today as the firm laid out its plans for its own artificial intelligence-powered products. From a report: The firm laid out its plans on how it would use AI a week after its share price tumbled by 15% as American rival Chegg said its own business had been hurt by the rise of ChatGPT. Those plans would include AI-powered summaries of Pearson educational videos, to be rolled out this month for Pearson+ members, as well as AI-generated multiple choice questions for areas where a student might need more help. Bird said Pearson had an advantage as its AI products would use Pearson content for training, which he said would make it more reliable. However, he also added that the business was also monitoring the situation regarding other businesses using Pearson content to train its AI. He said Pearson had already sent out a cease-and-desist letter, though did not say who it was addressed to.
This discussion has been archived. No new comments can be posted.

Pearson Taking Legal Action Over Use of Its Content To Train Language Models

Comments Filter:
  • Otherwise, first-sale doctrine rules apply. Kindly fuck off.
    • by mwfischer ( 1919758 ) on Tuesday May 09, 2023 @01:56PM (#63509191) Journal

      Here's the fun part. First sale doctrine only applies to PAPER. Not digital. You don't actually ever own the digital one. You just have the rights to use.

      On the other hand, fuck Pearson.

      • by presidenteloco ( 659168 ) on Tuesday May 09, 2023 @02:32PM (#63509265)
        First-sale doctrine is probably silent on whether you have the right to scan in your paper copy to digital form for you own purposes.
        So long as you are not distributing exact or derivative digital copies of the work, which would clearly violate copyright law if that right was not granted.

        So possibly at issue is whether the large language model, having been influenced by your copyrighted material along with gazillions of other pieces of material, is producing derivative works or not.

        I would say not.

        You might as well say that anything I say or write, I myself having read a lot of copyrighted material in my time, is derivative work of every one of those copyrighted articles.

         
        • Nope, that is unauthorized copying. Real-time OCR and use is ok, but once you store for further processing it is copying.

          • by aldousd666 ( 640240 ) on Tuesday May 09, 2023 @03:37PM (#63509439) Journal
            You aren't storing the work. LLM's dont' just have a database of facts they've read lying around. They have words that live in vector space based on the filters. These same words will be used by many different relationships, and there isn't a clear way to disentangle them after they go in. The LLM does not 'contain' a copy of the work, it contains knowledge relationships distilled from having read it. This is not settled law yet, but I'm going to bet that it comes down in the clear.
            • by ffkom ( 3519199 ) on Tuesday May 09, 2023 @04:10PM (#63509535)

              You aren't storing the work. LLM's dont' just have a database of facts they've read lying around. They have words that live in vector space based on the filters.

              That is not too different from a "lossy compression". I have seen LLM based services to pretty much literally citing whole sections of web sites - especially when you ask for a topic on which only few web sites have texts. The billions of LLM weights are a lossy compressed form of the training input, and for video and audio, re-creating stuff from a reduced set of parameters has been the norm for years now.

            • You store the data for intermediate steps typically... especially if you scan and OCR text to feed into the LLM since scanning, OCR, and processing are all different speed to complete.

            • I agree with you but this bit, "The LLM does not 'contain' a copy of the work, it contains knowledge relationships distilled from having read it." Is not correct. The LLM simply calculates the statistical probabilities of certain letters, morphemes, words, phrases, clauses, sentences, & other recurrent features & patterns, following each other. LLMs don't "read" & they don't store "knowledge." It's very important to make a distinction between information & knowledge. Saying that a machine "k
              • by AmiMoJo ( 196126 )

                LLMs have been known to repeat stuff they read verbatim. They do appear to have some memory of things used to train them.

                • Again, not really. It isn't appropriate say "memory" about an LLM. They don't remember things. They store the parameters of preferred & dispreferred recurrent lexicogrammatical configurations & use those to generate texts probabilistically from those parameters. If very few parameters have been stored, then there'll be relatively little variation in the generated text, i.e. it may well come out verbatim.

                  I think it's arguably correct to describe LLMs are giant p-hacking machines: Give 'em tonnes o
          • It's like you cannot be said to have copied a work by reading it. Neither does the LLM. Real-time OCR actually DOES copy the work.
            • To be legal in terms of Copyright as I understand it, you would need to scan, OCR, process via the LLM, and purge in a single step; once you store data between steps that is beyond a series of characters or words then a valid copyright claim could exist.

        • by burtosis ( 1124179 ) on Tuesday May 09, 2023 @03:08PM (#63509361)

          I myself having read a lot of copyrighted material in my time,

          You did what?!? I’m sorry but your contract to access that information has ended. Please take a cast iron pan and use blunt force trauma to the base of your skull until you are able to dump all the unlawful information.

      • by msauve ( 701917 )
        >Here's the fun part. First sale doctrine only applies to PAPER. Not digital. You don't actually ever own the digital one. You just have the rights to use.

        Nope. First sale applies to physical media. Digital (CD/DVD) is the same as paper. You own, and can sell, the media and the purchaser can then use the content. But you never own the content. It's when you pay for a non-physical version (e.g. on-line training or streaming an audiobook) where "first sale" doesn't apply. IMHO, "first sale" can apply to s
      • Really? I'm being serious here:

        First sale doctrine only applies to PAPER. Not digital. You don't actually ever own the digital one. You just have the rights to use.

        But what about copies of digital on physical media? Let's say I have Adobe Master Collection CS6 (cost about $2500). I upgraded my OS and can no longer use it.

        By your interpretation, I cannot sell-on these Adobe-original CDs to a third party legally. Even if I no longer do or even CAN use the software contained on those CDs? (Adobe CS6 for Mac s

    • by Junta ( 36770 )

      If you buy a book, you aren't allowed to make copies of the book for redistribution nor are you allowed to write your own knock-off book based on the book you purchased.

      First sale doctrine doesn't apply to what they are arguing, that the AI training constitutes creating derivative works, using 'Machine Learning' as a sort of way to launder infringement.

      • You are saying this as if it is settled. It is not. Extracting factual relationships from having read a work and using those relationships, but not the wording, is not copying the work. It's just like saying you learned how to paint from a great painter and then painted your own picture using some of his knowledge and techniques. You re-integrate knowledge in a different way than you got it, possibly even for many and various totally different purposes. It doesn't count as a derivative work. If you copy a
        • by Junta ( 36770 )

          True, it's not settled, but I'm inclined to consider machine learning as not as trans-formative as a human.

          Besides, even for humans, "learning" something doesn't prevent getting hit with copyright violations. At work, they told us not to even *look* at anything without the license being settled, as 'learned' code manifests in a manner indistinguishable of being copied.

          In music, in particular, you have a lot of cases of folks getting hit with a suit, and saying "oh, I didn't copy that, but I might have hear

      • "Laundering copyright infringement through the superficial use of AI's" sounds like it will be a hot legal topic in the coming years.

  • They're protecting their deteriorating business model. But, honestly I think that dinosaur should be allowed to go extinct. As I have watched them move the goalposts year after year, I have lost respect for them. Neverending revisions and chapter juggling to ensure new purchases, digital copy attachments with single use licenses, tendrils in the educational institutes... all for information freely available elsewhere. They stopped adding value a very long time ago. And increasingly I'm seeing courses where

    • But I want new calculus!
      • But I want new calculus!

        NOW OUT AT YOUR NEAREST STORE!

        The calculus, but written on vellum, not that old-fashioned stick-and-sand BS, and way better than chalk on slate!

    • I hate seeing posts like this (mine), but:

      True. So true. Pearson is a dinosaur protecting one of its last threads convincing people to 'buy the course textbook.'

      Didn't Pythagoras (or many others of his ilk) just use a stick to draw in the sand to teach the material? Thank goodness he didn't manage to copyright those things in perpetuity, or humankind would have never advanced beyond, well, writing lessons with sticks in the sand.

  • by presidenteloco ( 659168 ) on Tuesday May 09, 2023 @02:15PM (#63509229)
    it's arbitrarily readable by random humans, and arbitrarily scrapable by random software programs.

    If you don't like it, put your stuff behind a reader-identity-verified login or paywall.
    • And if you do license your content with a login / paywall gatekeeper and a EULA, then make sure the EULA specifically excludes AI training with the material, if that's what you want prohibited. Otherwise you probably don't (or shouldn't) have a legal leg to stand on.

      All material published openly on the worldwide web should be treated legally as having granted copyright implicitly to all users of the worldwide web, at least insofar as the copying was only for the purpose of informing the user (or now, user's
    • If AI models aren't derived works of the training material, then you have a point. But if that's how the law sees it, it will be the end of open publishing. There will be license agreements in front of every web page at the very minimum.

    • So then I just give my language model software the login, and proceed as before. I agree with your statement, but it does nothing to prevent someone with a login from scraping the content.
    • by AmiMoJo ( 196126 )

      That's never been how copyright works. Just publishing something does not mean giving up all IP rights. In fact it usually says on the thing itself "all rights reserved".

      The real question here is if an AI learning from something is the same as a human learning from something. There are some clear differences. You can't duplicate a person, but once an AI knows something you can create infinite copies. I'm not sure what the answer is, but it's not a simple question to answer.

  • by Alain Williams ( 2972 ) <addw@phcomp.co.uk> on Tuesday May 09, 2023 @02:19PM (#63509239) Homepage

    between me reading books to train myself and an AI reading books to train itself ?

    • Re: (Score:2, Troll)

      by Brain-Fu ( 1274756 )

      The AI can trivially be duplicated, you can't. And the AI can be used by way more simultaneous users than you can. And, unlike you, the AI won't eventually die.

      • The AI doesn't 'contain' a copy of the book. It contains knowledge about the content. That's different. The copyright is a copyright on the exact arrangement of the words, not the knowledge they impat. Sorry.
        • by ffkom ( 3519199 )

          The copyright is a copyright on the exact arrangement of the words, not the knowledge they impat. Sorry.

          If you JPEG-compress some image, you will get different pixel values than what was your input. Yet currently publishing a JPEG compressed version of some copyrighted artwork is still considered subject to licensing.

          That said, I am all for a modernized copyright law that frees information more than it is today. But the billions of parameters of an LLM are a lossy compressed version of the training input.

        • If you show an AI 1,000 images of cows and 1,000 images of sheep and it learns to distinguish the two (cows have horns, large udders, ... sheep are woolly, ...) but does not retain the images. Is that not similar to how I learn to tell sheep & cows apart ?

        • Well, no. Copyright is not concerned with exactness. It's concerned with "close enough to cause problems". Ice, Ice Baby is an example of inexact, but too close.

          HOWEVER, I agree with your point completely. AI doesn't retain content, and it doesn't reproduce it. If the results it gives get too close to the source, it shouldn't be a problem.

    • You (most likely) bought the books while the AI trainers scraped the text off the World Wide Web.

      Pearson seems to not like this because they aren't making money off it.

    • by Junta ( 36770 )

      If I wrote a textbook after reading one of their textbooks, then I would be in trouble.

      The challenge here is that the 'write the textbook' is an on-demand service, provided in a bespoke manner to others. So the copyright holders in general are trying to establish that ingest in a way that is expressly for the purpose to enable the software to produce derivative works should be addressed.

    • You are a human being. An AI isn't and even if we wanted to be fairly loose with our definitions, an AI program like ChatGPT isn't even close to human. It's not even sentient let alone sapient.

      Even putting that aside, the case is proof fair bit more complex than simply saying that because someone read your book anything you've done is clearly a derivative work. Facts (as are often found in text books) cannot be copyrighted and it's certainly likely that a person who learned the same set of facts from a d
    • Because the "AI" is not actually intelligent; it is essentially creating a map and hash of data that it can re-create even if it changes wording. The LLM does not create a "cleanroom" interpretation of the knowledge; it is a derivative work.

      • >The LLM does not create a "cleanroom" interpretation of the knowledge; it is a derivative work.
        Wrong about the derivative work, and also neither does your brain. By that "logic" anything ever written, but especially college papers and scientific literature would all be derivative works.

        >Because the "AI" is not actually intelligent; it is essentially creating a map and hash of data that it can re-create even if it changes wording.

        So... it operates just like your brain? That's literally the sum of Scie

        • My information is based on how it works with code (and I believe other types of copy). You can re-create a functional algorithm, but you need to do it independent of the original algorithm: you can understand how it works based on an original, but your process cannot reference the original to create a new implementation.

          In acedamia, sources are cited and exceprted directly only with attribution. An LLM doesn't know if it is quoting something or not once the information has been processed*; it doesn't stor

    • by AmiMoJo ( 196126 )

      There's one of you. AI can be cloned and sold to an infinite number of people.

  • Re: "Bird said Pearson had an advantage as its AI products would use Pearson content for training, which he said would make it more reliable."

    Pearson Education's a publisher like any other. They don't do fact checking or any kind of quality control over the content they publish. Such claims of reliability remain to be supported by independent evidence. For example, here's a white paper they published citing research on how learning styles are bunk: https://www.pearson.com/conten... [pearson.com] and here's a Pearson E
    • BTW, it took me 10 seconds to find those particular contradictions in Pearson Education's claims. There's no shortage of them.
  • And all these leeches that are like parasites attached onto the science & education sector
    • Hear! Hear!

      A while back I started wondering why it seemed new graduates lacked a baseline knowledge of key information (and how could you graduate without knowing this), which turned into a rabbit hole of computer-aided testing, Pearsons LONG list of lawsuits, and, strangely enough, Confucianism and meritocracy.

      While certain bits are open to interpretation, Pearson's role in peddling BS isn't.

      It really came to a head reading a story about how around 1/3 had passed one of their licensure exams, faking their

  • Maybe not in the case of Pearson specificly, but this is actually a valid complaint involving copyright, and a legitimate use of copyright , preventing _commercial_ infringing use.

    The language models are making use of copyrighted materials for very much commercial use.

    Think of the interesting issues here

    - if i buy one copy of a book, but i'm training 1000 AIs in parallel, do i need to buy 1000 copies?

    - the ML responses essentially plagiarize the source material. if a response substantially uses something fr

    • >if a response substantially uses something from a copyrighted source, and changes a few words, or rearranges the sentences, isn't that really copyright violation ?

      No, that's called the cash cow of next semester. Just bump the version number on the book and yell loudly about how it's the best and most comprehensive version ever, and that universities can't use the older copies that students can now get used!

    • by tragedy ( 27079 )

      - if i buy one copy of a book, but i'm training 1000 AIs in parallel, do i need to buy 1000 copies?

      I'm sorry, what special right do publishers have to sell 1000 copies of their books if 1000 people are going to read it? In fact, I have an idea. The following is to be read in a stuffy landed gentry voice: Books are expensive. Why, a decent book is ten pounds; fully two months of a laborers pay. The dozens and dozens of book in my library in my manor house are worth several lifetimes of pay for the common man. Yet my family and I can read at most several at any one time and even I could not afford a copy o

    • by Whibla ( 210729 )

      - if i buy one copy of a book, but i'm training 1000 AIs in parallel, do i need to buy 1000 copies?

      On reading this I had a 'thought'.

      Recent court cases have determined that AI's (or animals, but that's another story) cannot claim copyright. That 'privilege' is reserved for human beings. Surely this means that an 'AI' doesn't count as a 'reader' or 'consumer' of copyrighted works either. How many AI's is not the relevant metric, but how many AI trainers / programmers / researchers there are. I can 'feed' any AI I create with any material I have legal access to. Likewise my co-worker can do the same.

      This is mostly definitely not in the category of "i lend my book to my friend".

      How ab

      • by cats-paw ( 34890 )

        haha. that is a good point.
        AIs are not people too !

        my point about the 1000 AIs in parallel was that me and my 1000 friends can't read a book AT THE SAME TIME.

        but the AI can. So it seems to me that if you running the ML in learn mode and it's got 1000 parallel versions using the same material then they really should have paid for 1000 copies.

        anyway. ML is now big business, companies that create content for learning are now f*cked because the ML cartel will just trample copyright and none of these publishe

  • User: Which is the best textbook publisher?
    ChatGPT: NOT Pearson. Any publisher but Pearson!
    User: OK, thanks ChatGPT!

    Pearson Management: Mission... accomplished?
  • here we go again. company starts to fail so they go nuclear with the lawyers. can’t they just go down in flames peacefully?

"Don't tell me I'm burning the candle at both ends -- tell me where to get more wax!!"

Working...