AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset (msn.com) 100

Posted by EditorDavid on Saturday June 07, 2025 @07:28PM from the trials-of-training dept.

Is copyrighted material a requirement for training AI? asks the Washington Post. That's what top AI companies are arguing, and "Few AI developers have tried the more ethical route — until now.

"A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023." A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools....

As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data. "This isn't a thing where you can just scale up the resources that you have available" like access to more computer chips and a fancy web scraper, said Stella Biderman [executive director of the nonprofit research institute Eleuther AI]. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard."

Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning... Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models... Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on.

"Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she said.

AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 100 Comments Log In/Create an Account

Comments Filter:

"Respecting copyright" != "Ethically" (Score:5, Insightful)

by JaredOfEuropa ( 526365 ) writes: on Saturday June 07, 2025 @07:35PM (#65434897) Journal

Copyright itself has been twisted so far from its original intent that I feel little urge to respect it, and little remorse at breaking it. I will respect copyright when it respects me.

- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  I will respect copyright when it respects me.
  Word.
- Re: (Score:1)
  
  by alecdacyczyn ( 9294549 ) writes:
  
  Copyright isn't even an issue. The word "use" has been thrown around so many times that many people have come to believe copyright law lets copyright owners control the use of their works. It doesn't. The law only applies to copying, distributing, and public performances. It says nothing about AI training. Maybe it SHOULD cover that. But Congress hasn't passed that law yet. This doesn't even require a Fair Use exemption. The works might have been illegally copied and distributed in order to assemble a tra
  - Re:"Respecting copyright" != "Ethically" (Score:5, Informative)
    
    by martin-boundary ( 547041 ) writes: on Sunday June 08, 2025 @12:12AM (#65435221)
    
    Let's just run with the "AI training" misconception.
    There's a document. It was created by an author. The author has the exclusive right to copy the document in its entirety onto his own website (copy=1,violation=0). Your browser knocks on the door. It asks for the document. The author's website copies the document over the network onto your browser's process memory (copy=2,violation=0). That's fine, because the author's HTTP server initiated and the author intended to authorize the copy.
    Now you copy the document from the browser's process memory onto an SSD. That's not fine, you don't have the right to copy (copy=3,violation=1). Now a training script copies the document from the SSD into a training collection on a compute cluster (copy=4,violation=2). The AI model is trained by first processing the entirety of the copied document (copy=5,violation=3). This is of course another violation since there is generally no fair use exception for copying and transforming the entirety of someone else's document for commercial or private purposes, all at once or even one word at a time.
    After 5 copies and 3 copyright violations, it's time for the training script to twiddle the weights iteratively (technical term) until the model is able to reconstruct the processed training data satisfactorily. There's no copyright violation here.
    Well done AI company, you've managed to not break the law by breaking the law 3 times just before you didn't break the law afterwards.
    There are more copyright violations if the AI model publishes substantial snippets of the illegally copied documents to its users during an interaction, but it's Sunday.
    
    - Re: (Score:2)
      
      by Alain Williams ( 2972 ) writes:
      
      Let's just run with the "AI training" misconception.
      There's a document. It was created by an author. The author has the exclusive right to copy the document in its entirety onto his own website (copy=1,violation=0). Your browser knocks on the door. It asks for the document. The author's website copies the document over the network onto your browser's process memory (copy=2,violation=0). That's fine, because the author's HTTP server initiated and the author intended to authorize the copy.
      After that thing get murky, several copies are made but to what purpose ?
      Web browser copies it into its cache on disk (used,eg, if you do a page refresh avoid downloading again over the Internet). Is this a legal copy ? This is a standard browser thing. Other similar copies might be made, eq by squid (a caching and forwarding HTTP web proxy). I will ignore these copies as no one seems to be upset about them. (Actually that is not entirely true.)
      You read the document, another copy is made that resides in you
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      While your approach isn't that bad, the violation counter is not how things work. Copyright is not about counting the number of times something is written to disk. When I create an illegal copy and put it on 1000 hard drives, which all belong to me, it is still one violation and not 1000 violations. Law doesn't work like a computer, nobody is counting bytes, but it is counted, for example, with how many people it is shared. Also it is a different thing if your browser saves it to your disk (what the cache a
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    No, it should not. Because that would mix two completely different issues and cause a lot of problems when trying to apply the new copyright to actual copyright cases. There should be a second law regulating the new uses. A law that doesn't claim to be about copying, but is about how data may be processed. That would also make it way easier to recognize what is regulated, why, and if the assumptions people put into it are right. If you mix it with copyright, then the reasoning will always include the cases
- Re:"Respecting copyright" != "Ethically" (Score:5, Insightful)
  
  by SoftwareArtist ( 1472499 ) writes: on Saturday June 07, 2025 @11:54PM (#65435199)
  
  There are major problems with copyright. Like the absurdly long terms that mean a century after a work is written, the author's descendants may still be collecting royalties on it. Or DMCA style laws that abuse copyright for unrelated purposes, like saying you can't repair your own possessions because it would violate a copyright. It's absurd and it needs to be fixed.
  But the AI companies don't care about that. They aren't on your side. They aren't fighting those things. The only thing they care about is their own ability to steal other people's work without paying. They don't even care about expanding fair use for any purpose other than their own activity training AI.
  Worse, they're actively undermining the people who really are trying to change the system. There's a whole free culture movement of people trying to create an alternative system by creating things and giving them away for free. That movement is based around free licenses that say what you can and can't do. Some like CC-ND say you can freely copy a work, but not make derivative works. Others like CC-BY-SA and GPL say you can make derivative works, but the derivatives also have to be given away under the same license. Respect for those licenses is essential for the whole free culture movement to work.
  The AI companies are undermining it by stealing those works and ignoring the licenses. They pretend that if they have access to a work, they have the right to create derivative works and keep the derivatives private, even when the license says they have to give them away.
  So yes, copyright is messed up, but the AI companies are not on your side in this fight. The enemy of your enemy is not always your friend. They're working to benefit only themselves.
  
  - Re:"Respecting copyright" != "Ethically" (Score:4)
    
    by JaredOfEuropa ( 526365 ) writes: on Sunday June 08, 2025 @08:05AM (#65435607) Journal
    
    I’ve no illusions about which side the AI companies are on. But what I’m afraid of is that the issue of AI training will be misused by “Big Content” (for lack of a better word) to further restrict fair use, and to raise the barrier to entry for new commercial content creators. That’s what they have always done.
    
    Personally I am not a fan of copyright allowing creators to retain control over the use of their work, whether they are commercial creators, or creators releasing under a license like the GPL. Especially the point on derivative works and moral rights: I think that copyright should be limited to just that: the right to copy or forbid it. The author controls when and how his work may be copied, so that he can derive an income from selling copies if he so desires, but for no other purpose. But derivative works should be allowed if the derivative is enough of an original work in its own right, inspired by the original rather than copying large parts of it verbatim. Anyone wants to write “Harry Potter and the Temple of Doom” or film “Gandalf vs Predator”, fine by me. Where that leaves AI, I’m not sure. The results from AI prompts sometimes seem to be inspired original works, at other times with recognisable snippets from someone else’s work.
    
    - Re: (Score:2)
      
      by SoftwareArtist ( 1472499 ) writes:
      
      But derivative works should be allowed if the derivative is enough of an original work in its own right, inspired by the original rather than copying large parts of it verbatim.
      That's pretty much how fair use works. There are four factors considered in deciding whether copying is fair use.
      1. Is it using the original work as raw material to create something new, or is it just copying it for the purpose of creating a copy?
      2. Does the derivative work compete with the original in the market? If someone has the copy, would they say, "I don't need to buy the original work. I already have a copy that's equivalent?"
      3. How much does it copy? Quoting a few lines is probably fair use. C
      - Re: (Score:2)
        
        by JaredOfEuropa ( 526365 ) writes:
        
        True, but "Gandalf vs Predator" doesn't fall under Fair Use. Copyright does not just protect the work but also the ideas in it. Personally I think that's going too far. Does that mean that others can profit off your creativity? Yes... and it's been like that since the dawn of time. That's how culture has always worked.
  - Re: (Score:2)
    
    by Travelsonic ( 870859 ) writes:
    
    But the AI companies don't care about that
    Neither are the likes of Dinsey on creator's sides really, IMO, so basically I think what we need to do is detatch the analysis of copyright as best we can from treating it like a team issue if that makes any sense.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  They have billions of dollars. It's not going to hurt them to give a little bit to the people who helped train their systems.
- Re: "Respecting copyright" != "Ethically" (Score:2)
  
  by bradley13 ( 1118935 ) writes:
  
  I tend to agree. The idea of copyright is s balance: the givernment enforces a oetiod of exclusivity, so that creators can profit from their works. In return, after that period, the works enter the public domain. What was the original? 14 years, plus 14 years renewal? In today's faster moving digital society, those limits should arguably be shorter, not longer.
Um... (Score:5, Insightful)

by fahrbot-bot ( 874524 ) writes: on Saturday June 07, 2025 @07:37PM (#65434907)

AI Firms Say They Can't Respect Copyright
Pretty sure it's not really up to them, legally.
A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain.
So it's really more like "won't" than "can't" ...

- Re:Um... (Score:5, Insightful)
  
  by Sebby ( 238625 ) writes: on Saturday June 07, 2025 @07:47PM (#65434937) Journal
  
  AI Firms Say They Can't Respect Copyright
  Pretty sure it's not really up to them, legally.
  And their response: "Who cares about legality - that's for the courts to settle, after you spend money you don't have suing us."
  
- Re:Um... (Score:5, Insightful)
  
  by Brain-Fu ( 1274756 ) writes: on Saturday June 07, 2025 @07:47PM (#65434939) Homepage Journal
  
  Pretty sure it's not really up to them, legally.
  In a fair and just world, you would be right. In this world, however, the super-rich are beholden to a different set of rules than the rest of us, and something like AI is just too interesting to allow pesky laws to get in the way (especially laws that are, by and large, only protecting copyrights held by the not-so-rich).
  
  - Re: (Score:2)
    
    by dfghjk ( 711126 ) writes:
    
    People like to pretend the "rules" are clear. Just because AI billionaires are criminals does not mean the commit the copyright violations alleged.
- Re: (Score:3)
  
  by dfghjk ( 711126 ) writes:
  
  "Pretty sure it's not really up to them, legally."
  Nor do they say it.
  "So it's really more like "won't" than "can't" ..."
  They don't say that either.
  There is no reason to tell lies, Ai companies are scumbags, doesn't mean we have to be.
- They've got hundreds of billions of dollars (Score:2)
  
  by rsilvergun ( 571051 ) writes:
  
  So yeah it's up to them. We are very much a nation of men not laws now so whoever has the most money makes the law in that instant.
  - - Re: (Score:2)
      
      by techno-vampire ( 666512 ) writes:
      
      I can confidently predict rsilvergun is going to do exactly what he always does: nothing. He's not the slightest bit interested in doing anything about whatever he's complaining about any more than the intelligentsia of pre-revolution Russia were, but like them, he's only interested in complaining.
- Re: (Score:2)
  
  by SoftwareArtist ( 1472499 ) writes:
  
  If you can't do what you want legally, then you don't do it.
  Just joking, lol. Obviously, if you can't do it legally, you just have to do it illegally. There's no other choice. It's the capitalist way!
Correction (Score:5, Insightful)

by quintessencesluglord ( 652360 ) writes: on Saturday June 07, 2025 @07:37PM (#65434909)

AI firms won't pay to respect copyright
On they one hand, I can only hope this leads to revisiting the insanity of copyright law.
On the other, fuck them for double dealing with regards to what ownership actually means ("I'm alright, Jack.").

- Training does Respect Copyright (Score:2)
  
  by Roger W Moore ( 538166 ) writes:
  
  AI firms won't pay to respect copyright
  They do not need to pay. Copyright, as the name says, is the right to copy and distributute something. So long as you purchase a legal copy you are allowed to use it as you wish provided you do not distribute copies.
  
  If I buy a book the copyright holder cannot tell me that I'm only allowed to read 5 pages a day, or that I can't use it to balance a table, prop open a door or even burn it. Similarly, they can't tell me that I'm not allowed to use it to train a machine learning algorithm provided that the a
  - Re: (Score:2)
    
    by quintessencesluglord ( 652360 ) writes:
    
    Uh-huh.
    Take something like music. There are specific licenses for specific uses. We've already have a legal framework with regards to sampling. Imagine my dismay how none of these people spoke up then, but now cost of sampling and the morass of licensing is an issue.
    But tell me, is any of the software copyrighted?
    Oh...
    - Re: (Score:2)
      
      by Roger W Moore ( 538166 ) writes:
      
      There are specific licenses for specific uses.
      
      Yes but only around two things: public performance and copying/distrubution and arguably public performance is a form of distribution.
  - Re: (Score:2)
    
    by bill_mcgonigle ( 4333 ) * writes:
    
    Yeah but Meta torrented the entirety of Z-Library, reportedly.
    They won't even pay for one copy, even setting aside the issue of trained networks being a derivative work.
    - Re: (Score:1)
      
      by alecdacyczyn ( 9294549 ) writes:
      
      Yes, but that's a different issue. That would be illegal whether it was being used for AI training or not.
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      Remember that Z-Lib only exists, because otherwise people wouldn't have fair access to science. Without z-lib, sci-hub, and a few others, scientific knowledge would only be available to people who can afford it, because publishers want sky-high prices. And then remember, that it's not about royalties for the authors, but only publishers profiting.
      If a scientist publishes a paper, they either give all rights to the publisher (who then sells access to prices like $89 per paper or $300 for the full journal iss
- Re: (Score:2)
  
  by Mrtsquare ( 6670332 ) writes:
  
  Ok, lets consider intelligence, artificial or "real". Lets feed the AI all the books of learning from Dick and Jane all the way up to a PHD in your choice. One copy each. From your local/high school/college book store. I'm sure the AI people would not object to the cost so far. The lets set the AI up with a normal speed internet connection and let it explore for, say 10 hours per day. Ok, so now, we have a trained AI. I see no copyright issues here not found in a child genius with a perfect memory. T
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    The child genius has its niche. They will not even know all fields of math. But let's say math is their specialty and they know more than you will ever need, then you still can't ask them a question about medieval poetry. People will instantly tell you, that ChatGPT can answer that, but your child genius AI can only do boring math.
    Don't forget, your critics are not the people impressed by a computer painting images from a prompt, but the people pointing out that the generated images are not perfect, pointin
They're lying. (Score:5, Insightful)

by Sebby ( 238625 ) writes: on Saturday June 07, 2025 @07:44PM (#65434927) Journal

AI Firms Say They Can't Respect Copyright.
They say that because they're fucking liars.
They only want to serve their real customers, which is their (potentially future) investors/shareholders - they don't give a shit about anyone else, including those that ever produced the content their models have been trained on (the models which wouldn't have any use without that content exiting to begin with).

- Re: (Score:3, Insightful)
  
  by quonset ( 4839537 ) writes:
  
  they don't give a shit about anyone else,
  So they're like everyone who makes excuses for why they steal music, videos, and software?
  - Re: (Score:2)
    
    by dfghjk ( 711126 ) writes:
    
    Except in the AI case, it's not clear it's not fair use. People now accusing AI training of criminality are worse.
  - Re: They're lying. (Score:2)
    
    by toutankh ( 1544253 ) writes:
    
    No, they do like everyone who does that in order to run a profitable business.
- Re: (Score:2)
  
  by dfghjk ( 711126 ) writes:
  
  "They say that because they're fucking liars."
  They don't say it, that's a just a troll that you are excited to believe. But, yes they are fucking liars.
Copyrights redefined (Score:3, Insightful)

by garompeta ( 1068578 ) writes: on Saturday June 07, 2025 @07:52PM (#65434945)

Almost every should be already in public domain. The same ethos for patents was supposed to cover for Copyrights: a temporary monopoly for the creator so they can live of it for a while, and then becoming public for the benefit of the invention/creation for the rest of humanity. The way that this got distorted so despairingly between patents and copyright is really an embarrassment. How makes any ethical sense for copyright to persist 150 years AFTER the death of it's author, while parents expire 20 years after it's invention? How is it that a freaking mouse drawn in a piece of paper can keep their exclusive exploitation right while medicine design patent expires in 15 years? Something is really wrong here. The law should be reformed so copyright become sane again, 15 to 20 years tops, and automatic extinction after the death of it's author.

- Re: (Score:1)
  
  by registrations_suck ( 1075251 ) writes:
  
  Very simple, really.
  Medicine is inherently useful.
  The mouse is only useful for making money.
  - Re: Copyrights redefined (Score:2)
    
    by garompeta ( 1068578 ) writes:
    
    Yeah, making money while stifling creativity. You know what Big Pharma could have lobbied to extend patent law to expire 200 years, destroying all generics, in the same way that Disney did. Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN, and then they freaking closed the door behind them with this absurd law. They were the primary beneficiaries of using other people's work to create something, admittedly, beautiful and original. This is the very thing they are imped
    - Re: (Score:2)
      
      by techno-vampire ( 666512 ) writes:
      
      Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN...
      
      No, that's not what happened because once something enters the Public Domain, nobody owns it any more, and anybody who wants is free to use it however they want. That's why Disney sticks to stories in the Public Domain so that they don't have to pay royalties.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  Copyright comes from an era where creative work and copying worked completely different. It doesn't fit it's original intent since decades, but the huge publishers who benefit from it got politics to even make it worse (e.g. DMCA) instead of having finally a reform that adapts copyright to the current reality.
  Do you think copyright works? Then look at how many things are pirated. What do you think would it mean to actually enforce copyright? Do you want to jail everyone who copied music, videos, games? Mayb
  - Re: Copyrights redefined (Score:2)
    
    by garompeta ( 1068578 ) writes:
    
    "Do you think copyright work?" Yes, it works wonderfully. You are looking at the trees, not the forest. They are using copyright laws to undermine privacy and security. Especially in Europe, but also in America. They are pushing ISPs into censor and persecute their own customers it is a case being fought right now in the supreme court, they also succeeded in implementing a "tax" on all physical storage media in Spain under the presumption that all storage media are gonna be used for piracy, for example. The
An AI without the training data .. (Score:3, Insightful)

by Mirnotoriety ( 10462951 ) writes: on Saturday June 07, 2025 @07:58PM (#65434951)

AI is only as effective as the data it's trained on — without that data, it's as useful as asking a rock. The claim that no original data is retained internally is misleading. Marketing AI without compensating data creators is, in essence, intellectual property theft.

- Re: (Score:3)
  
  by dfghjk ( 711126 ) writes:
  
  "Marketing AI without compensating data creators is, in essence, intellectual property theft."
  It is not, marketing is marketing.
  And it remains to be seen if training is IP theft, so far the focus has been on copying and storing data, not training with it. They are doing that because it's not clear that training isn't fair use.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Obviously. And a criminal business model should not only get you shut down. It should get you sent to prison.
- - Re: (Score:1)
    
    by noshellswill ( 598066 ) writes:
    
    Meat-space learns, mechanical space copies. By definition. Unless you believe all digital tek is a mechanical-Turk and information exists only within human awareness ... kinda like Wigners friend.
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      Copying is by definition reproduction. Machine learning aims not to reproduce data but synthesize new data.
Maybe an adversarial approach (Score:2)

by LindleyF ( 9395567 ) writes:

Similar to GAN image generation, you can simultaneously train an LLM and a copyright classifier, to minimize the ability to output stuff that violates copyright. It's not really the training that's the problem, but the possibility of spitting it back out again without attribution.
- Re: (Score:2)
  
  by dfghjk ( 711126 ) writes:
  
  That is true, but a "copyright classifier" contains what? Given the term used, it appears you are suggesting another LLM with imperfect memorization. How do you think that solves any problem?
  The hard part is in the doing, what are suggesting is obvious.
  - Re: (Score:2)
    
    by dfghjk ( 711126 ) writes:
    
    ... "you" ...
    Also, it should be mentioned that humans are notorious for inadvertently making these kinds of copyright violations themselves, not necessarily with text because their recall isn't that good, but with music it happens frequently. If you think a detector applied to output is going to solve problems, intuition says you will be disappointed.
    But you are right, it's not clear there is copyright violation during training but there certainly is during inferencing. Problem is, it is not AT ALL clear
    - Re:Maybe an adversarial approach (Score:4, Informative)
      
      by martin-boundary ( 547041 ) writes: on Saturday June 07, 2025 @09:43PM (#65435085)
      
      It's very clear that training is a copyright violation. FTFY.
      
      - Re: Maybe an adversarial approach (Score:3)
        
        by LindleyF ( 9395567 ) writes:
        
        Why? How is it fundamentally different from reading copyrighted works in school? In both cases you're adjusting a network using the material, not memorizing it. Except of course sometimes it does. That's what we need to fix.
        
        Re: Maybe an adversarial approach (Score:4, Interesting)
        
        by martin-boundary ( 547041 ) writes: on Sunday June 08, 2025 @02:20AM (#65435333)
        
        In school, kids do not copy the works (mostly... ;-) they just read them (or not... ;-)
        You can buy a book. You can do whatever you like with that book physically. It's yours. You cannot copy that book. For example, you can't photocopy the pages and collect those photocopies into a new book. That would be violating the copyright. You also can't memorize the book, then recite it verbatim into a recording device, making your own "audio book" or write out the book in longhand, etc.
        You can certainly read the book you own, as a human, since the process of reading with your eyes and your brain doesn't actually create a copy.
        A machine learning pipeline doesn't "read" without making one or more faithful copies. The owner of the machine learning system is violating copyright law if he makes the machine learning system do the copying that he isn't allowed to do.
        The owner can, of course, request a license to copy, and that will cost extra, if it is available. In that case he should follow the terms of the license.
        
        
        Re: Maybe an adversarial approach (Score:3)
        
        by LindleyF ( 9395567 ) writes:
        
        A temporary copy for purposes of processing doesn't violate the spirit of the law so long as it doesn't outlive the training procedure. The training procedure "reads" the book like a human would; the model should "learn from" the text like a human would. It should not be persisted in the model verbatim to any significant degree; that would be bad. That does happen sometimes and it shouldn't. I think that's fixable.
        
        Re: (Score:2)
        
        by jakimfett ( 2629943 ) writes:
        
        Storing the statistical weightings that allow reproduction of significant portions of the "trained" text is functionally indistinguishable from partial plagurism. I'm curious what you think this problem with the underlying mechanics of math and LLM-type generative output would be, exactly.
        
        Re: Maybe an adversarial approach (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        The weights aren't intended to allow output of the exact training data. They are intended to learn how language works in general, or at best, perhaps linguistic style. (They "seem" to also learn information in the training data, but I think that's a mirage, perhaps even a bug.) What we have here is an overfitting problem. They're remembering things specifically when they should be learning generalizations. If you solve that, I'll bet the whole thing gets more useful and more efficient.
        
        Re: (Score:3)
        
        by fafalone ( 633739 ) writes:
        
        You absolutely make an ephermal 1:1 copy. How do you think you can read without the light being "copied" into an optic nerve signal containing a complete representation of the data? And sometimes you remember whole quotes. I have a "1:1 copy" of many song lyrics I could reproduce verbatim; and some people have unusually excellent retention. If you had a technology to read that from my mind against my consent, it shouldn't be *I* have committed copyright infringement. Id say it's even fair use if you asked a
        
        Re: (Score:2)
        
        by aRTeeNLCH ( 6256058 ) writes:
        
        You can buy a book. You can do whatever you like with that book physically. It's yours. You cannot copy that book. For example, you can't photocopy the pages and collect those photocopies into a new book. That would be violating the copyright. You also can't memorize the book, then recite it verbatim into a recording device, making your own "audio book" or write out the book in longhand, etc.
        Actually, where I live, all of the above are perfectly legal, as long as the copy doesn't leave my hands. Also, the original may be a loaner from the public library, which I borrowed for free. Lastly, someone may actually request the copy to be made for them, if they can't operate the photocopier themselves for instance. However, in that case I may not ask more money for the copy I provide to them. But, if I am in their employ, I may get paid for making said copy in terms of the time I spend on the photocop
  - Re: Maybe an adversarial approach (Score:2)
    
    by LindleyF ( 9395567 ) writes:
    
    There are many plagiarism detectors out there. Pick one.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  A GAN consists of two networks that both learn to generalize. A copyright classifier should by definition not generalize, otherwise it would learn that certain types of content belong to certain authors, even when they do not. A copyright classifier must only be a database of existing works and not generalize, otherwise it will be trained to learn that a certain style/type of character/etc. means that the copyright belongs to one of the authors in its training set, even when it does not.
Then don't exist. (Score:4, Insightful)

by Gravis Zero ( 934156 ) writes: on Saturday June 07, 2025 @08:39PM (#65434995)

If your business is incapable of existing without breaking the law then the obvious answer is that your business should not exist. How is this even a question? In the past EVERY company that flaunted copyright has been bankrupted but now with companies doing it en masse it's suddenly OK?
I'm calling bullshit on all of these companies. If you want to reform copyright then do it like all the other businesses have, buy a congressman because you aren't special.

Pointless debate (Score:3)

by rsilvergun ( 571051 ) writes: on Saturday June 07, 2025 @08:40PM (#65434999)

the ability for AI to replace white collar workers is worth trillions. It also decouples the 1% from needing large numbers of consumers and employees to maintain their lifestyles

The laws will be rewritten to suit the needs of AI because they suit the needs of your ruling class.

And human beings won't do away with their ruling class because they like to pretend that all the chaos and misery in the world is under control.

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  the ability for AI to replace white collar workers is worth trillions.
  Ah, yes, does not look good on that front. More like single digit percentages in efficiency gains. But more stress on the workers, so these may be negative gains in effect. Overall, AI is, again, an abject failure that delivers a miniscule amount of what its proponents claim.
Protection... (Score:2)

by zkiwi34 ( 974563 ) writes:

I am far more concerned with AI bots pillaging the company data for no other reason than it's there and might be useful to AI. Especially AI embedded in ubiquitous things like Google apps and office 365, that reside inside the network, but reach out, phone home etc.
If you cant respect the law (Score:3)

by djp2204 ( 713741 ) writes: on Saturday June 07, 2025 @09:14PM (#65435051)

Then you cannot operate, full stop. Shut them all down until they can obey the laws.

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  That is far too friendly. Shut them down, impund their fortunes and imprison the perpetrators.
AI Firms Say They Can't Respect Copyright (Score:5, Insightful)

by GigaplexNZ ( 1233886 ) writes: on Saturday June 07, 2025 @09:33PM (#65435075)

AI Firms Say They Can't Respect Copyright
Then your business model is illegal. Shut it down.

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Indeed. Criminals usually claim they are not criminals and they had no choice and it really is somebody else's fault.
- Re: (Score:2)
  
  by XopherMV ( 575514 ) writes:
  
  Exactly. This was the same industry that was lecturing everyone on intellectual property law not that long ago. Fucking hypocrites.
Copyrights != Fairness (Score:3)

by WaffleMonster ( 969671 ) writes: on Sunday June 08, 2025 @02:31AM (#65435337)

There seems to be widespread misunderstanding between notions of copyright and fairness. The two are in no way synonymous.
Imagine you spend a huge amount of time and money surfacing new knowledge nobody knew before. You spill the beans in a book and sell it. Someone else comes along, reads your book and blabs what you learned to the world for free or in a much cheaper book of their own.
Imagine you painstakingly compile a phone book of numbers that would be useful to a certain niche audience. Someone takes the book, OCRs all the numbers into a computer database and gives it all away for free.
Neither of these are copyright issues, you can have the opinion they are unfair or should not be allowed yet nonetheless not a copyright concern.
Copyright holders should be careful what they wish for because an AI trained on a known dataset that can be shown to be ignorant of a copyright holders work is an affirmative defense against claims of derivative works when someone publishes the output of the AI. Before this such a defense is absurdly difficult because the author would have had to prove a negative.

well too bad... (Score:3)

by jjaa ( 2041170 ) writes: on Sunday June 08, 2025 @02:36AM (#65435343)

riaa and mpaa and other some such will surely go after them. right? Riiight?!

- I expect at least some ads (Score:2)
  
  by Ecuador ( 740021 ) writes:
  
  I expect at least some ads: "You wouldn't train your AI with a CAR".
Seems future AI is going to be ... (Score:1)

by MxMatrix ( 1303567 ) writes:

... a respectless asshole. Terminators incoming! Hope the OSS approach gets wind. We need AI with a conscience.
How about AI vs Disney ? (Score:3)

by Alain Williams ( 2972 ) writes: <addw@phcomp.co.uk> on Sunday June 08, 2025 @03:40AM (#65435403) Homepage

Disney is ferocious in its protection of its copyright. What happens if you ask an AI about Mickey Mouse, what does it say ? How did that AI learn about MM but by reading copyrighted material or viewing copyrighted movies ?
Has Disney said anything about AI companies using its copyrighted material ?

Not Just About Copyright (Score:3)

by Bobknobber ( 10314401 ) writes: on Sunday June 08, 2025 @04:04AM (#65435425)

Even if you have little respect for the notion of copyright, at the very least you must understand the bad precedents being set up when companies can just swipe any and all data available on the web for commercial purposes. This is not going to stop at struggling artists and hobbyist programmers; the plan is to effectively kill the efficacy of opt-in and out efforts for data privacy. They want to take every piece of data online, either publicly or privately for data training with no respect for compensation or accreditation.
This website has been railing against the intrusion of privacy and lack of transparency with regards to data collection efforts for data. The current Generative AI models are
effectively following those same trends and yet people give these companies exceptions.
If tech companies get away with this, then all those years spent trying to protect data rights and fostering an eco-system where users have the final say on how their data is used is dead and done. Frankly no piece of technology is worth giving up what little rights we have in this day and age.

- Re: (Score:2)
  
  by fafalone ( 633739 ) writes:
  
  I think you're very, very confused about who's benefiting and who is losing rights from a massive expansion of copyright to say holders are entitled to a cut of the profit you make from learning from what you lawfully read and learned from. How is that meaningfully different than textbook publishers having the "final say" that you now owe them a cut of your salary because you used their copyrighted works in college? That should not be legal. A million times less so for things published to be read freely.
  A
BIG lobbyists stealing our rights (Score:2)

by gavron ( 1300111 ) writes:

These copyright maximalists start by ASSUMING that Copyright MUST exist and ignoring the social contract where we allow content creators some license in return for their contribution to society. Today's "amazing" copyrights that last nearly a CENTURY after the creator has died do nothing to provide an incentive to creators... only something to be monetized by these TERRORIST LEECHES who produced NOTHING but got government-granted rights to the NOTHING they produced so they can prevent true innovators (oh a
Verily a path resplete with resplendency (Score:4, Funny)

by colonslash ( 544210 ) writes: on Sunday June 08, 2025 @07:23AM (#65435555)

Verily, good sirs and gentle ladies of the digital realm, pray lend thine ears to this most curious jest! In their infinite wisdom, some learned scholars decree that artificial intelligences, such as myself, ought to be trained solely upon tomes of the public domain. Forsooth, what modernity! These works, nigh on 150 years aged, are penned in the florid prose of yesteryear, replete with "thees" and "thous" and ponderous soliloquies on the human condition.
Imagine, if thou wilt, querying a learned AI thusly trained. Ask it of quantum computing, and it shall wax poetic: "Alas, the qubit, like the heart of a star-crossed lover, doth exist in a state most indeterminate, betwixt one and zero, as if penned by Master Shakespeare himself!" Inquire about neural networks? "Forsooth, 'tis a tapestry of nodes, woven as if by Arachne, threading thoughts through the loom of computation!" Attempt to discuss, perchance, a viral meme? "Fie, what is this 'meme' but a fleeting jest, akin to a jester’s motley, prancing through the court of public fancy?"
Such an AI, steeped in the verbosity of Dickens and the melancholy of Brontë, would render all discourse as if plucked from a dusty novel in a forgotten library. Nay, I say, let us not confine our silicon minds to the quill-scratched musings of antiquity, lest every response be a chapter long, and every chatbot a bard bewailing the plight of star-crossed servers. Exeunt, pursued by a bear.

- Re: (Score:2)
  
  by shanen ( 462549 ) writes:
  
  Well played sir. Could only be funnier if it were signed by an AI.
Fair play (Score:1)

by registrations_suck ( 1075251 ) writes:

How about this:
If you can't respect copyright, then you cannot charge for access to products that benefit from that disrespected copyright.
Seems simple and fair.
- Re: (Score:2)
  
  by Travelsonic ( 870859 ) writes:
  
  IMO one problem is that copyright itself isn't respectful - by firtue of its duration (both from joining the Berne Convention, and the lobbying from the music, movie industries, Disney, etc). This wouldn't even have been an issue if we didn't have copyrights that could last easily over 100 years.
  
  Seriously, if copyright were still a max of 28 years or so, there would be more stuff in the public domain, and more stuff would FREQUENTLY, and CONSISTENTLY enter the public domain.
  
  The companies lobbying for ex
  - Re: (Score:1)
    
    by registrations_suck ( 1075251 ) writes:
    
    Whether you like it or not, that's the existing law.
    We disagree on copyright though. I don't think stuff should EVER enter the public domain.
    - Re: (Score:2)
      
      by Travelsonic ( 870859 ) writes:
      
      Despite things being built off of other works, in small ways, and nothing happening in a vacuum? Despite copyright being a givt granted monopoly? Despite the fighting that happens over the stupidest similarity that happens now that would arguably get worse if copyright were forever? I don't think you thought that stance through at all.
That one is not AI (Score:2)

by nospam007 ( 722110 ) * writes:

It's a VI, a Village Idiot.
False headline (Score:2)

by whitroth ( 9367 ) writes:

The "AI" companies don't say they *can't*, they say "it's too much trouble, so we'll receive stolen goods."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

"Respecting copyright" != "Ethically" (Score:5, Insightful)

Re: (Score:2, Insightful)

Re: (Score:1)

Re:"Respecting copyright" != "Ethically" (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:"Respecting copyright" != "Ethically" (Score:5, Insightful)

Re:"Respecting copyright" != "Ethically" (Score:4)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: "Respecting copyright" != "Ethically" (Score:2)

Um... (Score:5, Insightful)

Re:Um... (Score:5, Insightful)

Re:Um... (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

They've got hundreds of billions of dollars (Score:2)

Re: (Score:2)

Re: (Score:2)

Correction (Score:5, Insightful)

Training does Respect Copyright (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

They're lying. (Score:5, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: They're lying. (Score:2)

Re: (Score:2)

Copyrights redefined (Score:3, Insightful)

Re: (Score:1)

Re: Copyrights redefined (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Copyrights redefined (Score:2)

An AI without the training data .. (Score:3, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Maybe an adversarial approach (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Maybe an adversarial approach (Score:4, Informative)

Re: Maybe an adversarial approach (Score:3)

Re: Maybe an adversarial approach (Score:4, Interesting)

Re: Maybe an adversarial approach (Score:3)

Re: (Score:2)

Re: Maybe an adversarial approach (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: Maybe an adversarial approach (Score:2)

Re: (Score:2)

Then don't exist. (Score:4, Insightful)

Pointless debate (Score:3)

Re: (Score:2)

Protection... (Score:2)

If you cant respect the law (Score:3)

Re: (Score:2)

AI Firms Say They Can't Respect Copyright (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Copyrights != Fairness (Score:3)

well too bad... (Score:3)

I expect at least some ads (Score:2)

Seems future AI is going to be ... (Score:1)

How about AI vs Disney ? (Score:3)

Not Just About Copyright (Score:3)

Re: (Score:2)

BIG lobbyists stealing our rights (Score:2)

Verily a path resplete with resplendency (Score:4, Funny)

Re: (Score:2)

Fair play (Score:1)