Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
AI

AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset (msn.com) 63

Is copyrighted material a requirement for training AI? asks the Washington Post. That's what top AI companies are arguing, and "Few AI developers have tried the more ethical route — until now.

"A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023." A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools....

As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data. "This isn't a thing where you can just scale up the resources that you have available" like access to more computer chips and a fancy web scraper, said Stella Biderman [executive director of the nonprofit research institute Eleuther AI]. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard."

Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning... Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models... Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on.

"Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she said.

AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset

Comments Filter:
  • by JaredOfEuropa ( 526365 ) on Saturday June 07, 2025 @07:35PM (#65434897) Journal
    Copyright itself has been twisted so far from its original intent that I feel little urge to respect it, and little remorse at breaking it. I will respect copyright when it respects me.
    • by Anonymous Coward

      I will respect copyright when it respects me.

      Word.

    • Copyright isn't even an issue. The word "use" has been thrown around so many times that many people have come to believe copyright law lets copyright owners control the use of their works. It doesn't. The law only applies to copying, distributing, and public performances. It says nothing about AI training. Maybe it SHOULD cover that. But Congress hasn't passed that law yet. This doesn't even require a Fair Use exemption. The works might have been illegally copied and distributed in order to assemble a tra
      • Re: (Score:3, Informative)

        Let's just run with the "AI training" misconception.

        There's a document. It was created by an author. The author has the exclusive right to copy the document in its entirety onto his own website (copy=1,violation=0). Your browser knocks on the door. It asks for the document. The author's website copies the document over the network onto your browser's process memory (copy=2,violation=0). That's fine, because the author's HTTP server initiated and the author intended to authorize the copy.

        Now you copy th

        • Let's just run with the "AI training" misconception.

          There's a document. It was created by an author. The author has the exclusive right to copy the document in its entirety onto his own website (copy=1,violation=0). Your browser knocks on the door. It asks for the document. The author's website copies the document over the network onto your browser's process memory (copy=2,violation=0). That's fine, because the author's HTTP server initiated and the author intended to authorize the copy.

          After that thing get murky, several copies are made but to what purpose ?

          Web browser copies it into its cache on disk (used,eg, if you do a page refresh avoid downloading again over the Internet). Is this a legal copy ? This is a standard browser thing. Other similar copies might be made, eq by squid (a caching and forwarding HTTP web proxy). I will ignore these copies as no one seems to be upset about them. (Actually that is not entirely true.)

          You read the document, another copy is made that resides in you

    • There are major problems with copyright. Like the absurdly long terms that mean a century after a work is written, the author's descendants may still be collecting royalties on it. Or DMCA style laws that abuse copyright for unrelated purposes, like saying you can't repair your own possessions because it would violate a copyright. It's absurd and it needs to be fixed.

      But the AI companies don't care about that. They aren't on your side. They aren't fighting those things. The only thing they care about

    • They have billions of dollars. It's not going to hurt them to give a little bit to the people who helped train their systems.
  • Um... (Score:5, Insightful)

    by fahrbot-bot ( 874524 ) on Saturday June 07, 2025 @07:37PM (#65434907)

    AI Firms Say They Can't Respect Copyright

    Pretty sure it's not really up to them, legally.

    A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain.

    So it's really more like "won't" than "can't" ...

    • Re:Um... (Score:5, Insightful)

      by Sebby ( 238625 ) on Saturday June 07, 2025 @07:47PM (#65434937) Journal

      AI Firms Say They Can't Respect Copyright

      Pretty sure it's not really up to them, legally.

      And their response: "Who cares about legality - that's for the courts to settle, after you spend money you don't have suing us."

    • Re: (Score:3, Insightful)

      by Brain-Fu ( 1274756 )

      Pretty sure it's not really up to them, legally.

      In a fair and just world, you would be right. In this world, however, the super-rich are beholden to a different set of rules than the rest of us, and something like AI is just too interesting to allow pesky laws to get in the way (especially laws that are, by and large, only protecting copyrights held by the not-so-rich).

      • by dfghjk ( 711126 )

        People like to pretend the "rules" are clear. Just because AI billionaires are criminals does not mean the commit the copyright violations alleged.

    • by dfghjk ( 711126 )

      "Pretty sure it's not really up to them, legally."

      Nor do they say it.

      "So it's really more like "won't" than "can't" ..."

      They don't say that either.

      There is no reason to tell lies, Ai companies are scumbags, doesn't mean we have to be.

    • So yeah it's up to them. We are very much a nation of men not laws now so whoever has the most money makes the law in that instant.
    • If you can't do what you want legally, then you don't do it.

      Just joking, lol. Obviously, if you can't do it legally, you just have to do it illegally. There's no other choice. It's the capitalist way!

  • Correction (Score:5, Insightful)

    by quintessencesluglord ( 652360 ) on Saturday June 07, 2025 @07:37PM (#65434909)

    AI firms won't pay to respect copyright

    On they one hand, I can only hope this leads to revisiting the insanity of copyright law.

    On the other, fuck them for double dealing with regards to what ownership actually means ("I'm alright, Jack.").

    • AI firms won't pay to respect copyright

      They do not need to pay. Copyright, as the name says, is the right to copy and distributute something. So long as you purchase a legal copy you are allowed to use it as you wish provided you do not distribute copies.

      If I buy a book the copyright holder cannot tell me that I'm only allowed to read 5 pages a day, or that I can't use it to balance a table, prop open a door or even burn it. Similarly, they can't tell me that I'm not allowed to use it to train a machine learning algorithm provided that the a

      • Uh-huh.

        Take something like music. There are specific licenses for specific uses. We've already have a legal framework with regards to sampling. Imagine my dismay how none of these people spoke up then, but now cost of sampling and the morass of licensing is an issue.

        But tell me, is any of the software copyrighted?

        Oh...

        • There are specific licenses for specific uses.

          Yes but only around two things: public performance and copying/distrubution and arguably public performance is a form of distribution.

      • Yeah but Meta torrented the entirety of Z-Library, reportedly.

        They won't even pay for one copy, even setting aside the issue of trained networks being a derivative work.

    • Ok, lets consider intelligence, artificial or "real". Lets feed the AI all the books of learning from Dick and Jane all the way up to a PHD in your choice. One copy each. From your local/high school/college book store. I'm sure the AI people would not object to the cost so far. The lets set the AI up with a normal speed internet connection and let it explore for, say 10 hours per day. Ok, so now, we have a trained AI. I see no copyright issues here not found in a child genius with a perfect memory. T
  • They're lying. (Score:5, Insightful)

    by Sebby ( 238625 ) on Saturday June 07, 2025 @07:44PM (#65434927) Journal

    AI Firms Say They Can't Respect Copyright.

    They say that because they're fucking liars.

    They only want to serve their real customers, which is their (potentially future) investors/shareholders - they don't give a shit about anyone else, including those that ever produced the content their models have been trained on (the models which wouldn't have any use without that content exiting to begin with).

    • they don't give a shit about anyone else,

      So they're like everyone who makes excuses for why they steal music, videos, and software?

      • by dfghjk ( 711126 )

        Except in the AI case, it's not clear it's not fair use. People now accusing AI training of criminality are worse.

    • by dfghjk ( 711126 )

      "They say that because they're fucking liars."

      They don't say it, that's a just a troll that you are excited to believe. But, yes they are fucking liars.

  • Almost every should be already in public domain. The same ethos for patents was supposed to cover for Copyrights: a temporary monopoly for the creator so they can live of it for a while, and then becoming public for the benefit of the invention/creation for the rest of humanity. The way that this got distorted so despairingly between patents and copyright is really an embarrassment. How makes any ethical sense for copyright to persist 150 years AFTER the death of it's author, while parents expire 20 years
    • Very simple, really.

      Medicine is inherently useful.

      The mouse is only useful for making money.

      • Yeah, making money while stifling creativity. You know what Big Pharma could have lobbied to extend patent law to expire 200 years, destroying all generics, in the same way that Disney did. Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN, and then they freaking closed the door behind them with this absurd law. They were the primary beneficiaries of using other people's work to create something, admittedly, beautiful and original. This is the very thing they are imped
        • Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN...

          No, that's not what happened because once something enters the Public Domain, nobody owns it any more, and anybody who wants is free to use it however they want. That's why Disney sticks to stories in the Public Domain so that they don't have to pay royalties.
  • by Mirnotoriety ( 10462951 ) on Saturday June 07, 2025 @07:58PM (#65434951)
    AI is only as effective as the data it's trained on — without that data, it's as useful as asking a rock. The claim that no original data is retained internally is misleading. Marketing AI without compensating data creators is, in essence, intellectual property theft.
    • by dfghjk ( 711126 )

      "Marketing AI without compensating data creators is, in essence, intellectual property theft."

      It is not, marketing is marketing.

      And it remains to be seen if training is IP theft, so far the focus has been on copying and storing data, not training with it. They are doing that because it's not clear that training isn't fair use.

    • by gweihir ( 88907 )

      Obviously. And a criminal business model should not only get you shut down. It should get you sent to prison.

  • Similar to GAN image generation, you can simultaneously train an LLM and a copyright classifier, to minimize the ability to output stuff that violates copyright. It's not really the training that's the problem, but the possibility of spitting it back out again without attribution.
    • by dfghjk ( 711126 )

      That is true, but a "copyright classifier" contains what? Given the term used, it appears you are suggesting another LLM with imperfect memorization. How do you think that solves any problem?

      The hard part is in the doing, what are suggesting is obvious.

      • by dfghjk ( 711126 )

        ... "you" ...

        Also, it should be mentioned that humans are notorious for inadvertently making these kinds of copyright violations themselves, not necessarily with text because their recall isn't that good, but with music it happens frequently. If you think a detector applied to output is going to solve problems, intuition says you will be disappointed.

        But you are right, it's not clear there is copyright violation during training but there certainly is during inferencing. Problem is, it is not AT ALL clear

        • It's very clear that training is a copyright violation. FTFY.
          • Why? How is it fundamentally different from reading copyrighted works in school? In both cases you're adjusting a network using the material, not memorizing it. Except of course sometimes it does. That's what we need to fix.
            • In school, kids do not copy the works (mostly... ;-) they just read them (or not... ;-)

              You can buy a book. You can do whatever you like with that book physically. It's yours. You cannot copy that book. For example, you can't photocopy the pages and collect those photocopies into a new book. That would be violating the copyright. You also can't memorize the book, then recite it verbatim into a recording device, making your own "audio book" or write out the book in longhand, etc.

              You can certainly read the

              • A temporary copy for purposes of processing doesn't violate the spirit of the law so long as it doesn't outlive the training procedure. The training procedure "reads" the book like a human would; the model should "learn from" the text like a human would. It should not be persisted in the model verbatim to any significant degree; that would be bad. That does happen sometimes and it shouldn't. I think that's fixable.
      • There are many plagiarism detectors out there. Pick one.
  • by Gravis Zero ( 934156 ) on Saturday June 07, 2025 @08:39PM (#65434995)

    If your business is incapable of existing without breaking the law then the obvious answer is that your business should not exist. How is this even a question? In the past EVERY company that flaunted copyright has been bankrupted but now with companies doing it en masse it's suddenly OK?

    I'm calling bullshit on all of these companies. If you want to reform copyright then do it like all the other businesses have, buy a congressman because you aren't special.

  • the ability for AI to replace white collar workers is worth trillions. It also decouples the 1% from needing large numbers of consumers and employees to maintain their lifestyles

    The laws will be rewritten to suit the needs of AI because they suit the needs of your ruling class.

    And human beings won't do away with their ruling class because they like to pretend that all the chaos and misery in the world is under control.
    • by gweihir ( 88907 )

      the ability for AI to replace white collar workers is worth trillions.

      Ah, yes, does not look good on that front. More like single digit percentages in efficiency gains. But more stress on the workers, so these may be negative gains in effect. Overall, AI is, again, an abject failure that delivers a miniscule amount of what its proponents claim.

  • I am far more concerned with AI bots pillaging the company data for no other reason than it's there and might be useful to AI. Especially AI embedded in ubiquitous things like Google apps and office 365, that reside inside the network, but reach out, phone home etc.
  • by djp2204 ( 713741 ) on Saturday June 07, 2025 @09:14PM (#65435051)

    Then you cannot operate, full stop. Shut them all down until they can obey the laws.

    • by gweihir ( 88907 )

      That is far too friendly. Shut them down, impund their fortunes and imprison the perpetrators.

  • by GigaplexNZ ( 1233886 ) on Saturday June 07, 2025 @09:33PM (#65435075)

    AI Firms Say They Can't Respect Copyright

    Then your business model is illegal. Shut it down.

    • by gweihir ( 88907 )

      Indeed. Criminals usually claim they are not criminals and they had no choice and it really is somebody else's fault.

  • by WaffleMonster ( 969671 ) on Sunday June 08, 2025 @02:31AM (#65435337)

    There seems to be widespread misunderstanding between notions of copyright and fairness. The two are in no way synonymous.

    Imagine you spend a huge amount of time and money surfacing new knowledge nobody knew before. You spill the beans in a book and sell it. Someone else comes along, reads your book and blabs what you learned to the world for free or in a much cheaper book of their own.

    Imagine you painstakingly compile a phone book of numbers that would be useful to a certain niche audience. Someone takes the book, OCRs all the numbers into a computer database and gives it all away for free.

    Neither of these are copyright issues, you can have the opinion they are unfair or should not be allowed yet nonetheless not a copyright concern.

    Copyright holders should be careful what they wish for because an AI trained on a known dataset that can be shown to be ignorant of a copyright holders work is an affirmative defense against claims of derivative works when someone publishes the output of the AI. Before this such a defense is absurdly difficult because the author would have had to prove a negative.

  • by jjaa ( 2041170 ) on Sunday June 08, 2025 @02:36AM (#65435343)
    riaa and mpaa and other some such will surely go after them. right? Riiight?!
  • ... a respectless asshole. Terminators incoming! Hope the OSS approach gets wind. We need AI with a conscience.

  • Disney is ferocious in its protection of its copyright. What happens if you ask an AI about Mickey Mouse, what does it say ? How did that AI learn about MM but by reading copyrighted material or viewing copyrighted movies ?

    Has Disney said anything about AI companies using its copyrighted material ?

  • Even if you have little respect for the notion of copyright, at the very least you must understand the bad precedents being set up when companies can just swipe any and all data available on the web for commercial purposes. This is not going to stop at struggling artists and hobbyist programmers; the plan is to effectively kill the efficacy of opt-in and out efforts for data privacy. They want to take every piece of data online, either publicly or privately for data training with no respect for compensation

  • These copyright maximalists start by ASSUMING that Copyright MUST exist and ignoring the social contract where we allow content creators some license in return for their contribution to society. Today's "amazing" copyrights that last nearly a CENTURY after the creator has died do nothing to provide an incentive to creators... only something to be monetized by these TERRORIST LEECHES who produced NOTHING but got government-granted rights to the NOTHING they produced so they can prevent true innovators (oh a

Blessed be those who initiate lively discussions with the hopelessly mute, for they shall be known as Dentists.

Working...