Become a fan of Slashdot on Facebook


Forgot your password?
AI Google Facebook Microsoft Youtube

Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data? ( 58

What happened when OpenAI ran out of English-language training data in 2021?

They just created a speech recognition tool that could transcribe the audio from YouTube videos, reports The New York Times, as part of an investigation arguing that tech companies "including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law" in their search for AI training data. [Alternate URL here.] Some OpenAI employees discussed how such a move might go against YouTube's rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are "independent" of the video platform. Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4...

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said. That potentially violated the copyrights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products...

Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn't stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.

The article adds that some tech companies are now even developing "synthetic" information to train AI.

"This is not organic data created by humans, but text, images and code that AI models produce — in other words, the systems learn from what they themselves generate."
This discussion has been archived. No new comments can be posted.

Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data?

Comments Filter:
  • by memory_register ( 6248354 ) on Saturday May 11, 2024 @12:40PM (#64465079)
    • by dfghjk ( 711126 )

      But what a conventional way to lie about it. Try something new.

    • Re: (Score:3, Insightful)

      by quonset ( 4839537 )


      How is it theft? Nothing was taken? The original was still there, untouched.

    • by Anonymous Coward


      LOL. Kind of hard for the narcissist to claim an invasion of privacy when they’re quite busy on YouTube being an attention whore.

      Theft would imply consumers are owed something on a platform they pay nothing for to use. There’s a reason most are called consumers instead of customers.

    • by guruevi ( 827432 ) on Saturday May 11, 2024 @01:36PM (#64465189)

      It is not theft. It is copyright infringement at best but even that is stretching the concept as long as you cannot trick the system into verbatim regurgitating its contents. You published your content publicly, this was the intention of Google all along and started with Project Gutenberg, Google News etc. OpenAI is just a spinoff of the idea and very much came out of that same cadre of people.

    • by gweihir ( 88907 )

      Indeed. And organized theft by obviously criminal enterprises at that.

  • Not theft (Score:1, Insightful)

    That's because it's not copyright infringement to use copyrighted content to train AI. No where will a Youtube video be recreated 1:1, and they could easily pay a team to recreate the content (audio) and it not be copyright infringement. So if I can go out and make a new video with near identical words said, why would it matter if AI did it?

    This whole cry-baby "theft" rhetoric is non-sense. We will never be able to compete with China if we require license deals with every single piece of content that AI c
    • Well, that's just not true.
    • To add to your post, LLMs train on billions of documents. That means the effect of any one of the training examples is that much smaller. The more we scale the models, the less any training example matters, and the less its contribution. Like a drop of water in the sea.

      In such a situation how can we attribute a text to any one training example? Is it still infringement if the influence was 0.001%? Isn't the model doing something creative to combine the gradients from so many sources into something else?
  • Shocked (Score:4, Interesting)

    by david.emery ( 127135 ) on Saturday May 11, 2024 @12:59PM (#64465115)

    I'm Shocked, Shocked to hear that tech companies bent the rules to get ahead of their competition (and then hid it.)

      But just wait until LLMs start training each other. Garbage In, Amplified Garbage Out.

    • LLM already train each other. Itâ(TM)s called GAN and one of the primary ways these things make sure they stay within line.

      • LLM already train each other. ItÃ(TM)s called GAN and one of the primary ways these things make sure they stay within line.

        Nothing matters because the press widely reported the existence of a ridiculous paper titled "the curse of recursion" demonstrating generation loss by training a useless toy model with a hundred million parameters in the most ridiculous manner possible with obvious predictable consequences.

        Ever since this papers release everyone who wants to shit on all things AI have waved it around as proof of some wild evidence free fantasy of the surety of AI eating itself to death while in the real world when competent

        • They are thinking that AI is just training on self generated outputs. But it's not that way, those self generated outputs (the garbage in) is filtered, tested, validated and refined before being used as training material, so there won't be garbage out. Microsoft has a 3B model (Phi) that is trained on synthetic text and it works pretty well. It's even more efficient than models trained on human text at its size.

          The idea is AI -> goes to the environment and do things -> observe effects -> select
    • > Garbage In, Amplified Garbage Out.

      That's wrong. LLMs can and do learn from other sources than internet scrape, they can for example learn from code execution, simulations, games, robotics, or even from the human prompter (human in the loop). LLMs generate 1 trillion or more tokens in chat rooms, those are peppered with human responses that act as implicit feedback.

      The idea is that AI can learn from the world. Like AlphaZero did learn from the self-play tournaments, so it was basically learning fr
  • Explains a lot. (Score:3, Interesting)

    by geekmux ( 1040042 ) on Saturday May 11, 2024 @01:12PM (#64465145)

    They just created a speech recognition tool that could transcribe the audio from YouTube videos

    Given how hilariously bad closed captioning can be when enabled on that platform, perhaps this tends to explain the “intelligence” of artificial today.

  • Everybody did. All possible databases with private data, conversations, messages, email, chatrooms - all this was used, private or not.
  • Of course they did, it's what these companies are built on.

    One of the rare exceptions to Betteridge's law of headlines!

  • How come Google can use that for training? And no, most users who ouploadded content didn't know Google would use their stuff to train an AI. If something is publicly available it must be OK for AI to train off it. It's as ridiculous as saying if you read a book on how to repair cars it's illegal for you to be a mechanic without paying royalties to the book's author.

  • by dsgrntlxmply ( 610492 ) on Saturday May 11, 2024 @01:49PM (#64465203)
    The stochastic parrots are being trained to frequently interject "hey what's up you guys" and "subscribe to my" into their texts.
  • ...of headlines... is broken by this one, I think. As they say, "It's easier to ask forgiveness than permission." I expect whatever consequences befall them are worth it to get ahead in the game of pushing out AI bots ASAP.
  • That explains why the AI of today is big on Artificial and in need of Intelligence.
  • If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked (infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).

    • If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked

      Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.

      For example a training algorithm fed by a list of URLs produces copies of data stored in various routing equipment and system buffers on their way to training algorithm yet given the fleeting temporary nature would not constitute a fixed copy under copyright law and therefore would not be subject to copyright restrictions.

      (infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).

      This approach is doomed to failure in the cour

      • by Bongo ( 13261 )

        I expect clever lawyers will be able to make rational cases on both sides and it will come down to decisions.

        If an LLM can OUTPUT facts and information which it can only produce obviously because that information was input somewhere, then that looks like storage and copying (as an example of one argument).


        Here's a famous quote from Shakespeare:

        "To be, or not to be, that is the question."
        - **Hamlet**, Act 3, Scene 1

      • No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.

        • If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked ...
          No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.

          We are talking about different things. My comments are in regard to what is possible, what can be done to train models.

          What companies are actually doing I have zero clue. None of them are disclosing their workflows so unless people have relevant inside information I don't even know what the basis would be for making statements about what they are or are not doing.

      • Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.

        Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.

        • Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.

          I disagree, the word fixed becomes meaningless if interpreted in a way that renders it a functional nullity. There was a well known case about the execution of a computer program which is quite a bit different than temporary buffers.

          I would say it is a legal precedent akin to fighting words... One insane ruling followed by a persistent dwindling down to comical irrelevance.

          • You can disagree with the interpretation, and like I said, I don't at all think that's unreasonable, but the fact is that it is the current legal precedent.
  • These systems, which due to the simplicity and feebleness of their algorithms, require essentially all the text in the entire world to produce useful chatbots and only corporations with billions of dollars to dump into the projects, and such vast legal teams that they do not bother to consider the legality of their actions, can play here.

    Humans require only a tiny fraction of the training data that these behemoth projects consume to become truly intelligent.

    • by Kiliani ( 816330 )

      I have been thinking similarly. Considering how much knowledge (some of which undoubtedly should be called "knowledge") has been fed to those system, the outcome is disappointing. "Learning" and "reasoning" have not been tackled (let alone conquered), so it all seems mostly brute force with a lot of fluff. There are promising developments in specialized fields, but overall it is mostly amazingly wasteful, for what we get out at the moment.

      I chuckle at the notion that we already need AI systems to create da

  • Captain Obvious strikes again.

  • by nospam007 ( 722110 ) * on Saturday May 11, 2024 @05:10PM (#64465481)

    To learn, like we all.

  • Reddit has tons of low grade bots, so it's not surpriaing they also used it to train up their models []

    "Reddit's array of chats also have been a free teaching aid for companies like Google, OpenAI and Microsoft. Those companies are using Redditâ(TM)s conversations in the development of giant artificial intelligence systems that many in Silicon Valley think are on their way to becoming the tech industryâ(TM)s next big thing.

    Now Reddit wants to be paid for it."

  • This is so sus to see a report by NYT about OpenAI when NYT is suing OpenAI. It's just so fun to watch old media lose it s*it once again. It's like they ran around the office screaming "we have someone to sue - yeah haw!"
  • With no meaningful penalties at all, a huge potential profit, a track record..., nay, an entire business built upon invasion of users' privacy, of course these companies used whatever data they could get their hands on to train their AI.

    On top of that, for corporations, it is always easier to seek forgiveness than to seek permission. Do first then apologize later (only if caught!) is their SOP.

  • debated bending the law

    I'm willing to bet there was no debate, and that they just went ahead and did it. I think it's possible that they debated breaking the law before they (probably) did so.

  • Just use LLM to reword any text. Keep the ideas, discard the protected expression. Ideas are free nobody can hoard ideas themselves unless they patent them.
    • Use of synthetic data causes "Model collapse".

      The model ends up not able to understand anything it is trained on.

      The best you can hope for is to use one model to train another, as long as the one being trained hasn't seen the material before.

      Even that just mitigates it.

  • With the success of chatGPT and other LLMs, an estimate 100M users are exchanging over 1 trillion tokens with AI per month. So many people are just bringing their data and feedback to the LLM on their own. AI developers need just to save chat logs, that's why they are providing LLM services for free. They don't need to scrape copyrighted content so much after accumulating their own data. It comes to them.

Brain fried -- Core dumped
