Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data? (indiatimes.com) 58

Posted by EditorDavid on Saturday May 11, 2024 @12:34PM from the devouring-data dept.

What happened when OpenAI ran out of English-language training data in 2021?

They just created a speech recognition tool that could transcribe the audio from YouTube videos, reports The New York Times, as part of an investigation arguing that tech companies "including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law" in their search for AI training data. [Alternate URL here.] Some OpenAI employees discussed how such a move might go against YouTube's rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are "independent" of the video platform. Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4...

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said. That potentially violated the copyrights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products...

Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn't stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.
The article adds that some tech companies are now even developing "synthetic" information to train AI.

"This is not organic data created by humans, but text, images and code that AI models produce — in other words, the systems learn from what they themselves generate."

Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 58 Comments Log In/Create an Account

Comments Filter:

What a weird way to pronounce (Score:5, Insightful)

by memory_register ( 6248354 ) writes: on Saturday May 11, 2024 @12:40PM (#64465079)

Theft.

- Re: (Score:3)
  
  by dfghjk ( 711126 ) writes:
  
  But what a conventional way to lie about it. Try something new.
- Re: (Score:3, Insightful)
  
  by quonset ( 4839537 ) writes:
  
  Theft.
  How is it theft? Nothing was taken? The original was still there, untouched.
  - Re:What a weird way to pronounce (Score:5, Funny)
    
    by EvilSS ( 557649 ) writes: on Saturday May 11, 2024 @01:01PM (#64465119)
    
    Piracy is theft.... when it's them and not us doing it.
    
    - Re: (Score:2)
      
      by africanrhino ( 2643359 ) writes:
      
      Oddly the videos come with a download button
      - Re: (Score:1)
        
        by hey! ( 33014 ) writes:
        
        Which in itself says nothing whether you are or are not violating the creators' rights.
        You as the non-owner of the IP have certain fair use rights that depend, not on the mechanism by which you obtain a copy of the data, but on the effect of what you are doing with the data upon the copyright holder's proprietary interests. A download button does *not* indicate content is free game for commercial use.
  - Re: (Score:2)
    
    by evanh ( 627108 ) writes:
    
    Tell that to music and film studios.
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Theft.
  LOL. Kind of hard for the narcissist to claim an invasion of privacy when they’re quite busy on YouTube being an attention whore.
  Theft would imply consumers are owed something on a platform they pay nothing for to use. There’s a reason most are called consumers instead of customers.
- Comment removed (Score:5, Insightful)
  
  by account_deleted ( 4530225 ) writes: on Saturday May 11, 2024 @01:36PM (#64465189)
  
  Comment removed based on user account deletion
  
- Re: (Score:1)
  
  by gweihir ( 88907 ) writes:
  
  Indeed. And organized theft by obviously criminal enterprises at that.
Not theft (Score:1, Insightful)

by destined2fail1990 ( 10502474 ) writes:

That's because it's not copyright infringement to use copyrighted content to train AI. No where will a Youtube video be recreated 1:1, and they could easily pay a team to recreate the content (audio) and it not be copyright infringement. So if I can go out and make a new video with near identical words said, why would it matter if AI did it?

This whole cry-baby "theft" rhetoric is non-sense. We will never be able to compete with China if we require license deals with every single piece of content that AI c
- Re: Not theft (Score:2)
  
  by zkiwi34 ( 974563 ) writes:
  
  Well, that's just not true.
  - - Re: (Score:2)
      
      by Dragonslicer ( 991472 ) writes:
      
      No where will a Youtube video be recreated 1:1
      That's the part that isn't true. The YouTube video was "recreated" when it was copied from YouTube to the computer(s) being used to train the AI. You might not think that should be considered making an infringing copy, and it's reasonable to think that way, but legal precedent says that it is an infringing copy.
      - Re: Not theft (Score:2)
        
        by bool2 ( 1782642 ) writes:
        
        How is that different from the temporary copy made from YouTube to my computer when I watch the video?
        
        Re: (Score:2)
        
        by Dragonslicer ( 991472 ) writes:
        
        It isn't, other than the fact that YouTube has granted you permission to make that copy.
        
        Yes, there are a lot of problems with that precedent, and since then courts have made fair use exceptions for cases where copying something to RAM is a necessary part of an otherwise legal use.
      - Re: (Score:2)
        
        by Visarga ( 1071662 ) writes:
        
        It is only infringing if it regurgitates the original verbatim or is a derivative work, meaning it adds nothing transformative to the original. If what you said was true, if I say something any paraphrase of what I said is infringing on my copyrights. Do you want to be in a world where ideas are owned by whoever coined them first? I am talking about ideas not their specific expression.
        
        Re: (Score:2)
        
        by Dragonslicer ( 991472 ) writes:
        
        It is only infringing if it regurgitates the original verbatim or is a derivative work
        No, that's the part you're misunderstanding. As soon as the AI trainer downloads the video from YouTube to its own computer, it has made a copy. This is before the video is used as part of training the AI.
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  To add to your post, LLMs train on billions of documents. That means the effect of any one of the training examples is that much smaller. The more we scale the models, the less any training example matters, and the less its contribution. Like a drop of water in the sea.
  
  In such a situation how can we attribute a text to any one training example? Is it still infringement if the influence was 0.001%? Isn't the model doing something creative to combine the gradients from so many sources into something else?
Shocked (Score:4, Interesting)

by david.emery ( 127135 ) writes: on Saturday May 11, 2024 @12:59PM (#64465115)

I'm Shocked, Shocked to hear that tech companies bent the rules to get ahead of their competition (and then hid it.)
But just wait until LLMs start training each other. Garbage In, Amplified Garbage Out.

- Re: (Score:1)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:3)
    
    by WaffleMonster ( 969671 ) writes:
    
    LLM already train each other. ItÃ(TM)s called GAN and one of the primary ways these things make sure they stay within line.
    Nothing matters because the press widely reported the existence of a ridiculous paper titled "the curse of recursion" demonstrating generation loss by training a useless toy model with a hundred million parameters in the most ridiculous manner possible with obvious predictable consequences.
    Ever since this papers release everyone who wants to shit on all things AI have waved it around as proof of some wild evidence free fantasy of the surety of AI eating itself to death while in the real world when competent
    - Re: (Score:2)
      
      by Visarga ( 1071662 ) writes:
      
      They are thinking that AI is just training on self generated outputs. But it's not that way, those self generated outputs (the garbage in) is filtered, tested, validated and refined before being used as training material, so there won't be garbage out. Microsoft has a 3B model (Phi) that is trained on synthetic text and it works pretty well. It's even more efficient than models trained on human text at its size.
      
      The idea is AI -> goes to the environment and do things -> observe effects -> select
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  > Garbage In, Amplified Garbage Out.
  
  That's wrong. LLMs can and do learn from other sources than internet scrape, they can for example learn from code execution, simulations, games, robotics, or even from the human prompter (human in the loop). LLMs generate 1 trillion or more tokens in chat rooms, those are peppered with human responses that act as implicit feedback.
  
  The idea is that AI can learn from the world. Like AlphaZero did learn from the self-play tournaments, so it was basically learning fr
Explains a lot. (Score:3, Interesting)

by geekmux ( 1040042 ) writes: on Saturday May 11, 2024 @01:12PM (#64465145)

They just created a speech recognition tool that could transcribe the audio from YouTube videos
Given how hilariously bad closed captioning can be when enabled on that platform, perhaps this tends to explain the “intelligence” of artificial today.

Everything was used (Score:1)

by sirv ( 4898197 ) writes:

Everybody did. All possible databases with private data, conversations, messages, email, chatrooms - all this was used, private or not.
Yes (Score:2)

by TwistedGreen ( 80055 ) writes:

Of course they did, it's what these companies are built on.
One of the rare exceptions to Betteridge's law of headlines!
Youtube is the thief (Score:2)

by backslashdot ( 95548 ) writes:

How come Google can use that for training? And no, most users who ouploadded content didn't know Google would use their stuff to train an AI. If something is publicly available it must be OK for AI to train off it. It's as ridiculous as saying if you read a book on how to repair cars it's illegal for you to be a mechanic without paying royalties to the book's author.
YouTube selection bias (Score:4, Funny)

by dsgrntlxmply ( 610492 ) writes: on Saturday May 11, 2024 @01:49PM (#64465203)

The stochastic parrots are being trained to frequently interject "hey what's up you guys" and "subscribe to my" into their texts.

Re: (Score:2)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  There's not a lot of wiggle room for statutory infringement of registered works, do it enough times and only bankruptcy can save you.
- Re: (Score:2)
  
  by hey! ( 33014 ) writes:
  
  "If we only did the right thing then we couldn't do what we want to do."
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
"the audio from YouTube videos" (Score:2, Insightful)

by oldgraybeard ( 2939809 ) writes:

That explains why the AI of today is big on Artificial and in need of Intelligence.
Fair use or bankruptcy (Score:2)

by Pinky's Brain ( 1158667 ) writes:

If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked (infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).
- Re: (Score:3)
  
  by WaffleMonster ( 969671 ) writes:
  
  If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked
  Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.
  For example a training algorithm fed by a list of URLs produces copies of data stored in various routing equipment and system buffers on their way to training algorithm yet given the fleeting temporary nature would not constitute a fixed copy under copyright law and therefore would not be subject to copyright restrictions.
  (infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).
  This approach is doomed to failure in the cour
  - Re: (Score:2)
    
    by Bongo ( 13261 ) writes:
    
    I expect clever lawyers will be able to make rational cases on both sides and it will come down to decisions.
    If an LLM can OUTPUT facts and information which it can only produce obviously because that information was input somewhere, then that looks like storage and copying (as an example of one argument).
    ChatGPT:
    Here's a famous quote from Shakespeare:
    "To be, or not to be, that is the question."
    - **Hamlet**, Act 3, Scene 1
  - Re: (Score:3)
    
    by Pinky's Brain ( 1158667 ) writes:
    
    No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.
    - Re: (Score:2)
      
      by WaffleMonster ( 969671 ) writes:
      
      If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked ...
      No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.
      We are talking about different things. My comments are in regard to what is possible, what can be done to train models.
      What companies are actually doing I have zero clue. None of them are disclosing their workflows so unless people have relevant inside information I don't even know what the basis would be for making statements about what they are or are not doing.
  - Re: (Score:2)
    
    by Dragonslicer ( 991472 ) writes:
    
    Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.
    Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.
    - Re: (Score:2)
      
      by WaffleMonster ( 969671 ) writes:
      
      Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.
      I disagree, the word fixed becomes meaningless if interpreted in a way that renders it a functional nullity. There was a well known case about the execution of a computer program which is quite a bit different than temporary buffers.
      I would say it is a legal precedent akin to fighting words... One insane ruling followed by a persistent dwindling down to comical irrelevance.
      - Re: (Score:2)
        
        by Dragonslicer ( 991472 ) writes:
        
        You can disagree with the interpretation, and like I said, I don't at all think that's unreasonable, but the fact is that it is the current legal precedent.
A Game Only Giant Corporations Can Play (Score:2)

by crunchygranola ( 1954152 ) writes:

These systems, which due to the simplicity and feebleness of their algorithms, require essentially all the text in the entire world to produce useful chatbots and only corporations with billions of dollars to dump into the projects, and such vast legal teams that they do not bother to consider the legality of their actions, can play here.
Humans require only a tiny fraction of the training data that these behemoth projects consume to become truly intelligent.
- Re: (Score:2)
  
  by Kiliani ( 816330 ) writes:
  
  I have been thinking similarly. Considering how much knowledge (some of which undoubtedly should be called "knowledge") has been fed to those system, the outcome is disappointing. "Learning" and "reasoning" have not been tackled (let alone conquered), so it all seems mostly brute force with a lot of fluff. There are promising developments in specialized fields, but overall it is mostly amazingly wasteful, for what we get out at the moment.
  I chuckle at the notion that we already need AI systems to create da
Of course they did. (Score:2)

by Qbertino ( 265505 ) writes:

Captain Obvious strikes again.
The AI browsed the internet (Score:3)

by nospam007 ( 722110 ) * writes: on Saturday May 11, 2024 @05:10PM (#64465481)

To learn, like we all.

Not to mention reddit (Score:2)

by Rujiel ( 1632063 ) writes:

Reddit has tons of low grade bots, so it's not surpriaing they also used it to train up their models
https://www.nytimes.com/2023/0... [nytimes.com]
"Reddit's array of chats also have been a free teaching aid for companies like Google, OpenAI and Microsoft. Those companies are using Redditâ(TM)s conversations in the development of giant artificial intelligence systems that many in Silicon Valley think are on their way to becoming the tech industryâ(TM)s next big thing.
Now Reddit wants to be paid for it."
Nothing to see here move along break it up (Score:1)

by TheWho79 ( 10289219 ) writes:

This is so sus to see a report by NYT about OpenAI when NYT is suing OpenAI. It's just so fun to watch old media lose it s*it once again. It's like they ran around the office screaming "we have someone to sue - yeah haw!"
Of course they did (Score:2)

by khchung ( 462899 ) writes:

With no meaningful penalties at all, a huge potential profit, a track record..., nay, an entire business built upon invasion of users' privacy, of course these companies used whatever data they could get their hands on to train their AI.
On top of that, for corporations, it is always easier to seek forgiveness than to seek permission. Do first then apologize later (only if caught!) is their SOP.
Seems pretty unlikely (Score:2)

by jenningsthecat ( 1525947 ) writes:

debated bending the law
I'm willing to bet there was no debate, and that they just went ahead and did it. I think it's possible that they debated breaking the law before they (probably) did so.
Go synthetic (Score:2)

by Visarga ( 1071662 ) writes:

Just use LLM to reword any text. Keep the ideas, discard the protected expression. Ideas are free nobody can hoard ideas themselves unless they patent them.
- Re: (Score:2)
  
  by forgotten_my_nick ( 802929 ) writes:
  
  Use of synthetic data causes "Model collapse".
  The model ends up not able to understand anything it is trained on.
  The best you can hope for is to use one model to train another, as long as the one being trained hasn't seen the material before.
  Even that just mitigates it.
Users bring their data to AI on their own! (Score:2)

by Visarga ( 1071662 ) writes:

With the success of chatGPT and other LLMs, an estimate 100M users are exchanging over 1 trillion tokens with AI per month. So many people are just bringing their data and feedback to the LLM on their own. AI developers need just to save chat logs, that's why they are providing LLM services for free. They don't need to scrape copyrighted content so much after accumulating their own data. It comes to them.
Big Data steals data (Score:2)

by Rosco P. Coltrane ( 209368 ) writes:

Shocker...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What a weird way to pronounce (Score:5, Insightful)

Re: (Score:3)

Re: (Score:3, Insightful)

Re:What a weird way to pronounce (Score:5, Funny)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Comment removed (Score:5, Insightful)

Re: (Score:1)

Not theft (Score:1, Insightful)

Re: Not theft (Score:2)

Re: (Score:2)

Re: Not theft (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Shocked (Score:4, Interesting)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Explains a lot. (Score:3, Interesting)

Everything was used (Score:1)

Yes (Score:2)

Youtube is the thief (Score:2)

YouTube selection bias (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"the audio from YouTube videos" (Score:2, Insightful)

Fair use or bankruptcy (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

A Game Only Giant Corporations Can Play (Score:2)

Re: (Score:2)

Of course they did. (Score:2)

The AI browsed the internet (Score:3)

Not to mention reddit (Score:2)

Nothing to see here move along break it up (Score:1)

Of course they did (Score:2)

Seems pretty unlikely (Score:2)

Go synthetic (Score:2)

Re: (Score:2)

Users bring their data to AI on their own! (Score:2)

Big Data steals data (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals