Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data? (indiatimes.com) 58
What happened when OpenAI ran out of English-language training data in 2021?
They just created a speech recognition tool that could transcribe the audio from YouTube videos, reports The New York Times, as part of an investigation arguing that tech companies "including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law" in their search for AI training data. [Alternate URL here.] Some OpenAI employees discussed how such a move might go against YouTube's rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are "independent" of the video platform. Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4...
At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.
Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said. That potentially violated the copyrights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products...
Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn't stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.
The article adds that some tech companies are now even developing "synthetic" information to train AI.
"This is not organic data created by humans, but text, images and code that AI models produce — in other words, the systems learn from what they themselves generate."
They just created a speech recognition tool that could transcribe the audio from YouTube videos, reports The New York Times, as part of an investigation arguing that tech companies "including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law" in their search for AI training data. [Alternate URL here.] Some OpenAI employees discussed how such a move might go against YouTube's rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are "independent" of the video platform. Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4...
At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.
Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said. That potentially violated the copyrights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products...
Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn't stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.
The article adds that some tech companies are now even developing "synthetic" information to train AI.
"This is not organic data created by humans, but text, images and code that AI models produce — in other words, the systems learn from what they themselves generate."
What a weird way to pronounce (Score:5, Insightful)
Re: (Score:3)
But what a conventional way to lie about it. Try something new.
Re: (Score:3, Insightful)
Theft.
How is it theft? Nothing was taken? The original was still there, untouched.
Re:What a weird way to pronounce (Score:5, Funny)
Re: (Score:2)
Re: (Score:1)
Which in itself says nothing whether you are or are not violating the creators' rights.
You as the non-owner of the IP have certain fair use rights that depend, not on the mechanism by which you obtain a copy of the data, but on the effect of what you are doing with the data upon the copyright holder's proprietary interests. A download button does *not* indicate content is free game for commercial use.
Re: (Score:2)
Tell that to music and film studios.
Re: (Score:1)
Theft.
LOL. Kind of hard for the narcissist to claim an invasion of privacy when they’re quite busy on YouTube being an attention whore.
Theft would imply consumers are owed something on a platform they pay nothing for to use. There’s a reason most are called consumers instead of customers.
Re: What a weird way to pronounce (Score:5, Insightful)
It is not theft. It is copyright infringement at best but even that is stretching the concept as long as you cannot trick the system into verbatim regurgitating its contents. You published your content publicly, this was the intention of Google all along and started with Project Gutenberg, Google News etc. OpenAI is just a spinoff of the idea and very much came out of that same cadre of people.
Re: (Score:1)
Indeed. And organized theft by obviously criminal enterprises at that.
Not theft (Score:1, Insightful)
This whole cry-baby "theft" rhetoric is non-sense. We will never be able to compete with China if we require license deals with every single piece of content that AI c
Re: Not theft (Score:2)
Re: (Score:2)
No where will a Youtube video be recreated 1:1
That's the part that isn't true. The YouTube video was "recreated" when it was copied from YouTube to the computer(s) being used to train the AI. You might not think that should be considered making an infringing copy, and it's reasonable to think that way, but legal precedent says that it is an infringing copy.
Re: Not theft (Score:2)
Re: (Score:2)
Yes, there are a lot of problems with that precedent, and since then courts have made fair use exceptions for cases where copying something to RAM is a necessary part of an otherwise legal use.
Re: (Score:2)
Re: (Score:2)
It is only infringing if it regurgitates the original verbatim or is a derivative work
No, that's the part you're misunderstanding. As soon as the AI trainer downloads the video from YouTube to its own computer, it has made a copy. This is before the video is used as part of training the AI.
Re: (Score:2)
In such a situation how can we attribute a text to any one training example? Is it still infringement if the influence was 0.001%? Isn't the model doing something creative to combine the gradients from so many sources into something else?
Shocked (Score:4, Interesting)
I'm Shocked, Shocked to hear that tech companies bent the rules to get ahead of their competition (and then hid it.)
But just wait until LLMs start training each other. Garbage In, Amplified Garbage Out.
Re: Shocked (Score:1)
LLM already train each other. Itâ(TM)s called GAN and one of the primary ways these things make sure they stay within line.
Re: (Score:3)
LLM already train each other. ItÃ(TM)s called GAN and one of the primary ways these things make sure they stay within line.
Nothing matters because the press widely reported the existence of a ridiculous paper titled "the curse of recursion" demonstrating generation loss by training a useless toy model with a hundred million parameters in the most ridiculous manner possible with obvious predictable consequences.
Ever since this papers release everyone who wants to shit on all things AI have waved it around as proof of some wild evidence free fantasy of the surety of AI eating itself to death while in the real world when competent
Re: (Score:2)
The idea is AI -> goes to the environment and do things -> observe effects -> select
Re: (Score:2)
That's wrong. LLMs can and do learn from other sources than internet scrape, they can for example learn from code execution, simulations, games, robotics, or even from the human prompter (human in the loop). LLMs generate 1 trillion or more tokens in chat rooms, those are peppered with human responses that act as implicit feedback.
The idea is that AI can learn from the world. Like AlphaZero did learn from the self-play tournaments, so it was basically learning fr
Explains a lot. (Score:3, Interesting)
They just created a speech recognition tool that could transcribe the audio from YouTube videos
Given how hilariously bad closed captioning can be when enabled on that platform, perhaps this tends to explain the “intelligence” of artificial today.
Everything was used (Score:1)
Yes (Score:2)
Of course they did, it's what these companies are built on.
One of the rare exceptions to Betteridge's law of headlines!
Youtube is the thief (Score:2)
How come Google can use that for training? And no, most users who ouploadded content didn't know Google would use their stuff to train an AI. If something is publicly available it must be OK for AI to train off it. It's as ridiculous as saying if you read a book on how to repair cars it's illegal for you to be a mechanic without paying royalties to the book's author.
YouTube selection bias (Score:4, Funny)
Betteridges law... (Score:2)
Re: (Score:2)
There's not a lot of wiggle room for statutory infringement of registered works, do it enough times and only bankruptcy can save you.
Re: (Score:2)
"If we only did the right thing then we couldn't do what we want to do."
Re: (Score:2)
"the audio from YouTube videos" (Score:2, Insightful)
Fair use or bankruptcy (Score:2)
If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked (infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).
Re: (Score:3)
If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked
Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.
For example a training algorithm fed by a list of URLs produces copies of data stored in various routing equipment and system buffers on their way to training algorithm yet given the fleeting temporary nature would not constitute a fixed copy under copyright law and therefore would not be subject to copyright restrictions.
(infringement of the models and output are separate issues and the least interesting one, mostly only interesting for people looking to avoid addressing the primary issue).
This approach is doomed to failure in the cour
Re: (Score:2)
I expect clever lawyers will be able to make rational cases on both sides and it will come down to decisions.
If an LLM can OUTPUT facts and information which it can only produce obviously because that information was input somewhere, then that looks like storage and copying (as an example of one argument).
ChatGPT:
Here's a famous quote from Shakespeare:
"To be, or not to be, that is the question."
- **Hamlet**, Act 3, Scene 1
Re: (Score:3)
No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.
Re: (Score:2)
If the Supreme Court doesn't say copying for training data is fair use they're all proper fucked ...
No one streams in terabytes of content for training, repeatedly for each epoch on top. It's copied, permanently stored and then copied some more.
We are talking about different things. My comments are in regard to what is possible, what can be done to train models.
What companies are actually doing I have zero clue. None of them are disclosing their workflows so unless people have relevant inside information I don't even know what the basis would be for making statements about what they are or are not doing.
Re: (Score:2)
Training doesn't necessarily require "copying" as a copy is required to be "fixed" to be considered a "copy" for the purpose of copyright law.
Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.
Re: (Score:2)
Copying to RAM is still considered a copy and may be infringing. You may not like it, and I'm not particularly a fan of it either, but that's the legal precedent in the US.
I disagree, the word fixed becomes meaningless if interpreted in a way that renders it a functional nullity. There was a well known case about the execution of a computer program which is quite a bit different than temporary buffers.
I would say it is a legal precedent akin to fighting words... One insane ruling followed by a persistent dwindling down to comical irrelevance.
Re: (Score:2)
A Game Only Giant Corporations Can Play (Score:2)
These systems, which due to the simplicity and feebleness of their algorithms, require essentially all the text in the entire world to produce useful chatbots and only corporations with billions of dollars to dump into the projects, and such vast legal teams that they do not bother to consider the legality of their actions, can play here.
Humans require only a tiny fraction of the training data that these behemoth projects consume to become truly intelligent.
Re: (Score:2)
I have been thinking similarly. Considering how much knowledge (some of which undoubtedly should be called "knowledge") has been fed to those system, the outcome is disappointing. "Learning" and "reasoning" have not been tackled (let alone conquered), so it all seems mostly brute force with a lot of fluff. There are promising developments in specialized fields, but overall it is mostly amazingly wasteful, for what we get out at the moment.
I chuckle at the notion that we already need AI systems to create da
Of course they did. (Score:2)
Captain Obvious strikes again.
The AI browsed the internet (Score:3)
To learn, like we all.
Not to mention reddit (Score:2)
Reddit has tons of low grade bots, so it's not surpriaing they also used it to train up their models
https://www.nytimes.com/2023/0... [nytimes.com]
"Reddit's array of chats also have been a free teaching aid for companies like Google, OpenAI and Microsoft. Those companies are using Redditâ(TM)s conversations in the development of giant artificial intelligence systems that many in Silicon Valley think are on their way to becoming the tech industryâ(TM)s next big thing.
Now Reddit wants to be paid for it."
Nothing to see here move along break it up (Score:1)
Of course they did (Score:2)
With no meaningful penalties at all, a huge potential profit, a track record..., nay, an entire business built upon invasion of users' privacy, of course these companies used whatever data they could get their hands on to train their AI.
On top of that, for corporations, it is always easier to seek forgiveness than to seek permission. Do first then apologize later (only if caught!) is their SOP.
Seems pretty unlikely (Score:2)
debated bending the law
I'm willing to bet there was no debate, and that they just went ahead and did it. I think it's possible that they debated breaking the law before they (probably) did so.
Go synthetic (Score:2)
Re: (Score:2)
Use of synthetic data causes "Model collapse".
The model ends up not able to understand anything it is trained on.
The best you can hope for is to use one model to train another, as long as the one being trained hasn't seen the material before.
Even that just mitigates it.
Users bring their data to AI on their own! (Score:2)
Big Data steals data (Score:2)
Shocker...