Why Synthetic Data is Being Used To Train AI Models (ft.com) 31
Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. From a report: Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data -- computer-generated information to train their AI systems known as large language models (LLMs) -- as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI's ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.
The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
Yeah, why? (Score:2)
Title suggests there's an answer to why, but of course there is not. And of course, a /. editor doesn't know any better.
Also, "so-called synthetic data"? Does the editor understand what that means? Synthetic data has a specific meaning, precisely the meaning used here.
The source of data is not what matters, the quality of the data is what matters.
Re:Yeah, why? (Score:5, Interesting)
The fact they are even considering this indicates GenAI has peaked and they know it.
Re: (Score:2)
That was my first thought. The implosion should be spectacular. The next big thing will be just setting giant piles of money on fire.
Re: (Score:1)
That was my first thought. The implosion should be spectacular. The next big thing will be just setting giant piles of money on fire.
Maybe not. Data we create on the internet is analog and limitless as long as we continue to access the internet. I think that the corporate model needs to label and own everything so resigning to create language models from the common indefinable medium is unacceptable.
Re: (Score:2)
Re: (Score:2)
Data is expensive. Really, really expensive, and slow too. The impressive thing aboug ChatGPT is how much stuff it "knows" about. It's a bit like when Google StreetView came out and you realized Google ponied up the money to send cars with cameras effectively *everywhere*. It's shocking.
On the other hand making shit up is dirt cheap and fast. But it's also less generally useful.
I am not an AI nerd, but my understanding of the uses of synthetic data is for system testing -- which you sure want to do befor
Re: (Score:2)
Sounds like they're just working on their legal defense, all the while secretly training on every scrap of written words they can find.
Along with the specter of "WE CAN'T LET CHINA WITH THE AI WAR" (what do they win????) to corral Congress into granting some broad "fair use" copyright exception for AI models.
The irony is that if someone's AI was generating Mickey Mouse-like cartoons, you bet your sweet ass Disney would put a stop to it, but they want to use AI to replace real actors.
Re: (Score:2)
The irony is that if someone's AI was generating Mickey Mouse-like cartoons, you bet your sweet ass Disney
Well, try to - hopefully, being told to go after the infringer misusing it, but with IP laws as fucky-wucky as they are today, who knows if that'd happen...
Paywalled. Larger summary via Reddit (via AI?): (Score:5, Informative)
Possibly related ArXiv paper (Score:3)
How Does This Work? (Score:3)
Re: (Score:1)
I think we need to conceptually seperate "synthetic data" from the "model collapse" mentioned above.
Synthetic data does not add more information: it helps solve the problem of overfitting the model to the training data, so that it performs poorly on a separate test data set.
With image classification, it is clear from research that "data augmentation" by zooming, adding noise, streching, rotating, shearing and changing the color space can increase the effective size of your annotated image dataset and thus r
garbage in garbage out (Score:2)
The Best Source of Data (Score:2)
Synthetic data for synthetic use cases. (Score:2)
I can't imagine synthetic data is going to produce anything better suited to the real-world use cases they are trying to develop these monsters for. Synthetic data will be absolutely GREAT for synthetic use cases. Like testing. Or having the AIs develop a self-sustaining loop of infinite silliness that will mean absolutely nothing to humans.
Granted, if the AI proponents get their way, we'll all change ourselves to fit the AIs. So maybe there won't be any more need for reality in the glowing AI future. Hell,
I'm confused (Score:2)
So they have a system smart enough to generate meaningful training data, but at the same time they have systems dumb enough that they can learn from the data?
Slashdot from the 80s: (Score:3)
This is NOT a terrible idea (Score:2)
Intelligence is a form of compression. Compression of knowledge results in less information needing to be available to arrive at the same quantity and quality of conclusions. Compression of knowledge, up to a point, is almost always a good thing, which is why generalization and specialization are such important strategies to human learning.
A trend has recently begun for using micro-models, model-derived models, negative prompting, and techniques such as LoRA (Low-Rank Adaptation) to improve the quality of
Re: (Score:2)
But repeatedly applying lossy compression IS a terrible idea. It's AI models playing a game of Telephone.
Re: (Score:3)
Human beings have the same problem but they still perform useful work. I don't know if anyone should be treating the AI like a knowledge base anyway. Teach it how to fish and then give it access to good quality reference material separate from its training. Both people and AI bots hallucinate facts when memory is fuzzy. Humans are smart enough to assign a confidence level (some more accurately than others) and use that to direct their next step - whether that's research or just more thought. Certainly
Re: (Score:2)
But AI only works in one direction. We're able to look at the output of Telephone and realize that it's completely bogus. AI has no concept of that. It has no concept of concepts.
The standard AI researcher will just reply with "we need more data", when the answer is "we can't model that yet".
Re: (Score:1)
Re: (Score:2)
Synthetic data contains at best as much INFORMATION as is contained in the program that synthesizes it (Kolmogorov complexity). It stands to reason that such (man-made!) program is less complex than the collective output of millions of people. This is probably data laundering as someone mentioned above, or desperation to justify the investments, or both.
Re: (Score:2)
When this approach proves viable, you'll see more research papers appear, describing the active ingredients of good solutions.
I don't believe this is true, if non-determinism is involved.
Also, when it comes to AI algorithms, good data can be buried in bad relationships. I
More efficient use of data needed (Score:2)
If an AI studied and absorbed all of Wikipedia, including the incidentals of correct language usage, it would outperform current LLMs on most things.
Current LLMs rely on massive redundancy to fuel their statistical models. Improve that, and you have plenty of data available.
Re: (Score:2)
Yeah, a different type of AI would perform much better on being a knowledge base. But LLMs are not that. They are predictive text. They can generate a convincing looking Wikipedia article or regurgitate subsets with or without hallucination. It's just that LLMs are what we have right now.
synthetic data (Score:2)
Thats Brent Spiner , isn't it?
Synthetic, or direct observation? (Score:2)
If it's just random synthetic data it seems like GIGO, but what about direct observation of the physical world? That's a tall order. Biologists, for example, have been going out and observing creatures as long as there have been biologists. Some of that data is public domain, but you need a legal team to figure out which, and getting the AI fresh data from the physical world could have value. Human biologists have categorized all the critters, and prior to knowledge of DNA those categories were based o
Useful for hyper-specific tasks (Score:2)