Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI Technology

Why Synthetic Data is Being Used To Train AI Models (ft.com) 31

Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. From a report: Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data -- computer-generated information to train their AI systems known as large language models (LLMs) -- as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI's ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.

The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.

This discussion has been archived. No new comments can be posted.

Why Synthetic Data is Being Used To Train AI Models

Comments Filter:
  • Title suggests there's an answer to why, but of course there is not. And of course, a /. editor doesn't know any better.

    Also, "so-called synthetic data"? Does the editor understand what that means? Synthetic data has a specific meaning, precisely the meaning used here.

    The source of data is not what matters, the quality of the data is what matters.

    • Re:Yeah, why? (Score:5, Interesting)

      by iMadeGhostzilla ( 1851560 ) on Wednesday July 19, 2023 @10:52AM (#63699112)

      The fact they are even considering this indicates GenAI has peaked and they know it.

      • That was my first thought. The implosion should be spectacular. The next big thing will be just setting giant piles of money on fire.

        • That was my first thought. The implosion should be spectacular. The next big thing will be just setting giant piles of money on fire.

          Maybe not. Data we create on the internet is analog and limitless as long as we continue to access the internet. I think that the corporate model needs to label and own everything so resigning to create language models from the common indefinable medium is unacceptable.

    • Seems like this will just be used to launder the data by training one LLM based on the output of another and pretending that doesn't mean they effectively used the same training data, so its copyright is still relevant.
    • by hey! ( 33014 )

      Data is expensive. Really, really expensive, and slow too. The impressive thing aboug ChatGPT is how much stuff it "knows" about. It's a bit like when Google StreetView came out and you realized Google ponied up the money to send cars with cameras effectively *everywhere*. It's shocking.

      On the other hand making shit up is dirt cheap and fast. But it's also less generally useful.

      I am not an AI nerd, but my understanding of the uses of synthetic data is for system testing -- which you sure want to do befor

  • by Khopesh ( 112447 ) on Wednesday July 19, 2023 @10:52AM (#63699116) Homepage Journal
    Quoting ShotgunProxy on r/ChatGPT [reddit.com] in what is probably an AI-generated summary:

    Microsoft and OpenAI test synthetic data to train LLMs, as web data is "no longer good enough"

    AI models need increasingly unique and sophisticated data sets to improve their performance, but the developers behind major LLMs are finding that web data is "no longer good enough" and getting "extremely expensive," a report from the Financial Times [ft.com] (note: paywalled) reveals.

    So OpenAI, Microsoft, and Cohere are all actively exploring the use of synthetic data to save on costs and generate clean, high-quality data.

    Why this matters:

    • Major LLM creators believe they have reached the limits of human-made data improving performance. The next dramatic leap in performance may not come from just feeding models more web-scraped data.
    • Custom human-created data is extremely expensive and not a scalable solution. Getting experts in various fields to create additional finely detailed content is unviable at the quantity of data needed to train AI.
    • Web data is increasingly under lock and key, as sites like Reddit, Twitter, more are charging hefty fees in order to use their data.

    The approach is to have AI generate its own training data go-forward:

    • Cohere is having two AI models act as tutor and student to generate synthetic data. All of it is reviewed by a human at this point.
    • Microsoft's research team has shown that certain synthetic data can be used to train smaller models effectively -- but increasing GPT-4 performance's is still not viable with synthetic data.
    • Startups like Scale.ai [scale.ai] and Gretel.ai [gretel.ai] are already offering synthetic data-as-a-service, showing there's market appetite for this.

    What are AI leaders saying? They're determined to explore this future.

    • Sam Altman explained in May that he was “pretty confident that soon all data will be synthetic data,” which could help OpenAI sidestep privacy concerns in the EU. The pathway to superintelligence, he posited, is through models teaching themselves.
    • Aidan Gomez, CEO of LLM startup Cohere, believes web data is not great: "the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need."

    Some AI researches are urging caution, however: researchers from Oxford and Cambridge recently found that training AI models on their own raw outputs risked creating "irreversible defects" in these models that could corrupt and degrade their performance over time.

    The main takeaway: Human-made content was used to develop the first generations of LLMs. But we're now entering a fascinating world where the over the next decade, human-created content could become truly rare, with the bulk of the world's data and content all created by AI.

    • From a few months ago, Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [arxiv.org] by Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, Xia Hu. Abstract:

      Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this st

    • Can someone explain how they can generate language data that can be used to train an AI to make it better than the thing generating the training language? Also, given how no "AI" system to date can produce reliably factual information exactly how are they going to ensure that the training data is factually correct? Although if they are training the current models on random internet data perhaps they have already given up on facts.
      • I think we need to conceptually seperate "synthetic data" from the "model collapse" mentioned above.

        Synthetic data does not add more information: it helps solve the problem of overfitting the model to the training data, so that it performs poorly on a separate test data set.

        With image classification, it is clear from research that "data augmentation" by zooming, adding noise, streching, rotating, shearing and changing the color space can increase the effective size of your annotated image dataset and thus r

  • at least they will have garbage bias
  • I can't imagine synthetic data is going to produce anything better suited to the real-world use cases they are trying to develop these monsters for. Synthetic data will be absolutely GREAT for synthetic use cases. Like testing. Or having the AIs develop a self-sustaining loop of infinite silliness that will mean absolutely nothing to humans.

    Granted, if the AI proponents get their way, we'll all change ourselves to fit the AIs. So maybe there won't be any more need for reality in the glowing AI future. Hell,

  • So they have a system smart enough to generate meaningful training data, but at the same time they have systems dumb enough that they can learn from the data?

  • by kaatochacha ( 651922 ) on Wednesday July 19, 2023 @11:46AM (#63699282)
    "Cassette tapes can be copied infinitely with no loss of audio quality!"
  • Intelligence is a form of compression. Compression of knowledge results in less information needing to be available to arrive at the same quantity and quality of conclusions. Compression of knowledge, up to a point, is almost always a good thing, which is why generalization and specialization are such important strategies to human learning.

    A trend has recently begun for using micro-models, model-derived models, negative prompting, and techniques such as LoRA (Low-Rank Adaptation) to improve the quality of

    • But repeatedly applying lossy compression IS a terrible idea. It's AI models playing a game of Telephone.

      • Human beings have the same problem but they still perform useful work. I don't know if anyone should be treating the AI like a knowledge base anyway. Teach it how to fish and then give it access to good quality reference material separate from its training. Both people and AI bots hallucinate facts when memory is fuzzy. Humans are smart enough to assign a confidence level (some more accurately than others) and use that to direct their next step - whether that's research or just more thought. Certainly

        • But AI only works in one direction. We're able to look at the output of Telephone and realize that it's completely bogus. AI has no concept of that. It has no concept of concepts.

          The standard AI researcher will just reply with "we need more data", when the answer is "we can't model that yet".

      • No system has to be considered a single node. Why would an AI not be able to have a NetFlix account for example while at the same time be able to monitor what everyone else is watching across the entire enterprise?
    • Synthetic data contains at best as much INFORMATION as is contained in the program that synthesizes it (Kolmogorov complexity). It stands to reason that such (man-made!) program is less complex than the collective output of millions of people. This is probably data laundering as someone mentioned above, or desperation to justify the investments, or both.

      • This is probably data laundering as someone mentioned above, or desperation to justify the investments, or both.

        When this approach proves viable, you'll see more research papers appear, describing the active ingredients of good solutions.

        Synthetic data contains at best as much INFORMATION as is contained in the program that synthesizes it (Kolmogorov complexity).

        I don't believe this is true, if non-determinism is involved.

        Also, when it comes to AI algorithms, good data can be buried in bad relationships. I

  • If an AI studied and absorbed all of Wikipedia, including the incidentals of correct language usage, it would outperform current LLMs on most things.

    Current LLMs rely on massive redundancy to fuel their statistical models. Improve that, and you have plenty of data available.

    • Yeah, a different type of AI would perform much better on being a knowledge base. But LLMs are not that. They are predictive text. They can generate a convincing looking Wikipedia article or regurgitate subsets with or without hallucination. It's just that LLMs are what we have right now.

  • Thats Brent Spiner , isn't it?

  • If it's just random synthetic data it seems like GIGO, but what about direct observation of the physical world? That's a tall order. Biologists, for example, have been going out and observing creatures as long as there have been biologists. Some of that data is public domain, but you need a legal team to figure out which, and getting the AI fresh data from the physical world could have value. Human biologists have categorized all the critters, and prior to knowledge of DNA those categories were based o

  • Academia and inventors have been using synthetic data for half a decade especially for image processing and object detection. Basically the soap opera effect on your TV for image based training sets to remove noise. Real data is noisy. Imagine teaching a child how to speak over a telephone. Pronunciations will be all over the place.

Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Working...