Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Facebook AI

Meta's Not Telling Where It Got Its AI Training Data (slashdot.org) 26

An anonymous reader shares a report: Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on "data of the highest quality"! A dataset seven times larger than Llama2! And includes 4 times more code! What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.

This discussion has been archived. No new comments can be posted.

Meta's Not Telling Where It Got Its AI Training Data

Comments Filter:
  • There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.

    This just reminds me so much of bottom-feeders getting eaten by bigger and bigger fish.

  • by backslashdot ( 95548 ) on Friday April 19, 2024 @01:29PM (#64408264)

    is we don't wanna say.

  • Terabytes of "high quality" comments on videos about grinding resin on a lathe.

    • by Hadlock ( 143607 )

      People arguing about grades of steel on "one weird trick to make the PERFECT knife out of a cast iron skilet, EDC" video 7,363,983

  • Something about like the title maybe?
  • The author has no clue what they're talking about:

    Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created dat

  • Our data comes from our AIâ(TM)s fathers, fathers, cousins nephewâ(TM)s childrenâ(TM)s AI model
  • Everyone who publishes that data is just providing free evidence. Why would they do that? Who knows, but this is smarter, and only slightly less scrupulous.

  • When talking to various AI products like Bard/Gemini, Bing, and ChatGPT its obvious they are getting their facts from Wikipedia. 20 years ago Wikipedia was a big no no but now a new genreration of people tend to take it as a authoritative source despite the large amount of vandals (I used to vandalize it so know from experience), long term abusers, businesses writing their own articles and the infamous dysfunctional administration system. AI also blindly shits out walls of text from academic journals and ne
    • It may have learned based on wikipedia and other sources .. BUT .. unless you're an idiot, it definitely is more useful and faster than googling and reading its sources. For one thing, it can write letters and entire copy-pastable code with detailed instructions for you. A lot of my "coding" activity is just reviewing code that AI spits out for me -- which is usually flawless. In fact sometimes I give it my code to review (it isn't great yet, but it does say stuff like "you're missing a try-catch" etc.) If

      • Bad data aggregation thing will get worse IMO. A recent example for me was for an Apple product: the weight of the item was listed on Apple (but sounded wrong to me.) I found the same info on amazon and another indie review site. I bought the item (weighed myself) and the weight wasn't even close.

        An experienced person would sense that the info couldn't be right, but not an AI whatever. I ended up keeping the iPad case, but every time I pick it up and feel the weight of it I get mad - a physical reminder
      • I don't get it. Its just not that hard to write code if you've been doing it for a while. It takes me much longer to organize and debug code than write it. Writing a lot of code very fast always comes back to haunt me eventually, even if there are no errors. Perhaps whatever your working on is much more ameniable to Ai coding than my stuff.

    • by Rei ( 128717 )

      I've probably done tens of thousands of legit, constructive edits, but even I couldn't resist the temptation to prank it at one point. The article was on the sugar apple (Annona squamosa), and at the time, there was a big long list of the name of the fruit in different languages. I wrote that in Icelandic, the fruit was called "Hva[TH]er[TH]etta" (eth and thorn don't work on Slashdot), which means "What's that?", as in, "I've never seen that fruit before in my life" ;) Though the list disappeared from Wik

  • Myspace.
  • Sharing information openly.

    Honestly if you don't want your information read, whether by people or bots should make no difference, don't publish it on the f**king world wide web.

    That's my basic view on it. Why should an AI not have the "right" to learn from the stuff people publish on the open web?
  • Whether it speaks ill or well the performance will speak for itself.

  • Central Park is "publicly available," but that doesn't mean you have the right to cut down the trees in it and sell them for lumber. The AI companies are selling lumber and when you ask them where they got it, they shrug and say "publicly available sources."

Real Programmers don't write in PL/I. PL/I is for programmers who can't decide whether to write in COBOL or FORTRAN.

Working...