Meta's Not Telling Where It Got Its AI Training Data (slashdot.org) 26

Posted by msmash on Friday April 19, 2024 @12:22PM from the mum-is-the-word dept.

An anonymous reader shares a report: Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on "data of the highest quality"! A dataset seven times larger than Llama2! And includes 4 times more code! What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.

Meta's Not Telling Where It Got Its AI Training Data

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 26 Comments Log In/Create an Account

Comments Filter:

Mercury poisoning (Score:2)

by omnichad ( 1198475 ) writes:

There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.
This just reminds me so much of bottom-feeders getting eaten by bigger and bigger fish.
Soylent green (Score:3)

by backslashdot ( 95548 ) writes: on Friday April 19, 2024 @12:29PM (#64408264)

is we don't wanna say.

Quite obvious (Score:2)

by Mr. Dollar Ton ( 5495648 ) writes:

Terabytes of "high quality" comments on videos about grinding resin on a lathe.
- Re: (Score:2)
  
  by Hadlock ( 143607 ) writes:
  
  People arguing about grades of steel on "one weird trick to make the PERFECT knife out of a cast iron skilet, EDC" video 7,363,983
cat /dev/internet llm.data ; cat llm.data | llm (Score:3)

by Seven Spirals ( 4924941 ) writes: on Friday April 19, 2024 @12:40PM (#64408300)

Something about like the title maybe?

- Re: (Score:2)
  
  by LordHighExecutioner ( 4245243 ) writes:
  
  ..or from /dev/random ?!?
- Re: (Score:2)
  
  by sinkskinkshrieks ( 6952954 ) writes:
  
  $ llama3 </dev/internet & Warning: cat memes, neonazis, incel sucide bombers, conspiracy theories, and earlicking ASMR encountered. Results may not be suitable for human consumption or valid byte sequence with this locale.
Nonsense (Score:2)

by Rei ( 128717 ) writes:

The author has no clue what they're talking about:
Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created dat
Copyright infringement obfuscation (Score:2)

by akw0088 ( 7073305 ) writes:

Our data comes from our AIâ(TM)s fathers, fathers, cousins nephewâ(TM)s childrenâ(TM)s AI model
this is smart (Score:2)

by drinkypoo ( 153816 ) writes:

Everyone who publishes that data is just providing free evidence. Why would they do that? Who knows, but this is smarter, and only slightly less scrupulous.
AI is just Wikipedia (Score:2)

by xack ( 5304745 ) writes:

When talking to various AI products like Bard/Gemini, Bing, and ChatGPT its obvious they are getting their facts from Wikipedia. 20 years ago Wikipedia was a big no no but now a new genreration of people tend to take it as a authoritative source despite the large amount of vandals (I used to vandalize it so know from experience), long term abusers, businesses writing their own articles and the infamous dysfunctional administration system. AI also blindly shits out walls of text from academic journals and ne
- Re: (Score:2)
  
  by backslashdot ( 95548 ) writes:
  
  It may have learned based on wikipedia and other sources .. BUT .. unless you're an idiot, it definitely is more useful and faster than googling and reading its sources. For one thing, it can write letters and entire copy-pastable code with detailed instructions for you. A lot of my "coding" activity is just reviewing code that AI spits out for me -- which is usually flawless. In fact sometimes I give it my code to review (it isn't great yet, but it does say stuff like "you're missing a try-catch" etc.) If
  - Re: (Score:2)
    
    by kencurry ( 471519 ) writes:
    
    Bad data aggregation thing will get worse IMO. A recent example for me was for an Apple product: the weight of the item was listed on Apple (but sounded wrong to me.) I found the same info on amazon and another indie review site. I bought the item (weighed myself) and the weight wasn't even close.
    
    An experienced person would sense that the info couldn't be right, but not an AI whatever. I ended up keeping the iPad case, but every time I pick it up and feel the weight of it I get mad - a physical reminder
  - Re: (Score:1)
    
    by Morromist ( 1207276 ) writes:
    
    I don't get it. Its just not that hard to write code if you've been doing it for a while. It takes me much longer to organize and debug code than write it. Writing a lot of code very fast always comes back to haunt me eventually, even if there are no errors. Perhaps whatever your working on is much more ameniable to Ai coding than my stuff.
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  I've probably done tens of thousands of legit, constructive edits, but even I couldn't resist the temptation to prank it at one point. The article was on the sugar apple (Annona squamosa), and at the time, there was a big long list of the name of the fruit in different languages. I wrote that in Icelandic, the fruit was called "Hva[TH]er[TH]etta" (eth and thorn don't work on Slashdot), which means "What's that?", as in, "I've never seen that fruit before in my life" ;) Though the list disappeared from Wik
Myspace (Score:2)

by rwrife ( 712064 ) writes:

Myspace.
I remember the original premise of the web (Score:2)

by presidenteloco ( 659168 ) writes:

Sharing information openly.

Honestly if you don't want your information read, whether by people or bots should make no difference, don't publish it on the f**king world wide web.

That's my basic view on it. Why should an AI not have the "right" to learn from the stuff people publish on the open web?
- - Re: (Score:2)
    
    by Shaitan ( 22585 ) writes:
    
    "Similarly, when someone's art/music/poem/news article is redistributed by the AI, it's fine. But if we do it, it could be copyright infringement."
    That is the trade though, the output of AI isn't merely public domain, it is ineligible for copyright.
  - Re: (Score:2)
    
    by Shaitan ( 22585 ) writes:
    
    Probably also worth remembering that copyright/IP isn't a real thing. You don't really have any claim on ideas, they aren't even really novel, that is an illusion, others have the same ideas and if you are hit by a bus your brilliant unshared insight won't be lost forever.
    We created the artificial concept of copyright to encourage people to produce and share but at some point that may outlive it's usefulness and it we'll get more public benefit by having everything simply be public domain and freely shared.
  - Re: (Score:2)
    
    by presidenteloco ( 659168 ) writes:
    
    Usually, the AI synthesizes new material / answers using trhe statistics of many source statements / compositions. In other words it abstracts from many examples, then re-specializes in its own way to match to the question asked.
    
    This is little different to what a well educated person does when they answer a question or write an essay on some topic that they have learned about from many sources.
    
    If that sort of thing (re-arranged synthesis from statistical abstractions of many sources) is copyright infringeme
Does it matter? (Score:2)

by Shaitan ( 22585 ) writes:

Whether it speaks ill or well the performance will speak for itself.
Many privately owned things are publicly available (Score:2)

by dpbsmith ( 263124 ) writes:

Central Park is "publicly available," but that doesn't mean you have the right to cut down the trees in it and sell them for lumber. The AI companies are selling lumber and when you ask them where they got it, they shrug and say "publicly available sources."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Meta's Not Telling Where It Got Its AI Training Data (slashdot.org) 26

Meta's Not Telling Where It Got Its AI Training Data More Login

Meta's Not Telling Where It Got Its AI Training Data

Mercury poisoning (Score:2)

Soylent green (Score:3)

Quite obvious (Score:2)

Re: (Score:2)

cat /dev/internet llm.data ; cat llm.data | llm (Score:3)

Re: (Score:2)

Re: (Score:2)

Nonsense (Score:2)

Copyright infringement obfuscation (Score:2)

this is smart (Score:2)

AI is just Wikipedia (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Myspace (Score:2)

I remember the original premise of the web (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Does it matter? (Score:2)

Many privately owned things are publicly available (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot