Meta's Not Telling Where It Got Its AI Training Data (slashdot.org) 26
An anonymous reader shares a report: Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on "data of the highest quality"! A dataset seven times larger than Llama2! And includes 4 times more code! What is that training data? There the company is less loquacious.
Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.
Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.
Mercury poisoning (Score:2)
There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.
This just reminds me so much of bottom-feeders getting eaten by bigger and bigger fish.
Soylent green (Score:3)
is we don't wanna say.
Quite obvious (Score:2)
Terabytes of "high quality" comments on videos about grinding resin on a lathe.
Re: (Score:2)
People arguing about grades of steel on "one weird trick to make the PERFECT knife out of a cast iron skilet, EDC" video 7,363,983
cat /dev/internet llm.data ; cat llm.data | llm (Score:3)
Re: (Score:2)
Re: (Score:2)
Warning: cat memes, neonazis, incel sucide bombers, conspiracy theories, and earlicking ASMR encountered.
Results may not be suitable for human consumption or valid byte sequence with this locale.
Nonsense (Score:2)
The author has no clue what they're talking about:
Copyright infringement obfuscation (Score:2)
this is smart (Score:2)
Everyone who publishes that data is just providing free evidence. Why would they do that? Who knows, but this is smarter, and only slightly less scrupulous.
AI is just Wikipedia (Score:2)
Re: (Score:2)
It may have learned based on wikipedia and other sources .. BUT .. unless you're an idiot, it definitely is more useful and faster than googling and reading its sources. For one thing, it can write letters and entire copy-pastable code with detailed instructions for you. A lot of my "coding" activity is just reviewing code that AI spits out for me -- which is usually flawless. In fact sometimes I give it my code to review (it isn't great yet, but it does say stuff like "you're missing a try-catch" etc.) If
Re: (Score:2)
An experienced person would sense that the info couldn't be right, but not an AI whatever. I ended up keeping the iPad case, but every time I pick it up and feel the weight of it I get mad - a physical reminder
Re: (Score:1)
I don't get it. Its just not that hard to write code if you've been doing it for a while. It takes me much longer to organize and debug code than write it. Writing a lot of code very fast always comes back to haunt me eventually, even if there are no errors. Perhaps whatever your working on is much more ameniable to Ai coding than my stuff.
Re: (Score:2)
I've probably done tens of thousands of legit, constructive edits, but even I couldn't resist the temptation to prank it at one point. The article was on the sugar apple (Annona squamosa), and at the time, there was a big long list of the name of the fruit in different languages. I wrote that in Icelandic, the fruit was called "Hva[TH]er[TH]etta" (eth and thorn don't work on Slashdot), which means "What's that?", as in, "I've never seen that fruit before in my life" ;) Though the list disappeared from Wik
Myspace (Score:2)
I remember the original premise of the web (Score:2)
Honestly if you don't want your information read, whether by people or bots should make no difference, don't publish it on the f**king world wide web.
That's my basic view on it. Why should an AI not have the "right" to learn from the stuff people publish on the open web?
Re: (Score:2)
"Similarly, when someone's art/music/poem/news article is redistributed by the AI, it's fine. But if we do it, it could be copyright infringement."
That is the trade though, the output of AI isn't merely public domain, it is ineligible for copyright.
Re: (Score:2)
Probably also worth remembering that copyright/IP isn't a real thing. You don't really have any claim on ideas, they aren't even really novel, that is an illusion, others have the same ideas and if you are hit by a bus your brilliant unshared insight won't be lost forever.
We created the artificial concept of copyright to encourage people to produce and share but at some point that may outlive it's usefulness and it we'll get more public benefit by having everything simply be public domain and freely shared.
Re: (Score:2)
This is little different to what a well educated person does when they answer a question or write an essay on some topic that they have learned about from many sources.
If that sort of thing (re-arranged synthesis from statistical abstractions of many sources) is copyright infringeme
Does it matter? (Score:2)
Whether it speaks ill or well the performance will speak for itself.
Many privately owned things are publicly available (Score:2)
Central Park is "publicly available," but that doesn't mean you have the right to cut down the trees in it and sell them for lumber. The AI companies are selling lumber and when you ask them where they got it, they shrug and say "publicly available sources."