For Data-Guzzling AI Companies, the Internet Is Too Small (wsj.com) 60
Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans (non-paywalled link). From a report: Ever more powerful systems developed by OpenAI, Google and others require larger oceans of information to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies. Some executives and researchers say the industry's need for high-quality text data could outstrip supply within two years, potentially slowing AI's development.
AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcriptions of public YouTube videos, people familiar with the matter said. Companies also are experimenting with using AI-generated, or synthetic, data as training material -- an approach many researchers say could actually cause crippling malfunctions. These efforts are often secret, because executives think solutions could be a competitive advantage.
Data is among several essential AI resources in short supply. The chips needed to run what are called large-language models behind ChatGPT, Google's Gemini and other AI bots also are scarce. And industry leaders worry about a dearth of data centers and the electricity needed to power them. AI language models are built using text vacuumed up from the internet, including scientific research, news articles and Wikipedia entries. That material is broken into tokens -- words and parts of words that the models use to learn how to formulate humanlike expressions.
AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcriptions of public YouTube videos, people familiar with the matter said. Companies also are experimenting with using AI-generated, or synthetic, data as training material -- an approach many researchers say could actually cause crippling malfunctions. These efforts are often secret, because executives think solutions could be a competitive advantage.
Data is among several essential AI resources in short supply. The chips needed to run what are called large-language models behind ChatGPT, Google's Gemini and other AI bots also are scarce. And industry leaders worry about a dearth of data centers and the electricity needed to power them. AI language models are built using text vacuumed up from the internet, including scientific research, news articles and Wikipedia entries. That material is broken into tokens -- words and parts of words that the models use to learn how to formulate humanlike expressions.
Paperclip Maximizer (Score:3, Interesting)
Well, awesome. We've hit the Paperclip Maximizer [lesswrong.com].
Re:Paperclip Maximizer (Score:5, Insightful)
At the very least, this is a leading indicator that we're at or near the peak of inflated expectations. The scope and extent of LLM capabilities is showing its limits - the current approaches are fundamentally flawed, limited, and (frankly) useless for much more than a fuzzing tool in very limited cases where the result is important.
Re: (Score:2)
Re: (Score:2)
We have been there for a long time now. Just that the paperclips get hyped as new and different every time and countless morons fall for it and think that this time must be different and throw their money at empty promises and things that are not real or massively underperform.
Too little too late (Score:3)
Re:Too little too late (Score:5, Insightful)
AI is already pillaging any and all data it can access
Reading information and learning from it is not "pillaging".
If done by a human, reading and learning are not considered "copying" either.
What makes an AI different? While machine learning systems may keep temporary copies of data, the courts have ruled that caching is not copying for copyright enforcement purposes.
the experience of getting AI to be good (think Chess and Go)
The best AI chess and Go players did not learn from humans. They learned by playing against themselves through a GAN.
the learning material had to be very very domain constrained.
Domain constraints are much less important than you think. The best current LLMs are generalists. Imposing domain constraints assumes humans are better than machine learning systems at deciding what is important. They aren't.
Re:Too little too late (Score:4, Informative)
if you are playing chess or go the domain constraints are the game rules
if you are making a generalist ai, the domain constraints are ultimately physics; that's why the AIs cannot stop hallucinating, there's no way for them to check for correctness, while a chess rule checker is ridic simple
Re: (Score:1)
Re: (Score:2)
AI is already pillaging any and all data it can access
Reading information and learning from it is not "pillaging".
As there is no "learning" done here, but actually parameter calibration, it is. More exactly, it is massive criminal commercial copyright infringement. And the ones doing it store the data they stole without permission in order to be able to retrain their models.
Re: (Score:2)
"What makes an AI different?"
The scale.
Re: (Score:2)
If done by a human, reading and learning are not considered "copying" either.
If it's done with IP-law protected materials you don't own, then yes it is considered copying, and you can and will be prosecuted for it. (Assuming you're a big enough fish for the IP holder to give a shit about you.)
I don't like the law, and personally, I'd see it repealed. But it is still the law, and a machine doing the copying for you doesn't absolve you from the consequences of breaking it.
What makes an AI different?
Nothing. It's a machine working at the direction of a human. Until that changes, i.e. the machine can be prove
Re: (Score:2)
If done by a human, reading and learning are not considered "copying" either.
If it's done with IP-law protected materials you don't own, then yes it is considered copying, and you can and will be prosecuted for it.
What?? No. Copyright limits distribution and derivative works, not reading. Do you think publishers like you going to the library and borrowing books? They have no legal grounds to stop it. And if you read all day in a bookshop, they can ask you to leave (and you will be trespassing after that), but there is no permission to read copyrighted works that could be revoked.
I may be wrong about subtle details, or have simplified, but I had to correct such a strongly worded misconception. Copyright does not prohi
Re: (Score:2)
Which is fine. It means all the AI companies have equal pillaging. So ultimately should all have equally garbage LLM's as they start eating each other.
I keep saying this, but "we will not have general AI" any time soon, definitely not in 5 years.
These AI companies have pillaged all the "free" information they can get. Now if they want any more, they're going to have to source physical textbooks and encyclopedias from before the internet, and carefully curate it so it doesn't ingest old or debunked informati
Will more data help anything? (Score:2)
mostly garbage (Score:5, Funny)
When ever you ask an oligarch how much is enough, the answer is always "more"... sound familiar?
Re: mostly garbage (Score:3)
Re: (Score:2)
Take AI by its word (Score:2)
Seems to me that in order to make intelligent use of the data we have as to create actual, real, new, and valuable knowledge AI would have to be ... intelligent? What a concept!
Re: (Score:2)
It isn't intelligence it's missing, it's purpose, wisdom, and honesty. The current approach is fundamentally flawed in that regard: you can't have meaningful responses without those things being the progenitor of discovery.
How much is "enough"? (Score:2)
The internet is the single biggest conglomeration of data that mankind has. What else are these AI creators wanting?
Or is this code and what they mean to say is "Royalty free sources are too small"?
Where is the next great reservoir of free shit that can be cheaply harvested using someone else's tools and time, and turned into regurgitated AI copy pasta?
Re: (Score:2)
> What else are these AI creators wanting?
Differentiation. They want a product that has different/better training.
Honestly, there is a market for a bunch of smart English-major-type college kids to start a quality data source business.
Re: (Score:2)
But these companies are carrying on like it's merely a matter of getting a bigger data set. The improvements you're describing require improvement in everything else *but* data set. They have what they need content wise, the models can't put it together.
Re: (Score:2)
They are just trying to keep the hype going a bit longer to line their coffers. The end of the bubble is pretty near. And it will be just as pathetic a failure as the as AI bubbles. Incidentally, more training data, if it were available, would do very little. But since it is not, they can just claim that this is not their failure for pushing and severely limited and selling broken tech, it is obviously humanities fault. The are just planning their exist with the mountains of money they made on false promise
Re: (Score:2)
That's what I'm thinking, this isn't a "data" problem.
Remember when blockchain was going to revolutionize things and save us all? I wonder what the next magic bullet / venture capital money pit will be tomorrow.
Re: (Score:2)
That's what I'm thinking, this isn't a "data" problem.
Indeed. Basically all competent and honest AI researchers state these days that we are missing some fundamental insight or idea and until we find that, all the promises that the lying and hallucinating parts of the AI research field are making will not come to pass. Funny how many of the honest researchers are often in fields that are "AI" but that is called something different, like "automation", "cognitive systems", "pattern recognition", etc.
Remember when blockchain was going to revolutionize things and save us all?
Yes. That this would not happen was pretty clear to me once I t
Re: (Score:2)
That demand is straining the available pool of quality public data online
The key words I think are quality and public, sure there is a large amount of data most of it utter rubbish, or scams. The irony is while the internet has large amounts of data, its become harder to get hold of quality data, without paying a subscription to something, even then no guarantee. Sure it maybe out there somewhere for free but unless you are an expert in the field how do you wade through all of the nonsense. For example before I would go to a recipe book which first listed the ingredients and the
Well then ... (Score:2)
Where is the Electricity Coming From (Score:1)
Re: (Score:2)
It is essentially a get-rich-quick scheme and masses of natural morons are falling for it. The current hype is just a small incremental step and the training data for it was mostly obtained by massive criminal commercial copyright infringement. Not a sustainable basis for anything.
Just fine for real intelligence (Score:2)
The internet is just fine (in fact - way too large) for an actual human intelligence. In fact, humans were able to develop and maintain intelligence with vastly less information.
So, if the AI can consume as much information as internet can provide and still not be satisfied, arguably it is not "I" at all, and they are doing it wrong.
Re: (Score:2)
Quite true. But humans, even many dumb (i.e. "average") humans can do some limited fact checking and place higher value on more credible sources, numerous counter-example notwithstanding (flat-earthers, anti-vaxxers, covid-deniers, Trump voters, the deeply religious, etc.). Even if you only have the usual estimation of 10-15% people capable of fact checking ("independent thinkers") and an some additional people that can be convinced by rational argument (for a total or around 20%), that is a lot. On the AI
Re:Just fine for real intelligence (Score:4, Informative)
I'm not disagreeing with you, but the source of the hallucinations is a little more nuanced.
When training a neural model, one of the goals is to avoid "overfitting" (or "memorization"). You can think of it as, somebody thinking cats can only be black, because all the cats that they've ever seen were black. So in their mind, it's impossible for cats to be any other colour. To avoid this scenario, models would need a shit-tonne of data (as many pictures as possible of all the world's cats), or.... "noise" can be added to a "higher layer" of the model, where "colour" is understood. This "noise" could change the colour to blue, green, pink, etc, at random. The upside is overfitting will be less likely (which is good), the downside is you'll end up with a model that thinks purple and green cats are just as "real" or "common" as white cats. So now when somebody queries this model with "What colours can cats be?", it could respond with colours like purple and green.
This is a generalization, but it illustrates the source of "hallucinations".
Sidenote: This can also help with understanding why LSD results in hallucinations, and the mechanism by which it improves neural disorders such as PTSD and addiction. PTSD and addiction can be the result of strong neural connections/associations the brain creates during extremely emotional and traumatic events. This "over-weighting" is how the brain "overfits". LSD may chemically weaken the strength of these overfit neural connections and allow other signals (noise) to propagate where they previously couldn't --leading to hallucinations.
Back on topic.
Hallucinations by themselves are not *all* bad. They're just uncontrolled imagination. The question is, how much imagination is appropriate/desired/acceptable in a given context? eg. Image generation, information recall, image comprehension, legal document paraphrasing, grammar-checking, programming code analysis/debugging, etc. These would all have different levels of imagination that we would deem as acceptable.
Re: (Score:2)
I am well aware. One consequence is that models that can be overfitted are no good in any context where correctness matters. Hence hallucinations are an absolute killer in those areas. Back when I studied this, there still were some (faint) hopes to get logic into statistical models. That hope was never rational, because logic requires zero error probability or it does not work. At the same time, deductive models get bogged down in state-space explosion very soon.
On the other hand, the current crop of train
Re: (Score:2)
Yeah I see where you're coming from. The problem(?)/non-problem(?) with logic is that it relies on starting assumptions. Newtonian physics works wonders until somebody imagines what it's like to travel at the speed of light. The universe revolves around the earth until somebody imagines what would happen if everything revolved around the sun. Roots of negative numbers are ignored, until somebody imagines what it'd be like to simply leave it in an equation.
I don't see hallucinations as a bad thing. What I se
So no more Wikipedia donation requests? (Score:2)
And there are probably similar non-profits struggling to stay online.
Re: (Score:2)
The AI assholes are not in it for the long haul. They _know_ this hype will collapse soon and they know their tech cannot deliver mich longer and will regress. Also, they are not in it for any positive contribution to the human endeavor either. They just want to get rich quick and, unfortunately, many of them have already reached that goal.
Re: (Score:2)
Until we can completely cure deafness, there will always be a demand for automatic voice recognition generated transcriptions. Yes, it would be nice if it was more accurate, but it's better than nothing at all. ... and before you go there, no, it's not worse than nothing at all. It's not up to you or me to decide what content deaf people are allowed to have access to. Limiting people with a disability's free access to information based on your own values is a complete violation of their rights.
Re: (Score:2)
The problem with them is they rant on other people's behalf too much, except you double-replied on behalf of everyone else, and also specifically on behalf of the people you think will be confused by some results of AI training. It is the same thing you're complaining about.
Further, doing things on behalf of others is what makes non-breeders useful. It is literally our whole reason for existence. Nature only keeps us around for that. Complaining about it is insensible.
Johnny Five (Score:4, Funny)
Need Input!
"Slowing"? Hahahah, no. Much, much worse... (Score:5, Insightful)
What will actually happen is that Artificial Incompetence will regress. Because as more and more low-quality AI "information" is added to the Internet, Model Collapse becomes a thing. Hence what will actually happen in the near future is that you cannot train an AI model from the Internet anymore at all. The only reason ChatGPT and its ilk managed to (mostly) get simple things right is a massive criminal commercial copyright infringement campaign by its makers.
And the _other_ fundamental problem is that more data will not make much difference, unless it is orders of magnitude more data without lower quality. And even with that, the effects would be quite modest. That amount of data is not available and creating it is quite infeasible.
Hence what is going to happen is not that things will slow down. What is going to happen is that things will hit a hard wall (in some respects they already have), and then the wall will move in the wrong direction and make training AI even less effective and much, much harder and at some time training anything but specialized and restricted AI models will become quite infeasible. That time is probably much less than 10 years in the future. To make matters even worse, older models will age badly as they get more and more out of touch, an effect that is already visible.
Main effect: AI affinity for Cats (Score:1)
Re: (Score:2)
And it will love lichen subscriptions and smashing bells.
Pay people (Score:2)
Johnny 5 (Score:1)
Johnny 5 needs more input...
Misdirected Efforts (Score:2)
I think AI companies are misguided in their efforts for two reasons.
First, the internet being "too small" is a consequence of attempting to create a "jack of all trades" model. It would be equivalent of trying to train a single human to be knowledgeable and experienced in every single academic subject known to man. Now, "so what", right? They can burn all the money trying for all we care. My point here is if these companies want to produce anything useful that can be sold, and they care about efficiency, th
Could I perhaps recommend.. (Score:2)
Ask the Vatican (Score:1)
So are stomachs (Score:1)
Completely worked!
QQ for those who actually know ... (Score:1)
Point is, at some point, should we expect to be able to ask ChatGPT for someone's SSN, DOB, favorite porn, last sex toy purchase, etc.? I can easily see a hacker deliberately dropping some of this info onto a site they know o
Then it's over. (Score:2)
This model is over, at least.
This is like the "energy crisis". The Internet is the fossil fuel reserve, and AI has already burned through it and wants more. But there is no more. The reservoir will slowly replenish, several orders of magnitude too slow to be of any help, but the models have to change to better use what has already been scraped or there is nothing left to feed them.