Google Trained a Trillion-Parameter AI Language Model (venturebeat.com) 110
An anonymous reader quotes a report from VentureBeat: Google researchers developed and benchmarked techniques they claim enabled them to train a language model containing more than a trillion parameters. They say their 1.6-trillion-parameter model, which appears to be the largest of its size to date, achieved an up to 4 times speedup over the previously largest Google-developed language model (T5-XXL). As the researchers note in a paper detailing their work, large-scale training is an effective path toward powerful models. Simple architectures, backed by large datasets and parameter counts, surpass far more complicated algorithms. But effective, large-scale training is extremely computationally intensive. That's why the researchers pursued what they call the Switch Transformer, a "sparsely activated" technique that uses only a subset of a model's weights, or the parameters that transform input data within the model.
In an experiment, the researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Clean Crawled Corpus, a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources. They tasked the models with predicting missing words in passages where 15% of the words had been masked out, as well as other challenges, like retrieving text to answer a list of increasingly difficult questions. The researchers claim their 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited "no training instability at all," in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. However, on one benchmark -- the Sanford Question Answering Dataset (SQuAD) -- Switch-C scored lower (87.7) versus Switch-XXL (89.6), which the researchers attribute to the opaque relationship between fine-tuning quality, computational requirements, and the number of parameters.
This being the case, the Switch Transformer led to gains in a number of downstream tasks. For example, it enabled an over 7 times pretraining speedup while using the same amount of computational resources, according to the researchers, who demonstrated that the large sparse models could be used to create smaller, dense models fine-tuned on tasks with 30% of the quality gains of the larger model. In one test where a Switch Transformer model was trained to translate between over 100 different languages, the researchers observed "a universal improvement" across 101 languages, with 91% of the languages benefitting from an over 4 times speedup compared with a baseline model. "Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs," the researchers wrote in the paper. "We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving ~30% of the quality gain of the expert model."
In an experiment, the researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Clean Crawled Corpus, a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources. They tasked the models with predicting missing words in passages where 15% of the words had been masked out, as well as other challenges, like retrieving text to answer a list of increasingly difficult questions. The researchers claim their 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited "no training instability at all," in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. However, on one benchmark -- the Sanford Question Answering Dataset (SQuAD) -- Switch-C scored lower (87.7) versus Switch-XXL (89.6), which the researchers attribute to the opaque relationship between fine-tuning quality, computational requirements, and the number of parameters.
This being the case, the Switch Transformer led to gains in a number of downstream tasks. For example, it enabled an over 7 times pretraining speedup while using the same amount of computational resources, according to the researchers, who demonstrated that the large sparse models could be used to create smaller, dense models fine-tuned on tasks with 30% of the quality gains of the larger model. In one test where a Switch Transformer model was trained to translate between over 100 different languages, the researchers observed "a universal improvement" across 101 languages, with 91% of the languages benefitting from an over 4 times speedup compared with a baseline model. "Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs," the researchers wrote in the paper. "We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving ~30% of the quality gain of the expert model."
Is this why Google translate sucks? (Score:2)
It was much better 4 or 5 years ago, and now it is nearly unusable, even for the widely used languages in places where google does business.
Re: (Score:2)
Yeah. There was a noticeably decline in quality when they switch from conventional to NN-based translation.
Re: (Score:1)
Yeah, it appears more of a brute-force attack than anything else. Which is nice - it means the "self-teaching AI that can do anything" has hit the wall in a hard, hard way.
Re: (Score:2)
Current NN technology is really really good at interpolating, but shockingly bad at extrapolating.
Re: (Score:3)
That's always the case when your models depend on parameter fits.
You should see particle physics theories' predictions divergence when you extrapolate outside of the ranges where data is available.
Re: (Score:2)
That is fascinating.
Re: (Score:3)
Yes, it is. At least the part of it where all the funding that goes into overfitting "AI models" and not meaningful research.
Re: (Score:2)
Current NN technology is really really good at interpolating, but shockingly bad at extrapolating.
The solution is more training data in the areas needed.
According to TFA, the NN was trained on text from Wikipedia. So it may be bad at text prediction in, say, a medical research paper. The solution is to add medical journal articles to the training set.
Re: (Score:3)
No. Humans don't need to read the entire corpus of Wikipedia and JSTOR in order to form a coherent sentence. The solution is to improve our algorithms.
Re: (Score:3)
Interesting point.
On the other hand, humans can read thousands of Slashdot posts, yet still be unable to form a coherent sentence of their own. So indeed "reading a lot" isn't the key for humans. You'd think it would help, but it's not the key.
Re: (Score:3)
Re: Is this why Google translate sucks? (Score:2)
We go through an entire youth though.
And use much *much* more neurons.
You are still right, of course... and to specify: We need algorithms, that are not based on simplifying down to perfectly spherical horses on sinusoidal trajectories. :)
And here, with this sparse method, it seems they went even deeper down that very hole instead.
I wonder when they get to the point that their model will need so much processing power that they notice that simplification is not the right deity to suck off and they might aswe
Re: (Score:3)
And use much *much* more neurons.
many, many more.
/ sorry, pet peeve.
Re: (Score:2)
And yet, when people enter a new field, they often have to devote years to learning before they can extrapolate, and weeks before they understand the common specialized vocabulary.
Re: (Score:2)
It's amazing how you missed the point there. Are you a bot?
Re: (Score:2)
This is something that the head of AI ethics that Google recently forced out was saying. Massive corpuses are not the way forward, developing something that mimics the way the human brain is pre-wired to understand language is.
Aside from anything else training on these massive corpuses is expensive and produces massive amounts of CO2 emissions, all for something that will never really understand the meaning of the words, just be able to transcribe them with decent accuracy.
Re: (Score:2)
"... developing something that mimics the way the human brain is pre-wired to understand language is..."
Ummm. Assumption that the brain is "pre-wired" to understand language. Certain centers of the brain tend to be associated with speech, for example, but, in a certain number of people those locations can be swapped from hemisphere to hemisphere, nor always live in exactly the same place. In fact, "language" can involve many parts of the brain.
Re: (Score:2)
According to TFA, the NN was trained on text from Wikipedia. So it may be bad at text prediction in, say, a medical research paper. The solution is to add medical journal articles to the training set.
They already include other sources:
a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources
Though including reddit probably was counterproductive. Last thing we need is a language AI model that knows every flash in the pan meme and has no sense of what constitutes proper grammar.
Re: (Score:2)
> The Pile: An 800GB Dataset of Diverse Text for Language Modeling
https://arxiv.org/abs/2101.000... [arxiv.org]
This dataset includes all the Arxiv, PubMed and Github (open soure projects).
Re: (Score:2)
Shouldn't computer scientists know better than to rely on using more and more better force? Humans don't consume anywhere near 800 gigabytes of text in a lifetime. That's equal to more than 170,000 readings through Victor Hugo's unabridged Les Miserables, or 200,000 readings through Ayn Rand's Atlas Shrugged.
Re: (Score:2)
Thanks for the fantastic visual... I am now imaging a large gang of men all reading Les Miserables repeatedly for their entire lives, while singing,
Look down, look down
Eight hundred gigabytes
Look down, look down
You'll read until you die
Re: (Score:2)
Re: (Score:2)
Evolution, like this kind of "AI", sucks balls - there's a reason why the people who develop the covid vaccine, for example, are using their knowledge and not throwing the kitchen sink at a population of victims trying to get lucky by evolution.
It is always better to solve the problem directly, and "AI" sucks at problem solving, and is not going to get much better anytime soon.
Re: (Score:2)
Maybe if you translated only in some small, very specific domain.
On the whole, NN-based translation smokes the old latent Dirichlet allocation thing they used. I can probably train better models than that on my home computer now.
Re: (Score:2)
On the whole, NN-based translation smokes the old latent Dirichlet allocation thing they used.
Based on what metric? My experience is the new NN translation does a better job of producing grammatically correct sentences, but a worse job of matching the original text.
Re: (Score:2)
Based on any metric you can come up with, from BLEU to costly manual rating of translations by professional translators. Just look up the papers, though you have to go many years back, because it's a long time since they dropped pre-NMT models from the comparisons.
BLEU itself is biased to the old style of machine translation because understandably enough, the best way they could come up with to automatically rate translation at the time, looked a lot like the best way they could come up with to do machine t
Re: (Score:2)
Yeah, well the old method didn't translate "onna" from Japanese to "man" in English, so I call BS on whatever metric you're using.
Re: (Score:2)
So you "call bullshit" on a single anecdote which isn't even true [google.com]?
Re: (Score:2)
Yeap. It's inconsistent. For a while I was taking screenshots of all the mistakes it randomly made, but eventually I gave up.
Re: Is this why Google translate sucks? (Score:2)
Re: (Score:2)
Re: (Score:2)
I've noticed Google Translate improving over the last few years. In particular it seems to better understand context now.
It used to translate headlines as if it was a person describing something that happened to them, e.g. "I found the bank vault door open and the money was gone", rather than "bank vault found open, money missing".
Now the most common error I see is getting gender wrong. It tends to assume that everyone is male and doesn't notice when someone's name suggests otherwise.
Re: (Score:2)
I've noticed Google Translate improving over the last few years.
You've missed the big drop in quality that happened 4-5 years ago then.
It could have been "improving" after that, but it has much ground to cover to just get back to where it was before.
Currently, it "excels" at literal translations word for word, totally ignoring idioms or rare words, except that it often mixes up the type of the word (so you'd get a noun in the original become a verb and so on), and the sentence role (so the object becomes something else), which makes the "translation" nearly incomprehens
GIGO squared (Score:2)
Machine learning means that a computer compiles the least coherent responses from the raving herd, and then repeats them in random splice order. It reflects the insanity of the human mind at the end of civilization, but it does it with grammatical perfection and digital precision!
Re: (Score:2)
Come on, that's Google wizards you're talking about, it's already optimized it to nearly log GIGO.
Re: (Score:2)
They have perfected next-level bullshit!
Re: GIGO squared (Score:2)
Grammatical perfection?
You ain't never seen no Reddit posts, have ya?
Re: (Score:2)
I was told I needed to own a fedora before I could join Reddit, so I've held back. Apparently the users are all barista metrosexuals who think they are geniuses and enjoy fleshlights and pegging?
Compute magic (Score:5, Informative)
A trillion parameters is kind of astonishing.
GPT-3 has about 170 billion if I remember right, and that thing is quite mind blowing. GPT4 is projected to have about 20 trillion, but theres a *huge* catch, training the damn thing at the current cost for GPU compute would come in at 8-9 billion dollars (GPT-3 cost about 4.6 million)
Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.
Re: (Score:3)
Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.
In the first place they build their own TPU hardware. In the second place, Google is really good at shaping their Neural Network trees to be more efficient. They also have tricks like "use 8 bit integers instead of floats." So they have some efficiency experts on their team.
Re: (Score:2)
In the first place they build their own TPU hardware.
Indeed. For a typical NN, using TPUs is a factor of six more energy efficient than using GPUs.
Tesla also designs its own custom tensor-processing silicon.
Does anyone else?
Re: (Score:3)
Tons of them [ai-startups.org], it's a hot topic for startups right now. I watched this speech given by the founders of one of them, it's worth a watch [youtube.com].
Re: Compute magic (Score:1)
It's a word trick.
A image processing function that processes 32Kx32K pictures of 24 bit could be said to process "a trillion parameters".
Or, if counting single bits feels like cheating, how about one that inserts an effect into equally high resolution videos with one second (24 frames for film) of spread. E.g. 1 second long temporal blur.
So "one second of retina display VR goggles movie"... That's my attempt of today, at the news media system of units. :)
Re: (Score:2)
Re: (Score:2)
Re: Compute magic (Score:1)
Re: Compute magic (Score:1)
Re: (Score:2)
Re: (Score:2)
I'm sure they do have some special sauce, but even if they didn't $8bn might not be unreasonably for the potential gains. Language processing is extremely valuable now.
Of course it's English only, presumably if they had a similar size model for say Hungarian it wouldn't be worth putting in nearly as much effort.
Re: (Score:2)
Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.
They are probably not using GPU.
Google has absolute arseloads of servers, somewhere around a literal million of them. The unused capacity on these servers has made possible all of Google's side projects. They just give them the lowest priority so they don't step on indexing and correlating web data.
Re: (Score:2)
Re: (Score:2)
A trillion parameters is kind of astonishing.
Its classic over-fitting.
Re: (Score:2)
Give me 4 parameters and I can fit an elephant. Give me 5 and it can wiggle its trunk. ...an elephant that wiggles all its appendages?
Give me a trillion and uh
Fantastic! (Score:2)
Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation and their posts automatically hidden for being factually inaccurate. The internet of idiots spewing bullshit vastly outnumbers the number of experts on the internet.
Re: (Score:1)
There goes CNN.
Re: (Score:2)
Re: (Score:3)
But actually what you will see is high quality translations of trolls in multiple languages.
They're doing it for the CLIKZ.
Re: (Score:3)
Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation ...
Fat chance. They'd be trained by datasets where humans had made such ratings - with all their biases and the currently observed conflation between ideology and "correctness".
They might become as good as humans at identifying truth-or-agreement-with-my-in-group's-ideology. But that just turns them into automatic ideologues, not oracles of truth but of "truthiness".
Re: (Score:3)
Except that the AI will use what is trending on social media to determine Truth
Re: (Score:2)
Except that the AI will use what is trending on social media to determine Truth
According to who? You literally just made that shit up and created a prime example of a comment that should be shut down by said AI because even the summary explained that they used reputable sources of information, not what was trending on social media.
Re: (Score:2)
What do the "reputable sources" on your stream report, except that which is trending on social media?
Re: (Score:2)
Read a newspaper and find out.
Re: (Score:3)
Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation and their posts automatically hidden
That won't work. The trolls can easily defeat the filter by writing insightful, interesting, and factually accurate posts.
https://xkcd.com/810/ [xkcd.com]
Re: Fantastic! (Score:3)
Fun fact: Your post counts as one of those. :)
Also fun fact: If you want to see what an "AI" fed by an internet forum looks like, look up "Bucket (chatbot)" on Encyclopedia Dramatica from when it still existed.
My favorite quote after it had a fun day with 4chan:
Bucket: "Bucket is cancer is cancer is Bucket."
Says it all. :)
Comment removed (Score:5, Interesting)
Re: (Score:3)
Is it odd that they train a 1.6 trillion parameter model with only 0.75 trillion bytes of data?
Perhaps that is all the data they could find.
The text of the English edition of Wikipedia is only about 20 GB.
But using so many parameters has a very big risk of overfitting.
Re: 1.6-trillion-parameter model (Score:1)
I *knew* they were counting single bits as "parameters"!
Re: 1.6-trillion-parameter model (Score:2)
Re: (Score:2)
Compression maybe? Being text it probably compresses extremely well.
It would make sense to tokenize it for faster processing, aside from anything else. No point doing trillions of string matches over and over again.
Re: (Score:2)
use that horsepower with an efficiently written algorithm and you could spit out the entire Encyclopedia Brittanica.
Hardly. Knowledge production is not an algorithm.
Re: Trillion parameters? (Score:2)
Oh yes it is!
The automatic researcher is indeed a thing! :)
No, abduction (the logical concept related to induction and deduction) is not a human exclusive.
Disclaimer: I can automate *researchers* away now! ... Fear me! ;)
Re: (Score:2)
Disclaimer: I can automate *researchers* away now! ... Fear me! ;)
You think you're a big deal? Researcher production was automated long time ago. I myself have manufactured a whole bunch of self-teaching researchers and I'm so successful that their average h-index is now higher than mine.
In other words... (Score:3)
..enabled them to train a language model containing more than a trillion parameters.
In other words, Google finally has what is effectively an infinite number of monkeys devoted to language.
Re: (Score:2)
Re: In other words... (Score:1)
Re: (Score:2)
Therefore, we could say that natural selection is an incredibly powerful form of AI.
Well, we could say that natural selection is an effective way to generate AI, at least. And it's efficient in the sense that it can do a lot with a little, by finding a structure that works well. It's not very efficient in terms of the amount of time and energy it takes, though.
Re: (Score:1)
And it still can't accurately translate the works of Shakespeare.
What? (Score:1)
Did someone just pick buzzwords at random?
Re: (Score:1)
It's actually a very specialized algorithm using focus groups, caffeine, and copious amounts of LSD to write the article.
And a Trillion input parameters.
Okay, they watched 1000 old movies stored in AVI files.
Imagine a Beowulf cluster of those... (Score:2)
n/t
Re: Imagine a Beowulf cluster of those... (Score:2)
Sup dawg. I herd you like clusters ... !
'clean'? (Score:2)
I hope this does not imply unnatural distortion by Despicable Catholiban Linguistic Cieansing (DC/LC).
Re: (Score:2)
Yes, it wastes enormous amount of power and produces new strings from old strings.
42! (Score:2)
...but what's the question?
Re: (Score:2)
...but what's the question?
How do I post a very redundant and not funny at all old joke?
Re: (Score:1)
I think he was making a valid comment.
The headline tells us nothing about what this system DOES.
A useful headline would do that.
Next week, from the Slashdot "editors":
"Poptopdrop develops a new AI system with a Trillion and one parameterization factor assemblies."
They can't really use them all (Score:2)
Re: (Score:2)
Correct, it's a sparse model.
And we are still missing a clue (Score:1)
This is still the same kind of rote training on an existing corpus that we've been doing for decades. As you increase the size of the corpus and the number of parameters, you get ever-diminishing returns. Twice as big is not nearly twice as good.
There is still no *understanding* involved. It's just rote training. If you talk to a chatbot trained like this (as most are), it's just like the old Eliza: it reflects your thoughts back to you, maybe mixing in some stuff from the training corpus. For answering que
Re: (Score:2)
This thread is full of n-zi spammers, and then there are posts like yours, which are almost as useless.
The big deal with huge language models is precisely that the returns don't diminish as much as people thought they would. Bigger keeps doing better.
Sure. Just like there's no understanding involved in your post. Your notion of understanding is vague and doesn't stand up to scrutiny.
Re: (Score:2)
Re: (Score:2)
I don't know what model the various chatbots on customer support webpages use, but they all uniformly suck. Ask a pointed question and get a boilerplate answer, always completely unrelated to the question.
To be fair, that's the same with a human call centre too.
"AI" allows companies to be unhelpful for less money.
Energy use by the model vs. by humans? (Score:1)
I keep wondering (Score:1)
So how does this matter?
And yet a two parameter google search. (Score:3)
my favorite translations: (Score:2)
"out of sight, out of mind" -> blind and insane
"the spirit is willing but the flesh is weak" -> the wine is good but the meat is spoiled
Re: 5mod 3own (Score:2)
Written by the Google "AI", I presume?