Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI Google

Google Trained a Trillion-Parameter AI Language Model (venturebeat.com) 110

An anonymous reader quotes a report from VentureBeat: Google researchers developed and benchmarked techniques they claim enabled them to train a language model containing more than a trillion parameters. They say their 1.6-trillion-parameter model, which appears to be the largest of its size to date, achieved an up to 4 times speedup over the previously largest Google-developed language model (T5-XXL). As the researchers note in a paper detailing their work, large-scale training is an effective path toward powerful models. Simple architectures, backed by large datasets and parameter counts, surpass far more complicated algorithms. But effective, large-scale training is extremely computationally intensive. That's why the researchers pursued what they call the Switch Transformer, a "sparsely activated" technique that uses only a subset of a model's weights, or the parameters that transform input data within the model.

In an experiment, the researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Clean Crawled Corpus, a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources. They tasked the models with predicting missing words in passages where 15% of the words had been masked out, as well as other challenges, like retrieving text to answer a list of increasingly difficult questions. The researchers claim their 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited "no training instability at all," in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. However, on one benchmark -- the Sanford Question Answering Dataset (SQuAD) -- Switch-C scored lower (87.7) versus Switch-XXL (89.6), which the researchers attribute to the opaque relationship between fine-tuning quality, computational requirements, and the number of parameters.

This being the case, the Switch Transformer led to gains in a number of downstream tasks. For example, it enabled an over 7 times pretraining speedup while using the same amount of computational resources, according to the researchers, who demonstrated that the large sparse models could be used to create smaller, dense models fine-tuned on tasks with 30% of the quality gains of the larger model. In one test where a Switch Transformer model was trained to translate between over 100 different languages, the researchers observed "a universal improvement" across 101 languages, with 91% of the languages benefitting from an over 4 times speedup compared with a baseline model. "Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs," the researchers wrote in the paper. "We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving ~30% of the quality gain of the expert model."

This discussion has been archived. No new comments can be posted.

Google Trained a Trillion-Parameter AI Language Model

Comments Filter:
  • It was much better 4 or 5 years ago, and now it is nearly unusable, even for the widely used languages in places where google does business.

    • Yeah. There was a noticeably decline in quality when they switch from conventional to NN-based translation.

      • Yeah, it appears more of a brute-force attack than anything else. Which is nice - it means the "self-teaching AI that can do anything" has hit the wall in a hard, hard way.

        • Current NN technology is really really good at interpolating, but shockingly bad at extrapolating.

          • That's always the case when your models depend on parameter fits.

            You should see particle physics theories' predictions divergence when you extrapolate outside of the ranges where data is available.

          • Current NN technology is really really good at interpolating, but shockingly bad at extrapolating.

            The solution is more training data in the areas needed.

            According to TFA, the NN was trained on text from Wikipedia. So it may be bad at text prediction in, say, a medical research paper. The solution is to add medical journal articles to the training set.

            • No. Humans don't need to read the entire corpus of Wikipedia and JSTOR in order to form a coherent sentence. The solution is to improve our algorithms.

              • Interesting point.

                On the other hand, humans can read thousands of Slashdot posts, yet still be unable to form a coherent sentence of their own. So indeed "reading a lot" isn't the key for humans. You'd think it would help, but it's not the key.

              • We go through an entire youth though.
                And use much *much* more neurons.

                You are still right, of course... and to specify: We need algorithms, that are not based on simplifying down to perfectly spherical horses on sinusoidal trajectories. :)

                And here, with this sparse method, it seems they went even deeper down that very hole instead.
                I wonder when they get to the point that their model will need so much processing power that they notice that simplification is not the right deity to suck off and they might aswe

              • And yet, when people enter a new field, they often have to devote years to learning before they can extrapolate, and weeks before they understand the common specialized vocabulary.

              • by AmiMoJo ( 196126 )

                This is something that the head of AI ethics that Google recently forced out was saying. Massive corpuses are not the way forward, developing something that mimics the way the human brain is pre-wired to understand language is.

                Aside from anything else training on these massive corpuses is expensive and produces massive amounts of CO2 emissions, all for something that will never really understand the meaning of the words, just be able to transcribe them with decent accuracy.

                • by shmlco ( 594907 )

                  "... developing something that mimics the way the human brain is pre-wired to understand language is..."

                  Ummm. Assumption that the brain is "pre-wired" to understand language. Certain centers of the brain tend to be associated with speech, for example, but, in a certain number of people those locations can be swapped from hemisphere to hemisphere, nor always live in exactly the same place. In fact, "language" can involve many parts of the brain.

            • by skids ( 119237 )

              According to TFA, the NN was trained on text from Wikipedia. So it may be bad at text prediction in, say, a medical research paper. The solution is to add medical journal articles to the training set.

              They already include other sources:

              a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources

              Though including reddit probably was counterproductive. Last thing we need is a language AI model that knows every flash in the pan meme and has no sense of what constitutes proper grammar.

            • There's a grassroots effort to build a better corpus, called The Pile.

              > The Pile: An 800GB Dataset of Diverse Text for Language Modeling
              https://arxiv.org/abs/2101.000... [arxiv.org]

              This dataset includes all the Arxiv, PubMed and Github (open soure projects).
              • by Entrope ( 68843 )

                Shouldn't computer scientists know better than to rely on using more and more better force? Humans don't consume anywhere near 800 gigabytes of text in a lifetime. That's equal to more than 170,000 readings through Victor Hugo's unabridged Les Miserables, or 200,000 readings through Ayn Rand's Atlas Shrugged.

                • by Zak3056 ( 69287 )

                  Thanks for the fantastic visual... I am now imaging a large gang of men all reading Les Miserables repeatedly for their entire lives, while singing,

                  Look down, look down
                  Eight hundred gigabytes
                  Look down, look down
                  You'll read until you die

        • You have no idea what you're talking about. AI is starting to deliver but it's still too expensive to deploy. Brute forcing is what evolution does, as well.
          • Evolution, like this kind of "AI", sucks balls - there's a reason why the people who develop the covid vaccine, for example, are using their knowledge and not throwing the kitchen sink at a population of victims trying to get lucky by evolution.

            It is always better to solve the problem directly, and "AI" sucks at problem solving, and is not going to get much better anytime soon.

      • Maybe if you translated only in some small, very specific domain.

        On the whole, NN-based translation smokes the old latent Dirichlet allocation thing they used. I can probably train better models than that on my home computer now.

        • On the whole, NN-based translation smokes the old latent Dirichlet allocation thing they used.

          Based on what metric? My experience is the new NN translation does a better job of producing grammatically correct sentences, but a worse job of matching the original text.

          • Based on any metric you can come up with, from BLEU to costly manual rating of translations by professional translators. Just look up the papers, though you have to go many years back, because it's a long time since they dropped pre-NMT models from the comparisons.

            BLEU itself is biased to the old style of machine translation because understandably enough, the best way they could come up with to automatically rate translation at the time, looked a lot like the best way they could come up with to do machine t

      • What algos were they using before nn?
    • No, it's unrelated. This model is too large to deploy to the public. The public model is cheap and efficient, but not as good.
    • by AmiMoJo ( 196126 )

      I've noticed Google Translate improving over the last few years. In particular it seems to better understand context now.

      It used to translate headlines as if it was a person describing something that happened to them, e.g. "I found the bank vault door open and the money was gone", rather than "bank vault found open, money missing".

      Now the most common error I see is getting gender wrong. It tends to assume that everyone is male and doesn't notice when someone's name suggests otherwise.

      • I've noticed Google Translate improving over the last few years.

        You've missed the big drop in quality that happened 4-5 years ago then.

        It could have been "improving" after that, but it has much ground to cover to just get back to where it was before.

        Currently, it "excels" at literal translations word for word, totally ignoring idioms or rare words, except that it often mixes up the type of the word (so you'd get a noun in the original become a verb and so on), and the sentence role (so the object becomes something else), which makes the "translation" nearly incomprehens

  • Machine learning means that a computer compiles the least coherent responses from the raving herd, and then repeats them in random splice order. It reflects the insanity of the human mind at the end of civilization, but it does it with grammatical perfection and digital precision!

  • Compute magic (Score:5, Informative)

    by sg_oneill ( 159032 ) on Wednesday January 13, 2021 @10:56PM (#60941504)

    A trillion parameters is kind of astonishing.

    GPT-3 has about 170 billion if I remember right, and that thing is quite mind blowing. GPT4 is projected to have about 20 trillion, but theres a *huge* catch, training the damn thing at the current cost for GPU compute would come in at 8-9 billion dollars (GPT-3 cost about 4.6 million)

    Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.

    • Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.

      In the first place they build their own TPU hardware. In the second place, Google is really good at shaping their Neural Network trees to be more efficient. They also have tricks like "use 8 bit integers instead of floats." So they have some efficiency experts on their team.

    • It's a word trick.

      A image processing function that processes 32Kx32K pictures of 24 bit could be said to process "a trillion parameters".
      Or, if counting single bits feels like cheating, how about one that inserts an effect into equally high resolution videos with one second (24 frames for film) of spread. E.g. 1 second long temporal blur.

      So "one second of retina display VR goggles movie"... That's my attempt of today, at the news media system of units. :)

      • Are you counting input data size as parameter count? Parameters are the same between all input examples, don't change with each example.
        • he is confusing the inputs with the weights, because he is an ignorant doofus that wants to pretend that he knows something, but proves once again that he doesnt
    • TFA reports that the model was completing just under nth-7 cloze deletion tests, which is comparable to humans. In theory at least, that means that it can not only process texts from the 'bottom-up' (i.e. processing syntax & lexis probabilistically = Bayesian inference which is unremarkable), but also from the 'top-down,' i.e. infer missing words from pragmatic meaning. In others words, it's at least simulating strong AI. Impressive!
      • Either that or current language proficiency testing theory needs to be revised in the light of this. Specifically for cloze deletion tests, this may mean revisiting the 'reduced redundancy principle.'
        • Nah, cloze deletions work well. The fact that this model can be good at this task yet inferior to humans in other tasks means it lacks something else. What is missing from it is the ability to do anything of its own will, because it doesn't have a body and all it can do is read the text we feed into it. Like a life-long paralyzed person, blind, with no smell, taste and touch. Just a pair of eyes reading plain text fed page by page.
    • by AmiMoJo ( 196126 )

      I'm sure they do have some special sauce, but even if they didn't $8bn might not be unreasonably for the potential gains. Language processing is extremely valuable now.

      Of course it's English only, presumably if they had a similar size model for say Hungarian it wouldn't be worth putting in nearly as much effort.

    • Googles clearly is not spending half quarter of a billion on GPU, that would be *extremely* hard to justify to investors. So they've figured out some pretty dark magic to get this to work without spending Googles entire R&D budget on GPU compute for a single project.

      They are probably not using GPU.

      Google has absolute arseloads of servers, somewhere around a literal million of them. The unused capacity on these servers has made possible all of Google's side projects. They just give them the lowest priority so they don't step on indexing and correlating web data.

    • Read the announcement again - it says they invented a trick to make it easier to train. The trick is to split the model into 2000 "experts" that are only called upon in a sparse way (like just 1% active at a time). Reminds me of the old trope that the brain is only using 10% of its power. By this trick they can push the scaling, but I don't think this model surpasses GPT-3 in precision, just in speed.
    • A trillion parameters is kind of astonishing.

      Its classic over-fitting.

    • Give me 4 parameters and I can fit an elephant. Give me 5 and it can wiggle its trunk.
      Give me a trillion and uh ...an elephant that wiggles all its appendages?

  • Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation and their posts automatically hidden for being factually inaccurate. The internet of idiots spewing bullshit vastly outnumbers the number of experts on the internet.

    • by Anonymous Coward

      There goes CNN.

      • Not just the CNN. The convolutional neural networks - CNN are also being replaced by this kind of neural networks - transformers. It's the hottest fad in computer vision.
    • But actually what you will see is high quality translations of trolls in multiple languages.

      They're doing it for the CLIKZ.

    • Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation ...

      Fat chance. They'd be trained by datasets where humans had made such ratings - with all their biases and the currently observed conflation between ideology and "correctness".

      They might become as good as humans at identifying truth-or-agreement-with-my-in-group's-ideology. But that just turns them into automatic ideologues, not oracles of truth but of "truthiness".

    • Except that the AI will use what is trending on social media to determine Truth

      • Except that the AI will use what is trending on social media to determine Truth

        According to who? You literally just made that shit up and created a prime example of a comment that should be shut down by said AI because even the summary explained that they used reputable sources of information, not what was trending on social media.

    • Looking forward to trolls getting overfed by AIs that can identify misinformation/disinformation and their posts automatically hidden

      That won't work. The trolls can easily defeat the filter by writing insightful, interesting, and factually accurate posts.

      https://xkcd.com/810/ [xkcd.com]

    • Fun fact: Your post counts as one of those. :)

      Also fun fact: If you want to see what an "AI" fed by an internet forum looks like, look up "Bucket (chatbot)" on Encyclopedia Dramatica from when it still existed.

      My favorite quote after it had a fun day with 4chan:

      Bucket: "Bucket is cancer is cancer is Bucket."

      Says it all. :)

  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Wednesday January 13, 2021 @11:00PM (#60941516)
    Comment removed based on user account deletion
    • Is it odd that they train a 1.6 trillion parameter model with only 0.75 trillion bytes of data?

      Perhaps that is all the data they could find.

      The text of the English edition of Wikipedia is only about 20 GB.

      But using so many parameters has a very big risk of overfitting.

    • I *knew* they were counting single bits as "parameters"!

    • Not if the dataset is complex, which language is - highly complex, e.g. see 'dual structure.' A linguist can write a whole book describing the characteristics of a short paragraph of text without repeating him/herself. Wikipedia is more than enough to keep an AI busy for a few lifetimes. It's limitation is that it's only one particular mode & genre of text, i.e. written expository, & therefore not particularly generalisable to other genres, which have different lexicgrammatical configurations, e.g.
    • by AmiMoJo ( 196126 )

      Compression maybe? Being text it probably compresses extremely well.

      It would make sense to tokenize it for faster processing, aside from anything else. No point doing trillions of string matches over and over again.

  • by SuperKendall ( 25149 ) on Wednesday January 13, 2021 @11:21PM (#60941582)

    ..enabled them to train a language model containing more than a trillion parameters.

    In other words, Google finally has what is effectively an infinite number of monkeys devoted to language.

    • There are around 100 to 1000 trillion synapses in the human brain, so it's perhaps to effectively infinite. Google doing this certainly doesn't prove in itself that it's necessary, but, you never know until you try. 15 years ago people thought the idea of "Just throw a bunch of neurons at it" was a stupid approach, but it ended up working better than anybody thought it would.
      • It's interesting that the more they copy the precessing characteristics of human brains, the more efficient & effective the models become. Therefore, we could say that natural selection is an incredibly powerful form of AI.
        • Therefore, we could say that natural selection is an incredibly powerful form of AI.

          Well, we could say that natural selection is an effective way to generate AI, at least. And it's efficient in the sense that it can do a lot with a little, by finding a structure that works well. It's not very efficient in terms of the amount of time and energy it takes, though.

    • In other words, Google finally has what is effectively an infinite number of monkeys devoted to language.

      And it still can't accurately translate the works of Shakespeare.

  • Did someone just pick buzzwords at random?

  • I hope this does not imply unnatural distortion by Despicable Catholiban Linguistic Cieansing (DC/LC).

  • by Arethan ( 223197 )

    ...but what's the question?

    • ...but what's the question?

      How do I post a very redundant and not funny at all old joke?

      • I think he was making a valid comment.
        The headline tells us nothing about what this system DOES.
        A useful headline would do that.

        Next week, from the Slashdot "editors":

        "Poptopdrop develops a new AI system with a Trillion and one parameterization factor assemblies."

  • Even after they are properly trained, their model can't possibly use all these input data every time it is run. So where are they cheating? Are most of these parameters set to zero, so that it is a *sparse* matrix with 1.6 trillion entries but just a handful of nonzeros?
  • This is still the same kind of rote training on an existing corpus that we've been doing for decades. As you increase the size of the corpus and the number of parameters, you get ever-diminishing returns. Twice as big is not nearly twice as good.

    There is still no *understanding* involved. It's just rote training. If you talk to a chatbot trained like this (as most are), it's just like the old Eliza: it reflects your thoughts back to you, maybe mixing in some stuff from the training corpus. For answering que

    • This thread is full of n-zi spammers, and then there are posts like yours, which are almost as useless.

      The big deal with huge language models is precisely that the returns don't diminish as much as people thought they would. Bigger keeps doing better.

      There is still no *understanding* involved.

      Sure. Just like there's no understanding involved in your post. Your notion of understanding is vague and doesn't stand up to scrutiny.

      If you talk to a chatbot trained like this (as most are), it's just like the old El

      • by dargaud ( 518470 )
        I don't know what model the various chatbots on customer support webpages use, but they all uniformly suck. Ask a pointed question and get a boilerplate answer, always completely unrelated to the question. So yes, maybe you can use those models to generate amusing short stories from a prompt. But they always seem 'off' at the very best. And trying to do anything serious like answering precise customer questions is a complete failure. But like I said, maybe those bots are 4 generations beyond the current sta
        • by nagora ( 177841 )

          I don't know what model the various chatbots on customer support webpages use, but they all uniformly suck. Ask a pointed question and get a boilerplate answer, always completely unrelated to the question.

          To be fair, that's the same with a human call centre too.

          "AI" allows companies to be unhelpful for less money.

  • Regarding energy requirements of training the model, has anyone compared it to energy requirements of a group of humans (to grow up and) to learn and to do similar tasks?
  • I keep wondering: does this neural network even know what's saying or doing? There aren't even a trillion words in the English language and the number of distinct phrases is probably lower as well.

    So how does this matter?
  • by Fly Swatter ( 30498 ) on Thursday January 14, 2021 @11:07AM (#60943454) Homepage
    Gets worse every month. Progress!
  • "out of sight, out of mind" -> blind and insane

    "the spirit is willing but the flesh is weak" -> the wine is good but the meat is spoiled

Time is the most valuable thing a man can spend. -- Theophrastus

Working...