Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
Google AI

How Google Finally Leapfrogged Rivals With New Gemini Rollout (msn.com) 38

An anonymous reader shares a report: With the release of its third version last week, Google's Gemini large language model surged past ChatGPT and other competitors to become the most capable AI chatbot, as determined by consensus industry-benchmark tests. [...] Aaron Levie, chief executive of the cloud content management company Box, got early access to Gemini 3 several days ahead of the launch. The company ran its own evaluations of the model over the weekend to see how well it could analyze large sets of complex documents. "At first we kind of had to squint and be like, 'OK, did we do something wrong in our eval?' because the jump was so big," he said. "But every time we tested it, it came out double-digit points ahead."

[...] Google has been scrambling to get an edge in the AI race since the launch of ChatGPT three years ago, which stoked fears among investors that the company's iconic search engine would lose significant traffic to chatbots. The company struggled for months to get traction. Chief Executive Sundar Pichai and other executives have since worked to overhaul the company's AI development strategy by breaking down internal silos, streamlining leadership and consolidating work on its models, employees say. Sergey Brin, one of Google's co-founders, resumed a day-to-day role at the company helping to oversee its AI-development efforts.

How Google Finally Leapfrogged Rivals With New Gemini Rollout

Comments Filter:
  • by FictionPimp ( 712802 ) on Monday November 24, 2025 @09:04AM (#65814973) Homepage

    Eventually the models will be tuned to just perform well on the tests and perform like crap outside of the tests.

    • by Junta ( 36770 ) on Monday November 24, 2025 @09:13AM (#65814981)

      Eventually? We are kind of already there. I recall some question on one of these going viral, attracting a lot of actual humans to write up why they felt the AIs struggled with it including answering in their writeups. So then their writeups made their way into the RAG inputs into LLMs and also into training material. The AIs suddenly got better at that question, what a surprise...

      Just like most specific examples of LLM screwups get self-corrected in short order, automatically as the mocking ironically shapes the RAG component to avoid the specific behavior. Suddenly the LLMs got really good at counting the number of 'r's in strawberry, even as they couldn't actually count letters, but the internet now said how many rs were in strawberry just a whole bunch of times...

      • by Bobartig ( 61456 ) on Monday November 24, 2025 @11:15AM (#65815229)

        What you're describing isn't an implementation of RAG, but supervised post-training like Directed Preference Optimization. In DPO, researchers compile sets of answers from LLMs to the same question with a human reviewer selecting which one is a better answer. This data is used to fine tune, which steers models towards the good answers and away from the bad ones, which is why it is such an effective means of changing a models' behavior to a certain style of answer.

        It gets pushed way too far, which is why you can find tiny open-weights models that crush benchmarks, and tell you confidently that "there are 3 'r's' in strrawberrrry," because, as you point out, they can't count letters in token sequences.

        • and tell you confidently that "there are 3 'r's' in strrawberrrry," because, as you point out, they can't count letters in token sequences.

          While there's certainly no guarantee for it to be- there's no reason such capability can't be generalized.

          gpt-oss-120b:
          Me:

          How many times does the letter "v" appear in the following sentence:
          ```
          Verily I say, it's very nice to invoke a vehement interaction with your visage
          ```

          GPT:
          Counting them:

          1. V (in *Verily*)
          2. v (in *very*)
          3. v (in *invoke*)
          4. v (in *vehement*)
          5. v (in *visage*)

          So the letter “v” appears **5 times** in the sentence (including the initial capital V).

          *(If you count only lowercase“v”, there are 4 occurrences.)*

      • LLMs can count letters in unseen phrases.
        The fact that a model could not does not mean that LLMs cannot.

        gpt-oss-120b:
        Me:

        How many times does the letter "z" appear in the following word: "zlkjxcvlzkjxlcvzzZZZzlskdfjasdzz"

        GPT:

        So:

        - **Lowercase ‘z’: 7 occurrences**
        - **Uppercase ‘Z’: 3 occurrences**

        **Total (caseinsensitive) = 7+3 = 10**.

    • by Viol8 ( 599362 ) on Monday November 24, 2025 @09:35AM (#65815029) Homepage

      Just saying...

    • Seems to me that it already is. ChatGPT is significantly better at reading circuit diagrams (although it struggles with them too). Still, in this regard (and generally electronics) ChatGPT performs better than both Claude and Gemini.

    • Statistical models have concepts of generalization, and over-fitting.

      What you describe is over-fitting.
      The goal is generalization.
      Models have become better and better at generalization, not more and more over-fitted.
  • New version of Rectum now produces shit at 3 times closer to shit consistency benchmarks than previous versions of competitors Anus and Colon. Given that we've already decided that you just need more shit constantly and forever and will shove it into every aspect of your life, this must make you very happy.

  • Good job (Score:5, Interesting)

    by coofercat ( 719737 ) on Monday November 24, 2025 @09:35AM (#65815027) Homepage Journal

    I saw this news on LI a few days ago, so I headed over to try my 'stock test'. I have a programming question which isn't so easily found in online examples and whatnot, and on the face of it is easy, but actually it's a bit involved. I requires a two step solution (a sort of parse and then parse again type thing). I'd say it's maybe 100 lines of python to solve (at most). I didn't give it any examples of input/output, and my prompt was maybe one or two lines long at most - not a long essay spoon feeding it the implementation.

    I asked ChatGPT and got the usual one-pass not-quite-there-yet answer. It's the sort of answer you'd expect a human junior to give you before you talked it over and explained the deficiencies of it.

    I asked Gemini, and it gave me a working program, with a decent example input (with a proper two pass solution). The code had comments which actually explained what was going on, and the code was pretty nice (descriptive variable names and so on). I went on to ask it to make a relatively simple change, and again, same sort of response. I'd say it was close to "commit ready", if it were a task in a ticket I was working on or something. I'd probably do some more testing with a few more inputs (maybe get it to write some unit tests?) and assuming all was well, I'd commit it.

    I realise this is just one test of billions of possible ones, but it's one that every AI I've tried it on has failed to answer properly. Since it's the first that actually did answer it, and honestly, answered it really well, I'd say they do really seem to have 'cracked it' somewhat, at least for Python programs. It probably doesn't solve every problem, and it's still prone to making stuff up, but it's definitely got something about it that's good. I am tempted to try and connect it up to my IDE to see what I can do with it, but haven't taken the plunge yet - it's the first AI I've felt is worth using.

    What it knows about Lindsey Lohan I couldn't say though ;-)

    • by Viol8 ( 599362 )

      So how long before it can start to rewrite its own code to "improve" itself?

      I'm only half joking.

      • So how long before it can start to rewrite its own code to "improve" itself?

        I'm only half joking.

        Well, the LLMs don't really consist of "code" per se, but I think the AI labs are already using them to work on improving their own design. How far are they from being able to do this without human oversight and supervision? I have no idea.

        • by Viol8 ( 599362 )

          "Well, the LLMs don't really consist of "code" per se"

          Oh they do, a LOT of it, all the way from the high level python libraries such as TensorFlow, via sigmoid or relu activation functions down to the low level CUDA GPU libs. The only part that is pure data is the weighting of the neurons which themselves are code.

          • All of the stuff other than the weights is basically just scaffolding, though. Necessary, but the stuff that makes it "intelligent" is all in the weights.

            • A database is useless without data too. Most computer programs are.

              • An untrained transformer is like a CPU. A trained transformer is like a CPU plus a program (the weights) that run on it.

                You could say that the functioning of any computer program is mostly down to the CPU, but while somewhat true, that's a useless statement and fails to understand that it's the program where the logic is.

                In the exact same way, saying that the functioning of a trained transformer (LLM) is down to the transformer, is somewhat true, but fails to capture what is really going on in exactly the s

        • by allo ( 1728082 )

          The interesting part for AI to improve itself is coming up with better architectures and with better optimizers. The training part is just something that needs time and compute, but the next step in AI will probably involve an architecture solving some of the current drawbacks, like for example attention scaling badly for long context.

    • I just tried my standard "draw me an ASCII middle finger" and it flat-out refused to generate it for me! Grok still sucks at it but at least it tried. And it gave me the finger through emoji afterward. Gemini needs more work, I think.

      • I just tried my standard "draw me an ASCII middle finger" and it flat-out refused to generate it for me!

        Yep, the guardrails are also improving.

      • by Bobartig ( 61456 )

        At some point when I was bored and playing around with LLMs, I had chatGPT keep making ascii art over and over. I'd ask for something like a duck holding an umbrella. No matter what blobbly garbage it produced (and it was all blobby garbage), I just kept encouraging it and telling it to make more. Add details, refine, making it even duckier, etc. And it just kept taking the input blob and adding more blobs to it, like an infinite ascii blob spiral.

        • AI is still so bad at it that it makes for a great test. It's interesting too, because you can see that it sort of gets close, but it's still WAY off!

    • by RobinH ( 124750 )

      I just asked it a fairly simple (in my opinion) question: "What are the top 3 tier one parts suppliers in the North American automotive market, by revenue?"

      It very confidently gave me 3 tier 1 suppliers for the 2022 fiscal year. The top, not surprisingly, was Magna, which is probably true. But it said the revenue was ~$18.9 billion. That doesn't seem to line up with any facts I can find online about Magna. Typical revenue is more like $10 billion per quarter, or $40 billion per year. I can't figure out

      • If you're trying to get facts out of AI, you're doing it wrong. It's not a knowledge base, it's a natural language search and summarization tool. Its facts are only right if its sources are right, and it may not be able to rank sources by quality yet. (That will come.) And if you're asking an obscure question that its sources don't answer, it will make something up. Don't go to AI for facts. Go to it for its natural language abilities......which includes programming languages.
        • by RobinH ( 124750 )
          True, when I asked it to generate a spam email campaign, or a deepfake video of a local politician, it did great. I'm glad our society now has access to this wondrous new technology. I can't wait to see what amazing impact it will have on our lives. Too bad it can't, you know, go find me some facts and all, or at least tell me when it can't find any. Actually AI doesn't even go looking for facts. It generates text that looks statistically like text it has seen. So in no way does it do anything related
          • That's only sort of true. Here's an example. If you ask it "how do I use feature X" and give it some code and documentation as context, it will find relevant references in the documentation, locate examples of using that feature, adapt the examples to your code (variable names etc), and optionally update your code if you let it. That's the sort of thing it does well. What it down do well is STOP when it goes off the rails. You have to perform that function.
          • Too bad it can't, you know, go find me some facts and all

            Tool-enabled LLMs can. This is old tech.

            or at least tell me when it can't find any.

            Can do that too.

            It generates text that looks statistically like text it has seen.

            This bullshit again.
            It's not remotely true.

            An LLM is trained to give responses that match existing text, this does not mean they "generate text that looks statistically like text it has seen."

    • by Anonymous Coward

      Buy an ad.

    • Similar results here, Gemini 3 does an outstanding job at coding. Best I have seen.

    • I asked it one of my favorites... how to do something very unusual in a specific programming language. A traditional Google search easily surfaces on the 1st page the 1 major forum thread demonstrating how it is indeed possible with a full sample and lengthy explanation, 10y old now, my 2 public GitHub projects expanding on the subject, 1 and 2y old, and articles mentioning them.
      Gemini failed like all others, falsely claiming it's impossible. I'd say it failed worse than any AI yet, as it offered an mostl
  • "Open"AI's last release was said to be way better on tests the AI companies use, but real users didn't rank it much better than the previous iteration and some even said they would stick with that.

    So how much better is Google's new product for real users ?

    And are we seeing signs that LLMs are asymptotically approaching a maximum, a hard limit of the technology that throwing more bytes at them will not solve ?
     
    • by allo ( 1728082 )

      Currently there is no asymptote yet. The improvements seem roughly linear and the model size/cost drops exponentially.
      There is clearly a ceiling how much knowledge you can fit into a model of a certain size (so even when a 4B model beats GPT-3 in problem solving it won't beat it at knowledge questions), but if you instead go for "cleverness" there may be still room for improvement before hitting a ceiling.

  • by WaffleMonster ( 969671 ) on Monday November 24, 2025 @10:35AM (#65815135)

    Benchmarks are more marketing tools than accurate reflections of model capability. The only thing that matters is what users think.

    • by evanh ( 627108 )

      Yeah, this. I prompted just a couple days ago for an example code gen of bit-bashing SD 4-bit SD mode and got garbage. Told it to use the specs from SD Association too.

      If these LLMs can't see something to regurgitate then they're useless. There's no intelligence. They're just the new search engine is all.

      • Something as simple and well known as that, I'm inclined to think your prompting was deficient.

        That's not an insult- interacting with these things is a skill in itself.
        • by evanh ( 627108 )

          You've missed a pretty big detail. 4-bit SD mode is not done as bit-bashing at all. Only 1-bit SPI mode has lots of examples of bit-bashing in the open.

          All public native SD mode solutions revolve around a hardware controller handling all the data transfers, and uses a provided API wrapper. This closed situation is why I'm using it as a test of intelligence.

          Since the LMM can't find any code base to reference, it therefore can't regurgitate any solution ... so responds accordingly. I suppose that's better

  • Dead company walking.

  • making things up. I've searched for products and it has make up entire products complete with SKU's out of thin air.

    • The rest of us are talking about a model that was released 3 days ago. You're probably thinking of a different one.

The system will be down for 10 days for preventive maintenance.

Working...