Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
Google AI

Google Unveils Gemini 2.5 Pro, Its Latest AI Reasoning Model With Significant Benchmark Gains (blog.google) 7

Google DeepMind has launched Gemini 2.5, a new family of AI models designed to "think" before responding to queries. The initial release, Gemini 2.5 Pro Experimental, tops the LMArena leaderboard by what Google claims is a "significant margin" and demonstrates enhanced reasoning capabilities across technical tasks. The model achieved 18.8% on Humanity's Last Exam without tools, outperforming most competing flagship models. In mathematics, it scored 86.7% on AIME 2025 and 92.0% on AIME 2024 in single attempts, while reaching 84.0% on GPQA's diamond benchmark for scientific reasoning.

For developers, Gemini 2.5 Pro demonstrates improved coding abilities with 63.8% on SWE-Bench Verified using a custom agent setup, though this falls short of Anthropic's Claude 3.7 Sonnet score of 70.3%. On Aider Polyglot for code editing, it scores 68.6%, which Google claims surpasses competing models. The reasoning approach builds on Google's previous experiments with reinforcement learning and chain-of-thought prompting. These techniques allow the model to analyze information, incorporate context, and draw conclusions before delivering responses. Gemini 2.5 Pro ships with a 1 million token context window (approximately 750,000 words). The model is available immediately in Google AI Studio and for Gemini Advanced subscribers, with Vertex AI integration planned in the coming weeks.
This discussion has been archived. No new comments can be posted.

Google Unveils Gemini 2.5 Pro, Its Latest AI Reasoning Model With Significant Benchmark Gains

Comments Filter:
  • Whenever a peddler of LLM-crap stresses their artificial moron is doing better on benchmarks, that just means they have given up and are cheating now.

    • Are you still doing this? Move on, man. The world has.

      Your contentless ranting is just noise that pollutes Slashdot.

    • You never have benchmarks in your life? When putting together your new system, you don't look at how well the various components perform? When hiring for a position, you don't look at their credentials or what they've done? When judging which ar to buy you don't look at its 0-60 times, its fuel mileage, its reliability?

      Explain how one is to gauge the good or bad of something without a consistent benchmark to compare against.

    • I'm not sure they're cheating, but I think the significance of the benchmarks is pretty overstated.
      I suspect that even though the benchmarks are supposed to test something other than information retrieval and interpolation in practice they end up being amenable to being "solved" by information retrieval and interpolation.

      But a lot of what makes us go is our ability to switch out of that mode and into other modes, like pure logic, or other modes that are typically disparaged like emotional states or interac

  • "...designed to "think" before responding to queries..."

    Literally every piece of software EVER was "designed to think before responding to queries". It is impossible to do otherwise.

    I am so sick of this anthropomorphizing of AI. It is computer software.

    "...demonstrates enhanced reasoning capabilities across technical tasks."

    Does better than some other things at some tasks.

    "For developers, Gemini 2.5 Pro demonstrates improved coding abilities ..."

    Not to be confused with "coding abilities" of developers.

    "Th

    • In this case 'reasoning' describes the technique used to improve the LLMs that is different (https://en.wikipedia.org/wiki/Reasoning_language_model). You may disagree with the name, but it isn't just marketing hype. It is what the technique is called in the industry.

      • So it sounds like they are separating internal reasoning/logic from the communication functions. Are these also the AI that we hear about becoming "deceptive"?

Credit ... is the only enduring testimonial to man's confidence in man. -- James Blish

Working...