Forgot your password?
typodupeerror
AI Biotech

OpenAI Announces Benchmarks for AI Life Sciences Research. Its Best Model Failed 63.9% of the Test (nerds.xyz) 19

This week OpenAI announced a 750-task test to to measure "whether AI systems can support realistic life science research tasks, not just answer biology questions."

But while OpenAI's top-performing GPT-Rosalind model led the rankings, Slashdot reader BrianFagioli notes that "it achieved a pass rate of just 36.1 percent, failing nearly two-thirds of benchmark tasks." Nerds.xyz points out that means "the best-performing model failed nearly two-thirds of the benchmark's tasks." The benchmark also revealed a familiar weakness. AI systems generally perform better when everything is presented as text. Once they are forced to work with supporting documents, figures, or complex datasets, performance drops noticeably. GPT-Rosalind's pass rate fell from 45.1 percent on text-only tasks to 28.1 percent on tasks involving artifacts or URLs.

To be fair, the benchmark is not intended to suggest AI is useless in research. Quite the opposite. OpenAI found that models are becoming increasingly capable of scientific communication, evidence synthesis, and translating research findings into practical explanations. Those are valuable skills, particularly for researchers drowning in information. But LifeSciBench serves as a useful reminder that today's AI systems are still far from autonomous scientists. They can help. They can assist. They can sometimes provide surprisingly useful insights. What they cannot reliably do, however, is replace the expertise, judgment, and skepticism that real scientific research requires.

OpenAI Announces Benchmarks for AI Life Sciences Research. Its Best Model Failed 63.9% of the Test

Comments Filter:
  • Interestingly, the benchmark questions are not available to the public for âoesafety and licensingâ reasons. I donâ(TM)t t see how benchmarking questions about answering science questions could be a public risk unless the questions are asking how to build atomic weapons or biological warfare agents, and the scoring matrix contains useful /secret information â¦

  • by subreality ( 157447 ) on Saturday June 20, 2026 @06:42PM (#66202084)

    36.1% pass would be worrying if this was a qualification test of things it needs to be able to do. It's not. This is a benchmark, and it SHOULD have a low pass rate. That's how you know if you're making improvements.

    We could quite easily create a different benchmark where it passes 99.9%. That wouldn't mean the device being tested is good. It would just mean we have a useless benchmark.

    I have no opinion on whether AI is good or bad for this use case. I just hate when statistics are used to mislead people.

    • by SeaFox ( 739806 )

      I don't understand why they worded the headline like that. Who refers to scoring on an assessment by the percentage you got wrong? if their point is to say the models weren't very good, surely saying they scored 36.1% would have a better audience impact than using the opposite figure.

    • Unfortunately (for them) the inverse is also true,

      If we try to build a system like this to be definitive... Our system will ultimately be inadequate for the breadth of operations researchers will attempt to push through it.

      Doing last mile research integration is a fairly regular part of my day job... Most researchers don't really care to conform with an application framework, and those who do care to will know the frameworks they're working with well enough to make complex decisions like a developer would w

  • by SubmergedInTech ( 7710960 ) on Saturday June 20, 2026 @07:42PM (#66202104)

    For example, a new grad with a BS in Biology? Or a mid-career researcher?

    And with what time limits? Is the amount of work in this benchmark something that would take the human a day? A week? A month?

    I'd also like to know how quickly a new grad or mid-career researcher can identify which things the AI got right? For example, day it's asked a week's worth of work and gets 36% right = 14 hours. If it takes the human 10 hours to figure that out, it's a win. If it takes the human 20 hours to figure it out, it's not.

    And how well could the human figure out ahead of time which things it thought the AI would get right? If the human only asks that subset, then the payoff is better. Say the human only asks the AI to do 20% of the tasks (8 hours of work), but now it takes 20% of the time to grade (so instead of 20 hours, it takes 4 hours). Now it's a win again.

    Without knowing these things, it's like saying, "AI sucks at playing golf!" Without saying whether it's having trouble with 400-yard drives or just getting the ball into the windmill before the ramp goes up.

    • ...or if it thinks playing golf involves creative use of a golf cart, and correctly infers that just driving up and dropping the ball in the hole will incur a penalty... so it spends all your expensive premium tokens thinking up new ways to avoid detection...

      etc.

      • ...or if it thinks playing golf involves creative use of a golf cart, and correctly infers that just driving up and dropping the ball in the hole will incur a penalty... so it spends all your expensive premium tokens thinking up new ways to avoid detection...

        etc.

        So, funny story. I just implemented Cloudflare Turnstile on my website, because I'm tired of all the AI bots scraping it.

        I decided to see what Github Copilot would propose for an implementation. Its first pass wasn't a good integration with my site, but it did show me what pieces of the puzzle I'd need. Took me a few hours after that to reimplement it cleanly. I figure that first pass probably saved me half a day of futzing around myself, though.

        But then it wasn't working on third-party redirects (like

        • I have this image in my head right now of a 3rd party security bot getting mad at you because it, while wearing another hat, introduced XSS vulns.

          This seems like a great customer experience.

          • I was mostly amused. I worked in firmware and secure boot for decades, so it wasn't really going to get away with it on my website.

            But it has convinced me I will never upload any personal information to anything that looks vibe-coded. Because those people have no idea what their AI-generated code is doing. And there's a wide spectrum of bugs in between "really working" and "so preposterously f'd up that a newbie would notice.

            • yeah, for sure.

              The risk is it people who don't have the experience, and especially to those who don't know there's even anything to be learned to begin with... both at the developer and customer level. Probably at a few more levels, come to think of it.

    • The purpose of this benchmark is to track and steer AI improvement. They want to start with a low success rate to have room for improvement, how much human score at it doesn't matter.
      • The purpose of this benchmark is to track and steer AI improvement. They want to start with a low success rate to have room for improvement, how much human score at it doesn't matter.

        It matters, for two big reasons:

        1) It tells you whether you're benchmarking something that's actually useful. If you're benchmarking basketball by how many times a player bounces the ball on the court, that might indirectly correlate to how well they score (since it's an indicator of time with the ball), but it leads you to optimize the wrong thing.

        2) It tells you when the model is good enough to be useful for real-world tasks. "Our previous smoke detector was only rated 1.3 flugelhoffers; the new one is

  • It feels like software engineering with AI could be described with similar numbers.

    • Theres a bit of a magic trick with a lot of it.

      These things are *realy* good at drudge work. Shitty reactjs sites. wordpress themes. The kind of stuff you give to the fresh recruit from university, whos still got enough vigor in him to tolerate working on mind numbing web shit all day. But you give it a hard problem, race conditions in a multithreaded code, cache chaos, or worse mashing it all together and debugging memory leaks in a cached multithreaded system across multiple language translation layers in

  • People may say "63% fail rate is bad!" but new AI benchmarks are designed so current models fail, because when all models are in the range 85-90% of old benchmarks, the signal/noise ratio of the benchmark is low. So when this happens you introduce a benchmark that challenges current models to observe future models becoming better.

  • by LordHighExecutioner ( 4245243 ) on Sunday June 21, 2026 @01:26PM (#66202916)

    ...pose your Life Science question to OpenAI, then reverse the answer. It will be correct 63.9% times.

  • Anyone who has ever asked AI an advanced math question knows it's not really thinking, it's just making things up that sound right and stringing together unrelated things that it sort of heard about on the internet. Gee, I wonder why that wouldn't translate over to an even more complex and equally rigid scenario than mathematic?

The less time planning, the more time programming.

Working...