Forgot your password?
typodupeerror
AI Google

Testing Suggests Google's AI Overviews Tells Millions of Lies Per Hour (arstechnica.com) 104

A New York Times analysis found Google's AI Overviews now answer questions correctly about 90% of the time, which might sound impressive until you realize that roughly 1 in 10 answers is wrong. "[F]or Google, that means hundreds of thousands of lies going out every minute of the day," reports Ars Technica. From the report: The Times conducted this analysis with the help of a startup called Oumi, which itself is deeply involved in developing AI models. The company used AI tools to probe AI Overviews with the SimpleQA evaluation, a common test to rank the factuality of generative models like Gemini. Released by OpenAI in 2024, SimpleQA is essentially a list of more than 4,000 questions with verifiable answers that can be fed into an AI.

Oumi began running its test last year when Gemini 2.5 was still the company's best model. At the time, the benchmark showed an 85 percent accuracy rate. When the test was rerun following the Gemini 3 update, AI Overviews answered 91 percent of the questions correctly. If you extrapolate this miss rate out to all Google searches, AI Overviews is generating tens of millions of incorrect answers per day.

The report includes several examples of where AI Overviews went wrong. When asked for the date on which Bob Marley's former home became a museum, AI Overviews cited three pages, two of which didn't discuss the date at all. The final one, Wikipedia, listed two contradictory years, and AI Overviews confidently chose the wrong one. The benchmark also prompts models to produce the date on which Yo Yo Ma was inducted into the classical music hall of fame. While AI Overviews cited the organization's website that listed Ma's induction, it claimed there's no such thing as the Classical Music Hall of Fame.
"This study has serious holes," said Google spokesperson Ned Adriance. "It doesn't reflect what people are actually searching on Google." The search giant likes to use a test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted.

Testing Suggests Google's AI Overviews Tells Millions of Lies Per Hour

Comments Filter:
  • Great even the pol (Score:5, Insightful)

    by DarkOx ( 621550 ) on Tuesday April 07, 2026 @03:05PM (#66081846) Journal

    Well shoot, even the politicians jobs are not safe then!

  • by TwistedGreen ( 80055 ) on Tuesday April 07, 2026 @03:08PM (#66081858)

    Alice laughed. "There's no use trying," she said. "One can't believe impossible things."

    "I daresay you haven't had much practice," said Google. "Why, sometimes I've believed as many as six impossible things before breakfast."

  • by SlashbotAgent ( 6477336 ) on Tuesday April 07, 2026 @03:10PM (#66081868)

    The crommulence of AI responses is infallible and unimpeachable. This article is complete balderdash.

    • I didn't think cromulent was a word... turns out The Simpsons writers invented it. Cromulence just means acceptability, and... you spelled it wrong.
      • AI said I am correct and that you're teh gey[sic].

        So... there.

        • Google's Gemini says, "Yo mama's so slow, she makes SlashbotAgent look like a high-frequency trading algorithm." You know, I don't think Gemini is really clear on the concept of yo mama jokes. Maybe it gets it confused with Yo Yo Ma jokes.
          • Maybe it gets it confused with Yo Yo Ma jokes.

            I hear he's like the cellist dude. Never gets angry when people call his mother fat.

  • AI lies (Score:2, Informative)

    by gary s ( 5206985 )
    And non AI search results are pretty much all lies. Look at this, oh wait its an AD link...
  • I would rather use Grok.

    • by Tailhook ( 98486 )

      the LLM model they're using for "AI Overview" is terrible. Obviously, they're doing that because it's a small model that runs fast, so it can handle the load of millions of queries a minute. I find that if you then click "Dive Deeper", the model improves to something usable, often completely contradicting the "Overview" slop.

      It's not a good look. But I suppose they have to put "AI" out front, even when it's crap.

      • It's not a good look.

        Yeah, it makes an extremely bad first impression. Anecdotally, everybody I know sees it as the slop on top of the search results that you just skip over.

      • It's a case where doing nothing would have been better than this.

        A strange game. The only winning move is not to play. ... How about a nice game of chess?

  • I use gemini (Score:5, Interesting)

    by MpVpRb ( 1423381 ) on Tuesday April 07, 2026 @03:22PM (#66081884)

    It often gives excellent answers, but when it doesn't, the results are strange.
    I asked for help writing code for an obscure hobby CNC control system.
    It totally invented function calls and invented plausible documentation to explain how they worked and how to call them.
    It totally missed the easy answer that involved calling an existing simple function and writing no new code.
    If the answer doesn't exist on the internet, it appears to just make one up

    • by kellin ( 28417 )

      Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.

      I wanted to see how helpful Gen AI would be for an edge case to sort through a collection of heroes I have in a game I play. Right off the bat, I learned Gemini is the "most accurate." Anthropic was beyond worthless, OpenAI was maybe 50/50. Even so, I learned to Gemini's data before accepting its results. It definitely did not do as well as I originally thought, so I make sure it knows the stat

      • by Locke2005 ( 849178 ) on Tuesday April 07, 2026 @05:41PM (#66082138)

        Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.

        I've worked with people like that.

      • Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.

        Of course it will because "I don't know" isn't in the training data. If an LLM can't find good word associations, where a lot of the weights are very high, it can only work with the lower weight associations (unlikely to be right), and at worst will take the lowest weight association, which is probably guaranteed to be wrong. It would be nice if the models had a built-in rule such that if the weights fall below a certain threshold that the model would return "I don't know" or "I can't do that", but that's n

        • by EvilSS ( 557649 )
          You can't code rules into models themselves. Best you can do is try to train the behavior you want but that's never going to be 100% reliable. You can do it by watching the logits from the inference engine an try to redirect the model back on track or force a hard stop. Some are doing this today. The problem is that next word low probabilities are not always the source of this problem. You also run into high probability wrong results, so it's a bit more complicated. The other issue is not all of the APIs ex
    • It's an algorithm. It doesn't make a choice up to lie. It just doesn't resolve well due to lack of information and gives you the best it can do.
      • So Trump isn't really a liar, he's just an extremely low information person?
      • It's an algorithm that does not view confidently feeding the user false information as a type of failure.

        • Has Google failed if the first hit didn't apply to your search but the second one did? Now you have to realize that each word in an LLM response is like a Google page hit on its own. What they should be able to do is give you a confidence rating, but there is little knowledge of how accurate the answer is no more than Google provides relevant his.
          • Has Google failed if the first hit didn't apply to your search but the second one did?

            from a KPI point of view, yes obviously.

            What they should be able to do is give you a confidence rating

            Most of these models don't have a reliable way to extract confidence. There's a lot of false positives unfortunately. So we've all been hiding any sort of explicit confidence feedback to the user instead of giving them a random number generator.

    • by jd ( 1658 )

      Gemini is exceptionally bad, as LLMs go. I really have no idea why it is so dreadful, even compared to other LLMs. It isn't context window. and it doesn't seem to be training material either.

    • It almost always gives shit answers. Any time I search for details of things I know about it jumps in to tell me some shit I know is wrong. Every. Fucking. Time.

  • "That's not true if you only ask it the questions we want you to ask!"
    • by kellin ( 28417 )

      Basically this. And that's an idiotic statement to make. Gen AI needs to be good at everything for it to be useful. I realize that's a hard thing to do in the beginning, and it will probably get better over time, but we all need to help it along in some way by feeding it correct data.

  • by Morromist ( 1207276 ) on Tuesday April 07, 2026 @03:24PM (#66081894)

    Google: "Why can't you search for normal things like everybody else? Our ai is great at answering questions like 'where to buy a tv?' and 'who is Leonardo DiCaprio dating?" and "weather". If those things don't satisfy your every need I don't know what to say. Just because we're a search engine doesn't mean you're supposed to use it to search for difficult to find things. Search for normal things like a normal person, assholes."

  • by rsilvergun ( 571051 ) on Tuesday April 07, 2026 @03:25PM (#66081898)
    I mean based on the president of the United states? Those are rookie numbers. Come on Google you can do better!
  • Google has implemented Trump Mode in their AI? Gemini has been forced onto my Android Auto against my will.
  • by Zero__Kelvin ( 151819 ) on Tuesday April 07, 2026 @03:35PM (#66081922) Homepage
    If you ask the average human to use a non-AI search engine to find out the answer to 100 non-trivial questions I can assure you that you will get many more than 10 incorrect answers.
  • by mspohr ( 589790 ) on Tuesday April 07, 2026 @03:37PM (#66081926)

    According to an article here a few days ago, 70% of people just accept whatever AI tells them without thinking.

    • by jd ( 1658 )

      But was that figure provided by AI?

      Even if not, we all know that 793% of all statistics are invented.

  • That's AI models for you in a nutshell.

  • The New York Times ?
    Sooo ... is one of those lies that NATO stands for "North Atlantic Treaty Organization" ?

    Asking for a friend who remembers when the NYT wasn't full of biased shit.
    • I'd rather have a digital lying machine than the sub-human one we have right now. At least people will be more willing to ignore criminal orders because they are not in an AI cult. The AI do really like to start nuclear wars but nobody would follow those orders... But given how much AI produced slop from the White House already, we might just end up with a nuclear war... like we did tariffs against penguin island.

  • At this rate, reality is going to put The Onion out of business by 2029.

  • AI Overview
    Removing the serpentine belt on a 2018 Chevy Bolt involves releasing tension from the automatic tensioner, which is best accessed from the passenger-side wheel well. Use a 15mm socket on a long breaker bar to rotate the tensioner clockwise, allowing you to slip the belt off the pulleys.

    (in case anyone didn't get the joke, this is a real AI result Google just gave me, but the catch is that the Chevy Bolt is an EV and does not have a serpentine belt - or an engine, for that matter)

  • by Locke2005 ( 849178 ) on Tuesday April 07, 2026 @03:56PM (#66081964)
    How many Trumps is that?
  • by dfghjk ( 711126 ) on Tuesday April 07, 2026 @03:57PM (#66081966)

    A lie is bad information provided intentionally. AI does not have intent.

    • Says who?

      The AI's intent is defined by the way it is trained, and Gemini is trained to emphasize what the google executives want emphasized.

      • Says who?

        The AI's intent is defined by the way it is trained, and Gemini is trained to emphasize what the google executives want emphasized.

        Mmmm.... if anything it's "what the Google engineers want emphasized". Executives at Google have surprisingly little control over technical decisions. For nearly all of Google's existence it's been an almost completely bottom-up driven company and while in the last few years management has been trying to exert more control it's a very, very slow process.

        It's actually the engineering-driven culture that produces Google's infamous tendency to abandon products. Stuff gets built because some engineers think

        • Executives at Google have surprisingly little control over technical decisions.

          The executives at google define the policy, the technical crew implement it. The policy is "descriptive neutrality", which is roughly equal to the "fair and balanced" approach of Fox News, with a slight push for normalizing the "official position".

          So, while technical decision (how to implement a policy) are not a concern of the executives, setting the policy (what to implement) most definitely is.

          The point being that the "descriptive neutrality" with a preference for the "official side" is a thing, which yo

          • Executives at Google have surprisingly little control over technical decisions.

            The executives at google define the policy, the technical crew implement it.

            Not as much as you might think. Definitely not as much as at most companies.

      • by jd ( 1658 )

        We learned back in the 80s that trying to get a neural net to emphasise what you want is actually very difficult. What it will tend to emphasise are the assumptions that underly the test data, and that's usually a completely different sort of fiction.

  • When I test the different AI systems, Google's AI system loses track of complex problems incredibly quickly. It's great on simple stuff, but for complex stuff, it's useless.

    Unfortunately.... advice, overviews, etc, are very very complex problems indeed, which means that you're hitting the weakspot of their system.

  • by PPH ( 736903 ) on Tuesday April 07, 2026 @04:04PM (#66081978)

    That's the ticket!

  • Tinkering with a word guessing machine to see if you can make it smart is like breeding horses to be faster and faster expecting one to give birth to a locomotive. Word guessing machines are cool but they're never going to actually "understand" what they're talking about.
  • by Arrogant-Bastard ( 141720 ) on Tuesday April 07, 2026 @04:14PM (#66081996)
    (Reference: On the Dangers of Stochastic Parrots | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency [acm.org])

    The people/companies behind these models will keep trying to "fix" them by throwing ever-increasing amounts of computing power at them (with all the lovely real-world effects on everyone and everything) and by using ever-more-complex models. And yes, they'll perform better. But they're still just large exercises in statistics and linear algebra, they're still just stochastic parrots, and thus there's an upper bound that they may approach asymptotically -- but can't surpass.

    That's not because they're broken -- which is why I put "fix" in quotes in the previous paragraph. It's because that's how they work: it's an intrinsic property of all such models and no amount of computing power and/or model tweaking can change that: all it can do is obfuscate it. And obfuscated problems are far worse than obvious problems.
    • That's not because they're broken -- which is why I put "fix" in quotes in the previous paragraph. It's because that's how they work: it's an intrinsic property of all such models and no amount of computing power and/or model tweaking can change that: all it can do is obfuscate it. And obfuscated problems are far worse than obvious problems.

      That's a strong statement. Can you explain why that isn't also true of human brains? What's the intrinsic difference?

      • A human is able to tell if an LLM is wrong. The opposite isn't true.

        • A human is able to tell if an LLM is wrong. The opposite isn't true.

          Nonsense. LLMs point out my mistakes all the time. And I point out theirs. At this point there's more of the latter than the former, but both absolutely happen all the time.

        • A human is able to tell if an LLM is wrong. The opposite isn't true.

          Also, even if this fallacious claim were true, it wouldn't actually support Arrogant-Bastard's claim, which wasn't about the state of AI now, but a claim about "intrinsic properties", meaning it would be true forever.

  • by Fly Swatter ( 30498 ) on Tuesday April 07, 2026 @04:16PM (#66082002) Homepage
    We need numbers we can understand, saying 10 percent is too simplistic.
  • It's ironic that the human(s) reporting this couldn't do so without (apparently) lying, in the title no less. The article talks about accuracy, and an inaccuracy is not a lie unless it is intentional. Of course whomever wrote the title is likely seeking to impose their own anti-AI bias to the story, and so chose to lie about what the study actually says.
    • Actually, it's a perfect cromulent use of the word "lie" to mean a falsehood with or without the intent of deception. At least according to the dictionary. [merriam-webster.com]

    • by jd ( 1658 )

      If something is inaccurately presented as being the truth, then it is a lie of omission because it is dishonest about the fact that the information isn't actually known.

      • A lie of omission is when pertinent information is withheld. I'm not even going to try to parse the rest of your nonsensical sentence.
        • by jd ( 1658 )

          Since pertinent information was withheld (that it didn't know), then by your own post you acknowledge it was a lie of omission.

          The stupidity of people these days is truly beyond belief. And, yes, get the f off my lawn.

      • I see your point, but I don't think it quite works unless one knows it is inaccurate. Otherwise, it's just being wrong. That is, there is a subset of cases for which your statement would be correct, "I don't know the answer, so I'll guess and not tell you I'm just guessing", but you've worded it so broadly that it would also include sincere errors.

        One example of a non-lie your statement would have included would have been how, for over a thousand years, people believed flies had 4 legs because Aristotl

  • I have noticed that asking it questions about the video game "No Man's Sky" elicits perfect or at least nearly perfect answers every time. Asking it any technical questions about Linux though... usable accuracy drops to something like 50%.

  • To be fair I just wasted a week tracking down a radio telemetry problem because of a forum post that many people said worked great but it definitely pulled a pin high that was supposed to be low, which shut off an antenna.

    Only diving into the spec sheet and some sample embedded code convinced me that the forum post was exactly wrong and after making a simple change to do the opposite did all the telemetry devices mesh up and start reporting correctly.

    So ... how does 90% compare to human content?

    A wrinkle is

  • The search giant likes to use a test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted.

    Gee - that sounds rather like asking only questions which are known to be correctly answerable by the AI.

    But Google would never cheat to hide flaws and make their shit look better - right?

  • We will probably get to GLPH pretty soon. Or even PLPH or YLPH :-)

    (TLPH is Terra not Trump :-) although I can understand the confusion in this context)

  • A lie is an intentional deception. Being wrong is... being wrong.
  • NYT tells the truth about 10% of the time. I wonder if it's the same 10% that google gets "wrong."

You can't have everything... where would you put it? -- Steven Wright

Working...