Testing Suggests Google's AI Overviews Tells Millions of Lies Per Hour (arstechnica.com) 104
A New York Times analysis found Google's AI Overviews now answer questions correctly about 90% of the time, which might sound impressive until you realize that roughly 1 in 10 answers is wrong. "[F]or Google, that means hundreds of thousands of lies going out every minute of the day," reports Ars Technica. From the report: The Times conducted this analysis with the help of a startup called Oumi, which itself is deeply involved in developing AI models. The company used AI tools to probe AI Overviews with the SimpleQA evaluation, a common test to rank the factuality of generative models like Gemini. Released by OpenAI in 2024, SimpleQA is essentially a list of more than 4,000 questions with verifiable answers that can be fed into an AI.
Oumi began running its test last year when Gemini 2.5 was still the company's best model. At the time, the benchmark showed an 85 percent accuracy rate. When the test was rerun following the Gemini 3 update, AI Overviews answered 91 percent of the questions correctly. If you extrapolate this miss rate out to all Google searches, AI Overviews is generating tens of millions of incorrect answers per day.
The report includes several examples of where AI Overviews went wrong. When asked for the date on which Bob Marley's former home became a museum, AI Overviews cited three pages, two of which didn't discuss the date at all. The final one, Wikipedia, listed two contradictory years, and AI Overviews confidently chose the wrong one. The benchmark also prompts models to produce the date on which Yo Yo Ma was inducted into the classical music hall of fame. While AI Overviews cited the organization's website that listed Ma's induction, it claimed there's no such thing as the Classical Music Hall of Fame. "This study has serious holes," said Google spokesperson Ned Adriance. "It doesn't reflect what people are actually searching on Google." The search giant likes to use a test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted.
Oumi began running its test last year when Gemini 2.5 was still the company's best model. At the time, the benchmark showed an 85 percent accuracy rate. When the test was rerun following the Gemini 3 update, AI Overviews answered 91 percent of the questions correctly. If you extrapolate this miss rate out to all Google searches, AI Overviews is generating tens of millions of incorrect answers per day.
The report includes several examples of where AI Overviews went wrong. When asked for the date on which Bob Marley's former home became a museum, AI Overviews cited three pages, two of which didn't discuss the date at all. The final one, Wikipedia, listed two contradictory years, and AI Overviews confidently chose the wrong one. The benchmark also prompts models to produce the date on which Yo Yo Ma was inducted into the classical music hall of fame. While AI Overviews cited the organization's website that listed Ma's induction, it claimed there's no such thing as the Classical Music Hall of Fame. "This study has serious holes," said Google spokesperson Ned Adriance. "It doesn't reflect what people are actually searching on Google." The search giant likes to use a test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted.
Great even the pol (Score:5, Insightful)
Well shoot, even the politicians jobs are not safe then!
Re: (Score:3)
Actually, I think someone did try to run for elected office on the premise that they'd let an AI make all their decisions for them. I don't think it worked out for them.
Re: Great even the pol (Score:2)
Too many old people (my age) still vote for that sort of campaign to work.
I don't believe it (Score:5, Funny)
Alice laughed. "There's no use trying," she said. "One can't believe impossible things."
"I daresay you haven't had much practice," said Google. "Why, sometimes I've believed as many as six impossible things before breakfast."
Re: (Score:2)
Balderdash (Score:3)
The crommulence of AI responses is infallible and unimpeachable. This article is complete balderdash.
Re: (Score:2)
Re: (Score:3)
AI said I am correct and that you're teh gey[sic].
So... there.
Re: (Score:2)
Re: (Score:2)
Maybe it gets it confused with Yo Yo Ma jokes.
I hear he's like the cellist dude. Never gets angry when people call his mother fat.
AI lies (Score:2, Informative)
Google's AI is so bad... (Score:1)
I would rather use Grok.
Re: (Score:2)
the LLM model they're using for "AI Overview" is terrible. Obviously, they're doing that because it's a small model that runs fast, so it can handle the load of millions of queries a minute. I find that if you then click "Dive Deeper", the model improves to something usable, often completely contradicting the "Overview" slop.
It's not a good look. But I suppose they have to put "AI" out front, even when it's crap.
Re: (Score:2)
It's not a good look.
Yeah, it makes an extremely bad first impression. Anecdotally, everybody I know sees it as the slop on top of the search results that you just skip over.
Re: Google's AI is so bad... (Score:2)
It's a case where doing nothing would have been better than this.
A strange game. The only winning move is not to play. ... How about a nice game of chess?
I use gemini (Score:5, Interesting)
It often gives excellent answers, but when it doesn't, the results are strange.
I asked for help writing code for an obscure hobby CNC control system.
It totally invented function calls and invented plausible documentation to explain how they worked and how to call them.
It totally missed the easy answer that involved calling an existing simple function and writing no new code.
If the answer doesn't exist on the internet, it appears to just make one up
Re: (Score:2)
Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.
I wanted to see how helpful Gen AI would be for an edge case to sort through a collection of heroes I have in a game I play. Right off the bat, I learned Gemini is the "most accurate." Anthropic was beyond worthless, OpenAI was maybe 50/50. Even so, I learned to Gemini's data before accepting its results. It definitely did not do as well as I originally thought, so I make sure it knows the stat
Re:I use gemini (Score:4)
Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.
I've worked with people like that.
Re: (Score:2)
Yep. I've read that generative AI doesn't say "I don't know the answer," but will just make something up instead.
Of course it will because "I don't know" isn't in the training data. If an LLM can't find good word associations, where a lot of the weights are very high, it can only work with the lower weight associations (unlikely to be right), and at worst will take the lowest weight association, which is probably guaranteed to be wrong. It would be nice if the models had a built-in rule such that if the weights fall below a certain threshold that the model would return "I don't know" or "I can't do that", but that's n
Re: (Score:2)
Re: I use gemini (Score:2)
Re: (Score:2)
Re: (Score:2)
He's both.
He is an extremely low information person which has turned disengagement into an art form. And he still chooses to provably lie every chance he gets.
Re: (Score:2)
Re: (Score:2)
it's been a very bad algorithm (Score:2)
It's an algorithm that does not view confidently feeding the user false information as a type of failure.
Re: (Score:2)
Re: (Score:2)
Has Google failed if the first hit didn't apply to your search but the second one did?
from a KPI point of view, yes obviously.
What they should be able to do is give you a confidence rating
Most of these models don't have a reliable way to extract confidence. There's a lot of false positives unfortunately. So we've all been hiding any sort of explicit confidence feedback to the user instead of giving them a random number generator.
Re: (Score:2)
Gemini is exceptionally bad, as LLMs go. I really have no idea why it is so dreadful, even compared to other LLMs. It isn't context window. and it doesn't seem to be training material either.
Re: I use gemini (Score:2)
It almost always gives shit answers. Any time I search for details of things I know about it jumps in to tell me some shit I know is wrong. Every. Fucking. Time.
Re: (Score:2)
You used to be able to turn off the AI crap by swearing in your search query. It looks like they fixed that one.
Re: (Score:2)
I hid it with the "Hide Google AI Overviews" extension, but I still see their crap at work.
Google's Response (Score:1)
Re: (Score:2)
Basically this. And that's an idiotic statement to make. Gen AI needs to be good at everything for it to be useful. I realize that's a hard thing to do in the beginning, and it will probably get better over time, but we all need to help it along in some way by feeding it correct data.
Google: "you are all freaks" (Score:4, Informative)
Google: "Why can't you search for normal things like everybody else? Our ai is great at answering questions like 'where to buy a tv?' and 'who is Leonardo DiCaprio dating?" and "weather". If those things don't satisfy your every need I don't know what to say. Just because we're a search engine doesn't mean you're supposed to use it to search for difficult to find things. Search for normal things like a normal person, assholes."
Re:Google: "you are all freaks" (Score:4, Funny)
I don't know about that (Score:5, Funny)
Re: (Score:3)
At least Gemini is *trying* to tell the truth!
Isn't it???
So what you're saying is... (Score:2)
Re:So what you're saying is... (Score:5, Informative)
Google has implemented Trump Mode in their AI?
No, they said Google tells the truth 90% of the time, not 10%.
Re: (Score:2)
Google has implemented Trump Mode in their AI?
Well, it hasn't bombed Iran and developed a craving for McDonald's hamberders yet, so Google's still got some work to do.
Better than humans none the less (Score:3)
Re:Better than humans none the less (Score:4, Interesting)
It would be interesting to compare the AI summary accuracy to
1) Hitting "I feel lucky"
2) A selection of average humans given no-AI Google search
3) A selection of average humans given AI+Google search
4) A selection of average humans
And people believe AI... (Score:5, Informative)
According to an article here a few days ago, 70% of people just accept whatever AI tells them without thinking.
Re: (Score:3)
But was that figure provided by AI?
Even if not, we all know that 793% of all statistics are invented.
Lies, bigger lies and statistics. (Score:1)
That's AI models for you in a nutshell.
The New York Times, you say ? (Score:1, Redundant)
Sooo
Asking for a friend who remembers when the NYT wasn't full of biased shit.
Re: So when can it replace Trump? (Score:1)
I'd rather have a digital lying machine than the sub-human one we have right now. At least people will be more willing to ignore criminal orders because they are not in an AI cult. The AI do really like to start nuclear wars but nobody would follow those orders... But given how much AI produced slop from the White House already, we might just end up with a nuclear war... like we did tariffs against penguin island.
What a headline! (Score:1)
At this rate, reality is going to put The Onion out of business by 2029.
It gives great car repair advice, too (Score:1)
AI Overview
Removing the serpentine belt on a 2018 Chevy Bolt involves releasing tension from the automatic tensioner, which is best accessed from the passenger-side wheel well. Use a 15mm socket on a long breaker bar to rotate the tensioner clockwise, allowing you to slip the belt off the pulleys.
(in case anyone didn't get the joke, this is a real AI result Google just gave me, but the catch is that the Chevy Bolt is an EV and does not have a serpentine belt - or an engine, for that matter)
Unit conversion? (Score:3)
Re: Unit conversion? (Score:4, Interesting)
Here, from the horse's mouth:
Summary
Assuming Google AI search gives 1 lie in 10 answers (10%), it is roughly one-eighth of a Trump (0.125).
In other words, you would need 8 AI lies to equal the concentration of misinformation found in a single normalized Trump output.
Would you like to apply this "Trump" unit to other historical figures or tech benchmarks to see how they stack up?
Re: Unit conversion? (Score:2)
Here's the summary for some names you may know:
Elon Musk ~1.25 Trumps
Vladimir Putin ~1.20 Trumps
Muammar Gaddafi ~1.10 Trumps
Peter Thiel ~0.4 to 0.5 Trumps
Ursula von der Leyen ~0.15 Trumps
Google AI (Hypothetical) ~0.125 Trumps
Re: (Score:2)
AI doesn't lie. (Score:3)
A lie is bad information provided intentionally. AI does not have intent.
Re: AI doesn't lie. (Score:3)
Says who?
The AI's intent is defined by the way it is trained, and Gemini is trained to emphasize what the google executives want emphasized.
Re: (Score:3)
Says who?
The AI's intent is defined by the way it is trained, and Gemini is trained to emphasize what the google executives want emphasized.
Mmmm.... if anything it's "what the Google engineers want emphasized". Executives at Google have surprisingly little control over technical decisions. For nearly all of Google's existence it's been an almost completely bottom-up driven company and while in the last few years management has been trying to exert more control it's a very, very slow process.
It's actually the engineering-driven culture that produces Google's infamous tendency to abandon products. Stuff gets built because some engineers think
Re: (Score:2)
Executives at Google have surprisingly little control over technical decisions.
The executives at google define the policy, the technical crew implement it. The policy is "descriptive neutrality", which is roughly equal to the "fair and balanced" approach of Fox News, with a slight push for normalizing the "official position".
So, while technical decision (how to implement a policy) are not a concern of the executives, setting the policy (what to implement) most definitely is.
The point being that the "descriptive neutrality" with a preference for the "official side" is a thing, which yo
Re: (Score:2)
Executives at Google have surprisingly little control over technical decisions.
The executives at google define the policy, the technical crew implement it.
Not as much as you might think. Definitely not as much as at most companies.
Re: (Score:2)
We learned back in the 80s that trying to get a neural net to emphasise what you want is actually very difficult. What it will tend to emphasise are the assumptions that underly the test data, and that's usually a completely different sort of fiction.
Google's AI does not impress. (Score:2)
When I test the different AI systems, Google's AI system loses track of complex problems incredibly quickly. It's great on simple stuff, but for complex stuff, it's useless.
Unfortunately.... advice, overviews, etc, are very very complex problems indeed, which means that you're hitting the weakspot of their system.
Yeah! (Score:3)
That's the ticket!
Doctorow is right (Score:2)
This is what stochastic parrots do (Score:4, Informative)
The people/companies behind these models will keep trying to "fix" them by throwing ever-increasing amounts of computing power at them (with all the lovely real-world effects on everyone and everything) and by using ever-more-complex models. And yes, they'll perform better. But they're still just large exercises in statistics and linear algebra, they're still just stochastic parrots, and thus there's an upper bound that they may approach asymptotically -- but can't surpass.
That's not because they're broken -- which is why I put "fix" in quotes in the previous paragraph. It's because that's how they work: it's an intrinsic property of all such models and no amount of computing power and/or model tweaking can change that: all it can do is obfuscate it. And obfuscated problems are far worse than obvious problems.
Re: (Score:2)
That's not because they're broken -- which is why I put "fix" in quotes in the previous paragraph. It's because that's how they work: it's an intrinsic property of all such models and no amount of computing power and/or model tweaking can change that: all it can do is obfuscate it. And obfuscated problems are far worse than obvious problems.
That's a strong statement. Can you explain why that isn't also true of human brains? What's the intrinsic difference?
Re: This is what stochastic parrots do (Score:2)
A human is able to tell if an LLM is wrong. The opposite isn't true.
Re: (Score:2)
A human is able to tell if an LLM is wrong. The opposite isn't true.
Nonsense. LLMs point out my mistakes all the time. And I point out theirs. At this point there's more of the latter than the former, but both absolutely happen all the time.
Re: (Score:2)
A human is able to tell if an LLM is wrong. The opposite isn't true.
Also, even if this fallacious claim were true, it wouldn't actually support Arrogant-Bastard's claim, which wasn't about the state of AI now, but a claim about "intrinsic properties", meaning it would be true forever.
Give it to us in Lies per GigaWatt! (Score:3)
Ironically, this Slashdot summary title is a lie (Score:3)
Re: Ironically, this Slashdot summary title is a l (Score:2)
Re: (Score:2)
Actually, it's a perfect cromulent use of the word "lie" to mean a falsehood with or without the intent of deception. At least according to the dictionary. [merriam-webster.com]
Re: Ironically, this Slashdot summary title is a l (Score:2)
Re: (Score:2)
i.e. - "an untrue or inaccurate statement that may or may not be believed true by the speaker or writer"
"the lies we tell ourselves to feel better" - explicit intent.
Re: (Score:2)
If something is inaccurately presented as being the truth, then it is a lie of omission because it is dishonest about the fact that the information isn't actually known.
Re: (Score:2)
Re: (Score:2)
Since pertinent information was withheld (that it didn't know), then by your own post you acknowledge it was a lie of omission.
The stupidity of people these days is truly beyond belief. And, yes, get the f off my lawn.
Re: (Score:2)
Re: (Score:2)
Which it was.
Re: (Score:2)
One example of a non-lie your statement would have included would have been how, for over a thousand years, people believed flies had 4 legs because Aristotl
Depends heavily on the subject matter. (Score:1)
I have noticed that asking it questions about the video game "No Man's Sky" elicits perfect or at least nearly perfect answers every time. Asking it any technical questions about Linux though... usable accuracy drops to something like 50%.
Compared to? (Score:2)
To be fair I just wasted a week tracking down a radio telemetry problem because of a forum post that many people said worked great but it definitely pulled a pin high that was supposed to be low, which shut off an antenna.
Only diving into the spec sheet and some sample embedded code convinced me that the forum post was exactly wrong and after making a simple change to do the opposite did all the telemetry devices mesh up and start reporting correctly.
So ... how does 90% compare to human content?
A wrinkle is
Really? (Score:2)
The search giant likes to use a test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted.
Gee - that sounds rather like asking only questions which are known to be correctly answerable by the AI.
But Google would never cheat to hide flaws and make their shit look better - right?
MLPH? Mega Lies Per Hour? (Score:2)
We will probably get to GLPH pretty soon. Or even PLPH or YLPH :-)
(TLPH is Terra not Trump :-) although I can understand the confusion in this context)
Wrong != lying (Score:2)
pot and kettle (Score:1)
NYT tells the truth about 10% of the time. I wonder if it's the same 10% that google gets "wrong."