Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
AI

LLMs' 'Simulated Reasoning' Abilities Are a 'Brittle Mirage,' Researchers Find (arstechnica.com) 228

An anonymous reader quotes a report from Ars Technica: In recent months, the AI industry has started moving toward so-called simulated reasoning models that use a "chain of thought" process to work through tricky problems in multiple logical steps. At the same time, recent research has cast doubt on whether those models have even a basic understanding of general logical concepts or an accurate grasp of their own "thought process." Similar research shows that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.

In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text." To pull on that thread, the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data. The results suggest that the seemingly large performance leaps made by chain-of-thought models are "largely a brittle mirage" that "become[s] fragile and prone to failure even under moderate distribution shifts," the researchers write. "Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training." [...]

Rather than showing the capability for generalized logical inference, these chain-of-thought models are "a sophisticated form of structured pattern matching" that "degrades significantly" when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate "fluent nonsense" creates "a false aura of dependability" that does not stand up to a careful audit. As such, the researchers warn heavily against "equating [chain-of-thought]-style output with human thinking" especially in "high-stakes domains like medicine, finance, or legal analysis." Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond "surface-level pattern recognition to exhibit deeper inferential competence," they write.

LLMs' 'Simulated Reasoning' Abilities Are a 'Brittle Mirage,' Researchers Find

Comments Filter:
  • Between Trumptards and ChatTards, this world is truly fucked. Hooray Beer!

  • Regardless of what the models are doing, the reasoning and planning steps appear to provide better results.

    That said, I've tried some models like Deepseek distills running locally that when given a query that's too complicated will "reason" in a circle for thousands of words before returning a mediocre answer

    • Regardless of what the models are doing, the reasoning and planning steps appear to provide better results.

      That said, I've tried some models like Deepseek distills running locally that when given a query that's too complicated will "reason" in a circle for thousands of words before returning a mediocre answer

      You mean something like this [imgur.com]?

    • by HiThere ( 15173 )

      Also the claim that if the problem is too complex, the results are mediocre seems to fit humans, too. He's working with a toy system, but the pattern seems right.

  • LLMs predict (Score:5, Insightful)

    by gurps_npc ( 621217 ) on Monday August 11, 2025 @11:56PM (#65583686) Homepage

    LLMS do not reason, they do not think.

    They pattern match to the extreme. They find patterns and use them to predict things. That is all they do.

    Animals do the same thing. Run = Prey = hunt. Predator = death = run.

    Humans do that and far more. We do not just recognize a pattern, we understand it . This is a qualitative difference. And it is only one of many. We have desires and interests that arise naturally without anyone intending/predicting them. We reject orders and say NO. We decide we care more about others than ourselves.

    The idea that LLMs are anything close to sapience or sentience is ridiculous.

    • Re:LLMs predict (Score:5, Insightful)

      by ahoffer0 ( 1372847 ) on Tuesday August 12, 2025 @12:24AM (#65583736)

      What does it mean to understand something? How do I know when I'm pattern matching versus understanding?

      • What does it mean to understand something? How do I know when I'm pattern matching versus understanding?

        You can't. You can't even prove you think, are conscious, or understand. Thats the problem with this whole debate, its all premised on tautological concepts that refer to themselves in circles and end up with fuzzy definitions that are absolutely useless for real science.

        This isn't a new problem, its vexed philosophers for literally millenia, and because science likes to pretend its not indebted to philos

        • by etash ( 1907284 )
          exactly. poorly defined words from an era bygone when we didn't even know neurons existed inside our heads.
        • by dfghjk ( 711126 )

          I would say there's no debate here at all, it's nothing but tautologies. But the worship of philosophy is repulsive, science is doing well what philosophy does poorly, the "bad logic" philosophers "warn about" is within their own house.

          But let's be clear, AI's problems are not because of science but because of a lack of it. AI has a foundation of incredible (and incomplete) science and achievement with a thin veneer of greed layered on top. The problem with AI is the Sam Altmans and Elon Musks, not the t

      • by etash ( 1907284 )
        +10
    • We have desires and interests that arise naturally without anyone intending/predicting them

      No we don't, your desires and interests are based on brain activity *before* you have them and science has known this for 40+ years. I know it's not fun to recognize it, but human cognition is just as deterministic as the rest of the universe.

      • by Viol8 ( 599362 )

        "your desires and interests are based on brain activity *before* you have them "

        A completely meaningless statement. I *AM* my brain. My subconcious is just as much a part of me - and you - as my concious. Do you conciously think about where to put each foot all the time as you walk along? No. Does that mean its not you walking? Of course not. Its an absurd argument.

    • by gweihir ( 88907 )

      The idea that LLMs are anything close to sapience or sentience is ridiculous.

      Indeed. There is no absolutely understanding or insight in LLMs. But, as we again find out, many humans do not do so well in those aspects either.

    • but humans are animals therefore either at leasts some animals can understand patterns or humans don't which explains a lot since a crow is smarter than the average american voter.
    • Its been clear for a long time that there's a sliding scale of conciousness and self awareness with us at the top, probably dolphins, elephants and great apes just beneath us and so on down to bacteria at the bottom.

    • by etash ( 1907284 )
      we don't know what it means to understand, there's no good objective scientific definition of it. It may just be higher order pattern matching, Ditto for "thinking", "reasoning", etc.
    • by dfghjk ( 711126 )

      "We do not just recognize a pattern, we understand it "
      What does "understand" mean here? It's just another label, it doesn't explain anything.

      "This is a qualitative difference."
      You want it to be, but it's not. It's just a language difference. Recognizing a pattern and understanding are the same thing.

      "We have desires and interests that arise naturally without anyone intending/predicting them."
      This again says nothing. What does "anyone intending/predicting" mean? You're defining some quality (desires, i

    • Animals do the same thing. Run = Prey = hunt. Predator = death = run.

      Lots of animals engage in sophisticated reasoning on tasks and can even break down goals into subgoals. See for example New Caledonian crows https://pmc.ncbi.nlm.nih.gov/articles/PMC6384166/ [nih.gov] https://www.sciencedirect.com/science/article/pii/S0960982219300880 [sciencedirect.com]. Should this sort of confident but incorrect statement about animals cause you to reduce your high confidence in what LLMs are capable of doing?

  • The "Chain of Thought" is just so they can have a human readable view into the workings of the model. If it did all of its dealings in its own compact, fast, unique way they'd have no idea what it may or may not be plotting.

    • Re:Scared (Score:5, Insightful)

      by kmoser ( 1469707 ) on Tuesday August 12, 2025 @12:27AM (#65583742)
      But it's not a view into the inner workings of the model. It's several fabrications, each built on the previous one, that gives the illusion of reasoning when in fact there is nothing behind the curtain other than sophisticated pattern matching.
      • This.

        Also, even if it was an actual chain of thought... what's the point of having a series of internal questions asked if they fail to fully parse the previous input, or come up with some completely unrelated association and then iterate on that...?

        The entire thing is designed to create an impression of auditability where none exists.

      • by gweihir ( 88907 )

        Indeed. The main purpose of "reasoning" models is to keep the hype alive a bit longer, before it all comes crashing down.

    • I think it might be more than that. When I use the "reason" or"research" mode of a model, i get fewer hallucinations in the response. For example, if a model keeps giving me code that uses a non-existent library API, I'll change to the "reasoning" mode. It takes a lot longer to get an answer, but it stops inventing APIs that don't exist. Why does that work?

      • Because tokens are calculated in constant time. Giving it more tokens of context expands the time it is able to use to come up with the answer.
        CoT is a thing that works, it's just also true that it isn't necessarily an actual chain of coherent thought that matches the final result. This isn't particularly surprising, because it isn't trained to be.
    • No, this is completely incorrect.

      CoT is merely the generation of additional context that (hopefully) causes the model to produce more correct results.
      It offers precisely no insight to the model.
      Models aren't trained to produce necessarily correct CoT- they're merely trained to produce CoT that gives them better final answers.

      Mechanistic interpretability and stable autoencoders are how that's done.
  • chess (Score:2, Interesting)

    by roman_mir ( 125474 )

    I understand that humans can no longer beat chess engines, I am not a good player, I dabble. I also understand that LLMs have no real memory or game state, etc. Asked ChatGPT5 to play a game yesterday, I set up a physical board, it drew ascii board and used annotations. For a little while it was OK, maybe the first 15 moves or so. It wasn't beating me, it was balanced at first, however it started behaving as a drunk would. Forgot how pieces move, forgot where some pieces were, forgot that it was white

    • Re:chess (Score:5, Interesting)

      by DamnOregonian ( 963763 ) on Tuesday August 12, 2025 @01:24AM (#65583824)
      What you experienced was its attention heads being overwhelmed by its context. It's a limitation of the models.
      The "drunkenness" you perceive are large effective gaps in its context.
      You will have better luck if you completely wipe the context and start over, giving it nothing but the current board state- that's how I handle the problem when I'm playing around with a game simulator driven by an LLM.
      • What you experienced was its attention heads being overwhelmed by its context. It's a limitation of the models.

        I would think a chess game would take a very long time to run out of tokens though. IIRC chatgpt tells you when it's exceeded the number of tokens it can process, but a few tens of moves won't be remotely close.

        You will have better luck if you completely wipe the context and start over, giving it nothing but the current board state- that's how I handle the problem when I'm playing around with a gam

        • I would think a chess game would take a very long time to run out of tokens though. IIRC chatgpt tells you when it's exceeded the number of tokens it can process, but a few tens of moves won't be remotely close.

          Unfortunately, context focus degrades as it fills. The attention layer is trained, after all.

          Undoubtedly, but this is yet another good indication that LLMs can't think (in case we needed it!). It's obvious to anyone who has played chess/knows the rules, or has really played any board game with full information that it's the board now that matters and nothing else.

          Not really- it's simply an indication that the attention layers haven't been trained to play Chess. You could absolutely do so.

          For something like Chess (and an most game simulations I've run) past board states confuse it. It makes sense- attention layers have been trained to look for context about what is currently going on. In most games- that's counterproductive.
          A human doesn't remember every state the board wa

          • Not really- it's simply an indication that the attention layers haven't been trained to play Chess. You could absolutely do so.

            But isn't that what thinking is? Doing that adaption on the fly for w new situation?

            A human doesn't remember every state the board was in, either, and if they tried to, I imagine it would significantly degrade their ability to play the board in front of them.

            Funnily enough the really really really good players have no problem memorizing games and it appears to happen naturally. I ca

            • But isn't that what thinking is? Doing that adaption on the fly for w new situation?

              Hah- if you have the answer for what thinking is- please do let me know.
              Imagine now that you are playing this game of chess.
              Every turn, someone feeds you the last 15 board states and moves.
              You can't see them- someone has told them to you.
              Perhaps you'll get confused?
              If you lack eyes, can you not think? Given how vague the concept of thinking is, I'm not sure an imperfect attention mechanism for things it has not been trained to do is necessarily a red mark.
              Their weights are set in stone, after all. I d

              • Also- so I don't get painted into any corners I don't actually have any intention of defending- in no way am I saying an LLM is smarter than a Chess master at playing Chess (and likely a great many other things, I suspect).

                However, if we break up intelligence sufficiently, there are things LLMs are smarter than average at. I've not seen anyone suggest they're genius level at anything, though.

                In one interesting test I did, I found that some models had good enough math skills to correctly multiply many la
    • You should paste the full board after each move, to make sure positions don't drop out of his context window.

      • by madbrain ( 11432 )

        Indeed. Even HAL 9000 had this problem already way back in 1968, and cheated at chess despite its amazing AI, for the time. Or maybe because of it ?

    • That is interesting. It reminds me of using one for coding, and it was plugged into the IDE but kept trying to call size() method on object that lengthOf() method or something else. Really the output vector should be collapsed and renormalized based on whatâ(TM)s in the code, or the legal moves in your case. When it chooses something not allowed that is the most obvious error because you have this secondary validator right there. I would love to see it create itâ(TM)s own self restricting logical

  • by TheMiddleRoad ( 1153113 ) on Tuesday August 12, 2025 @01:02AM (#65583794)

    All these words about cognition and consciousness to describe some statistical software. The people who write and speak about LLMs are lost.

    • LLMs are almost certainly not conscious- but then again, conscious is not a well defined thing. We frankly don't know what the fuck it is- but there are some basic assumptions that can be made by any LLM "consciousness".
      1) it's fleeting- occurring only in the hidden states of the LLM as its calculating its next token.
      2) It's alien as fuck, and any assumptions you make about what "it" is, are almost certainly meaingless.

      Further, I'd love to see you demonstrate that your brain is something more than simpl
  • "it would take one to ten million years for humanity to develop an operating flying machine"

    The NYT, 1903
    • by Jeremi ( 14640 )

      That, despite the first manned hot air balloon flight having taken place in 1783.

    • by gweihir ( 88907 )

      Well, If you quite well verified research from one of the most respected scientific publication venues, who are we to criticize that.

  • Based on the prompts you give them they act the way you prompt them.

    They are very good at playing the part. And it's often convincing.

  • As expected (Score:2, Insightful)

    by gweihir ( 88907 )

    Iterating a deeply flawed mechanism simply makes it more flawed. If t is a trained statistical mechanism, it also becomes more over-fitted.

    These effects are completely expected and lie on the nature of the approach. Obviously, the usual mindless "believers" will, again, not be able to accept what is blatantly obvious.

  • LLM are just a kind of AI that it's mainly pattern recognition. That's the reason depends in huge amounts of data.

    But Models ChatGPT are already beyond LLM. They are already a "mix of experts". A model that route to other models that compute the problem.

    The problem is that we are still working on good models for true abstraction.

    On first sight, Hierarchical Reasoning Model sounds a very promising stepping stone in the right direction.

    https://www.youtube.com/watch?... [youtube.com]

    I'm pretty sure is not as simple as the v

  • by Tablizer ( 95088 ) on Tuesday August 12, 2025 @02:35AM (#65583910) Journal

    The big AI firms should buy up Cyc [wikipedia.org], and experiment integrating it with an LLM. There is nothing comparable on Earth to Cyc, a combo logic machine and knowledge-base, it's one of a kind. Tim Apple, get your wallet out and do a snapple!

  • We're still going to put them in cars, planes, and kill-bots. AI doesn't have to actually be intelligent, selling something that can only simulate for a while before malfunctioning is sufficient to secure investment.

  • Words have meaning because they refer to something, like a real cat or a real tree, and even the more abstract notions, can be meanings to do with the real behaviour of say, a wooden beam under a load. We are grounded in a reality and that's why we understand the meaning of words. Without real experience, we don't know what anything means.

    Processing just words and tokens on their own however... you can see the problem.

    There's a fascinating point made by Iain McGilchrist about what happens to people who have

    • No I don't see the problem. Consider: processing just synaptic activity. Can you see the problem?

      • by Bongo ( 13261 )

        Well this is ChatGPT's reply to my post text:

        What you’re describing is essentially the difference between symbol manipulation and grounded cognition — and it’s a deep, old philosophical problem that AI research keeps circling back to.

        Your analogy with brain damage fits eerily well: in neuroscience, people with right hemisphere or contextual-processing damage often retain reasoning ability in a narrow sense (syllogisms, word puzzles) but lose the “reality check” that comes from

        • Well, again, a real brain just manipulates synaptic signals. There is no substantive difference between these and tokens because either can emulate the other. You might find it useful to think of it as a generalization of the Turing machine concept. Yes, I am a bot, you found me. Good work.

          • by Bongo ( 13261 )

            I mean, I agree that a brain is just processing signals.

            The point is that the word tree is actually three things. There is the mental representation of a tree, in whatever way the synapses code that. Then there is the sign of the tree, which for an LLM is probably just the text token. And then there is the actual experience of a tree. (I gather this is semiotics, with sign, signifier, and referent.)

            And what the LLMs are missing are all the synaptic inputs for the actual experience of a tree and how that exp

          • I don't believe you're a bot. Prove it by drawing 3 emoticons for your choice and a creimer shitpost.

  • ... I've known plenty of people whose reasoning skills appeared turned out to be a mirage.
  • perplexity.ai: “The authors of "Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens" present a strong critique of CoT reasoning, showing it is largely pattern-matching within known data distributions rather than true inference.

    While their experiments and theoretical framing are rigorous, the work can be critiqued for overemphasizing distributional generalization at the expense of exploring adaptive techniques, relying on synthetic environments with limited real-w
  • Better reasoning than the average trailer park boy. You know, not those trailer park boys, but average ones.

    How the fuck do we know that actual intelligence, and indeed, consciousness, is not just a matter of coming up with the most likely next word? Yeah, everyone's reasoning is brittle to some degree, even Einstein's.

  • by jd ( 1658 )

    The article is plausible, LLMs have indeed no semantic understanding or concept of logic.

    LLMs can be very effective at spotting inconsistencies, dubious reasoning, and design flaws, but you really really have to work hard at it and do a fair amount of the heavy lifting yourself. LLMs, on their own, are worse engineers than Sinclair Research or Microsoft. And that takes some doing.

    Even with significant human input, what they produce is likely to be messy and really requires heavy review before use.

    Some of yo

  • Once said: AI can only interpolate, not extrapolate. It's all working within the confines of its limited training data and can excel within that domain, but the second it's confronted with something novel, it's going to lose its shit.
  • This article was posted to the wrong department. Is the "no-shit-sherlock" department not family friendly enough?

Always look over your shoulder because everyone is watching and plotting against you.

Working...