Study Done By Apple AI Scientists Proves LLMs Have No Ability to Reason (appleinsider.com) 233
Slashdot reader Rick Schumann shared this report from the blog AppleInsider:
A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.
The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...
The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...
The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
Duh (Score:5, Insightful)
Re:Duh (Score:5, Insightful)
You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.
Re:Duh (Score:5, Insightful)
You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.
Fundamentally I agree, but devils advocate here. You haven't said what this "reasoning" thing is. How do you know it isn't just a feature of a larger model?
What they have done here is to assert the existence of a class of problems which they say require reasoning to solve efficiently. They can create a more or less infinite set of such examples and manipulate them so that systems with "reasoning" (say graduate students as in comments below) can solve them easily and quickly and LLMs cannot.
That's a reasonably useful definition of what reasoning is which is a good thing.
Re: (Score:2)
Re:Duh (Score:4, Informative)
Yeah this is where my philosophy grad brain kicks in to protest a little bit.
Theres a whole series of words used in AI, and by the public in general that are "you know what I mean" fuzzy words. But these words are *terrible* for the practice of science.
Take everyones favorite "consciousness". You know intuitively what it is, your doing it right now, but try and define it? Not so easy without ending up in tautological loops. The best we can really come up with is something like "paying attention to something" (heidegger had a variation that claimed 'we are always conscious OF something' and never just conscious in itself). This leads to problems with trying to systemise it into science. We might point to a reading on an EEG or whatever and say "look! that there is what consciousness looks like". Fine, but that doesnt tell us what consciousness *is*, it only tells us what it looks like in a microscope.
The truth is, what we are talking about is phenomenological. We *experience* consciousness. Its something that we percieve, we observe ourselves being conscious, but trying to exfiltrate that perception out of the mind into something that concretely and repeatably is definable in terms of the levers and dials of the brain is a VERY shaky thing, at some point we're still reduced to saying "you know what I mean".
The same goes for terms like "intelligence", "reasoning", "morality", "values", "cognition", etc. We instinctively know what these means as human beings who experience these things, but nailing down fixed definitions that everyone can agree on and can be objectively measured with a theory of physical action coupled to it, well, thats a much more complicated matter, and its not one that I think we have the language to properly interrogate right now.
Re: (Score:3)
The truth is, what we are talking about is phenomenological. We *experience* consciousness.
Obviously. There is no known way how a physical mechanism can create consciousness. Hence attempts to define it relative to the physical world must fail. Whether we eventually get an extension of known Physics or whether consciousness will remain "magic" is to be seen. Note that Science does not rule out "magic". It just requires extraordinary evidence for its existence. Consciousness has that going for it.
The funny thing is that consciousness can influence physical reality (we can talk about it). Hence kno
Re:Duh (Score:4)
I agree with your description of "reasoning", but IMHO it implies that the definition of intelligence is just "reasoning not done by humans", rather than having some other innate, qualitatively different property - otherwise, you'd be able to define that property. It's the AI Effect [wikipedia.org], basically. Whatever property you choose - and people have chosen many over the years - people will immediately back off from as soon as a machine gets good at it, and try to find some other property elsewhere. It's the God of the Gaps, except that gap search just gets ever harder over time (not that that will stop people from trying).
IMHO, the rest of what you wrote is beyond weak. Like, just to pick an example: "The funny thing is that consciousness can influence physical reality (we can talk about it). Hence known Physics is known to be grossly incomplete in this regard." Literally everything physics does is about things that can influence physical reality. So how exactly is this any different from literally any other physics?
Re: (Score:2)
The second factor is the serial action bottleneck. Even though o
Re: (Score:2)
Reasoning involves, at the very least, some fact-checking ability and some deductive capabilities. LLMs have neither and cannot have either. They just do not have any base-mechanisms for them. It is not a question of training-data either.
Re: (Score:2)
Re: (Score:2)
That is just nonsense. Brains in isolation _die_.
Re: (Score:3)
You can literally make up and randomize logic problems using entirely fictional terms, with distraction sentences, and LLMs will still solve them. Not perfectly, esp. with growing complexity of the task, but they very much can do deductive capabilities. This is well established.
Heck, even individual neurons can do deductive capabilities. NNs at their most basic level are fuzzy logic engines. Every neuron subdivides its inputs with a fuzzy hyperplane, in effect asking a superposition of questions and yie
Re: (Score:2)
- Generating synonyms - humans can guess less synonyms than LLMs, imagination fails, we get writers block. I had for example to generate synonyms with LLMs and filter with humans for best results in a project I worked on.
- Generally you would have better chances with a LLM at recalling facts, they
Re: (Score:2)
Re: (Score:2)
1) let the human invent an LLM.
2) let the human ask this LLM for the answer to the problem.
3) let the human report the answer to the test adjudicator.
This is a proof that the human can always solve any problems that LLMs excel at. In other words, there are NO problems where LLMs excel but humans fail. (And obviously TFA already exhibits problems that the LLMs fail at w
Re: (Score:2)
Indeed. All the "LLMs can think" proponents are demonstrating is that _they_ cannot do so successfully.
Re: (Score:3)
For the record, this study has been widely criticized.
First off, and a common sin: there's no human controls. They start out with (in some cases) human tests, and it does well on them, but then they modify them to be harder and it does worse on them - yet they just assume that human performance would be the same. The simple fact is that the sort of changes that they do - changing the meanings of words, inserting distraction statements, etc - also increase the rate of humans making errors. It's Karen's Rul
Re:Duh (Score:5, Interesting)
It's even worse than that.
As more people use LLMs, more content will be LLM generated - and LLMs can't tell the difference between generated content (which is inappropriate to train on), and "real" content that may actually include some percentage of actual reasoning steps that could be used to fake some % of reason.
AI will ultimately destroy itself by devouring its own tail.
Re: (Score:2)
how do you know that generated content is "inappropriate to train on"? AI is in its infancy and every aspect is under rapid development. Some say that generated content may be problematic, but so called experts demonstrate a lot of basic ignorance constantly. For all we know, that could soon not be a problem at all, it's certainly not a problem in this context.
Re: (Score:3)
how do you know that generated content is "inappropriate to train on"?
This is a well-understood phenomenon. It's also intuitively obvious. I even described it in an off-hand way here months before the paper that coined the term "model collapse".
but so called experts demonstrate a lot of basic ignorance constantly
So... we should listen to people without any relevant knowledge or experience? If that's what you're after, LessWrong has no shortage of uneducated crackpots. You'll fit right in.
Re: (Score:3)
Re: (Score:2)
Re:Duh (Score:4, Informative)
1. AI is not "in its infancy" at all. It is a research area that is at least 70 years old with intensive efforts along the way. There are no "low hanging fruits" left.
2. Model collapse is a proven thing.
At least the the _basics_ right. Yes, I get that you "want to believe", but that does not make for valid arguments.
Re:Duh (Score:5, Informative)
Actually this is pretty widely studied at this point. It is one of the reasons that AI companies have pushed for marking AI generated content. At the most basic level LLMs are probability models where they predict the next most probable word given what has come before. If you think of it like a gaussian peak they are trying to select near the peak. If you train an LLM on the output of an LLM you are training it on just the peaks and you get a kind of focusing effect after only a couple generations and the model collapses.
It turns out you need the low probability information to keep the model stable and it can't generate that information.
Re: (Score:3)
how do you know that generated content is "inappropriate to train on"? AI is in its infancy and every aspect is under rapid development. Some say that generated content may be problematic, but so called experts demonstrate a lot of basic ignorance constantly. For all we know, that could soon not be a problem at all, it's certainly not a problem in this context.
It's not just an issue of AIs being trained on crappy AI generated content. Humans also generate a lot of content that is garbage for AI training purposes. The point here is that well trained and educated humans can filter out garbage content (regardless of whether it is human or AI generated) but current AIs can't do that very well. This is a problem since you are going to need a whole lot of high quality content to train a high quality AI, unless you are happy with your AI functioning on the garbage in, g
Re: (Score:2)
Obviously. I have been saying that for quite some time. At some point, training an LLM on public data will become infeasible due to model collapse. At the same time, generating enough non-public training data is already infeasible and will remain so. Hence what will be left is some old models that get more and more outdated and will probably hallucinate more and more give that the questions are current and the distance from the general context the model was trained in gets larger.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
There are no tools for automated reasoning in existence that can do any real dept. Well, there are but they cannot perform due to state-space explosion. This is a _fundamental_ limit and it cannot be circumvented.
Hence your statement is bullshit.
Re: (Score:3)
Water is wet, ducks report.
I know the mundane, non-tech literate people need to be told this, but every single LLM, "AI" thing out there is not intelligent. It's a trained parrot. The Parrot does not understand english, it does not "speak" it "mimics"
Re: (Score:2)
Parrots very well can understand and speak English in the range of 800 to 1000 words vocabulary.
There is plenty of research about that.
Re: Duh (Score:3, Funny)
There is nothing intelligent about today's so called AI. It's nothing more than massive brute forced machine learning. Sufficiently advanced tech will appear to be intelligent or even magical. When in fact it's just doing what it was programmed to do. Nothing more nothing less. Non-living beings will never be able to reason or have emotions. Machines, regardless of the amount of data and programming we toss at them or program them to consume, will never become sentient. They may appear that way in some sup
Re: (Score:2)
Re: (Score:2)
Parrots and corvids can reason and solve novel problems.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
"Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.
I think the only reason LLMs are so popular over other ML approaches is that LLMs self learn patterns, while many other (stronger) ML approaches require thousands or millions of labeled training samples. This also mean that LLMs try to find patterns but have nothing to tell them when they are right.
Re: (Score:2)
but "reasoning" is a part of intelligence, and therefore AI would exhibit reasoning if it weren't a fraud. And there is no such thing as "the fundamental LLM". Also, you don't say "LLM model", that is redundant.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Any good scam promises the maximum it can without becoming obvious to most of the marks. It helps that the marks are not very smart. As the "AI" field has run now quite a number of these scams before, they have significant experience on how to do it. But like all "AI hypes" before, this one will die. If people were aware of history, the current AI hype would not have happened.
Re: (Score:2)
Re: (Score:2)
I am entirely unsurprised by this result. I *WISH* I was surprised that some people (especially people in the field) were surprised by this, but I can't say I am.
As you said, the model simply wasn't designed for reasoning. I'll go one further and state that we have no clear idea how to go about designing an AI neural net that does reason.
Re: (Score:2)
Reasoning (Score:5, Interesting)
Future models can be improved for formalized reasoning, but what's beyond obvious at this point is that next-token text prediction is far more powerful than anyone ever imagined. Our current models out-perform graduate students and can be massive helps for professionals. It's still up to you to figure out how to integrate them into your workflow. For professional software engineering I've found them to be hugely useful, like a rubber duck that also has instant PhD level knowledge on specific tasks that often I'm learning or only familiar with. It's a productivity booster and a much better search engine, most of the time.
Re:Reasoning (Score:5, Funny)
Re: (Score:3)
That understanding actually comes from an understanding of data structures and algorithms and knowing that is in fact a very modest advancement on the same level as search. It also comes from having lived through a number of technological revolutions, each of which featured all manner of con-artists and hucksters promising things that could not be delivered.
That's part of what makes computer science a science. We make predictions then we test those predictions. Were those who said "That's not intelligence
Re: (Score:2)
>data structures and algorithms
Tell me you don't do professional engineering without telling me you don't do professional engineering.
Re: (Score:3)
google search in 2001 gave you real results, no LLM is as good as that, much less "slightly better".
If you're going to slag /. posters, don't prove yourself among the dumbest ones in the same sentence.
Re:Reasoning (Score:4, Interesting)
>google search in 2001 gave you real results
The Web of 2024 is nothing like the web of 2001. Now we need knowledge engines, and Google fails there because it can't deal with SEO. No one's making a bunch of webring-indexed personal homepages.
Re: (Score:2, Insightful)
The only 'threat' any of
Re: (Score:2)
Yep, too stupid to tell the difference, too arrogant to understand, ethically-challenged.
Re: (Score:2)
Re: (Score:2)
Why would we need a new generation of researchers for that? That's baseless
Re: (Score:2)
The problem with symbolic reasoning is that it runs into state-space explosion _very_ fast. There is no known solution, despite something like 70 years of intense efforts. Which have never stopped. A "new generation" of researchers will not do anything here. The evidence strongly indicated the problem is not solvable or requires some _fundamental_ breakthrough. These are really hard to come by and cannot be forced.
Re: (Score:2)
So ban on evaporative cooling as well as using aquifer water should not upset anyone then I should hope as outside of those things i laregly agree, a closed loop, water-to-air cooling system is quite efficent these days.
Re: Reasoning (Score:2)
Yes and no.
Water used in watercooling is often released back in the environment. Usually in a stream the water was pumped from. Though it seems you do need to be careful HOW you release it The water comes out quite hot and if you dump it in the environment hot, it actually damages the ecosystem of the river. So often the water sits in cooling coils before being dumped back out at a more reasonnable delta.
Few installation seem to operate purely on a water closed loop. I am guessing it is because it is more
Re: (Score:2)
Re: (Score:2)
Those resources could be used to serve PEOPLE instead of this shit.
In our current epoch, LLMs are feasible and therefore worth investigating. Who knows what benefits they may bring, to serve the people we both care about?
There's a story (perhaps apocryphal) of then-British-PM Benjamin Disraeli visiting the laboratory of Michael Faraday, and asking Faraday "what use is electricity?" Faraday replied, "Prime Minister, what use is a newborn baby?"
Re: (Score:2)
I do agree there. A few years ago, ChatGPT would be considered an "intern" level. Good enough to fetch stuff, but makes mistakes. Now with the newer LLMs, it is able to do more things, such as generating SCAD models from text. However, it still has a way to go, as the models it does generate may not make sense, or need cleanup.
We have come far with the newer models, but we have come far with a lot of advances, and eventually diminishing returns hit until we go find another technology, perhaps some other
Re:Reasoning (Score:4, Insightful)
Hugely useful for software development?
In specific tasks of boiler plate code generation maybe.
The training data for these code models is open source code repositories. At best you can hope for "average" code.
The model doesn't know which bits of code do what, or if they do it without bugs
It doesn't know how "good" the code is.
I'm sure its really good at reciting common examples to common questions.
Re: (Score:2)
If you're not able to articulate specific tasks, the exact inputs and outputs that are provided and what you desire, and give it enough context to understand the interoperability with your existing system, then you're not a good engineer anyway. Without that all you can expect is boilerplate or leetcode copy-paste, which is obviously not useful to a competent engineer.
Re: (Score:2)
If you are a good engineer, you do the work yourself. You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you. Now, we all know that you use these "hallucinating" tools because you brag about it. On the other hand, no one says you're a good engineer but you.
Re: (Score:2)
>If you are a good engineer, you do the work yourself.
Wrong these days. If you're a good engineer you use the tools available to you to get the task done well, and fast.
> You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you.
Do what you want and get left behind, then be left confused as to why it happened.
>Now, we all know that you use these "hallucinating" tools because you brag about it.
A moment ago you said it can only
Re: (Score:2)
On the other hand, no one says you're a good engineer but you.
Indeed. This person is probably a Dunning-Kruger "far left side" case. I have yet to find any demonstrably good software engineer to praise LLM assistance. But I have found a lot that said "meh" and have mostly stopped using them except as "better search".
Re: (Score:2)
So now you're just writing your code in LLM-inputs?
May as well write in the target language.
You're still going to have to write all the test cases
And go through the generated code to make sure all the code paths are tested, and make sure the result is what you expect.
Re: Reasoning (Score:2)
I find it useful for the kind of thing you'd be confortable pushing to an intern with stack overflow. And that's not bad.
In no way, these LLMs have revolutionize the way I write code But for many simpel tasks, they work reasonnably well. And once you only try to leverage them in these situation you can gain decent productivity!
For me, I find it particularly helpful to solve si ple tasks in tech I don't know well. I am an HPC scientist. So often I set up benchmarks on weird code bases, writen in a variety o
Re: (Score:2)
The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.
Re: (Score:3)
The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.
Ah, but which is faster: Writing out the paragraph of comments with enough precision for the LLM to do the right thing (or something close enough that you can massage into the right thing) or using simpler word completion/function name completion and writing the code yourself? In my experience, the latter is faster. Your mileage may vary.
Re: (Score:2)
I find that if I write out what I'm wanting to do in comments beforehand, it clarifies my thinking. Then if the assistant creates a reasonable facsimile I can work from it.
Re: (Score:2)
It helps me get a start with some boilerplate that is is 70% functional, contains 20% made up methods, and is 100% inappropriate for the given problem set, but I've still been more productive than without it. I use it to brainstorm and it never lets me down 33% of the time if you know what I mean.
It's frustrating, infuriating even, but I can't deny I'm better of with it. It's great to have another source of unreliable nonsense mixed in with genius other than just google and my friend Dave
Re: (Score:2)
The evidence suggests that it provides better searching capabilities, but that generating actual code is something it can only do better than a low-skill coder. There are plenty of those around, so they may be dazzled. But the problem is that these low-skill people cannot check or cleanup LLM results competently. Numerous high-skill coders have since reported that cleaning up LLM-generated code is often _more_ work than writing things yourself.
There are also other problems: LLMs could well put subtle securi
Re: (Score:2)
Our current models out-perform graduate students.
That's more a statement that your tests are broken than a statement that the models are working.
Re: (Score:2)
>T-t-the tests are broken!
When you don't like the results just claim the tests are wrong. Like any social "scientist" does.
https://arxiv.org/abs/2311.120... [arxiv.org]
Re: (Score:2)
They aren't good at deeper reasoning but they're good at memorizing and doing simple applications. The problem there, as a smart search engine, is that they memorize insufficiently well and don't know when they're making mistakes.
Re: (Score:2)
Our current models can out perform graduate students... at regurgitating facts.
Re: (Score:2)
Indeed. Which is only useful if the "facts" are dependable. For LLMs, they are not.
Re: (Score:2)
Future models can be improved for formalized reasoning,
Nope. Get some basics right.
Re: (Score:2)
With anger like that you're probably right to fear being replaced by younger, better programmers who are more able to use the tools available to them.
ELIZA? (Score:5, Insightful)
Sometimes I wonder that even with all the nodes and capacity that modern LLMs have, we are not that far away from good old ELIZA back in the 70s. We have gone far with CPU, disk, RAM and such, but we may need to go a different route completely for AGI/ASI.
Re: (Score:3)
Re: (Score:2)
Since you asked...
ELIZA is/was a program that rather simple-mindedly tried to pass the Turing test by asking general questions and mimicking back certain phrases input by the human interlocutor. It was called ELIZA because the human was instructed to treat the program as though it was a therapist, thus leading to the exchange of phrases with no expected deep-sharing of personal info from the program. It was modestly impressive, in a low-bar 1970s kind of way.
No offense meant to therapists here.
Tell me more...
Oh wait, are
Re: ELIZA? (Score:2)
How do you feel about whoosh?
Cope (Score:2)
O1-mini easily identified the information was irrelevant and produces the correct answers .. it is also the first model with basic advanced reasoning.
Insane resources are deployed. It is a race to when they stop getting smarter.. not there yet.
I have also found them quite sensitive to wording (Score:4, Interesting)
I havent done the test recently so the results may have changed. If you ask the llm to produce a code to solve knapsack it will produce the standard dynamic programming.
If you describe a problem as a set of objects with weights and value and you try to select a subset of objects whose sum weight fits within a capacity and that maximizes sum of value, it produces a dynamic programming.
If you present it as a problem where you have a set of zorglub with foo and bar prodperties and you want to select a subset of the zorglub so that their sum foo is smaller than grandduck and to maximize the sum of bar. It gives you a brute force algorithm.
So clearly it takes it cues not from the structure of the problem but from the names that get used. Which, fair enough, doesn't mean it is not useful. But virtually all the students in my algorithm class would go: "isn't that just knapsack?"
Re: I have also found them quite sensitive to word (Score:2)
Playing devil's advocate, isn't your class just matching patterns with a much bigger training set than your LLM was?
Re: I have also found them quite sensitive to wor (Score:2)
Actually, I don't think that's what it is I think they'd recognize the mathematical structure of a knapsack problem. And they'll ignore the wording.
Because they know that when looking at these kinds of problems, you have to extract input, constraints, decisions, and objective. And then solve from there.
And once you are in that space, it should be obvious it is knapsack
Turing (Score:2)
Re: (Score:2)
It turns out, humans aren't hard to trick.
Re: (Score:2)
Indeed. Something like 80% by some reliable estimates. At least that is the percentage of people _not_ accessible to rational argument (i.e. where you do not even need to come up with it yourself, just check it).
Re: (Score:2)
How could it pass a Turing test without reasoning? It seems that an important component of a test for human thinking would focus on reasoning and logic.
The Turing test doesn't really test that. Not directly anyway. The test requires a human's judgement of whether an interlocutor is machine or human. All that needs to happen is that the human judge cannot tell which it is. It's not a formal test of reasoning.
Re: (Score:3)
The Turing test is routinely vastly overestimated by amateurs. It does not test what you think it tests.
LLM brute-force methods have hard limits (Score:2)
It takes a freaking nuclear power plant to brute-force trillions of floating-point matrix operations. I'd say that's a pretty hard limit. However AI is actually going to be done, it can't be done like that. I experimented with a very simple AI rules engine decades ago, and ran right into the computational complexity wall after just a few dozen levels of reasoning. When I showed the results to the customer, he told me he wasn't looking for a machine to replace human reasoning, he wanted a machine to just run
Separate Issues (Score:2)
"Can I get an LLM to lay out a reasoning strategy?" and "can I trip it up with words that would not trip up most postdocs?" are separate questions.
Some of the LLM's can do the first part now beyond simple next-token generation.
A buddy of mine asked one how many calories are in a typical male hippo and it described its plan to calculate it, did the math right, and advised against adding hippo to the diet.
pattern recognition engine (Score:2)
And that the answers it gives are from a best match pattern search?
imho the AI won't ever achieve the ability to reason. (Not saying AI doesn't have good uses)
Incompetent use of the word "prove" (Score:2)
So, a study takes a look at two current implementations of LLMs and proclaims that all other existing LLMs don't work and furthermore no other LLMs that will ever be created in the future will work. Why? Because all LLMs are the same, always have been, and always will be in the future. It's not like the hundreds of billions of dollars in hardware are being used to perform research into different LLM architectures, models, and processes.
Yes, the word "prove" is incompetently used by the article writer and
Re: (Score:2)
I'm sorry that reality ruined your silly AI fantasy. I'm amazed that it took this long.
Where is the human control? (Score:2)
Now apply the same test to humans, and I am confident the average person will be found to have "No Ability to Reason".
When Sophie watches her nephew, she gets out a variety of toys for him.
The bag of building blocks has 31 blocks in it. The bin of stuffed
animals has 8 stuffed animals inside.
The tower of stacking rings has 9 multicolored rings on it.Sophie recently bought a tube of bouncy
balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?
People living in Silicon valley may be remote from the abstract reasoning abilities of the median human.
AI is the Asbestos of engineering (Score:3)
Now every tech company wants to put it everywhere, but a time will come in which we will be removing it from almost every product.
Re: Thanks science (Score:2)
Yeah, right now AI developpers are getting paid stupid amounts. But my guess is that most willpivot or be fired in the next two years.