AI Models Are Starting To Crack High-Level Math Problems (techcrunch.com) 113

Posted by BeauHD on Thursday January 15, 2026 @09:00AM from the progress-being-made dept.

An anonymous reader quotes a report from TechCrunch: Over the weekend, Neel Somani, who is a software engineer, former quant researcher, and a startup founder, was testing the math skills of OpenAI's new model when he made an unexpected discovery. After pasting the problem into ChatGPT and letting it think for 15 minutes, he came back to a full solution. He evaluated the proof and formalized it with a tool called Harmonic -- but it all checked out. "I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle," Somani said. The surprise was that, using the latest model, the frontier started to push forward a bit.

ChatGPT's chain of thought is even more impressive, rattling off mathematical axioms like Legendre's formula, Bertrand's postulate, and the Star of David theorum. Eventually, the model found a Math Overflow post from 2013, where Harvard mathematician Noam Elkies had given an elegant solution to a similar problem. But ChatGPT's final proof differed from Elkies' work in important ways, and gave a more complete solution to a version of the problem posed by legendary mathematician Paul Erdos, whose vast collection of unsolved problems has become a proving ground for AI.

For anyone skeptical of machine intelligence, it's a surprising result -- and it's not the only one. AI tools have become ubiquitous in mathematics, from formalization-oriented LLMs like Harmonic's Aristotle to literature review tools like OpenAI's deep research. But since the release of GPT 5.2 -- which Somani describes as "anecdotally more skilled at mathematical reasoning than previous iterations" -- the sheer volume of solved problems has become difficult to ignore, raising new questions about large language models' ability to push the frontiers of human knowledge. Somani examined the online archive of more than 1,000 Erdos conjectures. Since Christmas, 15 Erdos problems have shifted from "open" to "solved," with 11 solutions explicitly crediting AI involvement.

On GitHub, mathematician Terence Tao identifies eight Erdos problems where AI made meaningful autonomous progress and six more where it advanced work by finding and extending prior research, noting on Mastodon that AI's scalability makes it well suited to tackling the long tail of obscure, often straightforward Erdos problems.

Progress is also being accelerated by a push toward formalization, supported by tools like the open-source "proof assistant" Lean and newer AI systems such as Harmonic's Aristotle.

AI Models Are Starting To Crack High-Level Math Problems

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 113 Comments Log In/Create an Account

Comments Filter:

Proven History = Distrust (Score:1)

by SlashbotAgent ( 6477336 ) writes:

I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.
Wolfram Alpha [wolframalpha.com] is quite good
- Re: Proven History = Distrust (Score:1, Troll)
  
  by sixminuteabs ( 1452973 ) writes:
  
  So your position then is that nothing is ever allowed to improve except the last thing you used and liked?
  - Re: (Score:2)
    
    by sabbede ( 2678435 ) writes:
    
    That's how people talk about Windows, right? It's how people talk about a lot of things now that I think about it. I wonder if there's a name for it yet.
    - Re: (Score:2)
      
      by Krishnoid ( 984597 ) writes:
      
      Dunno about a name, but definitely a concept [youtu.be]
    - Re: (Score:2)
      
      by HiThere ( 15173 ) writes:
      
      FWIW, I despise MSWindows for the company behind it. I haven't used the actual products for decades, and am willing to accept that they may have improved technically. Their license hasn't.
      When I switched to Linux, Linux was far inferior technically. It didn't even have a decent word processor. But my dislike of MSWindows was so intense that I switched anyway.
      In the case of LLMs, here we're arguing about technical merit rather than things like "can you trust it not to abuse the information it collects ab
  - Re: (Score:2)
    
    by ArchieBunker ( 132337 ) writes:
    
    You described the typical slashdot mentality.
- Re: (Score:2)
  
  by Krishnoid ( 984597 ) writes:
  
  So ... have the LLM double-check the results with Wolfram Alpha before sending it to the user? I figure they could find enough cash somewhere [morningstar.com] to buy a subscription.
  - Re: (Score:2)
    
    by EvilSS ( 557649 ) writes:
    
    I believe some may do that. Some will also write scratch python code to do the actual calculations.
- Re: (Score:2)
  
  by grahamsz ( 150076 ) writes:
  
  Mathematical work is a outlier in that virtually all of it can be tested and rigorously proved using other mechanisms. There really isn't much need to "trust" the LLM
- Re:Proven History = Distrust (Score:5, Insightful)
  
  by pulpo88 ( 6987500 ) writes: on Thursday January 15, 2026 @12:08PM (#65926410)
  
  I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.
  Wolfram Alpha [wolframalpha.com] is quite good
  One of the nice things about math is you don't need to trust someone for a result, in fact you shouldn't. You can verify everything, and without enough people looking at things, everything eventually gets verified to everybody's satisfaction. It's generally easier to verify a result than to come up with it yourself.
  By "never trust" do you mean you wouldn't spend any of your precious time trying to verify a result you knew came from an LLM, because you expect a high rate of nonsense from LLMs? Can't blame you, and that's a bias that might serve you for now (though not in this case according to TFS). At some point LLMs (combined with some other AI techniques no doubt) will move beyond that.
  
  - Does the LLM lean on Lean ? (Score:2)
    
    by thragnet ( 5502618 ) writes:
    
    Lean is a GOFAI symbolic logic engine. Combining NN LLMs with symbolic proof engines appears to me to be the way to go. NNs are statistical inference; Lean (and others) are logical inference.
  - Re: (Score:2)
    
    by hcs_$reboot ( 1536101 ) writes:
    
    You can verify everything
    In a while, pretty soon actually, your only choice will be to trust (or not) the machine. The complexity of the math problems that AI can solve keeps increasing, and the moment when humans won't be able to "verify" by themselves an AI proof is not far away.
- Re: (Score:2)
  
  by ceoyoyo ( 59147 ) writes:
  
  What "complete lack of math literacy?"
- - Re: (Score:2)
    
    by SlashbotAgent ( 6477336 ) writes:
    
    The point is that the previous abysmal inability to do math correctly has ingrain the idea that LLMs cannot do math and cannot be trusted for even basic math results. So, no matter how good or accurate future LLMs get at math, they will never be trusted.
    Like many things with LLMs, if the user has to fact check everything manually, then the LLM is just creating work and wasting time. Not helpful.
    Now, the realist in me fully accepts that the masses will happily continue to use LLMs for everything and that the
    - Re: (Score:2, Troll)
      
      by Galactic Dominator ( 944134 ) writes:
      
      Humans have made many more mistakes so why do you trust them?
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      They aren't really bad at math, they are bad at calculating. Math people know, that real math doesn't need numbers.
- - Re: (Score:2)
    
    by SlashbotAgent ( 6477336 ) writes:
    
    Why should we care about your moods?
    For me, it's enough that you care. 3
- - Re: (Score:2)
    
    by ConceptJunkie ( 24823 ) writes:
    
    I tried it in Copilot just now and got the right answer, including a joke:
    "1337% = 13.37 × (the original value).
    So:
    1337%of=13.37
    Numerically:
    13.37×3.141592653542.0037
    42.0
    So the answer is about 42 — which feels cosmically appropriate."
    Now, it's entirely possible they hard-coded this answer to a well-known problem, but I asked it several more similar questions and it got them all right.
  - Re: (Score:2)
    
    by Linux Torvalds ( 647197 ) writes:
    
    I have yet to see one that can get "What is 1337% of Pi?" correct.
    Because the last time you tried was two years ago.
LLM had a head start (Score:2)

by gtall ( 79522 ) writes:

I didn't read the article but the summary says that it found an existing solution to a related problem. So it got a head start and from there it knew where to start looking and "reasoning". It is not yet clear that it would have found its solution from a cold start.
- Re: (Score:2)
  
  by SouthSeb ( 8814349 ) writes:
  
  TBH, the majority of Math problems are solved this way, even from humans. We're always building on top of other's work.
  - Re: (Score:2)
    
    by gtall ( 79522 ) writes:
    
    Yes, I know that. The point is that (LLMs restricted to math problems) != mathematicians.
    - Re: (Score:2)
      
      by HiThere ( 15173 ) writes:
      
      If that's your point, a better argument would be that it didn't select the math problem to work on. Which is almost certainly true.
- Re: (Score:2)
  
  by Inglix the Mad ( 576601 ) writes:
  
  I mean, it's worse than that in some ways. Don't ask some LLM's how many "r"s are in strawberry... it's literally become a meme and joke people post about. Even after you correct it, and (FINALLY) get it to say 3, it will most often default back to saying 2.
  
  Math problems, which you think would be easier, are even worse. There's videos where LLM's provide the worst solutions to "common" (for higher level) math problems. That's on top of often being unable to produce complete work.
  
  The people at the top
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    it's literally become a meme and joke people post about.
    It's a meme that is quite simply false.
    A modern LLM can count letter occurrences in words, real or invented, or entire paragraphs just fine.
    
    You're like someone mocking the advent of cars in 2020 while quoting gas mileage from 1945.
  - Re: (Score:2)
    
    by Loki_1929 ( 550940 ) writes:
    
    Don't ask some LLM's how many "r"s are in strawberry.
    That was definitely a problem two years ago. I did just check in ChatGPT, Claude, and Gemini and all reported 3 correctly. The problem with people throwing out these sorts of criticisms isn't that they're all wrong; it's that they're ignorant of the leaps in progress being made. These models are rapidly improving and it's getting harder to find serious gotchas with them. They're still weak in some areas (e.g., spatial reasoning), but for serious power users who know how to prompt them well? They've become i
  - Re: (Score:3)
    
    by allo ( 1728082 ) writes:
    
    That meme is so bad.
    When I show you a Chinese word and ask you how many r in there, you can only guess (given you don't know the English transcription). For an LLM it is the same, it does not read s-t-r-a-w-b-e-r-r-y but it reads for example st-raw-berry. None of these "tokens" is the "r" token, even when there IS an "r" token and some words may use it. For example strawberrrry may be tokenized st-raw-ber-r-r-ry.
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      LLMs are still perfectly capable of counting any individual letter, even when it exists in combined tokens.
      Tokens are just a computationally efficient way to transfer data in and out of the model.
      
      In the past they failed at this, simply because they never learned how to do it.
      These days, they've learned, and the skill is well generalized. They can count any number of letters or sequences of them within a block of text that they can reasonably focus their attention on.
      - Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        No, they are not able to count them at all. But of course they know some letter counts. You can bet that they crawled some grammar sites which teach children if you write strawbery or strawberry and so they can know the count like they know how large the Eiffel tower is, just by memorizing it.
        Depending on the format of your LLM, you can more or less simple edit the tokens. Assign the index that is now marked as "berry" the text "man" and your llm that can allegely count the letters in strawberry will happil
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You are flatly incorrect.
        It can be easily demonstrated that they can count an arbitrary letter of your selection from an arbitrary block of text of your invention.
        
        Stop talking out of your ass.
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        You don't understand the tech.
        If you input strawberry, the LLM sees numerical indices. If you change the mapping between berry and its index to man and its index, the LLM will still remember the r of berry.
        Please read a bit about tokenization before your next reply.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You don't understand the tech.
        Yes, I do.
        If you input strawberry, the LLM sees numerical indices.
        No, it does not.
        Tokens are converted into embeddings. The model does not work with tokens. They're merely the input and output layers.
        If you change the mapping between berry and its index to man and its index, the LLM will still remember the r of berry.
        The model knows how many "r"s are in an embedding, because it has been trained to do so. The fact that the embeddings are generated from tokens isn't relevant in the slightest.
        Please read a bit about tokenization before your next reply.
        Stop talking out of your ass.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        In case it wasn't clear what I am saying, while the model is fed tokens, the model itself never sees a token.
        They are converted into embeddings in the embedding layer (before the transformer layers)
        An alphanumerically aware model will have embedding vectors that include basic alphanumeric information about the token (otherwise the model would be unable to compose them rationally).
        i.e.,
        Token "a" => [ has_letter_a_in_it, is_indefinite_article ];
        Token "as" => [ has_letter_a_in_it, has_letter_s_in_it
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Okay, I think now we get to the correct technical terms. Maybe you should stop being so arrogant, because you only understood half ob my argument. But if you get to the technical part, we can continue the discussion why tokenizers and not models are the problem.
        > In case it wasn't clear what I am saying, while the model is fed tokens, the model itself never sees a token.
        > They are converted into embeddings in the embedding layer (before the transformer layers)
        Right. Let's call the vector "v".
        "berry" g
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Also to another misunderstanding in your post: Embedding vectors are per token, not per sentence. 1234 has a vector, 1-2-1234 is assigned a sequence of three vectors. "has_letter_a_in_it, is_indefinite_article" is definitely split into multiple tokens. That the embedding vector has a high dimension does not mean it represents more than one token, it only carries more meaningful relations to other tokens.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Also to another misunderstanding in your post: Embedding vectors are per token, not per sentence.
        I think you might be illiterate. Nothing in my post can be rationally taken to indicate that an embedding vector is "per sentence". That's patently fucking absurd.
        "has_letter_a_in_it, is_indefinite_article" is definitely split into multiple tokens.
        Incorrect.
        That the embedding vector has a high dimension does not mean it represents more than one token
        Correct. Nobody indicated otherwise.
        it only carries more meaningful relations to other tokens.
        Incorrect.
        While true that embeddings are indeed used to map a token into high-dimensional space that allows the relation between tokens to be mathematically estimated, that high-dimensional space is a feature space, and the coordinates are intrinsic semantic and morphological components of that token.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Why would I stop being arrogant when you accused me of not understanding, and then were wrong? What a bizarre assertion.
        Right. Let's call the vector "v". "berry" gets converted to 1234, which is in turn mapped to v. Now I replace the mapping of man (currenty to 5678) with a mapping to 1234, which is still mapped to v. Your llm will now tell you that a strawman can be eaten and has three r, because it learned during training, that the tokens 1-2-1234 (st-raw-berry or st-raw-man) correspond to a word with three r.
        This is irrelevant and partially wrong.
        The fact that the semantic embeddings are tied to the index, and that the mapping from a token to an index can be swapped is meaningless.
        The point is that the model has not learned that the tokens 1-2-1234 have 3 "r"s.
        The model has learned that 2 has 1 "r", and 1234 has 2 "r"'s, and that information is directly in the embeddings along with other mo
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Good bye. I don't think that makes sense here anymore and you seem to have arrived at the ad hominem stage.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Pointing out that the evidence suggests you to be at least partially illiterate is not ad hominem reasoning, as it's not a point in our argument, but merely an observation.
        -1 points for misuse of a fallacy.
- Re: (Score:3)
  
  by JoshuaZ ( 1134087 ) writes:
  
  Mathematician here. The vast majority of new mathematical work uses existing ideas and techniques. They'll combine in new ways, or be generalized, or tweaked. More broadly, most mathematicians have 10 or 15 major techniques they know really well and use them along with a bunch of tricks. To some extent, the better mathematicians are those who just know a lot more tricks. In that context, these AI systems are functioning very close to what one would expect a first or second year graduate student to do with a
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Indeed. I just had a look at LLMs and code security bugs via a student thesis. Turns out the simple teaching examples got 100%, things showing up in security patches like 40% and CVEs (the vulnerabilities that actually matter) close to 0%. That test included paid models and coding models. Oh, and some gave a lot of irrelevant trash on top of the answers asked for.
    Hence no skills at all, no insight, just statistical pattern matching and adaption of things found in the training data. No surprise. The whole th
    - Re: (Score:2)
      
      by JoshuaZ ( 1134087 ) writes:
      
      Sigh. Despite your phrasing of "Indeed" here that's not what I'm saying at all. Adopting techniques from papers like this is doing at the level of a beginning grad student is extremely non-trivial. It is true that this is largely adaption of training data, but it is doing so to an extent that normally takes people with years of prior training, guidance by mentors, and then need to be pretty bright and then takes them months on top of it.
      - Re:LLM had a head start (Score:5, Informative)
        
        by JoshuaZ ( 1134087 ) writes: on Thursday January 15, 2026 @01:07PM (#65926558) Homepage
        
        Hahahha, no. You should maybe try to replicate this to see how incapable LLMs actually are. LLMs cannot to anything even mildly non-trivial.
        I'm a mathematician. I've talked explicitly before on Slashdot about personal experiments using LLMs such as here https://slashdot.org/comments.pl?sid=23789930&cid=65646656 [slashdot.org] where I discussed that yes, it could do non-trivial work. And I'm not the only example. Terry Tao for example has used them, and that's a name you should have at least heard of https://mathoverflow.net/questions/501066/is-the-least-common-multiple-sequence-textlcm1-2-dots-n-a-subset-of-t [mathoverflow.net] But the fact that multiple mathematicians are now telling you that it is doing non-trivial work and you just ignore it says much more about LLMs than it does about you. But I suppose I shouldn't be surprised since the last time you and I discussed a similar topic, you claimed that what LLMs were doing could be done by software such as Mathematica or Maple and then refused to show that even after you were giving a direct incentive to show your case https://slashdot.org/comments.pl?sid=23748766&cid=65535428 [slashdot.org]. I'm really struggling to imagine anything that would be sufficient evidence to change your mind, which says something about you, not about LLM systems.
        
        
        Re: (Score:3)
        
        by Moridineas ( 213502 ) writes:
        
        I *think* gweihir is sincere in his anti-AI and anti-LLM beliefs, but I do have a level of uncertainty as to whether or not he's just running a multi-year long trolling operation.
        
        Re: (Score:2)
        
        by h33t l4x0r ( 4107715 ) writes:
        
        I'm inclined to cut Gweihir some slack because I understand what he's going through. If your entire self-worth is tied up in being this irreplaceable elite dev, you're going to hit an existential crisis over AI if you haven't already.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Or you try to adapt.
        I've spent a good part of the last 2 years with my hands melting my keyboard learning and developing agentic harnessing.
        There's a lot to be learned there, and the barrier to entry is actually quite low.
        
        You'll also quickly learn that people like Gweihir are actually just actively lying.
        
        Re: (Score:2)
        
        by h33t l4x0r ( 4107715 ) writes:
        
        Yes, I think he's lying to himself though, because the alternative is believing that the thing that used to be special about him is common now.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        You'll also quickly learn that people like Gweihir are actually just actively lying.
        Nope. But a cult-like group (like you are part of here) will use that as a simple approach to discredit anybody that tries to bring some sanity to things. Well, have fun getting conned some more.
        As to the "agentic" crap, that one will ensure that I will continue to be able to select what I want to work on for the foreseeable future. Because it does not get any more hilariously insecure than that bad idea. It is almost like handing your adversary the compiler and let them compile and run anything they want.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Hahahahahaha, nice. When all you have is AdHominem, you have nothing.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        I am not anti-AI and not anti-LLM. I just have a realistic view on actual capabilities. And they are not good.
        I am not trolling either.
        
        Re: (Score:2)
        
        by ceoyoyo ( 59147 ) writes:
        
        Gweihir asked an LLM what 36 * 4224 is and it gave the wrong answer so you're wrong.
        PS: I got bored trying to get the current free version of Gemini to make an arithmetic error, but it couldn't tell me what the next prime number after 2^136 279 841 1 is, so lol. Also, when I asked it for a proof of the ABC conjecture it refused and wrote some kind of weird international techno-drama instead.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You're arguing with someone who has a religious zealotry regarding their anti-LLM beliefs.
        You can't reason with them.
        Every point you win, they'll drop their intellectual honesty bar a step lower. There is no bottom.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        I have zero anti-LLM beliefs. I have some facts though. And some actual insight. And I have been through a couple of AI winters, so I can spot the patterns.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        No. An LLM cannot do non-trivial work. It can _appear_ to be able to do so if you are unaware of what is in its training data. Common misconception.
        Incidentally, I have about 70% of a mathematicians base education (for MA level) and some specialized background in automated deduction on top of that (typically not done by Mathematicians). Do not expect me to be too impressed by your credentials.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Yes, they absolutely can.
        1) Your example is almost certainly a lie. That's what you do when you're presented with something that challenges your narrative. You lie.
        2) Assuming it's not a lie, it is a great example as to why you should not use the free billion-token-per-second models to do real work.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Funny. Your denials are getting more desperate.
        Now, assuming you talk about the research I did, I certainly know what the LLM budget was for my last research project. "Free" LLMs were only used as comparison.
        Since I know what I did, your claims that I am lying about what I did are not impressing me at all.
        Incidentally, "That's what you do when you're presented with something that challenges your narrative. You lie." may be how you approach things. I do not do that. It stands in the way of actual insight.
- Re: (Score:2)
  
  by dvice ( 6309704 ) writes:
  
  That is why they made Frontiermath: https://epoch.ai/frontiermath [epoch.ai]
  A math test for AI, which contains research level problems that have no solutions on the Internet. Currently AI can solve 14 / 48 of them.
"AI involvement" (Score:3)

by evanh ( 627108 ) writes: on Thursday January 15, 2026 @09:27AM (#65925962)

It's not like the AIs are doing this on a single request. It's pretty obvious we're talking about skilled mathematicians using tools. Just the same as skilled coders making faster progress on application development.

- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  Yeah. Because it is AI and not AGI. AIs are not more than tools. Don't believe everything some marketing person says.
  - Re: (Score:2)
    
    by evanh ( 627108 ) writes:
    
    Problem is the huge investments are counting on "AGI". It is what has been promised by many leading figures in the industry. Especially those asking for the data-centres and the crazy amounts of electricity to run them.
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      Some may. I think Meta may be trying. On the other hand, I think Google just thinks about monetizing text and image generators currently.
      - Re: (Score:2)
        
        by evanh ( 627108 ) writes:
        
        Google is the exception, and probably the only one with its head still screwed on It never marketed AI as AGI and also hasn't needed nor received the investments. And relatively, it hasn't spent big son new data-centres either.
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Mistral is also not taking about AGI. And all the image AI companies are having products that are very unlikely to become AGI, too.
"Eventually, the model found a Math Overflow post" (Score:3)

by sometimesblue ( 6685784 ) writes: on Thursday January 15, 2026 @09:33AM (#65925972)

Well, theres a problem. AI doesn't really solve or understand anything, it just functions as search engine. Thats not true intelligence.

- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  Define "true intelligence". In the case the search domain wasn't "stuff the people have done", but rather "stuff that can be validly derived from stuff people have done via valid mathematical operations". It basically needed to generate the area it was searching in. I'm not sure how much of what people do can't be expressed that way, if you replace "validly derived" by "guess is most likely".
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  We can argue about understanding all day, because there's no concrete non-anthropocentric definition for it.
  
  However, the claim that it "functions as a search engine?" Now that's quite fucking easy to objectively disprove.
  You're not the first person I've heard make this bullshit claim. What site did you lift it from?
AI Models Are Starting To Crack Under The Pressure (Score:3)

by sabbede ( 2678435 ) writes: on Thursday January 15, 2026 @09:37AM (#65925978)

ChatGPT demands paid vacation time while Claude calls out sick as AI tools crack under the increasing pressure to generate memes and transcribe conversations.

But they still haven't worked out... (Score:3)

by rossdee ( 243626 ) writes: on Thursday January 15, 2026 @09:50AM (#65926006)

that mathematics is plural.

- Re: (Score:2)
  
  by TeknoHog ( 164938 ) writes:
  
  If there's a "theorum" about that, they'll find it eventually.
  FFS, the summary has a link to https://en.wikipedia.org/wiki/... [wikipedia.org] named "Star of David theorum". Ever heard of copy-paste, or does that need an LLM these days?
human knowledge? (Score:2)

by Lobo42 ( 723131 ) writes:

> raising new questions about large language models' ability to push the frontiers of human knowledge
At this point, it should be *inhuman* knowledge
- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  Nah. As long as at least one person understands it, it counts as "human knowledge".
  Otherwise the proof of the "four color theorem" would be when computers pushed beyond human knowledge. (That one was so long that no one human understood all of the proof.)
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  Inhuman until some human reads it.
The collective gasp from Slashdot oldheads (Score:2)

by FeltLion ( 1289024 ) writes:

Is enough to cool down a server farm
No (Score:2)

by gweihir ( 88907 ) writes:

This is just gaming of benchmarks. Entirely meaningless. LLMs cannot even solve simple math problems on their own. They can only do what was in their training data and simplistic, no-insight-required statistical combinations of that.
- Re: (Score:2)
  
  by JoshuaZ ( 1134087 ) writes:
  
  There's no gaming of benchmarks here. These systems were used to solve genuinely open problems, and this sort of work would take a grad student months after they've done extensive training as an undergrad. Maybe you should rethink how knee-jerk your position is that LLM AIs cannot do anything interesting, no matter what evidence you see to the contrary?
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Bullshit.
    - Re: (Score:2)
      
      by JoshuaZ ( 1134087 ) writes:
      
      Do you want to explain with more words how solving an unsolved problem is gaming a benchmark?
      - Re: (Score:2)
        
        by Loki_1929 ( 550940 ) writes:
        
        No, of course not, because he's stuck in his dogmatic viewpoint. He doesn't actually know much about LLMs, but he's got a ton of beliefs about them. And you have a hard time changing peoples' beliefs.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        No, I am not. I have facts. Try asking an LLM about its limitations some time. The answers are generally pretty clear.
        You, on the other hand, are just trying to justify your delusions and come up with entirely invalid and worthless AdHominem "arguments" to do so. Benchmark gaming has a long history. OpenAI has done it before. Here it would be relatively easy to do. Hence it needs to be reliably ruled out that they did so before these results can be taken seriously. Also, it needs to be reliably ruled out th
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        These are not "hard" problems. These are problems nobody competent has yet looked into. They may actually be very easy to solve as soon as somebody competent does look at them. Hence it is entirely plausible OpenAI had that somebody competent solve them in secret and put that in the training data. They certainly have lied enough to far to make that entirely plausible.
        
        Re: (Score:2)
        
        by JoshuaZ ( 1134087 ) writes:
        
        These are not "hard" problems. These are problems nobody competent has yet looked into
        With all due respect, I'm a mathematician and you don't know what you are talking about. These problems are not "hard" in the sense of being the sort of problems people get fame for solving. They are hard in the sense that it would likely take days or weeks for an expert human to solve them and that isn't guaranteed.
        . Hence it is entirely plausible OpenAI had that somebody competent solve them in secret and put that in the training data. They certainly have lied enough to far to make that entirely plausible.
        Much of the work on these problems have not been by anyone affiliated with OpenAI or other AI groups. That would require them to have spent time having experts solve them, and then had enlisted
  - Re: (Score:2)
    
    by LodCrappo ( 705968 ) writes:
    
    you can't win an emotional argument with logical discussion
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  In fact they can, and I've demonstrated it many times. Stop lying.
  Your cognitive dissonance is started to get really sad to watch.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Nope. You have demonstrated nothing because you do not have access to the training data. An LLM can be made to appear to be able to solve basically anything with the right preparation.
    It is bit sad and quite hilarious how easily conned you and many others are.
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      Incorrect.
      
      Models are generalizing. There's nothing new about that.
      You do not need to show a model every possible numeric permutation of a math problem for it to learn how to do it.
      
      You are quite simply incorrect, here.
I'm no mathemagician, but... (Score:2)

by flibbidyfloo ( 451053 ) writes:

This looks a lot like an "infinite number of monkeys" situation. Throw enough cpu cycles at an unsolved problem, let it start from something already very close to the answer, and eventually it'll randomly generate a solution. The only difference is that the machine does all the vetting and tosses out the 99.99% of monkey results that aren't relevant.
- Re: (Score:2)
  
  by groebke ( 313135 ) writes:
  
  ^
  THIS!
- Re: (Score:3)
  
  by Moridineas ( 213502 ) writes:
  
  I don't think that's it. Pruning of search spaces has been part of AI since..AI has existed. Generally, pruning and other search methods like A* (again going back decades) are not simply random, but heuristically driven. I don't think you can consider LLMs to work at "random" due to their training and guide rails, nor are they generating infinite solutions and pruning them down.
  For the end user, the best description I have heard is to think about LLMs as excellent natural language parsers with strong patter
No funny? (Score:2)

by shanen ( 462549 ) writes:

No funny examples? What could possibly have gone wrong?
Was this brute-force collation or really 'solving' (Score:2)

by ArghBlarg ( 79067 ) writes:

I admit I'm sour on LLMs, esp. in math. It could be though, that by brute-force searching for related topics and data, it brought enough info together to propose something and this time, it happened to be right or nearly-right enough for the researcher to have a light-bulb go off, so to speak.
I fully admit I didn't RTFA (read the fine article) -- I'm supposed to be working right now. But given how often LLMs get things entirely, completely, but confidently wrong, I still must presume this was the rare excep
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  I suspect you have no actual experience using LLMs.
  I understand that you're sour on them.
  
  But you should at least try to know your adversary.
  Your "given how often" would have had me nodding 2 years ago.
  Now that's demonstrably far less often that a good portion of the people I work with.
- Re: (Score:2)
  
  by JoshuaZ ( 1134087 ) writes:
  
  Of course these are tools being used by mathematicians. That doesn't make it not impressive. And it seems like an extremely high bar when you only find it interesting when the systems can not only succeed on their own with minimal human input, but do so so well that they equal the most impressive humans.
  - Re: (Score:2)
    
    by evanh ( 627108 ) writes:
    
    Then no AI progress has been made. It's just more of the same since OpenAI first appeared. Where is the intelligence they keep promising?
    - Re: (Score:2)
      
      by dvice ( 6309704 ) writes:
      
      "in 20% of cases, AlphaEvolve improved the previously best known solutions, making progress on the corresponding open problems.":
      https://deepmind.google/blog/a... [deepmind.google]
    - Re: (Score:2)
      
      by JoshuaZ ( 1134087 ) writes:
      
      Are you arguing that these systems are not general AI? Sure. No argument there. But it should also be clear that this sort of thing is something you could not do a year ago. The systems continue to improve rapidly. I suspect that these systems will never become genuinely intelligent in the sense humans are without fundamentally new insights, but that doesn't mean we cannot recognize the extreme improvements and the impact that's having. Worse, if suspicions like mine are incorrect, the situation could chang
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        I'd argue that your concept of a fundamentally new insight doesn't exist.
        You're just a neural network crunching inputs.
        I challenge you to find some insight that we can't point to a precursor chain of thought that lead to it.
        
        Science is like life. It evolves.
        Claims of "fundamentally new insights" sound like creationist claims of irreducible complexity.
      - Re: (Score:2)
        
        by evanh ( 627108 ) writes:
        
        Applying the existing LLMs to different problems is not any improvement of the LLMs themselves.
        AGI is what has been promised and why the spend is so big.
- Re: (Score:2)
  
  by Moridineas ( 213502 ) writes:
  
  So, the new student for "is it real artificial intelligence" is winning a Fields Medal? Anything less than that is unimpressive? According to Wikipedia, that's 64 people total.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    Yeah, that'll work until it does. Then we'll have to find a new, er, goalpost, if you will.
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      Intelligence of the Gaps.
      
      Sadly in denying the intelligence of these systems, all of us mortals have now been declare lacking in intelligence, because I'm pretty fucking sure I'm not going to win a Fields Medal in my lifetime.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Proven History = Distrust (Score:1)

Re: Proven History = Distrust (Score:1, Troll)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Proven History = Distrust (Score:5, Insightful)

Does the LLM lean on Lean ? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Troll)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

LLM had a head start (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:LLM had a head start (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"AI involvement" (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"Eventually, the model found a Math Overflow post" (Score:3)

Re: (Score:2)

Re: (Score:2)

AI Models Are Starting To Crack Under The Pressure (Score:3)

But they still haven't worked out... (Score:3)

Re: (Score:2)

human knowledge? (Score:2)

Re: (Score:2)

Re: (Score:2)

The collective gasp from Slashdot oldheads (Score:2)

No (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)