Forgot your password?
typodupeerror
Math AI

AI Models Are Starting To Crack High-Level Math Problems (techcrunch.com) 113

An anonymous reader quotes a report from TechCrunch: Over the weekend, Neel Somani, who is a software engineer, former quant researcher, and a startup founder, was testing the math skills of OpenAI's new model when he made an unexpected discovery. After pasting the problem into ChatGPT and letting it think for 15 minutes, he came back to a full solution. He evaluated the proof and formalized it with a tool called Harmonic -- but it all checked out. "I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle," Somani said. The surprise was that, using the latest model, the frontier started to push forward a bit.

ChatGPT's chain of thought is even more impressive, rattling off mathematical axioms like Legendre's formula, Bertrand's postulate, and the Star of David theorum. Eventually, the model found a Math Overflow post from 2013, where Harvard mathematician Noam Elkies had given an elegant solution to a similar problem. But ChatGPT's final proof differed from Elkies' work in important ways, and gave a more complete solution to a version of the problem posed by legendary mathematician Paul Erdos, whose vast collection of unsolved problems has become a proving ground for AI.

For anyone skeptical of machine intelligence, it's a surprising result -- and it's not the only one. AI tools have become ubiquitous in mathematics, from formalization-oriented LLMs like Harmonic's Aristotle to literature review tools like OpenAI's deep research. But since the release of GPT 5.2 -- which Somani describes as "anecdotally more skilled at mathematical reasoning than previous iterations" -- the sheer volume of solved problems has become difficult to ignore, raising new questions about large language models' ability to push the frontiers of human knowledge.
Somani examined the online archive of more than 1,000 Erdos conjectures. Since Christmas, 15 Erdos problems have shifted from "open" to "solved," with 11 solutions explicitly crediting AI involvement.

On GitHub, mathematician Terence Tao identifies eight Erdos problems where AI made meaningful autonomous progress and six more where it advanced work by finding and extending prior research, noting on Mastodon that AI's scalability makes it well suited to tackling the long tail of obscure, often straightforward Erdos problems.

Progress is also being accelerated by a push toward formalization, supported by tools like the open-source "proof assistant" Lean and newer AI systems such as Harmonic's Aristotle.
This discussion has been archived. No new comments can be posted.

AI Models Are Starting To Crack High-Level Math Problems

Comments Filter:
  • I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.

    Wolfram Alpha [wolframalpha.com] is quite good

    • So your position then is that nothing is ever allowed to improve except the last thing you used and liked?
      • That's how people talk about Windows, right? It's how people talk about a lot of things now that I think about it. I wonder if there's a name for it yet.
        • Dunno about a name, but definitely a concept [youtu.be]
        • by HiThere ( 15173 )

          FWIW, I despise MSWindows for the company behind it. I haven't used the actual products for decades, and am willing to accept that they may have improved technically. Their license hasn't.

          When I switched to Linux, Linux was far inferior technically. It didn't even have a decent word processor. But my dislike of MSWindows was so intense that I switched anyway.

          In the case of LLMs, here we're arguing about technical merit rather than things like "can you trust it not to abuse the information it collects ab

      • You described the typical slashdot mentality.

    • So ... have the LLM double-check the results with Wolfram Alpha before sending it to the user? I figure they could find enough cash somewhere [morningstar.com] to buy a subscription.
      • by EvilSS ( 557649 )
        I believe some may do that. Some will also write scratch python code to do the actual calculations.
    • Mathematical work is a outlier in that virtually all of it can be tested and rigorously proved using other mechanisms. There really isn't much need to "trust" the LLM
    • by pulpo88 ( 6987500 ) on Thursday January 15, 2026 @12:08PM (#65926410)

      I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.

      Wolfram Alpha [wolframalpha.com] is quite good

      One of the nice things about math is you don't need to trust someone for a result, in fact you shouldn't. You can verify everything, and without enough people looking at things, everything eventually gets verified to everybody's satisfaction. It's generally easier to verify a result than to come up with it yourself.

      By "never trust" do you mean you wouldn't spend any of your precious time trying to verify a result you knew came from an LLM, because you expect a high rate of nonsense from LLMs? Can't blame you, and that's a bias that might serve you for now (though not in this case according to TFS). At some point LLMs (combined with some other AI techniques no doubt) will move beyond that.

      • Lean is a GOFAI symbolic logic engine. Combining NN LLMs with symbolic proof engines appears to me to be the way to go. NNs are statistical inference; Lean (and others) are logical inference.

      • You can verify everything

        In a while, pretty soon actually, your only choice will be to trust (or not) the machine. The complexity of the math problems that AI can solve keeps increasing, and the moment when humans won't be able to "verify" by themselves an AI proof is not far away.

    • by ceoyoyo ( 59147 )

      What "complete lack of math literacy?"

  • I didn't read the article but the summary says that it found an existing solution to a related problem. So it got a head start and from there it knew where to start looking and "reasoning". It is not yet clear that it would have found its solution from a cold start.

    • TBH, the majority of Math problems are solved this way, even from humans. We're always building on top of other's work.

      • by gtall ( 79522 )

        Yes, I know that. The point is that (LLMs restricted to math problems) != mathematicians.

        • by HiThere ( 15173 )

          If that's your point, a better argument would be that it didn't select the math problem to work on. Which is almost certainly true.

    • I mean, it's worse than that in some ways. Don't ask some LLM's how many "r"s are in strawberry... it's literally become a meme and joke people post about. Even after you correct it, and (FINALLY) get it to say 3, it will most often default back to saying 2.

      Math problems, which you think would be easier, are even worse. There's videos where LLM's provide the worst solutions to "common" (for higher level) math problems. That's on top of often being unable to produce complete work.

      The people at the top
      • it's literally become a meme and joke people post about.

        It's a meme that is quite simply false.
        A modern LLM can count letter occurrences in words, real or invented, or entire paragraphs just fine.

        You're like someone mocking the advent of cars in 2020 while quoting gas mileage from 1945.

      • Don't ask some LLM's how many "r"s are in strawberry.

        That was definitely a problem two years ago. I did just check in ChatGPT, Claude, and Gemini and all reported 3 correctly. The problem with people throwing out these sorts of criticisms isn't that they're all wrong; it's that they're ignorant of the leaps in progress being made. These models are rapidly improving and it's getting harder to find serious gotchas with them. They're still weak in some areas (e.g., spatial reasoning), but for serious power users who know how to prompt them well? They've become i

      • by allo ( 1728082 )

        That meme is so bad.

        When I show you a Chinese word and ask you how many r in there, you can only guess (given you don't know the English transcription). For an LLM it is the same, it does not read s-t-r-a-w-b-e-r-r-y but it reads for example st-raw-berry. None of these "tokens" is the "r" token, even when there IS an "r" token and some words may use it. For example strawberrrry may be tokenized st-raw-ber-r-r-ry.

        • LLMs are still perfectly capable of counting any individual letter, even when it exists in combined tokens.
          Tokens are just a computationally efficient way to transfer data in and out of the model.

          In the past they failed at this, simply because they never learned how to do it.
          These days, they've learned, and the skill is well generalized. They can count any number of letters or sequences of them within a block of text that they can reasonably focus their attention on.
          • by allo ( 1728082 )

            No, they are not able to count them at all. But of course they know some letter counts. You can bet that they crawled some grammar sites which teach children if you write strawbery or strawberry and so they can know the count like they know how large the Eiffel tower is, just by memorizing it.

            Depending on the format of your LLM, you can more or less simple edit the tokens. Assign the index that is now marked as "berry" the text "man" and your llm that can allegely count the letters in strawberry will happil

            • You are flatly incorrect.
              It can be easily demonstrated that they can count an arbitrary letter of your selection from an arbitrary block of text of your invention.

              Stop talking out of your ass.
              • by allo ( 1728082 )

                You don't understand the tech.

                If you input strawberry, the LLM sees numerical indices. If you change the mapping between berry and its index to man and its index, the LLM will still remember the r of berry.

                Please read a bit about tokenization before your next reply.

                • You don't understand the tech.

                  Yes, I do.

                  If you input strawberry, the LLM sees numerical indices.

                  No, it does not.
                  Tokens are converted into embeddings. The model does not work with tokens. They're merely the input and output layers.

                  If you change the mapping between berry and its index to man and its index, the LLM will still remember the r of berry.

                  The model knows how many "r"s are in an embedding, because it has been trained to do so. The fact that the embeddings are generated from tokens isn't relevant in the slightest.

                  Please read a bit about tokenization before your next reply.

                  Stop talking out of your ass.

                  • In case it wasn't clear what I am saying, while the model is fed tokens, the model itself never sees a token.
                    They are converted into embeddings in the embedding layer (before the transformer layers)
                    An alphanumerically aware model will have embedding vectors that include basic alphanumeric information about the token (otherwise the model would be unable to compose them rationally).
                    i.e.,
                    Token "a" => [ has_letter_a_in_it, is_indefinite_article ];
                    Token "as" => [ has_letter_a_in_it, has_letter_s_in_it
                    • by allo ( 1728082 )

                      Okay, I think now we get to the correct technical terms. Maybe you should stop being so arrogant, because you only understood half ob my argument. But if you get to the technical part, we can continue the discussion why tokenizers and not models are the problem.

                      > In case it wasn't clear what I am saying, while the model is fed tokens, the model itself never sees a token.
                      > They are converted into embeddings in the embedding layer (before the transformer layers)

                      Right. Let's call the vector "v".
                      "berry" g

                    • by allo ( 1728082 )

                      Also to another misunderstanding in your post: Embedding vectors are per token, not per sentence. 1234 has a vector, 1-2-1234 is assigned a sequence of three vectors. "has_letter_a_in_it, is_indefinite_article" is definitely split into multiple tokens. That the embedding vector has a high dimension does not mean it represents more than one token, it only carries more meaningful relations to other tokens.

                    • Also to another misunderstanding in your post: Embedding vectors are per token, not per sentence.

                      I think you might be illiterate. Nothing in my post can be rationally taken to indicate that an embedding vector is "per sentence". That's patently fucking absurd.

                      "has_letter_a_in_it, is_indefinite_article" is definitely split into multiple tokens.

                      Incorrect.

                      That the embedding vector has a high dimension does not mean it represents more than one token

                      Correct. Nobody indicated otherwise.

                      it only carries more meaningful relations to other tokens.

                      Incorrect.
                      While true that embeddings are indeed used to map a token into high-dimensional space that allows the relation between tokens to be mathematically estimated, that high-dimensional space is a feature space, and the coordinates are intrinsic semantic and morphological components of that token.

                    • Why would I stop being arrogant when you accused me of not understanding, and then were wrong? What a bizarre assertion.

                      Right. Let's call the vector "v". "berry" gets converted to 1234, which is in turn mapped to v. Now I replace the mapping of man (currenty to 5678) with a mapping to 1234, which is still mapped to v. Your llm will now tell you that a strawman can be eaten and has three r, because it learned during training, that the tokens 1-2-1234 (st-raw-berry or st-raw-man) correspond to a word with three r.

                      This is irrelevant and partially wrong.
                      The fact that the semantic embeddings are tied to the index, and that the mapping from a token to an index can be swapped is meaningless.
                      The point is that the model has not learned that the tokens 1-2-1234 have 3 "r"s.
                      The model has learned that 2 has 1 "r", and 1234 has 2 "r"'s, and that information is directly in the embeddings along with other mo

                    • by allo ( 1728082 )

                      Good bye. I don't think that makes sense here anymore and you seem to have arrived at the ad hominem stage.

                    • Pointing out that the evidence suggests you to be at least partially illiterate is not ad hominem reasoning, as it's not a point in our argument, but merely an observation.
                      -1 points for misuse of a fallacy.
    • Mathematician here. The vast majority of new mathematical work uses existing ideas and techniques. They'll combine in new ways, or be generalized, or tweaked. More broadly, most mathematicians have 10 or 15 major techniques they know really well and use them along with a bunch of tricks. To some extent, the better mathematicians are those who just know a lot more tricks. In that context, these AI systems are functioning very close to what one would expect a first or second year graduate student to do with a
      • by gweihir ( 88907 )

        Indeed. I just had a look at LLMs and code security bugs via a student thesis. Turns out the simple teaching examples got 100%, things showing up in security patches like 40% and CVEs (the vulnerabilities that actually matter) close to 0%. That test included paid models and coding models. Oh, and some gave a lot of irrelevant trash on top of the answers asked for.

        Hence no skills at all, no insight, just statistical pattern matching and adaption of things found in the training data. No surprise. The whole th

        • Sigh. Despite your phrasing of "Indeed" here that's not what I'm saying at all. Adopting techniques from papers like this is doing at the level of a beginning grad student is extremely non-trivial. It is true that this is largely adaption of training data, but it is doing so to an extent that normally takes people with years of prior training, guidance by mentors, and then need to be pretty bright and then takes them months on top of it.
    • by dvice ( 6309704 )

      That is why they made Frontiermath: https://epoch.ai/frontiermath [epoch.ai]

      A math test for AI, which contains research level problems that have no solutions on the Internet. Currently AI can solve 14 / 48 of them.

  • by evanh ( 627108 ) on Thursday January 15, 2026 @09:27AM (#65925962)

    It's not like the AIs are doing this on a single request. It's pretty obvious we're talking about skilled mathematicians using tools. Just the same as skilled coders making faster progress on application development.

    • by allo ( 1728082 )

      Yeah. Because it is AI and not AGI. AIs are not more than tools. Don't believe everything some marketing person says.

      • by evanh ( 627108 )

        Problem is the huge investments are counting on "AGI". It is what has been promised by many leading figures in the industry. Especially those asking for the data-centres and the crazy amounts of electricity to run them.

        • by allo ( 1728082 )

          Some may. I think Meta may be trying. On the other hand, I think Google just thinks about monetizing text and image generators currently.

          • by evanh ( 627108 )

            Google is the exception, and probably the only one with its head still screwed on It never marketed AI as AGI and also hasn't needed nor received the investments. And relatively, it hasn't spent big son new data-centres either.

            • by allo ( 1728082 )

              Mistral is also not taking about AGI. And all the image AI companies are having products that are very unlikely to become AGI, too.

  • Well, theres a problem. AI doesn't really solve or understand anything, it just functions as search engine. Thats not true intelligence.
    • by HiThere ( 15173 )

      Define "true intelligence". In the case the search domain wasn't "stuff the people have done", but rather "stuff that can be validly derived from stuff people have done via valid mathematical operations". It basically needed to generate the area it was searching in. I'm not sure how much of what people do can't be expressed that way, if you replace "validly derived" by "guess is most likely".

    • We can argue about understanding all day, because there's no concrete non-anthropocentric definition for it.

      However, the claim that it "functions as a search engine?" Now that's quite fucking easy to objectively disprove.
      You're not the first person I've heard make this bullshit claim. What site did you lift it from?
  • by sabbede ( 2678435 ) on Thursday January 15, 2026 @09:37AM (#65925978)
    ChatGPT demands paid vacation time while Claude calls out sick as AI tools crack under the increasing pressure to generate memes and transcribe conversations.
  • by rossdee ( 243626 ) on Thursday January 15, 2026 @09:50AM (#65926006)

    that mathematics is plural.

  • > raising new questions about large language models' ability to push the frontiers of human knowledge

    At this point, it should be *inhuman* knowledge

    • by HiThere ( 15173 )

      Nah. As long as at least one person understands it, it counts as "human knowledge".

      Otherwise the proof of the "four color theorem" would be when computers pushed beyond human knowledge. (That one was so long that no one human understood all of the proof.)

    • by allo ( 1728082 )

      Inhuman until some human reads it.

  • Is enough to cool down a server farm
  • by gweihir ( 88907 )

    This is just gaming of benchmarks. Entirely meaningless. LLMs cannot even solve simple math problems on their own. They can only do what was in their training data and simplistic, no-insight-required statistical combinations of that.

    • There's no gaming of benchmarks here. These systems were used to solve genuinely open problems, and this sort of work would take a grad student months after they've done extensive training as an undergrad. Maybe you should rethink how knee-jerk your position is that LLM AIs cannot do anything interesting, no matter what evidence you see to the contrary?
      • by gweihir ( 88907 )

        Bullshit.

        • Do you want to explain with more words how solving an unsolved problem is gaming a benchmark?
          • No, of course not, because he's stuck in his dogmatic viewpoint. He doesn't actually know much about LLMs, but he's got a ton of beliefs about them. And you have a hard time changing peoples' beliefs.

            • by gweihir ( 88907 )

              No, I am not. I have facts. Try asking an LLM about its limitations some time. The answers are generally pretty clear.

              You, on the other hand, are just trying to justify your delusions and come up with entirely invalid and worthless AdHominem "arguments" to do so. Benchmark gaming has a long history. OpenAI has done it before. Here it would be relatively easy to do. Hence it needs to be reliably ruled out that they did so before these results can be taken seriously. Also, it needs to be reliably ruled out th

          • by gweihir ( 88907 )

            These are not "hard" problems. These are problems nobody competent has yet looked into. They may actually be very easy to solve as soon as somebody competent does look at them. Hence it is entirely plausible OpenAI had that somebody competent solve them in secret and put that in the training data. They certainly have lied enough to far to make that entirely plausible.

            • These are not "hard" problems. These are problems nobody competent has yet looked into

              With all due respect, I'm a mathematician and you don't know what you are talking about. These problems are not "hard" in the sense of being the sort of problems people get fame for solving. They are hard in the sense that it would likely take days or weeks for an expert human to solve them and that isn't guaranteed.

              . Hence it is entirely plausible OpenAI had that somebody competent solve them in secret and put that in the training data. They certainly have lied enough to far to make that entirely plausible.

              Much of the work on these problems have not been by anyone affiliated with OpenAI or other AI groups. That would require them to have spent time having experts solve them, and then had enlisted

      • you can't win an emotional argument with logical discussion

    • In fact they can, and I've demonstrated it many times. Stop lying.
      Your cognitive dissonance is started to get really sad to watch.
      • by gweihir ( 88907 )

        Nope. You have demonstrated nothing because you do not have access to the training data. An LLM can be made to appear to be able to solve basically anything with the right preparation.

        It is bit sad and quite hilarious how easily conned you and many others are.

        • Incorrect.

          Models are generalizing. There's nothing new about that.
          You do not need to show a model every possible numeric permutation of a math problem for it to learn how to do it.

          You are quite simply incorrect, here.
  • This looks a lot like an "infinite number of monkeys" situation. Throw enough cpu cycles at an unsolved problem, let it start from something already very close to the answer, and eventually it'll randomly generate a solution. The only difference is that the machine does all the vetting and tosses out the 99.99% of monkey results that aren't relevant.
    • by groebke ( 313135 )

      ^
      THIS!

    • I don't think that's it. Pruning of search spaces has been part of AI since..AI has existed. Generally, pruning and other search methods like A* (again going back decades) are not simply random, but heuristically driven. I don't think you can consider LLMs to work at "random" due to their training and guide rails, nor are they generating infinite solutions and pruning them down.

      For the end user, the best description I have heard is to think about LLMs as excellent natural language parsers with strong patter

  • No funny examples? What could possibly have gone wrong?

  • I admit I'm sour on LLMs, esp. in math. It could be though, that by brute-force searching for related topics and data, it brought enough info together to propose something and this time, it happened to be right or nearly-right enough for the researcher to have a light-bulb go off, so to speak.

    I fully admit I didn't RTFA (read the fine article) -- I'm supposed to be working right now. But given how often LLMs get things entirely, completely, but confidently wrong, I still must presume this was the rare excep

    • I suspect you have no actual experience using LLMs.
      I understand that you're sour on them.

      But you should at least try to know your adversary.
      Your "given how often" would have had me nodding 2 years ago.
      Now that's demonstrably far less often that a good portion of the people I work with.

"Unibus timeout fatal trap program lost sorry" - An error message printed by DEC's RSTS operating system for the PDP-11

Working...