Microsoft Copilot Joins ChatGPT At the Feet of the Mighty Atari 2600 Video Chess (theregister.com) 39

Posted by BeauHD on Thursday July 03, 2025 @07:30PM from the beware-of-chatbot-confidence dept.

Robert Caruso once again pitted an AI chatbot against Atari 2600 Video Chess -- this time using Microsoft's Copilot instead of ChatGPT. Despite confident claims of chess mastery, Copilot fell just as hard. The Register reports: By now, anybody with experience of today's generative AI systems will know what happened. Copilot's hubris was misplaced. Its moves were... interesting, and it managed to lose two pawns, a knight, and a bishop while the mighty Atari 2600 Video Chess was only down a single pawn. Eventually, Caruso asked Copilot to compare what it thought the board looked like with the last screenshot he'd pasted, and the chatbot admitted they were different. "ChatGPT deja vu."

There was no way Microsoft's chatbot could win with this handicap. Still, it was gracious in defeat: "Atari's earned the win this round. I'll tip my digital king with dignity and honor [to the] the vintage silicon mastermind that bested me fair and square." Caruso's experiment is amusing but also highlights the absolute confidence with which an AI can spout nonsense. Copilot (like ChatGPT) had likely been trained on the fundamentals of chess, but could not create strategies. The problem was compounded by the fact that what it understood the positions on the chessboard to be, versus reality, appeared to be markedly different.

The story's moral has to be: Beware of the confidence of chatbots. LLMs are apparently good at some things. A 45-year-old chess game is clearly not one of them.

Microsoft Copilot Joins ChatGPT At the Feet of the Mighty Atari 2600 Video Chess

Post Load All Comments

Search 39 Comments Log In/Create an Account

Comments Filter:

I found ChatGDP to be humble. (Score:1)

by ndsurvivor ( 891239 ) writes:

I tried to play chess with ChatGDP. It constantly said it was not designed to do this. I prodded it and got about 7 moves out of it. It is a chatbot and not a chess player. I know this and it knows this. It did play a great game, after I repeatedly asked if I played this, what would you do? It is not a chess player.
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  Imagine your AGI gets owned so hard at chess that instead of just giving it superintelligence to beat chess with, you write an init prompt telling it to tell the customer that it's not a chess player.
  Now do something NP complete.
  - Re: I found ChatGDP to be humble. (Score:3)
    
    by Bodrius ( 191265 ) writes:
    
    As a helpful AI assistant, I cannot do NP complete tasks. I can only confidently pretend to do so.
- Re: I found ChatGDP to be humble. (Score:2)
  
  by Bodrius ( 191265 ) writes:
  
  I was going to complain about your typo, but just realized in the context of the US tax spending bill passing the house today it was a clever reference to economic (like health) policy from misused LLMMs.
  Bravo. I applaud your subtlety
  - Re: (Score:2)
    
    by ndsurvivor ( 891239 ) writes:
    
    It was a Freudian slip. I caught it after I hit post. I thought.... OK.
2600 chess is better than you think (Score:2)

by rsilvergun ( 571051 ) writes:

It is just ludicrously slow. But is a chess player it's surprisingly good if you're willing to be very very patient
- Re: (Score:2)
  
  by ndsurvivor ( 891239 ) writes:
  
  eh, no. I played that for a few rounds and got bored by beating it. I am on Chess.com if you are willing. I am rated around 1750. Not great but I play exciting games... with sacrifices and dramatic checkmates.
  - Re: (Score:1)
    
    by muntjac ( 805565 ) writes:
    
    1750 on chess.com is like a god compared to most average players isn't it? I don't think thats a good representation of how most people would fare against the atari.
    - Re: (Score:2)
      
      by ndsurvivor ( 891239 ) writes:
      
      humbly, yes, I think 1750 is like a god. I was a child prodigy, and was a chess champion. I have many fond memories.
  - Re:2600 chess is better than you think (Score:5, Informative)
    
    by test321 ( 8891681 ) writes: on Thursday July 03, 2025 @08:20PM (#65495416)
    
    People elsewhere estimate the Atari 2600 Video Chess to be ~1300 ELO https://www.reddit.com/r/chess... [reddit.com]
    
    Reply to This Parent Share
    Flag as Inappropriate
    - Re: (Score:2)
      
      by ndsurvivor ( 891239 ) writes:
      
      back in that day, the programs did not recognize about pawns becoming queens when they hit the eighth square, or did not recognize simple sacrifices to go for a checkmate. I got bored with them quickly.
      - Heuristic (Score:2)
        
        by DrYak ( 748999 ) writes:
        
        It's expected. At their core all chess algo are search a min-max tree, but instead of going width- or depth- first exhaustive search, they use heuristics to prioritize some branches of the tree (A-star).
        On the modest hardware of the older machine there isn't that much you can explore before the player gets bored waiting.
        So obviously, you're going to make much stringent rules: "Never take a branch where you lose a piece" prunes entire swaths of the tree, rather than "see if sacrificing peice XXX gives us a
- Re: (Score:2)
  
  by AmiMoJo ( 196126 ) writes:
  
  It's main advantage seems to be that it knows where the pieces are on the board.
  I've had ChatGPT forget the current state of things with other stuff too. I asked it to do some web code, and it kept forgetting what state the files were in. I hear that some are better like Claude with access to a repo, but with ChatGPT even if you give it the current file as an attachment it often just ignores it and carries on blindly.
  In fact one bug it created was due to it forgetting what it named a variable, and trying to
  - Context window (Score:2)
    
    by DrYak ( 748999 ) writes:
    
    I've had ChatGPT forget the current state of things with other stuff too. I asked it to do some web code, and it kept forgetting what state the files were in. I hear that some are better like Claude with access to a repo, but with ChatGPT even if you give it the current file as an attachment it often just ignores it and carries on blindly.
    Yup, they currently have very limited context windows.
    And it's also a case of "wrong tool for the wrong job". Keeping track of very large code bases is well within the range of much simpler software (e.g.: the thing that powers the "autosuggest" function of your IDE which is fed from a database of all functions/variables/etc. names of the entire database).
    For code, you would need such an exhaustive tool to give the list of possible suggestion and then the language model to only predict which from the pre-fi
not newsworthy (Score:3)

by cathector ( 972646 ) writes: on Thursday July 03, 2025 @08:29PM (#65495428)

this is just click bait.
everyone knows these models are not good at actual gameplay nor is it news that they will confidently mis-state stuff. it wasn't news on the first round it's still not news, and it misses the point that there is a Ton of stuff that humans currently do which the models will do cheaper.

Reply to This Share
Flag as Inappropriate
- Re: (Score:2)
  
  by ndsurvivor ( 891239 ) writes:
  
  To pull at a string, I did play ChatGTP for 6 or 7 moves. It did do well. I know it scanned and consumed like.. all of the great Chess games ever played. It can only predict the next word, or move. That seems like the nature of LLM's. If I ever can coax ChatGTP to play a whole chess game.. I will let you know the results.
  - Already done with markov chains (Score:2)
    
    by DrYak ( 748999 ) writes:
    
    I know it scanned and consumed like.. all of the great Chess games ever played. It can only predict the next word, or move.
    ...and this has been already demonstrated eons ago using hidden Markov models.
    (can't manage to find the website with the exact example I had in mind, but it's by the same guy who had fun feeding both Alice in Wonderland and the bible into a Markov model and use it to predict/generate funny walls of text).
    That seems like the nature of LLM's. If I ever can coax ChatGTP to play a whole chess game.. I will let you know the results.
    The only limitation of both old models like HHM and the current chatbots is that they don't have a concept of the state of the chess board.
    Back in that example, the dev used a simple chess software to keep
- Re: (Score:2)
  
  by timeOday ( 582209 ) writes:
  
  I think it's kind of delicious to see chess used as a benchmark of intelligence again. Of course the chatbot could be augmented with a chess engine that it knows how to invoke to easily beat any human. But using an LLM as a chess engine itself is a nice challenge. Maybe there's a way to do it, or maybe AI as we know it needs more visualization capability. Or maybe if the AI can write a good chess engine given the rules.
  In any case a single guy trying to prove a negative by failing to do something (set
  - Re: (Score:2)
    
    by ndsurvivor ( 891239 ) writes:
    
    I would prefer that an AI Company just goes for General Intelligence. Maybe something self-aware. I sometimes think that I am smarter than and better than other people because I can beat them at chess, but then I hear my mom yelling at me, and possibly feeling her slapping me on my butt, telling me that because I am good at one thing, does not mean that I am "better than", other people.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    You could maybe get an average engine, if you really optimized your prompts with things like "Always output a representation of the current board at the end of your answer", but by just using a dialog and hoping that all moves make sense you more benchmark if the LLM can reconstruct the state of the board from a long dialog and then get some weak move as bonus. If you want to make this competition fair, first think real hard how to optimize a LLM to play chess as good as possible, instead of using an arbitr
- LLMs can't match even braces (Score:2)
  
  by Somervillain ( 4719341 ) writes:
  
  it misses the point that there is a Ton of stuff that humans currently do which the models will do cheaper.
  A WORKING AI can do a lot of things cheaper than MANY jobs. However, nothing I have seen can reliably generate Java that compiles...like obvious, basic syntax mistakes, like missing braces or semi-colons placed randomly. Java may not be your jam, but if can't do that...what can it do? That's an easy use case and perfect for AI.
  
  I use Claude 4.0 daily at work with a mandate by my employer...their quote "those who don't embrace AI will be replaced by programmers who do." OK, kewl....I want to be more pr
HVAC Repair... (Score:5, Interesting)

by NobleNobbler ( 9626406 ) writes: on Thursday July 03, 2025 @08:41PM (#65495460)

Was diagnosing an HVAC low delta problem on 3 hours of sleep and tried some LLMs as an experiment. They all rang 20 alarm fires, said the compressor is going to explode and absolutely just went all in on catastrophe.
Then I noticed the liquid line had a lower pressure than the suction line.
I reversed the probes.
The things that "AI" misses are outrageous. The language it uses is definitive and it draws on complex topics.
And it misses literally classroom 101 common sense sanity checks.

Reply to This Share
Flag as Inappropriate
- Re: (Score:3)
  
  by ndsurvivor ( 891239 ) writes:
  
  I work with insanely high voltages, and with software too. I do research and development. We have to have a firm grasp at what AI is now and what it is not. It does not have common sense, it simply predicts the next word in a sentence. I find that amazing when I am writing software.
  - Re: (Score:2)
    
    by leptons ( 891340 ) writes:
    
    I find the LLM is wrong 90% of the time when it tries to write software for me.
    - Re: (Score:2)
      
      by ndsurvivor ( 891239 ) writes:
      
      me2. I like that it does the simple stuff though. It is great at syntax and making loops. It goes nuts when I tell it that stuff does not work, it keeps apologizing and telling me that it does work.
Well darn (Score:2)

by Artem S. Tashkinov ( 764309 ) writes:

Everyone here is so righteous and says this isn't AI, but this thing can make moves and use its limited abilities to predict the next best move. Given its limited skill set, the fact that it can even play is already miraculous.
- Re: (Score:3)
  
  by evanh ( 627108 ) writes:
  
  The Atari 2600 manages with maybe a 0.1 MIPS single core, a few kBytes of RAM, not a lot more ROM, and one or two Watts of electricity.
  - - Re: (Score:2)
      
      by evanh ( 627108 ) writes:
      
      It was clever enough to produce the graphical representation itself. ;)
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  the fact that it can even play is already miraculous.
  LLMs invent invalid moves. It is somewhat miraculous that they can sometimes generate valid ones, but the real miracle is that people think there is intelligence there. It's a miracle for the people selling "AI" anyway.
It's just an unfair comparision (Score:2)

by allo ( 1728082 ) writes:

A chess AI like the Atari's has an internal representation of the current state (exact!), can keep a history of previous states (I don't know if it does), has a list of legal moves and no option to attempt another one, and a planning algorithm that can go many moves ahead before deciding on one.
They benchmarked this against a LLM, that needs to parse the history of the game to (hopefully) get some kinda meaningful representation in its latent space and then can try to make up a move. Even with reasoning it
- Devil's in the detail. (Score:2)
  
  by DrYak ( 748999 ) writes:
  
  I wonder if you wouldn't win if you just told ChatGPT to write an chess AI and then used the chess AI to beat the Atari. Writing code is something text models are good for. Playing chess is not.
  The devil is in the detail.
  All chess algorithms are A-star: they search a min-max tree, but use heuristic to prioritize some branches instead of doing width- or depth- frist.
  Generatingn a template of a standard chess algo would be probably easy for a chatbot (these are prominently featured in tons of "introduction to machine learning" courses that training the LLM could have ingested), writing the heurisitc function to guide the A-star search is more an art-form and is probably where the chat bot is going
- Re: (Score:2)
  
  by IDemand2HaveSumBooze ( 9493913 ) writes:
  
  You could probably have an 'AI' that was not a language model but purely a predictive model for chess moves. There are great many chess games, especially at high level, that consist entirely of moves someone has played before. You could feed it a shitton of games, then feed it moves in the same format as the training data. I imagine you'd be able to confuse that 'AI' by playing crazy nonstandard moves, but so long as you stuck to standard lines it would probably give you a pretty good game.
  That's one of the
LLMs Playing Chess Isn't a Test of Anything... (Score:3)

by rocket rancher ( 447670 ) writes: <themovingfinger@gmail.com> on Friday July 04, 2025 @10:00AM (#65496536)

...except media gullibility. Look, I get it. Watching a generative AI flail against an Atari 2600 is funny. It plays well on social media. It makes people feel good about “real” computing. But let’s be clear: LLMs getting curb-stomped by 8-bit silicon in a chess match isn’t just apples to oranges — it’s apples to architecture diagrams.
ChatGPT and Copilot are language models. They don’t play chess the way AlphaZero or even Stockfish does. They generate plausible descriptions of chess moves based on training data. They aren’t tracking game state in structured memory. They don’t use a search tree or evaluation function. They’re basically cosplaying a chess engine — like a high schooler pretending to be a lawyer after binge-watching Suits.
And you can flip that analogy around and still make it work: expecting an LLM to beat a dedicated chess algorithm is like asking Tom Cruise to fly a combat mission over Iran just because he looked convincing doing it on screen.
Meanwhile, even the humble Atari 2600 version of Video Chess was running a purpose-built minmax search algorithm with a handcrafted evaluation function — all in silicon, not tokens. It doesn't have to guess what the board looks like. It knows. And it doesn't hallucinate, get distracted, or lose track of a bishop because the move history got flattened in the working token space.
So what does this little stunt prove? That LLMs aren't optimized for real-time spatial state tracking? Shocking. That trying to bolt a complex turn-based system onto a model that lacks persistent memory and visual context is a bad idea? Groundbreaking. That prompt-driven hubris doesn’t equal capability? You don't say.
This isn’t a fair fight. It's a stunt for attracting eyeballs and mouse clicks. And it's about as informative as asking an Atari to write a sonnet or explain Gödel’s incompleteness theorems — both of which LLMs can do, and often better than most poets or mathematicians could manage on the fly. Wake me when someone wires up a transformer-based architecture with structured spatial memory and an embedded rules engine — something capable of reproducing the cognitive contours in Hilbert space that mirror what biological chess engines like Boris Spassky or Bobby Fischer did in their wetware.
Until then, all this proves is that language models are terrible chess engines — which is like saying your microwave is bad at making omelets. We knew that already.

Reply to This Share
Flag as Inappropriate
- So what is it good for then? (Score:2)
  
  by Somervillain ( 4719341 ) writes:
  
  Until then, all this proves is that language models are terrible chess engines — which is like saying your microwave is bad at making omelets. We knew that already.
  OK, so what is it good for? ChatGPT was released 3 years ago. Think of how much the internet advanced 3 years after Netscape Navigator 1.0 was released. What is something commercially valuable these LLMs can actually do that we can objectively see? (no, Mark Zuckerberg promising his developers are so much more productive with them is pure bullshit...as he has presented no evidence and based on subjective "vibes").
  
  The wealthiest companies in history have poured trillions into and hired the best minds an

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Microsoft Copilot Joins ChatGPT At the Feet of the Mighty Atari 2600 Video Chess (theregister.com) 39

Microsoft Copilot Joins ChatGPT At the Feet of the Mighty Atari 2600 Video Chess More | Reply Login

Microsoft Copilot Joins ChatGPT At the Feet of the Mighty Atari 2600 Video Chess

I found ChatGDP to be humble. (Score:1)

Re: (Score:2, Insightful)

Re: I found ChatGDP to be humble. (Score:3)

Re: I found ChatGDP to be humble. (Score:2)

Re: (Score:2)

2600 chess is better than you think (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re:2600 chess is better than you think (Score:5, Informative)

Re: (Score:2)

Heuristic (Score:2)

Re: (Score:2)

Context window (Score:2)

not newsworthy (Score:3)

Re: (Score:2)

Already done with markov chains (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

LLMs can't match even braces (Score:2)

HVAC Repair... (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Well darn (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

It's just an unfair comparision (Score:2)

Devil's in the detail. (Score:2)

Re: (Score:2)

LLMs Playing Chess Isn't a Test of Anything... (Score:3)

So what is it good for then? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot