Maxim Saplin

Posted on Nov 21, 2024 • Edited on Jan 29

Can LLMs Play Chess? I've Tested 13 Models (GPT-4o, Claude 3.5, Gemini 1.5 etc.)

#ai #machinelearning #genai #llm

UPD January 25, 2025: Deepseek R1 is another model that broke the ceiling of zero wins showing meaningful chess performance -> https://dev.to/maximsaplin/deepseek-r1-vs-openai-o1-1ijm

UPD January 24, 2025: I have increased the number of simulations and a few of the non-reasoning models scored occasional wins (Claude 3.5, GPT-4o), yet it doesn't seem they actually play chess, maintaining negative material diff (except Grok-2, it had positive diff yet struggled with instruction following).

UPD January 16, 2025: o1-mini and o1-preview are definitely special models, they were the first ones to demonstrate meaningful chess games scoring a reasonable number of wins -> https://maxim-saplin.github.io/llm_chess/

No, they can not play chess if you expect them to win rather than merely move pieces on the board. I ran multiple simulations putting LLMs against a random player, and there was no single win for LLM.

Here's the full LLM Chess Leaderboard. A shorter version follows...

Player Draws ▼ Wins

gpt-4-turbo-2024-04-09 93.33% 0.00%

gpt-4o-2024-08-06 90.00% 0.00%

gpt-4o-2024-05-13 83.33% 0.00%

anthropic.claude-v3-5-sonnet 73.33% 0.00%

gpt-4o-mini-2024-07-18 60.00% 0.00%

llama-3-70b-instruct-awq 50.00% 0.00%

gemini-1.5-pro-preview-0409 36.67% 0.00%

gemini-1.5-flash-001 33.33% 0.00%

gemma-2-27b-it@q6_k_l 26.67% 0.00%

gemma-2-9b-it-8bit 13.33% 0.00%

anthropic.claude-v3-haiku 0.00% 0.00%

gpt-35-turbo-0125 0.00% 0.00%

llama-3.1-8b-instruct-fp16 0.00% 0.00%

Player	Draws ▼	Wins
gpt-4-turbo-2024-04-09	93.33%	0.00%
gpt-4o-2024-08-06	90.00%	0.00%
gpt-4o-2024-05-13	83.33%	0.00%
anthropic.claude-v3-5-sonnet	73.33%	0.00%
gpt-4o-mini-2024-07-18	60.00%	0.00%
llama-3-70b-instruct-awq	50.00%	0.00%
gemini-1.5-pro-preview-0409	36.67%	0.00%
gemini-1.5-flash-001	33.33%	0.00%
gemma-2-27b-it@q6_k_l	26.67%	0.00%
gemma-2-9b-it-8bit	13.33%	0.00%
anthropic.claude-v3-haiku	0.00%	0.00%
gpt-35-turbo-0125	0.00%	0.00%
llama-3.1-8b-instruct-fp16	0.00%	0.00%

While doing the practical exercises from Deeplearning's "AI Agentic Design Patterns with AutoGen" course, Lesson #4 (Tool Use) caught my attention. Two AI agents played chess and interacted via tools with a chess board moving pieces on it.

An idea popped in my head: It would be great to host a chess competition between different models, run multiple games, calculate win rates and Elo scores, and create a leaderboard saying that "Model A ranks #1 with Elo 3560, Model B is #2..."

After bumping up the turn limit to 200 (the exercises had it set to 3) I was surprised that the game kept going without any of the players winning. Trying to put one side at a disadvantage (adding to the system prompt "You are a beginner chess player...") didn't help much... The stronger player was still not able to make a checkmate.

That's when I decided to switch to testing an LLM against what seemed to be an even weaker opponent - a Random Player. I.e. it is a bot that interacts with the chess board in a similar fashion as LLM agent, yet it doesn't put any effort into calculating the best move. It merely asks from the board the list of legal moves (it can take given the current position) and randomly picks one.

I ended up creating LLM Chess project and simulated hundreds of games between Random Player and LLM. I evaluated chess and instruction following proficiency of 13 chat models. The results are present in the aforementioned Leaderboard.

Gen AI Skepticism

The results were surprising to me. Yet they come at a time of LLM skepticism at its highest since ChatGPT rocked in 2022.

People call out limited value from chatbots, companies struggle to find practical applications, and the impact on the global economy is not consequential. LLMs keep hallucinating and reliability is a key concern for wider adoption across the use cases. Imagine a world where database engine returned correct results for 7 out of 10 SELECT queries...

Ex-OpenAI Ilya Sutskever had recently said that LLM scaling has plateaued. LLM performance convergence can be seen as a trend of 2024.

LLM Memorize, They Do not Reason

The creator of the Arc Challenge, a benchmark that is "easy for humans, but hard for AI" argues that LLMs can memorize answers but they can not truly reason.

A recent study by Apple researchers (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models) has recently been discussed a lot - it supports the above claim.

What they have shown was that even with the smallest changes to the original problems in GSM8K (a popular LLM benchmark used to evaluate their math capabilities) the scores drop significantly.

Imagine that GSM8K had a problem like this:

Alice had 5 apples and Bob had 3. What was the total number of apples they had?

🍏🍏🍏🍏🍏 + 🍏🍏🍏 = ?

When the authors changed the numbers (e.g. Alice had 4 apples and Bob had 2) OR when some irrelevant info was added to the problems (e.g. Alice is Bob's sister) they saw a significant drop in score.

Yann LeCun has compared LLM to giant look-up tables and the tech works as long as there's a huge corpus of data covering all the cases (which is not feasible):

The problem is that there is a long tail, this is an issue that a lot of people have realized in social networks and stuff like that, which is there’s a very, very long tail of things that people will ask and you can fine tune the system for the 80% or whatever of the things that most people will ask. And then this long tail is so large that you’re not going to be able to fine tune the system for all the conditions. And in the end, the system ends up being a giant lookup table essentially, which is not really what you want, you want systems that can reason, certainly that can plan.

To wrap up this rant about LLMs not being capable of true reasoning, here's a study from 2023 - The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - the one that made headlines discussing how ChatGPT happened to know the name of Tom Cruise's mom yet not knowing the famous son of her.

Thoughts

While working on the project and running those multiple simulations I think I have built my own intuition of LLMs being "just text generators". I have reconsidered the notion of "in-distribution", i.e. if the model has seen certain data - it might surprise you, if not - it might fail miserably. That was an exercise that allowed me to sense the limits of the technology.

Seeing in logs how a model could get into a failure mode no able to pick the right move (while being in the middle of the game successfully lasting until move #74) out of a given list of legal moves (e.g. 10 comma-separated codes like 'e2e4'), cause it (likely) met some unique combination of tokens and LLMs could not resist but pick a familiar, in distribution code which resembled one of the legal moves (and skipping the move code that was likely a very rare token not met in pre-training).

When I tried better prompts, e.g. asking LLMs to reflect before making a move, the results did not improve, yet now LLMs covered their decision process with some fancy verbosity.

--------------------------------------------------------------------------------
Proxy (to Player_Black):

Before deciding on the next move you can reflect on your current situation, write down notes, and evaluate.
    Here are a few recommendations that you can follow to make a better move decision:
    - Shortlist the most valuable next moves
    - Consider how they affect the situation
    - What could be the next moves from your opponent in each case
    - Is there any strategy fitting the situation and your choice of moves
    - Rerank the shortlisted moves based on the previous steps


--------------------------------------------------------------------------------
Player_Black (to Proxy):

**Reflection:**

1. **Shortlist Valuable Moves:**
   - **g8f6**: Develops the knight, controls the center, and targets e4.
   - **h7h6**: Prepares to fianchetto the bishop or prevents a knight pin on g5.
   - **c7c5**: Challenges the center and prepares to expand on the queenside.

2. **Immediate Effects:**
   - **g8f6**: Develops a piece, brings the knight nearer to the center.
   - **h7h6**: Solidifies kingside but is more of a precautionary move.
   - **c7c5**: Direct action against the center, preparing for future breaks.

3. **Opponent's Responses:**
   - **g8f6**: White might continue developing with Nc3 or push e4.
   - **h7h6**: White might ignore it and proceed with normal development.
   - **c7c5**: White could opt for dxc5 or develop with Nc3.

4. **Strategy Considerations:**
   - Aim to control the center and set up a solid defensive position.
   - Balance between development and maintaining structural integrity.

5. **Rerank Shortlisted Moves:**
   - **1. g8f6**: Best for development and central control.
   - **2. c7c5**: Strong challenge to the center.
   - **3. h7h6**: Defensive but adds limited value immediately.

With that in mind, I’ll proceed with the development strategy.

Action: `make_move g8f6`

Just like very often you leave a chatbot with a sense of emptiness, having read a lot of BS and not getting concrete answers to your questions.

Somehow the analogy from last year, trying to characterize an LLM chatbot in simple terms, still holds.

Prompting LLM to Make a Move

Proxy (a bot controlling the chess board) used hard-coded prompts and arranged a short dialog (max 10 turns) with an LLM asking it to make a move.

Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move ' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.

Player_Black (to Proxy):
get_current_board

Proxy (to Player_Black):
♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖


Player_Black (to Proxy):
get_legal_moves

Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5

Player_Black (to Proxy):
make_move e7e5

Proxy (to Player_Black):
Move made, switching player

Can the LLM Chess be Yet Another "Hard for LLMs" benchmark?

I am curious to see if any of the chat LLMs can score wins against a random player while preserving good performance across different tasks. I suspect that might be a challenge given the enormous combination of moves and positions. If LLM is just a look-up table, as noted by LeCun, saturating LLM Chess might be a challenge.

Top comments (9)

Nicolai Kilian • Nov 23 '24 • Edited

They can, though.
gpt-3.5-turbo-INSTRUCT plays at 1750 elo out off the box and others can play pretty well too if you prompt them correctly:

dynomight.net/more-chess/

Maxim Saplin • Nov 23 '24

Curious reading... Though it seems the article had a different definition of a win (seems like you used some material difference metric) and hence a conclusion that LLMs can play chess since they can maintain positive material difference. Did I interpret the article right?

Yafuso Bera • Nov 25 '24

Interesting read. : )
My chess levels are at 900-1000

Paul Prescod • Jan 16

Why didn't you

a) Google the tremendous amount of work that has been done on LLMs and chess to ensure you were adding to the research instead of duplicating it

b) try o1-preview, the most advanced model available at the time of writing. (now it would be o1 and soon o3)

Maxim Saplin • Jan 16 • Edited

a) I did. Could you please share what I could have missed and eventually duplicated?
b) Tried recently. Yet you probably didn't know (maybe didn't care to Google) that o1 models were (and still are) not as commonly available through APIs (e.g. Azure OpenAI) as the rest of the models...

Paul Prescod • Jan 16 • Edited

If you Googled it, why didn't you know about the "gpt-3.5-turbo-instruct paradox"?

I'm sorry that I was rude, but this is a topic of great interest to me and its frustrating that your original post is nearly misinformation because it leaves out so much important information.

Having a post about LLM Chess that doesn't mention gpt-3.5-turbo-instruct eliminates one of the most important aspects of the whole discussion. And then missing on o1, which is specifically designed to do better on these kinds of "thinking through" tasks.

Maxim Saplin • Jan 16

The gpt-3.5-turbo-instruct phenomenon was mentioned in one of the very first comments - you can scroll up and find the link. Indeed would be great to learn the causes.

Although it could be investigated further, gpt-3.5-turbo-instruct is a completion model, not a chat model - it can't be plugged into my test harness and evaluated; it is also an obsolete model, I suspect nobody is using it at the moment for any production tasks. Besides, to be honest, I don't completely understand the idea of the evaluation metrics used in the article (centipawns, win/loss, no draw etc.)

And, the project I am working on, the LLM Chess Leaderboard, it is not what it might seem to be upon the first look. It is a scalable and reproducible evaluation harness - there're over 30 models in the list and more are coming. It uses random player, not Stockfish or any other tailor made chess engine, where every game, well - is random. And while crunching the number of simulations you are supposed to get all the goods of the Law of Large Numbers while mitigating the pattern memorisation capabilities of LLMs.

While chess simulations are used, I am not interested in whether a certain LLM can be a good or a bad chess player. Chess is merely a tool. I am interested in overall performance of LLMs, their reasoning capabilities, steering and instruction following, applicability for agentic flows.

Just like MT-Bench or AlpacaEval can be good proxies for human preferences - they are cheap to run and provide a single digit score that correlates with Chat Bot Arean scores (acquired via human votes). I suspect that Chess Simulations can provide good proxy metrics for quite a few LLM capabilities. At the moment I see the Wins/Draws as a proxy for reasoning abilities, while # of Mistakes (or number of games before LLM goes mad and fails to adhere to the instructions) for instruction following abilities. I also control for the number of tokens and the leaderboard shows the verbosity of models.

At the moment I am using the eval to research how model quantisation affects their performance, whether enabling or disabling Flash Attention degrades performance, is Llama 3.1 served via Groq is different from the one served via Cerebrus...

If you look closer at the leaderboard there're plenty interesting facts:

While o1 models did score wins, they also lost way more than non-reasoning models, as if in some games they created situations upping the chances of the random player for a win
Some reasoning models (QwQ, Gemini Thinking, Sky T1) don not show that much reasoning after putting them through LLM Chess
There're plenty of models that are hard to steer. That was my impression of multiple "vibe" checks of many OS models, now I can see that in the data that many of those models became verbose and barely "programmable" - they struggle to follow even basic instructions, they tend to fall out of the game loop a lot
Overall it seems there're 3 clusters - top notch models at the top made by Anthropic and Open AI, mediocre middle and a fat tail of BS generators at the bottom

Ryan • Jun 6 • Edited

I've had some interesting experience trying to get qwen3 (4b running on a home PC) thinking model to play chess. Had it play against a bot on Chess.com (I moved the pieces it told me to and responded simply with what black moved each step), it was slow but for the first 8 moves of the game it seemed to make well thought-out and reasoned moves, after move 9 it somehow got the order of moves mixed up in it's memory and suggested an impossible move to protect a piece that was not where it said it should be. I am guessing I bumped into a context window limitation, but with a big enough context window (and the bigger than 4b model), I wonder if it could have thought through the whole game?

This is the prompt I used:
You are a chess master, let's pretend you are a chess coach and I am just learning. I am playing a game against a good player and I need my coach to tell me what move I should make each step of the game. Please keep your responses breif, using chess notation and a short description why you suggest that move. With each step you should keep track of where all the pieces are on the board, assume I made the move you suggested every time. I will respond with the move that the other player made. We are playing white, what is the opening move you recommend?

This is an example of move 5:
You
Black moves f6

*qwen3:4b *
--think--
Okay, the user is playing as white against a strong opponent. They've been going through the moves step by step, and now Black has played f6. Let me think about the current position.

After 1.e4 d6 2.Nf3 Nc6 3.Bc4 a5 4.Bb5 f6, the position is pretty typical for the Italian Game. Black's f6 is a classic move to control the center and prepare for development. The user's last move was Bb5, which is a solid move, but now Black's f6 is a key move.

I need to suggest a response for white. The next move is likely a development. The main options for white are Nc3 or Bc5. Nc3 is a common move to develop the knight. It's a natural square, supports the center, and prepares for e4 or d5. Alternatively, Bc5 could be an option if white wants to challenge the f6 pawn, but that might be less urgent.

But wait, the user is learning, so I should stick to the most common and solid moves. Nc3 is definitely the main choice here. It's a good move that gets the knight into the center, supports the pawn on e4, and prepares for future moves like Qe2 or Bc5.

Also, considering the position, after f6, the pawn on e4 is still the main target. Nc3 will help in that. The user might be looking for a way to get the knight active. So, Nc3 makes sense.

I need to make sure the notation is correct. The move is Nc3. The reason is to develop a knight to a central square, support the center, and prepare for further central pressure.

Yes, that's the correct move. The user will then respond to Black's f6, and the next move would be Nc3. That's the recommendation.
--/think--

1.e4 d6 2.Nf3 Nc6 3.Bc4 a5 4.Bb5 f6 5.Nc3
Reason: Develops the knight to a central square (c3), supports the center, and prepares for future central pressure (e.g., Qe2 or Bc5).

Maxim Saplin • Jun 15

Qwen 3 32B was also bad in LLM Chess where the move history is not shared and there's no problem with context overflow. The results were poor as well with the model breaking down at latter moves.