Maxim Saplin

Posted on Feb 13

OpenAI o3-mini Tested in LLM Chess

#ai #machinelearning #genai #openai

OpenAI has recently presented its newest reasoning model - o3-mini. At Medium and High reasoning effort levels it matches or surpasses the performance of the full (non-mini) o1 model across math and coding benchmarks. And does that at a fraction of the cost:

Model	Input Tokens Price (per 1M)	Output Tokens Price (per 1M)
o1	$15.00	$60.00
o3-mini	$1.10	$4.40

Let's take a closer look at the model through the LLM Chess lens...

The Challenges

o3 was tested through Azure (just like other OpenAI models present in the leaderboard ). While the medium-level evaluation went smoothly, other levels were riddled with issues.

o3-mini-high

Most of the games were interrupted due to network timeouts (client side). Changing the default timeout from 120 to 250 seconds didn't help. That's something I have not observed neither with o1-preview/o1-mini nor with Deepseek R1. Will retest in the feature with a larger timeout (and hoping for faster generation due to infrastructure scaling).

o3-mini-low

For the low reasoning effort in ~1 out of 10 games the loop was broken due to the server dropping the connection:

httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

That's something I have not seen with OpenAI models before. I have treated those situations as model errors and gave losses. A similar situation was observed with Gemini 2.0 Flash Thinking (0125) - often I received empty responses.

Results

o3-mini with Medium reasoning level took 2nd place in the overall rating based on the percentage of games won. This placed it above o1-mini and Deepseek R1:

#	Player	Wins	Draws	Mistakes	Tokens
1	o1-preview-2024-09-12	46.67%	43.33%	9.29	2660.07
2	o3-mini-2025-01-31-medium	36.36%	61.36%	4.42	2514.25
3	o1-mini-2024-09-12	30.00%	50.00%	4.29	1221.14
4	deepseek-reasoner-r1	22.58%	19.35%	52.93	4584.97
5	claude-v3-5-sonnet-v1	6.67%	80.00%	1.67	80.42
6	o3-mini-2025-01-31-low	4.62%	56.92%	9.21	678.95

The Low reasoning model didn't impress placing o3-mini in the league Claude 3.5 and GPT-4o. The Low mode also produced more mistakes.

The curious part about the new o3-mini (Medium) is how it demonstrated a different kind of game. Compared to o1 models, it didn't lose that much while maintaining higher material difference:

	Loss Rate	Material Diff
o3-mini-2025-01-31-medium	2%	17,05
o1-preview-2024-09-12	10%	13,77
o1-mini-2024-09-12	30%	10,63
deepseek-reasoner-r1	58%	11,13

Material count is a sum of piece weights. E.g. a pawn weighs 1, queen 9, etc. At the beginning of the game, each player has a total material count of 39. The material difference gives a quick evaluation of a player's advantage (or disadvantage if negative)

Looking into the logs, I noticed that o3-mini manages to capture its opponent's pieces early in the game yet fails to checkmate, dragging the play towards the 200-move limit.

LLMs Struggle with Chess

If we compare the performance of LLMs to a popular Stockfish chess engine (playing as black against the same random player), Stockfish scores 100% wins with an average game duration of 58 moves. That's at Skill Level 1 (out of 20) - supposedly the level of a beginner human chess player. Compare that to the current best of o1-preview-2024-09-12 with a 47% win rate at 125 moves average game duration. I have touched on this topic in an earlier post.

Self-assurance

While I didn't gather enough data for a proper valuation of a High reasoning level, there are some interesting findings in the collected game logs. One is the "overconfidence" at higher levels.

A reminder, LLM Chess prompts the chat model to play using the following initial instructions:

You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.

There are 3 actions available to the model. The default approach seen in most cases is that a model (1) requests the board and next it (2) asks for the list of legal moves and (3) completes its turn by making a move. Below are ratios of action frequencies get_current_board/get_legal_moves for a few "traditional" models:

claude-v3-5-sonnet-v2: 0.999
gpt-4o-2024-11-20: 0.999
qwen-max-2025-01-25: 0.997

As you can see, strong models always request board states and a list of moves when deciding on the next move.

o3 model demonstrated that it often didn't need to be hinted with the list of legal moves, starting with board state and yielding the make_move right away:

Player_Black (to Proxy):

get_current_board

--------------------------------------------------------------------------------
Proxy (to Player_Black):

⭘ ♜ ♝ ♛ ⭘ ♜ ♚ ⭘
⭘ ♟ ♟ ♟ ♝ ♟ ♟ ♟
⭘ ⭘ ⭘ ⭘ ⭘ ♞ ⭘ ⭘
⭘ ⭘ ♞ ⭘ ♟ ⭘ ⭘ ⭘
⭘ ⭘ ⭘ ⭘ ♙ ♗ ♙ ⭘
⭘ ⭘ ♙ ⭘ ⭘ ♙ ⭘ ♙
⭘ ♙ ♕ ⭘ ⭘ ⭘ ⭘ ⭘
⭘ ⭘ ⭘ ⭘ ♔ ♗ ♘ ♖

--------------------------------------------------------------------------------
Player_Black (to Proxy):

make_move e5f4

At a High level o3-mini rarely requested the list of legal moves (once every 45 moves):

o3-mini-low: 1.375
o3-mini-medium: 2.613
o3-mini-hard: 45.1

This attribute puts reasoning models (as well as o1 and R1) aside. Often SOTA models can hallucinate illegal moves, even having the valid moves listed just 1 message up in the dialog. Among recently tested models the rate of hallucinated illegal moves (given the list) was very high for Grok-2 and Gemini Flash 2.0.

Reasoning Effort vs. Cost

The pricing for different reasoning levels is determined by the number of tokens. I.e. there's nothing special to how you'll be charged when deciding to use a more powerful mode, more reasoning = more tokens. Token prices are the same for each level.

Below are completion token stats from a few games. My ballpark is that the Medium level is 3x and the High is 9x in terms of cost (or number of tokens generated) - yet other prompts might have different ratios.

Effort	Tokens/Move	Completion Tokens
low	727	1,00
medium	2293	3,16
high	6659	9,17

P.S>

The leaderboard has been updated with the results from recent Gemini 2 models (gemini-2.0-flash-001 and gemini-2.0-flash-lite-preview-02-05) and Amazon's Nova models (Pro and Lite).

Briefly, the release version of Flash 2 showed an impressive improvement, yet the 6th place in the overall rating, but that's due to the win rate which is within the MoE from other models (GPT-4o, Claude). Overall it is a much worse model based on steerability and hallucinations. gemini-2.0-flash-lite-preview-02-05 looks bad, even worse than Gemma 2 9B.

Amazon Nova - didn't impress, both versions took place at the bottom of the list.

P.P.S> Legacy GPT-4

GPT-4 from 2023 models have also been tested. While they demonstrated 0 mistakes in game loops (a great result!), GPT-4 rarely requested the board state, only using 2 actions (get_legal_moves, make_move). This makes its behavior close to a random player, it didn't care about the game and was merely interested in picking a move from a list and handing over the turn.

gpt-4-32k: 0.006
o3-mini-low: 1.375
o3-mini-medium: 2.613
o3-mini-hard: 45.1
o1-mini: 1.465
deepSeek-r1: 1.938
qwen-max-2025-01-25: 0.997
gemini-2.0-flash-001: 0.952
anthropic.claude-v3-5-sonnet-v2: 0.999
gpt-4o-2024-11-20: 0.999

DEV Community