David Dominguez

Posted on May 3

AI Coliseum: Building a Python Chess Engine for LLM Benchmarking

#ai #llm #python #showdev

What happens when you pit two different programming logics against each other in an environment of strict rules? To answer this, I developed a Python project that functions as an execution environment (Engine) where different AI scripts can compete.

In this post, I’ll explain how I designed the board, the rules, and the experiment of facing white.py against black.py.

1. The Referee: A Pure Rules Engine

Unlike other projects, here the "engine" is not the player. Its sole mission is to be the board's operating system. It is programmed to:

Validate Legal Moves: Ensuring no AI attempts to bypass the rules.
Manage State: Controlling castling, en passant captures, and promotions.
Detect Game End: Applying rules for insufficient material, threefold repetition, and the 50-move rule.

2. The Experiment: `white.py` vs. `black.py`

The real magic happens in external files. The engine acts as a "Coliseum" that delivers the current board state to each script and waits for a response.

This allows for AI Benchmarking experiments:

Search Logic: Who calculates better: a Minimax-based script or one based on experimental heuristics?
Optimization: How response time (time.time()) affects the quality of the move returned.
Evaluation Heuristics: Comparing how different algorithms value piece positioning.

3. Anatomy of the Engine (The Backbone)

The code is designed to be modular and transparent. Key implementation points include:

Board Cloning: The engine allows AIs to clone the state to simulate future moves without affecting the live game.
PGN Builder: Automatic generation of standard notation so games can be analyzed in external viewers.
Console Visualization: An ASCII rendering system to follow the "fight" in real-time from the terminal.

4. Why Separate the Board from the Logic?

Separating the engine from the players (white.py and black.py) offers critical software development advantages:

Modularity: You can change an AI's logic without touching a single line of chess rule code.
Impartiality: Both scripts receive the exact same state information (get_state()).
Scalability: It's easy to add new "players" and face them in a Round Robin tournament.

The Experiment: A Chronicle of Algorithmic Slaughter

Developing this Coliseum wasn't linear; it was an evolution of "survival of the fittest." In early tests, GPT—using an optimized initial prompt—completely swept basic versions of Gemini and clearly outperformed DeepSeek. The throne seemed set.

However, the landscape shifted when Claude entered the scene. With much deeper calculation logic and superior decision-tree management, Claude dethroned GPT and mercilessly "destroyed" the other competitors.

The most interesting part followed: by applying the optimization principles and code structure used by Claude to the other AIs, the field leveled. While still playing with less precision, these improved versions began forcing draws by repetition. They evolved from tactical suicides to a defense solid enough to hold the line.

From Code to Board: Generating PGN

To ensure these games don't just live as terminal logs, the software implements a PGN Builder. PGN (Portable Game Notation) is the universal standard for recording chess games.

How the Data Flows:

Move Capture: Every time white.py or black.py makes a move, the engine logs the origin, destination, captures, and promotions.
Algebraic Translation: The engine converts internal array coordinates (e.g., [7, 4]) into standard notation (e.g., e1).
Special State Detection: The software checks if the move resulted in check (+), checkmate (#), or castling (O-O).
Export: At the end of the game, these strings are concatenated: 1. e4 e5 2. Nf3 Nc6....

Visualizing on Chess.com or Lichess

The final result is a block of text you can paste into tools like the Chess.com Analysis board. These platforms read the PGN, reconstruct the game visually, and let you use engines like Stockfish to see exactly where Claude gained the upper hand.

Open Source Challenge

I have uploaded this competition environment to my repository. If you are a developer and want to test your own chess logic, just create your script and pit it against the board.

Project Repository: https://github.com/davdomin/chess_motor

What strategy would you use to win: an aggressive material-based attack or a solid positional defense? Let me know in the comments!

DEV Community

AI Coliseum: Building a Python Chess Engine for LLM Benchmarking

1. The Referee: A Pure Rules Engine

2. The Experiment: `white.py` vs. `black.py`

3. Anatomy of the Engine (The Backbone)

4. Why Separate the Board from the Logic?

The Experiment: A Chronicle of Algorithmic Slaughter

From Code to Board: Generating PGN

How the Data Flows:

Visualizing on Chess.com or Lichess

Open Source Challenge

Top comments (0)

1. The Referee: A Pure Rules Engine

2. The Experiment: white.py vs. black.py

3. Anatomy of the Engine (The Backbone)

4. Why Separate the Board from the Logic?

The Experiment: A Chronicle of Algorithmic Slaughter

From Code to Board: Generating PGN

How the Data Flows:

Visualizing on Chess.com or Lichess

Open Source Challenge

2. The Experiment: `white.py` vs. `black.py`