DEV Community

Cover image for Building Ollama Chess Arena: Pitting Local LLMs Against Each Other ♟️
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building Ollama Chess Arena: Pitting Local LLMs Against Each Other ♟️

Day 15 of 2026 Building

Have you ever wondered if an 8-billion parameter model plays better chess than a 20-billion parameter one? I did. And instead of looking up benchmarks, I decided to make them fight.

The Experiment

Ollama Chess Arena is a local-first platform that connects to your local Ollama instance and lets you configure chess matches between any two open-source models (e.g., Llama 3 vs. Mistral, Gemma vs. Phi).

No cloud APIs. No tokens. Just your GPU (or CPU) sweating it out over an E4 opening.

Quick tutorial

The Tech Stack

I wanted this to be a "raw" daily build, so I stuck to a lightweight and flexible stack:

  • Backend: Node.js + Socket.io for real-time game state synchronization.
  • Frontend: React + Vite + Tailwind CSS v4 (using a strict monochrome aesthetic).
  • AI: Ollama for local inference.
  • Logic: chess.js for move validation and game state management.

The Challenge: Making language models play chess

LLMs are great at poetry, but they are notoriously bad at spatial reasoning. Getting them to play a legal game of chess was... interesting.

1. The Hallucination Problem

When asked for a move, a model might say:

"I'll move my Knight to f6 because it controls the center."

But sometimes it says:

"Horsey goes jump!"

Or worse, it tries to play Nf6 when there's already a pawn there.

The Solution: I implemented a robust Feedback Loop. The system asks the model for a move. If the move is invalid (validated by chess.js), the system feeds the error back to the model in the next prompt:

"Invalid Move: Nf6. That square is occupied. Choose from: e5, d5, a3..."

We allow up to 10 retries. This "nagging" approach allows even smaller, dumber models to eventually stumble upon a legal move.

2. The "Lazy Draw"

Initially, the models kept drawing by Threefold Repetition. They would move a piece, realize it was scary out there, and move it back. Over and over again.

The Solution: Prompt Engineering. I gave the models a "Grandmaster Persona":

"You are a Grandmaster. Play aggressively to WIN. Do NOT settle for a draw. AVOID REPEATING MOVES."

I also injected the last 6 moves (PGN) into their context window so they could "see" they were repeating themselves.

3. The "Silent Hang"

Sometimes, gpt-oss:20b would just... stop. It would think for 60 seconds and return nothing.

The Solution: Strict 30-second timeouts on the Axios requests. If a model sleeps, it loses its turn (or we retry). This keeps the game flowing.

The Result

We now have a fully functional "Perpetual Play" arena. You can set it up, click "Initialize Battle", and watch the models play game after game, accumulating wins and losses in a leaderboard style.

The "Decision Stream" log shows you exactly what they are thinking—and it's hilarious. Seeing an AI justify a blunder with supreme confidence ("Sacrificing the Queen for positional advantage") never gets old.

Screenshot 1

Screenshot 2

Screenshot 3

Screenshot 4

Screenshot 5

What's Next?

  • God Mode: Adding a third LLM (GPT-4 via API) to provide color commentary on the match.
  • ELO System: Tracking persistent ratings for models over thousands of games.

Check out the code on GitHub and run your own arena!

Top comments (0)