Day 15 of 2026 Building
Have you ever wondered if an 8-billion parameter model plays better chess than a 20-billion parameter one? I did. And instead of looking up benchmarks, I decided to make them fight.
The Experiment
Ollama Chess Arena is a local-first platform that connects to your local Ollama instance and lets you configure chess matches between any two open-source models (e.g., Llama 3 vs. Mistral, Gemma vs. Phi).
No cloud APIs. No tokens. Just your GPU (or CPU) sweating it out over an E4 opening.
The Tech Stack
I wanted this to be a "raw" daily build, so I stuck to a lightweight and flexible stack:
- Backend: Node.js + Socket.io for real-time game state synchronization.
- Frontend: React + Vite + Tailwind CSS v4 (using a strict monochrome aesthetic).
- AI: Ollama for local inference.
- Logic:
chess.jsfor move validation and game state management.
The Challenge: Making language models play chess
LLMs are great at poetry, but they are notoriously bad at spatial reasoning. Getting them to play a legal game of chess was... interesting.
1. The Hallucination Problem
When asked for a move, a model might say:
"I'll move my Knight to f6 because it controls the center."
But sometimes it says:
"Horsey goes jump!"
Or worse, it tries to play Nf6 when there's already a pawn there.
The Solution: I implemented a robust Feedback Loop. The system asks the model for a move. If the move is invalid (validated by chess.js), the system feeds the error back to the model in the next prompt:
"Invalid Move: Nf6. That square is occupied. Choose from: e5, d5, a3..."
We allow up to 10 retries. This "nagging" approach allows even smaller, dumber models to eventually stumble upon a legal move.
2. The "Lazy Draw"
Initially, the models kept drawing by Threefold Repetition. They would move a piece, realize it was scary out there, and move it back. Over and over again.
The Solution: Prompt Engineering. I gave the models a "Grandmaster Persona":
"You are a Grandmaster. Play aggressively to WIN. Do NOT settle for a draw. AVOID REPEATING MOVES."
I also injected the last 6 moves (PGN) into their context window so they could "see" they were repeating themselves.
3. The "Silent Hang"
Sometimes, gpt-oss:20b would just... stop. It would think for 60 seconds and return nothing.
The Solution: Strict 30-second timeouts on the Axios requests. If a model sleeps, it loses its turn (or we retry). This keeps the game flowing.
The Result
We now have a fully functional "Perpetual Play" arena. You can set it up, click "Initialize Battle", and watch the models play game after game, accumulating wins and losses in a leaderboard style.
The "Decision Stream" log shows you exactly what they are thinking—and it's hilarious. Seeing an AI justify a blunder with supreme confidence ("Sacrificing the Queen for positional advantage") never gets old.
What's Next?
- God Mode: Adding a third LLM (GPT-4 via API) to provide color commentary on the match.
- ELO System: Tracking persistent ratings for models over thousands of games.
Check out the code on GitHub and run your own arena!






Top comments (0)