Zhixiang Li

Posted on Mar 1

How I Built a Readable AlphaZero From Scratch — A Deep Dive Into the Code

#alphazero #reinforcementlearning #deeplearning #python

Most AlphaZero repositories fall into one of two traps: they're either so heavily optimised that the algorithm is buried under infrastructure, or they're toy demos that don't actually produce a strong player. I wanted something in the middle — clean enough to read, strong enough to beat you at Gomoku.

The result is alphazero-board-games: a lightweight AlphaZero implementation covering Gomoku (9×9 and 15×15) and Connect4, with pretrained checkpoints you can play against immediately.

In this post I'm going to pull apart every major component and explain exactly what's happening and why.

The Big Picture: What AlphaZero Actually Does

Before we touch any code, let's lock down the algorithm at a conceptual level, because there's a lot of confusion in blog posts that conflate AlphaGo, AlphaGoZero, and AlphaZero.

AlphaZero (2017) learns entirely from self-play — no human games, no handcrafted features. The training loop has three interlocked components:

A residual neural network with two heads: a policy head (probability distribution over moves) and a value head (estimated win probability from –1 to +1).
Monte Carlo Tree Search (MCTS) guided by the network — the network's policy priors bias which branches MCTS explores; the network's value replaces random rollouts at leaf nodes.
A self-play RL loop where the network plays against itself, generates game records, and trains on them. A stronger network generates better data, which trains an even stronger network.

The feedback loop is self-bootstrapping. Starting from random weights, within enough iterations the agent discovers strategies that have taken humans centuries to develop.

Repository Layout

alphazero/          ← shared core (game API, MCTS, network, RL loop)
gomoku_9_9/         ← 9×9 Gomoku rules + trainer + terminal player
gomoku_15_15/       ← 15×15 Gomoku rules + trainer + terminal player
connect4/           ← Connect4 rules + trainer + terminal player
scripts/            ← utility shell scripts

The clean separation between the core engine (alphazero/) and the game presets is the most important architectural decision in the repo. Adding a new game means implementing one interface — you never touch the MCTS or training code.

Component 1: The Abstract Game API

Every game in this project implements the same abstract interface. Roughly, it exposes:

get_current_player() — whose turn it is (encoded as +1 / –1 for the two players)
get_valid_moves() — a flat boolean mask over the action space
apply_move(action) — mutate the state with a move
get_game_result() — returns None (ongoing), +1 (current player won), –1 (lost), or 0 (draw)
encode_state() — converts the board into a tensor suitable for the neural network

That last method deserves its own paragraph.

Board Encoding: Planes Over Pixels

Rather than feeding a raw 2D board matrix into the network, the state is encoded as a stack of binary planes. For a two-player game at step t, a typical encoding looks like:

Plane	Content
0	Current player's stones
1	Opponent's stones
2	(optional) Whose turn it is, broadcast across the board

This multi-plane representation gives the convolutional network the same spatial locality information a human sees: where my stones are, where theirs are, and who moves next — all without any numeric encoding tricks that might confuse early conv layers.

For Gomoku the action space is simply every empty intersection: board_rows × board_cols possible moves. For Connect4 it's just the 7 columns. The network always outputs a flat vector of size equal to the action space, and the valid-move mask is applied after the softmax to zero out illegal moves before re-normalising.

Component 2: The Residual Policy/Value Network

The network is the brain. Its architecture mirrors the one in the original DeepMind paper, scaled down to be trainable on consumer hardware.

Architecture Summary

Input: (batch, planes, H, W)
  │
  ▼
Conv Block (conv → BN → ReLU)
  │
  ▼
Residual Blocks × N  ─── (conv → BN → ReLU → conv → BN → skip-add → ReLU)
  │
  ├──▶  Policy Head  →  FC → softmax → π (action probabilities)
  │
  └──▶  Value Head   →  FC → FC → tanh →  v ∈ (–1, +1)

Why residual blocks? Vanilla deep CNNs suffer from vanishing gradients. Residual connections (skip connections that add the input to the output of a two-conv stack) let gradients flow directly backwards through the identity path, enabling much deeper networks to train reliably.

Why two heads sharing a backbone? Policy and value are tightly correlated: a position that's good for one player tends to have a narrower set of good moves, not just a higher value. Sharing the convolutional feature extractor forces the network to learn spatial representations that are useful for both tasks simultaneously. The two heads then specialise on top of shared features.

Why tanh on the value head? The game result is always in {–1, 0, +1} (loss / draw / win). tanh naturally squashes the value head's output into (–1, +1), which matches that training signal without needing any normalisation.

What the Network Actually Learns

After sufficient self-play training, the policy head learns to assign high probability to:

Moves that extend winning threats
Moves that block the opponent's winning threats
Moves that create multiple simultaneous threats (forks)

The value head learns the game-theoretic value of a position — essentially "if both players play perfectly from here, who wins?" Early in training it's wild guesses. After training it's accurate enough that MCTS rarely needs to search very deep to get a good evaluation.

Component 3: Monte Carlo Tree Search (MCTS)

This is where the magic really happens. MCTS is the search algorithm that uses the neural network to play the game. Each MCTS search starts at the current board position (the root) and runs N simulations.

One Simulation: Four Steps

Step 1 — Selection. From the root, descend the tree by repeatedly picking the child with the highest PUCT score:

PUCT(s, a) = Q(s, a)  +  c_puct × P(s, a) × √(ΣN(s, b)) / (1 + N(s, a))

Where:

Q(s, a) — the running average value of taking action a from state s (exploitation)
P(s, a) — the prior probability from the neural network's policy head (exploration bias)
N(s, a) — the visit count for this edge
c_puct — a constant controlling the exploration–exploitation tradeoff

The formula has a beautiful property: a move with a high prior P gets explored early (when visit counts are low, the second term is large). But as it gets visited and its Q value is refined, the exploration bonus shrinks. Moves that consistently return good results rise to the top; moves that looked promising but turned out weak get deprioritised.

Step 2 — Expansion. When we reach a node that has never been visited (a leaf), we query the neural network. The network returns (π, v) — a policy vector and a value scalar. We:

Store π as the prior probabilities for all children of this node
Use v as the value estimate (instead of playing a random rollout to the end of the game)

Step 3 — Backup. Propagate the value v back up the path to the root, updating Q(s, a) and N(s, a) for every edge traversed. Critically, values are flipped at each ply because the game alternates between players: a value of +0.8 for the player-to-move is –0.8 from their opponent's perspective.

Step 4 — Move Selection. After all N simulations, the final move is chosen proportional to visit counts at the root:

π_mcts[a] ∝ N(root, a) ^ (1 / temperature)

At temperature = 1.0 (early training) the selection is stochastic — exploration is maximised. At temperature → 0 (competitive play) the move with the most visits is chosen deterministically.

Why MCTS Visit Counts Beat Raw Policy Priors

Here's an important subtlety: we don't just play the move with the highest policy prior. We play the move with the most MCTS visits. Why?

Because MCTS is doing one-step lookahead (and more, recursively). A move with a mediocre prior might get visited if its children look promising. Over N simulations, the visit count represents a much stronger signal than the raw policy prior — it has been refined by actually exploring the consequences. This is why AlphaZero is stronger than a greedy policy network alone: MCTS is doing iterative, guided search on top of the network's intuition.

Component 4: The Self-Play RL Loop

Training data isn't downloaded — it's generated by the model playing against itself.

The Data Generation Process

For each game of self-play:

Start with the current board state.
Run MCTS for simulation_num steps → get π_mcts (a probability distribution over moves).
Sample a move from π_mcts (with temperature) and apply it.
Record the tuple (encoded_board, π_mcts) for this step.
Repeat until the game ends with result r ∈ {–1, 0, +1}.
Assign the value label z to each step: +r if that step was made by the winning player, –r if by the losing player.

The training dataset is a collection of (board_state, π_mcts, z) triples.

The Training Objective

The network is trained to minimise a combined loss:

L = MSE(v, z) + CrossEntropy(π_network, π_mcts) + λ‖θ‖²

The value loss MSE(v, z) trains the value head to correctly predict game outcomes.
The policy loss CrossEntropy(π_network, π_mcts) trains the policy head to match MCTS's refined move distribution (not just game outcomes — this is the crucial difference from plain policy gradient RL).
The L2 regularisation λ‖θ‖² prevents overfitting.

The brilliant insight: the MCTS-refined distribution π_mcts is a better target for the policy than the raw game result. It captures not just "this player won" but which moves MCTS found most promising. The network bootstraps off its own search.

Data Augmentation via Symmetry

Gomoku and Connect4 have geometric symmetries. A position and its mirror image are strategically identical. The training pipeline exploits this: each recorded game position is augmented with rotations and reflections, multiplying the effective dataset size for free.

The Training Loop in Practice

while True:
    # Phase 1: Generate data
    games = self_play(model, num_games=N, simulation_num=S)
    replay_buffer.add(games)

    # Phase 2: Train
    for batch in replay_buffer.sample(batch_size):
        loss = compute_loss(model, batch)
        optimizer.step(loss)

    # Phase 3: Evaluate (optional)
    # compare new model vs old model in head-to-head play
    # keep the winner

The train_interval parameter in the trainer controls how many self-play games are collected before each training phase — a key hyperparameter for balancing data freshness vs. compute cost.

Playing Immediately (No Training Required)

The repo includes pretrained checkpoints in each game's data/ directory, so you can skip all of the above and just play:

# Install with uv (Python 3.12+)
uv sync

# Play Gomoku 15×15 in your terminal
uv run python -m gomoku_15_15.stdio_play --human-color W --simulation-num 400

# Play Connect4
uv run python -m connect4.stdio_play --human-color B

Move format is intuitive: E5 or E 5 for Gomoku (column letter + row number), column number for Connect4.

The --simulation-num flag directly controls AI strength. At 400 simulations per move the AI is strong but quick. Push it to 1200+ if you want a serious challenge (and don't mind waiting a few seconds per move).

Training Your Own Models

# Start training from scratch (uses default config)
uv run python -m gomoku_9_9.trainer

# Override hyperparameters from the CLI
uv run python -m gomoku_15_15.trainer -simulation_num 1200 -train_interval 20

Key hyperparameters to tune:

Parameter	Effect
`simulation_num`	MCTS simulations per move. Higher → stronger AI, slower self-play
`train_interval`	Games of self-play between training steps
`learning_rate`	Standard NN learning rate
`batch_size`	Training batch size
`num_residual_blocks`	Depth of the network
`num_filters`	Width of conv layers (capacity)

On a modern laptop with a CPU (no GPU required), the 9×9 Gomoku model starts showing real strategy within a few hours of training. The 15×15 model needs more compute but the pretrained checkpoint is already strong.

How to Extend It: Adding a New Game

The cleanest feature of this architecture is that adding a new game is a well-defined, isolated task. You need to:

Create a new directory, e.g. tictactoe/.
Implement the Game abstract interface with your rules: valid moves, move application, win/draw detection, and board encoding.
Create a trainer.py that instantiates your game class and calls the shared training loop.
Create a stdio_play.py that instantiates your game and the MCTS player for terminal play.

The MCTS, the residual network, the RL loop, the replay buffer, the loss function — you inherit all of that for free. The only game-specific code is rules + board encoding. This is the right abstraction boundary.

What Makes This Implementation Different

There are many AlphaZero repos out there. Here's what I was optimising for with this one:

Readability over performance. The MCTS implementation is single-threaded and synchronous. A production system would batch neural network evaluations across parallel tree simulations. That's faster, but harder to read. This codebase is designed to be understood, not to break speed records.

Batteries included. The pretrained checkpoints mean you get a working demo in 30 seconds. Most repos make you train for hours before you see anything interesting.

Modern Python tooling. The project uses uv for dependency management and pyproject.toml for configuration. No requirements.txt version conflicts. No conda environment hell. Just uv sync and you're running.

Multi-game from day one. The abstract game API and shared core were designed upfront, not retrofitted. Gomoku 9×9, Gomoku 15×15, and Connect4 all live in the same repo and share every line of non-game-specific code.

The Key Insight Worth Internalising

If you read nothing else from this post, read this:

AlphaZero doesn't learn moves. It learns to evaluate positions, and then uses search (MCTS) to turn those evaluations into moves.

The network is trained to predict two things: the probability distribution over good moves (policy), and whether the current position is winning (value). MCTS uses these predictions to efficiently explore the game tree. The training data comes from MCTS itself — the network learns to be a better evaluator, which makes MCTS search better, which generates better training data, and the cycle continues.

It's a beautiful closed loop. And it works from nothing — no human games, no domain knowledge, just the rules of the game and enough compute.

Try It, Fork It, Break It

The project is on GitHub at zhixiangli/alphazero-board-games under the Apache-2.0 license.

A few things I'd love to see people build on top of it:

A new game: Othello, TicTacToe, or even something like Hex
A stronger training pipeline: async self-play, batched MCTS evaluation
An evaluation harness: automatically pit new checkpoints against old ones
A web UI: replace the terminal stdio_play with a browser interface

If you find a bug, have a question about the MCTS implementation, or want to discuss a design decision, open an issue or drop a comment below. And if you find the project useful — a ⭐ on GitHub goes a long way!

Happy coding. ♟️

DEV Community