Tic-tac-toe is a solved game. Any competent adult can force a draw every time. But can an agent figure that out with zero human knowledge? Give two agents a blank board, a few simple rules about wins and losses, and nothing else. No opening theory, no strategy guides, no human games to study. After 100,000 games of fumbling against each other, they discover forks, blocking, and centre-first openings entirely on their own.
This is Q-learning applied to games. In our previous Q-learning post, the agent navigated a frozen lake alone, learning from its own mistakes. Here, we add an opponent. The agent can't just learn the environment; it must learn to outsmart another learner who's improving at the same time.
By the end of this post, you'll build two Q-learning agents that teach each other tic-tac-toe through self-play, and you'll understand why this simple setup discovers remarkably strong strategy.
The Problem: Tic-Tac-Toe as an RL Environment
Tic-tac-toe is the simplest non-trivial two-player game. The board has 9 cells, two players alternate placing X and O, and the first to complete a row, column, or diagonal wins. If all cells are filled with no winner, it's a draw.
As an RL problem:
- State: the current board (which cells have X, O, or are empty)
- Actions: place your marker on any empty cell
- Reward: +1 for winning, -1 for losing, 0 for a draw or an ongoing game
- Transition: deterministic (unlike the slippery FrozenLake), but the opponent's move is stochastic from your perspective
The state space is manageable: there are at most $3^9 = 19{,}683$ possible board configurations (fewer in practice, since many are unreachable). This makes tabular Q-learning a perfect fit, with no need for neural network function approximation.
Quick Win: Self-Play in Action
Let's see two Q-learning agents teach each other from scratch. Click the badge to run this yourself:
Watch how the agents' play evolves from random moves (early training) to strategic play (late training):
Here's the complete implementation. We need three pieces: an environment, an agent, and a self-play training loop.
import numpy as np
import random
class TicTacToe:
"""Tic-tac-toe environment. Board is a flat array of 9 cells.
Values: 0=empty, 1=X, -1=O."""
def __init__(self):
self.state = np.zeros(9, dtype=int)
def reset(self):
self.state = np.zeros(9, dtype=int)
return self.state.copy()
def available_actions(self):
return np.where(self.state == 0)[0]
def step(self, action, marker):
self.state[action] = marker
if self._is_winner():
return self.state.copy(), 1, True, 'win'
elif len(self.available_actions()) == 0:
return self.state.copy(), 0, True, 'draw'
return self.state.copy(), 0, False, 'ongoing'
def _is_winner(self):
b = self.state.reshape(3, 3)
for i in range(3):
if abs(b[i].sum()) == 3: return True
if abs(b[:, i].sum()) == 3: return True
if abs(np.diag(b).sum()) == 3: return True
if abs(np.diag(np.fliplr(b)).sum()) == 3: return True
return False
The agent is a standard Q-learner with one key adaptation: Q-values for occupied cells are set to NaN so the agent never tries to play in a taken position.
class QLearningAgent:
def __init__(self, marker, epsilon=1.0, lr=1.0,
gamma=0.95, final_epsilon=0.05):
self.marker = marker # 1 for X, -1 for O
self.epsilon = epsilon
self.lr = lr
self.gamma = gamma
self.final_epsilon = final_epsilon
self.q_table = {} # {tuple(state): np.array(9)}
def _get_q(self, state):
key = tuple(state)
if key not in self.q_table:
q = np.full(9, np.nan)
q[state == 0] = 0.0 # only empty cells get Q-values
self.q_table[key] = q
return self.q_table[key]
def pick_action(self, state):
available = np.where(state == 0)[0]
if np.random.rand() < self.epsilon:
return np.random.choice(available)
q = self._get_q(state)
available_q = [(a, q[a]) for a in available]
max_q = max(v for _, v in available_q)
best = [a for a, v in available_q if v == max_q]
return random.choice(best)
def update(self, state, action, reward, next_state, done):
q = self._get_q(state)
if done:
target = reward
else:
next_q = self._get_q(next_state)
target = reward + self.gamma * np.nanmax(next_q)
q[action] += self.lr * (target - q[action])
Now the self-play training loop. Both agents learn simultaneously, with the loser receiving a -1 reward when the other wins:
env = TicTacToe()
agent_x = QLearningAgent(marker=1, epsilon=1.0, lr=1.0, gamma=0.95)
agent_o = QLearningAgent(marker=-1, epsilon=1.0, lr=1.0, gamma=0.95)
eps_decay = 2.5e-5
for ep in range(100_000):
state = env.reset()
agents = [agent_x, agent_o]
if random.random() < 0.5:
agents = [agent_o, agent_x] # randomise who goes first
turn = 0
history = []
done = False
while not done:
agent = agents[turn % 2]
s = state.copy()
action = agent.pick_action(s)
next_state, reward, done, info = env.step(action, agent.marker)
history.append((agent, s, action, reward, next_state, done))
if done:
# winner learns from the final move
agent.update(s, action, reward, next_state, done)
# loser learns too: propagate -reward to their last move
if info == 'win' and len(history) >= 2:
other = agents[(turn + 1) % 2]
prev = history[-2]
other.update(prev[1], prev[2], -reward, next_state, True)
else:
agent.update(s, action, reward, next_state, done)
state = next_state
turn += 1
# decay epsilon for both agents
for a in [agent_x, agent_o]:
if a.epsilon > a.final_epsilon:
a.epsilon -= eps_decay
After training, both agents win around 85% of games against a random opponent (85% for X, 84% for O):
You just trained two agents to play tic-tac-toe without teaching them a single strategy. Let's understand how.
What Just Happened?
The Board as State, Cells as Actions
The environment represents the board as a flat array of 9 integers: 1 for X, -1 for O, 0 for empty. This encoding is compact and makes win detection elegant. A row, column, or diagonal sums to +3 (X wins) or -3 (O wins).
# Check rows, columns, diagonals
b = state.reshape(3, 3)
if abs(b[i].sum()) == 3: # row i
if abs(b[:, i].sum()) == 3: # column i
The action space is the set of empty cells. Using NaN for occupied positions in the Q-table means the agent physically cannot select an illegal move, as np.nanmax ignores NaN values:
q = np.full(9, np.nan)
q[state == 0] = 0.0 # only legal moves get Q-values
Self-Play: The Opponent is the Curriculum
The key insight of self-play is that both agents improve together. In early training, epsilon (the probability of choosing a random action instead of the greedy one) starts at 1.0, so both play nearly randomly and wins and losses are noisy. As epsilon decays linearly towards 0.05, they exploit what they've learned, and the opponent becomes a tougher challenge.
This creates an arms race. Watch the training curve:
Three things happen as training progresses:
- Draw rate rises from ~10% to ~42%. Both agents get better at defending, so fewer games end in a clear win.
- Win rates equalise. X starts with a slight advantage (going first), but by the end, both hover around 30%.
- The transition is sharp. Around episode 30,000, epsilon has decayed enough that agents exploit their Q-values more than they explore. The draw rate shoots up.
Reward Propagation in Adversarial Games
In single-agent Q-learning (like FrozenLake), the agent updates after every step. In a two-player game, we need an extra mechanism: when one agent wins, the loser must also learn from its last move.
if info == 'win' and len(history) >= 2:
other = agents[(turn + 1) % 2]
prev = history[-2]
other.update(prev[1], prev[2], -reward, next_state, True)
The winner gets reward +1. The loser's last move gets -1. This is how the agent learns defensive play: "the move I made two turns ago led to my opponent winning, so that was a bad move."
Reading the Q-Values
The Q-table is where the agent's strategy lives. Each entry says: "from this board state, how good is it to play in cell X?" Let's look at three critical situations the agent learned to handle:
Left panel (Set Up a Fork): X has the centre and top-left corner. The agent assigns Q = +0.85 to the bottom-right corner (position 8), which creates a fork: two ways to win that the opponent can't both block. Every other empty cell gets Q = 0.
Centre panel (Block or Lose): O has positions 0 and 3, threatening to complete the left column. The Q-values here are all negative except position 6 (Q = 0.00), the blocking move. The agent learned that not blocking leads to certain defeat. Notice the agent didn't just learn that position 6 is good; it learned that every other option is bad.
Right panel (Take the Win): X has positions 0 and 1, one move away from completing the top row. Position 2 gets Q = +0.81. The agent learned to finish the game when the opportunity is there, rather than play elsewhere.
Going Deeper
Q-Learning in Games vs Single-Agent Environments
In a single-agent setting like FrozenLake or Value Iteration on a grid world, the environment is stationary. The transition probabilities don't change. In a game with self-play, the "environment" includes the opponent, and the opponent is changing constantly.
This means Q-learning in games violates a core assumption: stationarity. The Markov property still holds (the board state contains all relevant information), but the transition dynamics shift as the opponent improves. In practice, this works because both agents improve gradually, and the learning rate is high enough to track the changing opponent.
The Learning Rate = 1 Choice
You might have noticed lr=1.0, which seems aggressive. With $\alpha = 1$, each Q-update completely replaces the old value:
This works for tic-tac-toe because the game is deterministic: from a given board state, taking a specific action always produces the same next state (your move is deterministic; only the opponent's response varies). With $\alpha = 1$, the agent always uses the most recent outcome, which adapts quickly to the opponent's evolving strategy.
For stochastic environments, $\alpha = 1$ would be catastrophic, as it would forget everything from past experience. But for deterministic transitions in a game, it's ideal.
The Self-Play Arms Race
Self-play training has a characteristic signature: the draw rate is a proxy for skill. When two beginners play, most games end in wins (because both make exploitable mistakes). When two experts play, most games end in draws (because neither makes a mistake worth exploiting).
Tic-tac-toe with perfect play from both sides is provably a draw. Our agents' ~42% draw rate suggests they're strong but not perfect: they're still occasionally making mistakes that the opponent can exploit.
Hyperparameter Sensitivity
The original code uses these values, all from the source implementation:
| Parameter | Value | Why |
|---|---|---|
gamma |
0.95 | Games are short (5-9 moves), so moderate discounting works. Higher values (0.99) also work. |
lr |
1.0 | Deterministic transitions; always use the latest outcome. |
epsilon |
1.0 to 0.05 | Start fully random, end mostly greedy. |
eps_decay |
2.5e-5 | Linear decay over ~38,000 episodes to reach final_epsilon. |
episodes |
100,000 | Enough for the Q-table to converge on the ~6,600 reachable states. |
The Q-table ends up with roughly 6,600 entries (out of the theoretical 19,683 board configurations). Many configurations are unreachable in valid play (e.g., a board where X has played 5 times but O has played once).
When NOT to Use Tabular Q-Learning for Games
Tabular Q-learning works beautifully for tic-tac-toe because the state space is tiny. It fails for:
-
Chess (
$\sim 10^{44}$legal positions): the Q-table would be impossibly large -
Go (
$\sim 10^{170}$): even worse - Games with continuous state spaces: no table can hold them
For these, you need function approximation: deep Q-networks replace the table with a neural network, or policy gradient methods learn a policy directly. The ideas from this post (self-play, reward propagation, exploration) carry forward directly.
Comparison: Self-Play vs Teacher
Our implementation uses self-play: both agents learn simultaneously. An alternative approach (also in the original code) trains against a teacher, a heuristic opponent that plays well but not perfectly. Self-play has the advantage of being curriculum-free: you don't need to design a teacher, and the difficulty automatically scales with the learner's ability. The downside is that training can be unstable early on, as the quality of the training signal depends on having a reasonable opponent.
Where This Comes From
The Roots: Watkins and Temporal Difference Learning
Q-learning was introduced by Chris Watkins in his 1989 PhD thesis, "Learning from Delayed Rewards." The core idea is that an agent can learn the value of actions without knowing the environment's dynamics, purely from the reward signal and the temporal difference between consecutive estimates.
The update rule we used is exactly Watkins' formulation:
The term in brackets is the TD error: the difference between what we expected ($Q(s_t, a_t)$) and what we actually observed ($r_{t+1} + \gamma \max_a Q(s_{t+1}, a)$). Learning adjusts Q towards the observed value.
Watkins and Dayan (1992) later proved that Q-learning converges to optimal Q-values under certain conditions: every state-action pair must be visited infinitely often, and the learning rate must satisfy the Robbins-Monro conditions ($\sum \alpha = \infty$, $\sum \alpha^2 < \infty$). Our $\alpha = 1$ technically violates these conditions, but the deterministic nature of tic-tac-toe means the algorithm still converges in practice.
Game-Playing AI: A Brief History
Games have been the proving ground for AI since the field's inception. Sutton and Barto open Chapter 1 of Reinforcement Learning: An Introduction with exactly this problem: a temporal-difference learner playing tic-tac-toe. They use it to introduce the core RL concepts before any formal machinery.
The lineage of game-playing RL runs deep:
- Samuel (1959): Arthur Samuel's checkers program was one of the first learning programs, using a form of temporal difference learning decades before the name existed. It beat its creator.
- Tesauro (1995): Gerald Tesauro's TD-Gammon used temporal difference learning with a neural network to play backgammon at world-champion level. It discovered novel strategies that human experts later adopted.
- Silver et al. (2016): AlphaGo combined deep neural networks with Monte Carlo tree search and self-play to defeat the world Go champion. The self-play idea is the same as ours; only the scale is different.
"The game of tic-tac-toe is a simple example, but it illustrates the fundamental principles of reinforcement learning: learning from interaction, temporal difference methods, and the trade-off between exploration and exploitation."
-- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 1
Connection to Minimax
For a two-player, zero-sum game like tic-tac-toe, optimal play follows the minimax principle: each player assumes the opponent plays optimally and chooses the action that maximises the minimum possible outcome.
Q-learning with self-play implicitly converges towards minimax values. When both agents are learning optimally, the Q-values for X represent $\max$ (X wants to maximise its reward) and the Q-values for O represent $\min$ (O wants to minimise X's reward, which is equivalent to maximising O's own). The self-play training process, where both agents simultaneously improve, pushes the Q-values towards this minimax equilibrium.
This is why our agents discover strong strategy without being told about minimax: the competitive pressure of self-play naturally drives them there.
Further Reading
- The original thesis: Watkins (1989) "Learning from Delayed Rewards", Sections 3-4 for the Q-learning algorithm
- The convergence proof: Watkins & Dayan (1992) in Machine Learning
- The RL textbook: Sutton & Barto (2018), Chapter 1 (tic-tac-toe example) and Chapter 6 (TD learning)
- Self-play at scale: Silver et al. (2017) "Mastering Chess and Shogi by Self-Play", the AlphaZero paper
- Next step: Our DQN post shows how to replace the Q-table with a neural network for environments too large for tables
Interactive Tools
- Q-Learning Visualiser — Watch Q-learning train step-by-step on grid worlds in the browser
Related Posts
- Q-Learning from Scratch: Navigating the Frozen Lake (tabular Q-learning fundamentals)
- Value Iteration vs Q-Learning: Dynamic Programming Meets RL (comparing model-based and model-free approaches)
- Deep Q-Networks: When Tables Aren't Enough (scaling Q-learning with neural networks)
- Policy Gradients and REINFORCE from Scratch (an alternative to Q-learning that learns a policy directly)
Frequently Asked Questions
What is Q-learning with self-play?
Q-learning is a reinforcement learning algorithm that learns the value of each state-action pair by interacting with an environment. Self-play means both players are Q-learning agents training against each other. As each agent improves, it forces the other to improve too, driving both towards optimal play without needing a hand-crafted opponent.
Why use self-play instead of training against a fixed opponent?
A fixed opponent (random or rule-based) has a ceiling: once your agent exploits its weaknesses, it stops improving. Self-play creates an ever-improving curriculum because the opponent adapts alongside the learner. This naturally pushes both agents towards minimax-optimal strategies.
How does epsilon affect self-play training?
Epsilon controls how often the agent takes a random action instead of its current best. Too low and the agents settle into a narrow set of positions, missing better strategies. Too high and learning is slow because actions are mostly random. Decaying epsilon over time (high early, low late) gives broad exploration first, then refined exploitation.
Does Q-learning with self-play always converge to optimal play in tic-tac-toe?
Yes, given enough training episodes and appropriate hyperparameters. Tic-tac-toe has a small enough state space (under 6,000 reachable positions) that tabular Q-learning can visit every state-action pair many times. The Q-values converge to the minimax equilibrium, where both agents play perfectly and every game ends in a draw.
Can this approach scale to more complex games like chess or Go?
Not with a Q-table. Chess has roughly $10^{47}$ positions, making tabular Q-learning impossible. For complex games, you replace the table with a neural network (Deep Q-Networks) or use policy gradient methods. AlphaGo and AlphaZero used self-play with deep neural networks and Monte Carlo tree search to master Go, chess, and shogi.
What is the difference between Q-learning and minimax for game playing?
Minimax requires a complete model of the game (all possible states and transitions) and searches the full game tree. Q-learning is model-free: it learns from experience without needing the game rules explicitly. For small games like tic-tac-toe both reach the same optimal strategy, but Q-learning generalises to environments where you cannot enumerate the full game tree.






Top comments (0)