Imagine you're standing on a frozen lake. Your goal is on the far side, but there are holes in the ice — fall in and it's game over. Worse, the ice is slippery: when you try to go right, you might slide up or down instead. You have no map, no instructions. All you can do is try, fail, and gradually learn which moves lead to safety.
This is exactly what Q-learning solves. The agent learns a value for every state-action pair — "how good is it to take action A from state S?" — purely from trial and error. No model of the environment needed, no supervision, just rewards.
By the end of this post, you'll implement Q-learning from scratch, train an agent to navigate OpenAI's FrozenLake environment, and understand the Bellman equation that makes it all work. You'll also see why exploration matters — and what happens when an agent gets greedy too early.
The algorithm was introduced by Watkins (1989) in his PhD thesis and its convergence was proven by Watkins & Dayan (1992).
The Problem: FrozenLake
OpenAI's FrozenLake is a 4×4 grid world:
SFFF S = Start
FHFH F = Frozen (safe)
FFFH H = Hole (game over)
HFFG G = Goal (reward = 1)
The agent can move Left, Down, Right, Up. But on slippery ice, each action has only a 1/3 chance of going in the intended direction — there's a 1/3 chance of sliding in each perpendicular direction. The only reward is +1 for reaching the Goal. Fall in a Hole or run out of steps and you get 0.
This is a model-free problem: the agent doesn't know the transition probabilities or the reward structure. It must learn entirely from experience.
Quick Win: Run the Algorithm
Let's see Q-learning in action. Click the badge to open the interactive notebook:
The GIF shows the agent's learned policy evolving from random arrows (no idea what to do) to a coherent strategy that avoids holes and reaches the goal. The colour intensity shows $V(s) = \max_a Q(s, a)$ — how valuable each state is.
import numpy as np
import gymnasium as gym
def q_learning(env, n_episodes=20000, alpha=0.1, gamma=0.99,
epsilon_start=1.0, epsilon_end=0.01, decay_rate=1e-3):
"""Q-learning with epsilon-greedy exploration."""
Q = np.zeros((env.observation_space.n, env.action_space.n))
rewards = []
epsilon = epsilon_start
for episode in range(n_episodes):
state, _ = env.reset()
total_reward = 0
for step in range(100):
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-learning update (Bellman equation)
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
total_reward += reward
state = next_state
if done:
break
rewards.append(total_reward)
epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(
-decay_rate * episode
)
return Q, rewards
# Train the agent
env = gym.make('FrozenLake-v1', is_slippery=True)
Q, rewards = q_learning(env)
# Show the learned policy
actions = ['←', '↓', '→', '↑']
policy = np.array([actions[a] for a in np.argmax(Q, axis=1)]).reshape(4, 4)
print("Learned policy:")
print(policy)
print(f"\nSuccess rate (last 1000): {np.mean(rewards[-1000:]):.1%}")
The result: After 20,000 episodes, the agent learns a policy that reaches the goal ~67% of the time. Even with a good policy, the slippery ice inevitably pushes you into holes — the theoretical maximum for this environment is about 74%.
Visualise the Learning
import matplotlib.pyplot as plt
rolling = np.convolve(rewards, np.ones(100)/100, mode='valid')
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rolling, 'b-', linewidth=0.8)
ax.set_xlabel('Episode')
ax.set_ylabel('Success Rate (100-episode rolling avg)')
ax.set_title('Q-Learning on FrozenLake')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
plt.show()
The reward curve climbs from 0% (random) to ~67% (near-optimal). The initial flat period is pure exploration — the agent is collecting experience before it starts exploiting what it's learned.
What Just Happened?
Q-learning maintains a Q-table — a lookup table mapping every (state, action) pair to a value. For FrozenLake, that's 16 states × 4 actions = 64 entries, all starting at zero.
The Bellman Equation
The core update rule is one line of code:
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
This implements the temporal difference (TD) update:
Let's break this apart:
-
$Q(s, a)$— current estimate of "how good is action a in state s?" -
$r$— immediate reward received -
$\gamma \max_{a'} Q(s', a')$— discounted value of the best action in the next state -
$r + \gamma \max_{a'} Q(s', a')$— the TD target (what we think the true value is) -
$\text{target} - Q(s, a)$— the TD error (how wrong we were) -
$\alpha$— learning rate (how much to correct)
Think of it like updating a GPS estimate: you have a current estimate of the distance ($Q(s,a)$), you observe a new measurement (the target), and you adjust by a fraction ($\alpha$) of the difference.
Epsilon-Greedy Exploration
if np.random.random() < epsilon:
action = env.action_space.sample() # explore: random action
else:
action = np.argmax(Q[state]) # exploit: best known action
With probability $\epsilon$, the agent takes a random action (exploration). Otherwise, it takes the best action according to its Q-table (exploitation).
Early in training, $\epsilon = 1.0$ — pure exploration. Over time, it decays exponentially:
This mirrors the exploration-exploitation trade-off in MCMC — the Metropolis-Hastings algorithm also balances proposing random moves with accepting good ones.
Why "Off-Policy"?
Q-learning is off-policy: it learns about the greedy policy ($\max_{a'} Q(s', a')$) while following an exploratory policy (epsilon-greedy). This separation is powerful — the agent can explore freely without corrupting what it's learning about the optimal policy.
The alternative, SARSA (on-policy), updates using the action the agent actually took next, not the best possible action. SARSA is more conservative — it learns a safer policy that accounts for its own exploration.
Going Deeper
Hyperparameters
| Parameter | What it controls | FrozenLake value | Guidance |
|---|---|---|---|
alpha |
Learning rate | 0.1 | Low for stochastic envs, higher for deterministic |
gamma |
Discount factor | 0.99 | Close to 1 for long-horizon tasks |
epsilon_start |
Initial exploration | 1.0 | Always start fully random |
epsilon_end |
Final exploration | 0.01 | Small but non-zero |
decay_rate |
Exploration decay | 0.001 | Slower = more exploration |
n_episodes |
Training budget | 20,000 | Until convergence |
gamma is critical. It controls how much the agent values future rewards vs immediate ones:
-
$\gamma = 0$— only cares about immediate reward (myopic) -
$\gamma = 1$— values all future rewards equally (farsighted) -
$\gamma = 0.95$— reward 10 steps away is worth$0.95^{10} \approx 0.60$of its face value
For FrozenLake, the goal is ~6 steps away but stochastic transitions make paths longer. With $\gamma = 0.99$, a reward 10 steps away is worth $0.99^{10} \approx 0.90$ — the agent strongly values the distant goal.
Slippery vs Non-Slippery
# Deterministic: agent goes exactly where intended
env_easy = gym.make('FrozenLake-v1', is_slippery=False)
Q_easy, rewards_easy = q_learning(env_easy)
print(f"Non-slippery success rate: {np.mean(rewards_easy[-1000:]):.1%}") # ~100%
# Stochastic: 1/3 chance of sliding sideways
env_hard = gym.make('FrozenLake-v1', is_slippery=True)
Q_hard, rewards_hard = q_learning(env_hard)
print(f"Slippery success rate: {np.mean(rewards_hard[-1000:]):.1%}") # ~75%
Without slippery ice, the agent achieves ~100% success — the problem becomes trivial. The stochastic transitions are what make FrozenLake interesting: the agent must learn a policy that's robust to noise.
When NOT to Use Tabular Q-Learning
- Large state spaces — A 100×100 grid has 10,000 states × 4 actions = 40,000 Q-values. An Atari game has ~10¹⁸ possible states. Tables don't scale — use neural network function approximation instead (Deep Q-Networks)
- Continuous states — Temperature, position, velocity can't be tabulated. Use function approximation or discretise
-
Continuous actions — Q-learning requires
$\max_a Q(s,a)$, which is trivial for discrete actions but requires optimisation for continuous ones. Use policy gradient methods instead
Alternatives:
- Deep Q-Networks (DQN) — Replace the Q-table with a neural network
- SARSA — On-policy variant, learns a safer policy
- Policy gradient methods — Directly optimise the policy, handle continuous actions
- Monte Carlo methods — Wait until the end of the episode to update (no bootstrapping)
Deep Dive: The Papers
Watkins (1989)
Christopher Watkins introduced Q-learning in his PhD thesis Learning from Delayed Rewards at Cambridge. The key problem he addressed: how can an agent learn optimal behaviour when rewards are delayed — you don't know if a move was good until many steps later?
His solution was to learn the action-value function $Q(s, a)$ directly, rather than the state-value function $V(s)$. The crucial insight: by learning $Q$ values, the agent can select optimal actions without needing a model of the environment — it just picks $\arg\max_a Q(s, a)$.
Convergence Guarantee
Watkins & Dayan proved in their 1992 paper that Q-learning converges to the optimal Q-function $Q^*$ under two conditions:
- Every state-action pair is visited infinitely often
- The learning rate decays appropriately (but not too fast)
Formally, given the update:
Convergence requires:
The first condition ensures sufficient learning; the second ensures the updates eventually settle. In practice, a constant learning rate works well for tabular problems.
The Bellman Optimality Equation
Q-learning is grounded in Bellman's Dynamic Programming (1957). The optimal Q-function satisfies the Bellman optimality equation:
This is a fixed-point equation: the optimal Q-values are self-consistent. Q-learning iteratively approximates this fixed point using sampled transitions instead of computing the full expectation.
Our Implementation vs the Theory
| Theory | Our code |
|---|---|
State space $\mathcal{S}$
|
16 grid cells (0-15) |
Action space $\mathcal{A}$
|
4 directions (L, D, R, U) |
Q-function $Q(s,a)$
|
Q numpy array (16×4) |
Policy $\pi(s) = \arg\max_a Q(s,a)$
|
np.argmax(Q[state]) |
| Bellman update | Q[s,a] += alpha * (r + gamma * max(Q[s']) - Q[s,a]) |
Exploration $\epsilon$-greedy |
if random() < epsilon: sample() |
Historical Context
- Bellman (1957) — Dynamic programming and the Bellman equation
- Samuel (1959) — Checkers-playing program, early TD learning
- Sutton (1988) — TD(λ) learning framework
- Watkins (1989) — Q-learning: model-free off-policy control
- Watkins & Dayan (1992) — Convergence proof for Q-learning
- Mnih et al. (2013) — Deep Q-Networks (DQN) — Q-learning with neural nets, playing Atari
- Sutton & Barto (2018) — Reinforcement Learning: An Introduction — the definitive textbook
Further Reading
- Watkins (1989) — The original Q-learning thesis
- Watkins & Dayan (1992) — Convergence proof
- Sutton & Barto (2018) — Chapters 6 (TD Learning) and 6.5 (Q-Learning) — freely available online
- Mnih et al. (2015) — Human-level control through deep RL — DQN paper in Nature
Interactive Tools
- Q-Learning Visualiser — Watch Q-learning train step-by-step on grid worlds in the browser
Related Posts
- Genetic Algorithms — evolution-based optimisation (no gradient, no model — like Q-learning)
- Backpropagation Demystified — the gradient engine that powers Deep Q-Networks
- MCMC Island Hopping — another algorithm balancing exploration and exploitation
Try It Yourself
The interactive notebook includes exercises:
-
Slippery vs non-slippery — Compare convergence curves for
is_slippery=TrueandFalse. How does the optimal success rate differ? -
Gamma sweep — Try
$\gamma \in \{0.5, 0.8, 0.95, 0.99\}$. How does the discount factor affect learning speed and final performance? -
SARSA comparison — Modify the update to use the actual next action instead of
$\max_{a'}$. How does the learned policy differ? -
8×8 FrozenLake — Switch to
FrozenLake8x8-v1(64 states). Does Q-learning still converge? How many more episodes does it need? - Visualise exploration — Track how many times each state is visited during training. Which states are under-explored?
Understanding Q-learning opens the door to Deep Q-Networks, policy gradients, and modern RL. The core insight — learning action values from delayed rewards through bootstrapping — is the foundation of all value-based reinforcement learning.
Frequently Asked Questions
What is the difference between Q-learning and SARSA?
Q-learning is off-policy: it updates using the best possible next action regardless of what the agent actually does. SARSA is on-policy: it updates using the action the agent actually took next, which means exploration noise feeds into the value estimates. In practice, SARSA learns a more conservative policy that accounts for the agent's own exploratory behaviour.
Why does the Q-table start with all zeros?
Initialising all Q-values to zero is a simple and common choice for environments where rewards are non-negative. Since the only reward in FrozenLake is +1 at the goal, any discovered path to the goal will produce Q-values above zero, naturally encouraging the agent to pursue it. Optimistic initialisation (starting with high values) is an alternative that can speed up exploration.
Can Q-learning handle continuous state spaces?
Not in its tabular form. With continuous states like position or velocity, you cannot maintain a lookup table for every possible state. The standard solution is to replace the Q-table with a function approximator such as a neural network, which is exactly what Deep Q-Networks (DQN) do. Alternatively, you can discretise the state space, though this scales poorly to high dimensions.
Why does the agent only reach ~67-74% success on slippery FrozenLake?
The slippery ice introduces genuine stochasticity that no policy can overcome. Each action has only a 1/3 chance of going in the intended direction, so even with a perfect policy, the agent will sometimes slide into holes. The theoretical maximum success rate for this environment is approximately 74%, meaning the agent's learned policy is near-optimal.
How does the discount factor gamma affect learning?
Gamma controls how much the agent values future rewards relative to immediate ones. A low gamma (e.g. 0.5) makes the agent short-sighted, focusing only on nearby rewards. A high gamma (e.g. 0.99) makes the agent plan further ahead, which is essential in FrozenLake where the goal is several steps away. Setting gamma too close to 1.0 can slow convergence because the agent gives nearly equal weight to very distant outcomes.
Does Q-learning guarantee finding the optimal policy?
Yes, under theoretical conditions: every state-action pair must be visited infinitely often, and the learning rate must decay appropriately. In practice, with a fixed learning rate and epsilon-greedy exploration, Q-learning converges reliably on small tabular problems like FrozenLake. For larger or continuous problems, convergence guarantees no longer hold, and techniques like experience replay and target networks become necessary.


Top comments (0)