Berkan Sesen

Posted on Apr 4 • Originally published at sesen.ai

Q-Learning from Scratch: Navigating the Frozen Lake

#reinforcementlearning #optimisation

Imagine you're standing on a frozen lake. Your goal is on the far side, but there are holes in the ice — fall in and it's game over. Worse, the ice is slippery: when you try to go right, you might slide up or down instead. You have no map, no instructions. All you can do is try, fail, and gradually learn which moves lead to safety.

This is exactly what Q-learning solves. The agent learns a value for every state-action pair — "how good is it to take action A from state S?" — purely from trial and error. No model of the environment needed, no supervision, just rewards.

By the end of this post, you'll implement Q-learning from scratch, train an agent to navigate OpenAI's FrozenLake environment, and understand the Bellman equation that makes it all work. You'll also see why exploration matters — and what happens when an agent gets greedy too early.

The algorithm was introduced by Watkins (1989) in his PhD thesis and its convergence was proven by Watkins & Dayan (1992).

The Problem: FrozenLake

OpenAI's FrozenLake is a 4×4 grid world:

SFFF    S = Start
FHFH    F = Frozen (safe)
FFFH    H = Hole (game over)
HFFG    G = Goal (reward = 1)

The agent can move Left, Down, Right, Up. But on slippery ice, each action has only a 1/3 chance of going in the intended direction — there's a 1/3 chance of sliding in each perpendicular direction. The only reward is +1 for reaching the Goal. Fall in a Hole or run out of steps and you get 0.

This is a model-free problem: the agent doesn't know the transition probabilities or the reward structure. It must learn entirely from experience.

Quick Win: Run the Algorithm

Let's see Q-learning in action. Click the badge to open the interactive notebook:

The GIF shows the agent's learned policy evolving from random arrows (no idea what to do) to a coherent strategy that avoids holes and reaches the goal. The colour intensity shows $V(s) = \max_a Q(s, a)$ — how valuable each state is.

import numpy as np
import gymnasium as gym

def q_learning(env, n_episodes=20000, alpha=0.1, gamma=0.99,
               epsilon_start=1.0, epsilon_end=0.01, decay_rate=1e-3):
    """Q-learning with epsilon-greedy exploration."""
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    rewards = []

    epsilon = epsilon_start
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0

        for step in range(100):
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])

            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-learning update (Bellman equation)
            Q[state, action] += alpha * (
                reward + gamma * np.max(Q[next_state]) - Q[state, action]
            )

            total_reward += reward
            state = next_state
            if done:
                break

        rewards.append(total_reward)
        epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(
            -decay_rate * episode
        )

    return Q, rewards

# Train the agent
env = gym.make('FrozenLake-v1', is_slippery=True)
Q, rewards = q_learning(env)

# Show the learned policy
actions = ['←', '↓', '→', '↑']
policy = np.array([actions[a] for a in np.argmax(Q, axis=1)]).reshape(4, 4)
print("Learned policy:")
print(policy)
print(f"\nSuccess rate (last 1000): {np.mean(rewards[-1000:]):.1%}")

The result: After 20,000 episodes, the agent learns a policy that reaches the goal ~67% of the time. Even with a good policy, the slippery ice inevitably pushes you into holes — the theoretical maximum for this environment is about 74%.

Visualise the Learning

import matplotlib.pyplot as plt

rolling = np.convolve(rewards, np.ones(100)/100, mode='valid')

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rolling, 'b-', linewidth=0.8)
ax.set_xlabel('Episode')
ax.set_ylabel('Success Rate (100-episode rolling avg)')
ax.set_title('Q-Learning on FrozenLake')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
plt.show()

The reward curve climbs from 0% (random) to ~67% (near-optimal). The initial flat period is pure exploration — the agent is collecting experience before it starts exploiting what it's learned.

What Just Happened?

Q-learning maintains a Q-table — a lookup table mapping every (state, action) pair to a value. For FrozenLake, that's 16 states × 4 actions = 64 entries, all starting at zero.

The Bellman Equation

The core update rule is one line of code:

Q[state, action] += alpha * (
    reward + gamma * np.max(Q[next_state]) - Q[state, action]
)

This implements the temporal difference (TD) update:

Let's break this apart:

$Q(s, a)$ — current estimate of "how good is action a in state s?"
$r$ — immediate reward received
$\gamma \max_{a'} Q(s', a')$ — discounted value of the best action in the next state
$r + \gamma \max_{a'} Q(s', a')$ — the TD target (what we think the true value is)
$\text{target} - Q(s, a)$ — the TD error (how wrong we were)
$\alpha$ — learning rate (how much to correct)

Think of it like updating a GPS estimate: you have a current estimate of the distance ( $Q(s,a)$ ), you observe a new measurement (the target), and you adjust by a fraction ( $\alpha$ ) of the difference.

Epsilon-Greedy Exploration

if np.random.random() < epsilon:
    action = env.action_space.sample()   # explore: random action
else:
    action = np.argmax(Q[state])         # exploit: best known action

With probability $\epsilon$ , the agent takes a random action (exploration). Otherwise, it takes the best action according to its Q-table (exploitation).

Early in training, $\epsilon = 1.0$ — pure exploration. Over time, it decays exponentially:

This mirrors the exploration-exploitation trade-off in MCMC — the Metropolis-Hastings algorithm also balances proposing random moves with accepting good ones.

Why "Off-Policy"?

Q-learning is off-policy: it learns about the greedy policy ( $\max_{a'} Q(s', a')$ ) while following an exploratory policy (epsilon-greedy). This separation is powerful — the agent can explore freely without corrupting what it's learning about the optimal policy.

The alternative, SARSA (on-policy), updates using the action the agent actually took next, not the best possible action. SARSA is more conservative — it learns a safer policy that accounts for its own exploration.

Going Deeper

Hyperparameters

Parameter	What it controls	FrozenLake value	Guidance
`alpha`	Learning rate	0.1	Low for stochastic envs, higher for deterministic
`gamma`	Discount factor	0.99	Close to 1 for long-horizon tasks
`epsilon_start`	Initial exploration	1.0	Always start fully random
`epsilon_end`	Final exploration	0.01	Small but non-zero
`decay_rate`	Exploration decay	0.001	Slower = more exploration
`n_episodes`	Training budget	20,000	Until convergence

gamma is critical. It controls how much the agent values future rewards vs immediate ones:

$\gamma = 0$ — only cares about immediate reward (myopic)
$\gamma = 1$ — values all future rewards equally (farsighted)
$\gamma = 0.95$ — reward 10 steps away is worth $0.95^{10} \approx 0.60$ of its face value

For FrozenLake, the goal is ~6 steps away but stochastic transitions make paths longer. With $\gamma = 0.99$ , a reward 10 steps away is worth $0.99^{10} \approx 0.90$ — the agent strongly values the distant goal.

Slippery vs Non-Slippery

# Deterministic: agent goes exactly where intended
env_easy = gym.make('FrozenLake-v1', is_slippery=False)
Q_easy, rewards_easy = q_learning(env_easy)
print(f"Non-slippery success rate: {np.mean(rewards_easy[-1000:]):.1%}")  # ~100%

# Stochastic: 1/3 chance of sliding sideways
env_hard = gym.make('FrozenLake-v1', is_slippery=True)
Q_hard, rewards_hard = q_learning(env_hard)
print(f"Slippery success rate: {np.mean(rewards_hard[-1000:]):.1%}")  # ~75%

Without slippery ice, the agent achieves ~100% success — the problem becomes trivial. The stochastic transitions are what make FrozenLake interesting: the agent must learn a policy that's robust to noise.

When NOT to Use Tabular Q-Learning

Large state spaces — A 100×100 grid has 10,000 states × 4 actions = 40,000 Q-values. An Atari game has ~10¹⁸ possible states. Tables don't scale — use neural network function approximation instead (Deep Q-Networks)
Continuous states — Temperature, position, velocity can't be tabulated. Use function approximation or discretise
Continuous actions — Q-learning requires $\max_a Q(s,a)$ , which is trivial for discrete actions but requires optimisation for continuous ones. Use policy gradient methods instead

Alternatives:

Deep Q-Networks (DQN) — Replace the Q-table with a neural network
SARSA — On-policy variant, learns a safer policy
Policy gradient methods — Directly optimise the policy, handle continuous actions
Monte Carlo methods — Wait until the end of the episode to update (no bootstrapping)

Deep Dive: The Papers

Watkins (1989)

Christopher Watkins introduced Q-learning in his PhD thesis Learning from Delayed Rewards at Cambridge. The key problem he addressed: how can an agent learn optimal behaviour when rewards are delayed — you don't know if a move was good until many steps later?

His solution was to learn the action-value function $Q(s, a)$ directly, rather than the state-value function $V(s)$ . The crucial insight: by learning $Q$ values, the agent can select optimal actions without needing a model of the environment — it just picks $\arg\max_a Q(s, a)$ .

Convergence Guarantee

Watkins & Dayan proved in their 1992 paper that Q-learning converges to the optimal Q-function $Q^*$ under two conditions:

Every state-action pair is visited infinitely often
The learning rate decays appropriately (but not too fast)

Formally, given the update:

Convergence requires:

The first condition ensures sufficient learning; the second ensures the updates eventually settle. In practice, a constant learning rate works well for tabular problems.

The Bellman Optimality Equation

Q-learning is grounded in Bellman's Dynamic Programming (1957). The optimal Q-function satisfies the Bellman optimality equation:

This is a fixed-point equation: the optimal Q-values are self-consistent. Q-learning iteratively approximates this fixed point using sampled transitions instead of computing the full expectation.

Our Implementation vs the Theory

Theory	Our code
State space $\mathcal{S}$	16 grid cells (0-15)
Action space $\mathcal{A}$	4 directions (L, D, R, U)
Q-function $Q(s,a)$	`Q` numpy array (16×4)
Policy $\pi(s) = \arg\max_a Q(s,a)$	`np.argmax(Q[state])`
Bellman update	`Q[s,a] += alpha * (r + gamma * max(Q[s']) - Q[s,a])`
Exploration $\epsilon$ -greedy	`if random() < epsilon: sample()`

Historical Context

Bellman (1957) — Dynamic programming and the Bellman equation
Samuel (1959) — Checkers-playing program, early TD learning
Sutton (1988) — TD(λ) learning framework
Watkins (1989) — Q-learning: model-free off-policy control
Watkins & Dayan (1992) — Convergence proof for Q-learning
Mnih et al. (2013) — Deep Q-Networks (DQN) — Q-learning with neural nets, playing Atari
Sutton & Barto (2018) — Reinforcement Learning: An Introduction — the definitive textbook

Interactive Tools

Q-Learning Visualiser — Watch Q-learning train step-by-step on grid worlds in the browser

Genetic Algorithms — evolution-based optimisation (no gradient, no model — like Q-learning)
Backpropagation Demystified — the gradient engine that powers Deep Q-Networks
MCMC Island Hopping — another algorithm balancing exploration and exploitation

Try It Yourself

The interactive notebook includes exercises:

Slippery vs non-slippery — Compare convergence curves for is_slippery=True and False. How does the optimal success rate differ?
Gamma sweep — Try $\gamma \in \{0.5, 0.8, 0.95, 0.99\}$ . How does the discount factor affect learning speed and final performance?
SARSA comparison — Modify the update to use the actual next action instead of $\max_{a'}$ . How does the learned policy differ?
8×8 FrozenLake — Switch to FrozenLake8x8-v1 (64 states). Does Q-learning still converge? How many more episodes does it need?
Visualise exploration — Track how many times each state is visited during training. Which states are under-explored?

Understanding Q-learning opens the door to Deep Q-Networks, policy gradients, and modern RL. The core insight — learning action values from delayed rewards through bootstrapping — is the foundation of all value-based reinforcement learning.

Frequently Asked Questions

What is the difference between Q-learning and SARSA?

Q-learning is off-policy: it updates using the best possible next action regardless of what the agent actually does. SARSA is on-policy: it updates using the action the agent actually took next, which means exploration noise feeds into the value estimates. In practice, SARSA learns a more conservative policy that accounts for the agent's own exploratory behaviour.

Why does the Q-table start with all zeros?

Initialising all Q-values to zero is a simple and common choice for environments where rewards are non-negative. Since the only reward in FrozenLake is +1 at the goal, any discovered path to the goal will produce Q-values above zero, naturally encouraging the agent to pursue it. Optimistic initialisation (starting with high values) is an alternative that can speed up exploration.

Can Q-learning handle continuous state spaces?

Not in its tabular form. With continuous states like position or velocity, you cannot maintain a lookup table for every possible state. The standard solution is to replace the Q-table with a function approximator such as a neural network, which is exactly what Deep Q-Networks (DQN) do. Alternatively, you can discretise the state space, though this scales poorly to high dimensions.

Why does the agent only reach ~67-74% success on slippery FrozenLake?

The slippery ice introduces genuine stochasticity that no policy can overcome. Each action has only a 1/3 chance of going in the intended direction, so even with a perfect policy, the agent will sometimes slide into holes. The theoretical maximum success rate for this environment is approximately 74%, meaning the agent's learned policy is near-optimal.

How does the discount factor gamma affect learning?

Gamma controls how much the agent values future rewards relative to immediate ones. A low gamma (e.g. 0.5) makes the agent short-sighted, focusing only on nearby rewards. A high gamma (e.g. 0.99) makes the agent plan further ahead, which is essential in FrozenLake where the goal is several steps away. Setting gamma too close to 1.0 can slow convergence because the agent gives nearly equal weight to very distant outcomes.

Does Q-learning guarantee finding the optimal policy?

Yes, under theoretical conditions: every state-action pair must be visited infinitely often, and the learning rate must decay appropriately. In practice, with a fixed learning rate and epsilon-greedy exploration, Q-learning converges reliably on small tabular problems like FrozenLake. For larger or continuous problems, convergence guarantees no longer hold, and techniques like experience replay and target networks become necessary.

DEV Community