DEV Community

Cover image for Q-Learning from Scratch: Navigating the Frozen Lake
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

Q-Learning from Scratch: Navigating the Frozen Lake

Imagine you're standing on a frozen lake. Your goal is on the far side, but there are holes in the ice — fall in and it's game over. Worse, the ice is slippery: when you try to go right, you might slide up or down instead. You have no map, no instructions. All you can do is try, fail, and gradually learn which moves lead to safety.

This is exactly what Q-learning solves. The agent learns a value for every state-action pair — "how good is it to take action A from state S?" — purely from trial and error. No model of the environment needed, no supervision, just rewards.

By the end of this post, you'll implement Q-learning from scratch, train an agent to navigate OpenAI's FrozenLake environment, and understand the Bellman equation that makes it all work. You'll also see why exploration matters — and what happens when an agent gets greedy too early.

The algorithm was introduced by Watkins (1989) in his PhD thesis and its convergence was proven by Watkins & Dayan (1992).

The Problem: FrozenLake

OpenAI's FrozenLake is a 4×4 grid world:

SFFF    S = Start
FHFH    F = Frozen (safe)
FFFH    H = Hole (game over)
HFFG    G = Goal (reward = 1)
Enter fullscreen mode Exit fullscreen mode

The agent can move Left, Down, Right, Up. But on slippery ice, each action has only a 1/3 chance of going in the intended direction — there's a 1/3 chance of sliding in each perpendicular direction. The only reward is +1 for reaching the Goal. Fall in a Hole or run out of steps and you get 0.

This is a model-free problem: the agent doesn't know the transition probabilities or the reward structure. It must learn entirely from experience.

Quick Win: Run the Algorithm

Let's see Q-learning in action. Click the badge to open the interactive notebook:

Open In Colab

Q-learning policy evolution — watch the agent learn optimal actions on the FrozenLake grid over 20,000 episodes

The GIF shows the agent's learned policy evolving from random arrows (no idea what to do) to a coherent strategy that avoids holes and reaches the goal. The colour intensity shows $V(s) = \max_a Q(s, a)$ — how valuable each state is.

import numpy as np
import gymnasium as gym

def q_learning(env, n_episodes=20000, alpha=0.1, gamma=0.99,
               epsilon_start=1.0, epsilon_end=0.01, decay_rate=1e-3):
    """Q-learning with epsilon-greedy exploration."""
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    rewards = []

    epsilon = epsilon_start
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0

        for step in range(100):
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])

            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-learning update (Bellman equation)
            Q[state, action] += alpha * (
                reward + gamma * np.max(Q[next_state]) - Q[state, action]
            )

            total_reward += reward
            state = next_state
            if done:
                break

        rewards.append(total_reward)
        epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(
            -decay_rate * episode
        )

    return Q, rewards

# Train the agent
env = gym.make('FrozenLake-v1', is_slippery=True)
Q, rewards = q_learning(env)

# Show the learned policy
actions = ['', '', '', '']
policy = np.array([actions[a] for a in np.argmax(Q, axis=1)]).reshape(4, 4)
print("Learned policy:")
print(policy)
print(f"\nSuccess rate (last 1000): {np.mean(rewards[-1000:]):.1%}")
Enter fullscreen mode Exit fullscreen mode

The result: After 20,000 episodes, the agent learns a policy that reaches the goal ~67% of the time. Even with a good policy, the slippery ice inevitably pushes you into holes — the theoretical maximum for this environment is about 74%.

Visualise the Learning

import matplotlib.pyplot as plt

rolling = np.convolve(rewards, np.ones(100)/100, mode='valid')

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rolling, 'b-', linewidth=0.8)
ax.set_xlabel('Episode')
ax.set_ylabel('Success Rate (100-episode rolling avg)')
ax.set_title('Q-Learning on FrozenLake')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Q-learning reward curve — rolling average success rate climbing from 0% to ~67% over 20,000 episodes

The reward curve climbs from 0% (random) to ~67% (near-optimal). The initial flat period is pure exploration — the agent is collecting experience before it starts exploiting what it's learned.

What Just Happened?

Q-learning maintains a Q-table — a lookup table mapping every (state, action) pair to a value. For FrozenLake, that's 16 states × 4 actions = 64 entries, all starting at zero.

The Bellman Equation

The core update rule is one line of code:

Q[state, action] += alpha * (
    reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
Enter fullscreen mode Exit fullscreen mode

This implements the temporal difference (TD) update:

equation

Let's break this apart:

  • $Q(s, a)$ — current estimate of "how good is action a in state s?"
  • $r$ — immediate reward received
  • $\gamma \max_{a'} Q(s', a')$ — discounted value of the best action in the next state
  • $r + \gamma \max_{a'} Q(s', a')$ — the TD target (what we think the true value is)
  • $\text{target} - Q(s, a)$ — the TD error (how wrong we were)
  • $\alpha$ — learning rate (how much to correct)

Think of it like updating a GPS estimate: you have a current estimate of the distance ($Q(s,a)$), you observe a new measurement (the target), and you adjust by a fraction ($\alpha$) of the difference.

Epsilon-Greedy Exploration

if np.random.random() < epsilon:
    action = env.action_space.sample()   # explore: random action
else:
    action = np.argmax(Q[state])         # exploit: best known action
Enter fullscreen mode Exit fullscreen mode

With probability $\epsilon$, the agent takes a random action (exploration). Otherwise, it takes the best action according to its Q-table (exploitation).

Early in training, $\epsilon = 1.0$ — pure exploration. Over time, it decays exponentially:

equation

This mirrors the exploration-exploitation trade-off in MCMC — the Metropolis-Hastings algorithm also balances proposing random moves with accepting good ones.

Why "Off-Policy"?

Q-learning is off-policy: it learns about the greedy policy ($\max_{a'} Q(s', a')$) while following an exploratory policy (epsilon-greedy). This separation is powerful — the agent can explore freely without corrupting what it's learning about the optimal policy.

The alternative, SARSA (on-policy), updates using the action the agent actually took next, not the best possible action. SARSA is more conservative — it learns a safer policy that accounts for its own exploration.

Going Deeper

Hyperparameters

Parameter What it controls FrozenLake value Guidance
alpha Learning rate 0.1 Low for stochastic envs, higher for deterministic
gamma Discount factor 0.99 Close to 1 for long-horizon tasks
epsilon_start Initial exploration 1.0 Always start fully random
epsilon_end Final exploration 0.01 Small but non-zero
decay_rate Exploration decay 0.001 Slower = more exploration
n_episodes Training budget 20,000 Until convergence

gamma is critical. It controls how much the agent values future rewards vs immediate ones:

  • $\gamma = 0$ — only cares about immediate reward (myopic)
  • $\gamma = 1$ — values all future rewards equally (farsighted)
  • $\gamma = 0.95$ — reward 10 steps away is worth $0.95^{10} \approx 0.60$ of its face value

For FrozenLake, the goal is ~6 steps away but stochastic transitions make paths longer. With $\gamma = 0.99$, a reward 10 steps away is worth $0.99^{10} \approx 0.90$ — the agent strongly values the distant goal.

Slippery vs Non-Slippery

# Deterministic: agent goes exactly where intended
env_easy = gym.make('FrozenLake-v1', is_slippery=False)
Q_easy, rewards_easy = q_learning(env_easy)
print(f"Non-slippery success rate: {np.mean(rewards_easy[-1000:]):.1%}")  # ~100%

# Stochastic: 1/3 chance of sliding sideways
env_hard = gym.make('FrozenLake-v1', is_slippery=True)
Q_hard, rewards_hard = q_learning(env_hard)
print(f"Slippery success rate: {np.mean(rewards_hard[-1000:]):.1%}")  # ~75%
Enter fullscreen mode Exit fullscreen mode

Without slippery ice, the agent achieves ~100% success — the problem becomes trivial. The stochastic transitions are what make FrozenLake interesting: the agent must learn a policy that's robust to noise.

When NOT to Use Tabular Q-Learning

  1. Large state spaces — A 100×100 grid has 10,000 states × 4 actions = 40,000 Q-values. An Atari game has ~10¹⁸ possible states. Tables don't scale — use neural network function approximation instead (Deep Q-Networks)
  2. Continuous states — Temperature, position, velocity can't be tabulated. Use function approximation or discretise
  3. Continuous actions — Q-learning requires $\max_a Q(s,a)$, which is trivial for discrete actions but requires optimisation for continuous ones. Use policy gradient methods instead

Alternatives:

  • Deep Q-Networks (DQN) — Replace the Q-table with a neural network
  • SARSA — On-policy variant, learns a safer policy
  • Policy gradient methods — Directly optimise the policy, handle continuous actions
  • Monte Carlo methods — Wait until the end of the episode to update (no bootstrapping)

Deep Dive: The Papers

Watkins (1989)

Christopher Watkins introduced Q-learning in his PhD thesis Learning from Delayed Rewards at Cambridge. The key problem he addressed: how can an agent learn optimal behaviour when rewards are delayed — you don't know if a move was good until many steps later?

His solution was to learn the action-value function $Q(s, a)$ directly, rather than the state-value function $V(s)$. The crucial insight: by learning $Q$ values, the agent can select optimal actions without needing a model of the environment — it just picks $\arg\max_a Q(s, a)$.

Convergence Guarantee

Watkins & Dayan proved in their 1992 paper that Q-learning converges to the optimal Q-function $Q^*$ under two conditions:

  1. Every state-action pair is visited infinitely often
  2. The learning rate decays appropriately (but not too fast)

Formally, given the update:

equation

Convergence requires:

equation

The first condition ensures sufficient learning; the second ensures the updates eventually settle. In practice, a constant learning rate works well for tabular problems.

The Bellman Optimality Equation

Q-learning is grounded in Bellman's Dynamic Programming (1957). The optimal Q-function satisfies the Bellman optimality equation:

equation

This is a fixed-point equation: the optimal Q-values are self-consistent. Q-learning iteratively approximates this fixed point using sampled transitions instead of computing the full expectation.

Our Implementation vs the Theory

Theory Our code
State space $\mathcal{S}$ 16 grid cells (0-15)
Action space $\mathcal{A}$ 4 directions (L, D, R, U)
Q-function $Q(s,a)$ Q numpy array (16×4)
Policy $\pi(s) = \arg\max_a Q(s,a)$ np.argmax(Q[state])
Bellman update Q[s,a] += alpha * (r + gamma * max(Q[s']) - Q[s,a])
Exploration $\epsilon$-greedy if random() < epsilon: sample()

Historical Context

  • Bellman (1957) — Dynamic programming and the Bellman equation
  • Samuel (1959) — Checkers-playing program, early TD learning
  • Sutton (1988) — TD(λ) learning framework
  • Watkins (1989) — Q-learning: model-free off-policy control
  • Watkins & Dayan (1992) — Convergence proof for Q-learning
  • Mnih et al. (2013) — Deep Q-Networks (DQN) — Q-learning with neural nets, playing Atari
  • Sutton & Barto (2018)Reinforcement Learning: An Introduction — the definitive textbook

Further Reading

Interactive Tools

Related Posts

Try It Yourself

The interactive notebook includes exercises:

  1. Slippery vs non-slippery — Compare convergence curves for is_slippery=True and False. How does the optimal success rate differ?
  2. Gamma sweep — Try $\gamma \in \{0.5, 0.8, 0.95, 0.99\}$. How does the discount factor affect learning speed and final performance?
  3. SARSA comparison — Modify the update to use the actual next action instead of $\max_{a'}$. How does the learned policy differ?
  4. 8×8 FrozenLake — Switch to FrozenLake8x8-v1 (64 states). Does Q-learning still converge? How many more episodes does it need?
  5. Visualise exploration — Track how many times each state is visited during training. Which states are under-explored?

Understanding Q-learning opens the door to Deep Q-Networks, policy gradients, and modern RL. The core insight — learning action values from delayed rewards through bootstrapping — is the foundation of all value-based reinforcement learning.

Frequently Asked Questions

What is the difference between Q-learning and SARSA?

Q-learning is off-policy: it updates using the best possible next action regardless of what the agent actually does. SARSA is on-policy: it updates using the action the agent actually took next, which means exploration noise feeds into the value estimates. In practice, SARSA learns a more conservative policy that accounts for the agent's own exploratory behaviour.

Why does the Q-table start with all zeros?

Initialising all Q-values to zero is a simple and common choice for environments where rewards are non-negative. Since the only reward in FrozenLake is +1 at the goal, any discovered path to the goal will produce Q-values above zero, naturally encouraging the agent to pursue it. Optimistic initialisation (starting with high values) is an alternative that can speed up exploration.

Can Q-learning handle continuous state spaces?

Not in its tabular form. With continuous states like position or velocity, you cannot maintain a lookup table for every possible state. The standard solution is to replace the Q-table with a function approximator such as a neural network, which is exactly what Deep Q-Networks (DQN) do. Alternatively, you can discretise the state space, though this scales poorly to high dimensions.

Why does the agent only reach ~67-74% success on slippery FrozenLake?

The slippery ice introduces genuine stochasticity that no policy can overcome. Each action has only a 1/3 chance of going in the intended direction, so even with a perfect policy, the agent will sometimes slide into holes. The theoretical maximum success rate for this environment is approximately 74%, meaning the agent's learned policy is near-optimal.

How does the discount factor gamma affect learning?

Gamma controls how much the agent values future rewards relative to immediate ones. A low gamma (e.g. 0.5) makes the agent short-sighted, focusing only on nearby rewards. A high gamma (e.g. 0.99) makes the agent plan further ahead, which is essential in FrozenLake where the goal is several steps away. Setting gamma too close to 1.0 can slow convergence because the agent gives nearly equal weight to very distant outcomes.

Does Q-learning guarantee finding the optimal policy?

Yes, under theoretical conditions: every state-action pair must be visited infinitely often, and the learning rate must decay appropriately. In practice, with a fixed learning rate and epsilon-greedy exploration, Q-learning converges reliably on small tabular problems like FrozenLake. For larger or continuous problems, convergence guarantees no longer hold, and techniques like experience replay and target networks become necessary.

Top comments (0)