Berkan Sesen

Posted on May 4 • Originally published at sesen.ai

Value Iteration vs Q-Learning: Dynamic Programming Meets RL

#reinforcementlearning #optimisation #dynamicprogramming

You have a map of the frozen lake. Every crack in the ice, every slippery patch, every hole is marked. You can sit at your desk and plan the perfect route before stepping foot on the ice. That is value iteration.

Now imagine you have no map. You lace up your boots and start walking. You slip, you fall into holes, you backtrack. But each time you learn a little more about which moves pay off and which ones do not. That is Q-learning.

Both approaches solve the same problem (finding the best policy in a Markov Decision Process), but they start from radically different assumptions about what you know. In our earlier Q-learning post, we focused purely on the model-free approach. This post puts the two side by side on the same FrozenLake environment, so you can see exactly what a model buys you, and what you give up when you do not have one.

By the end of this post, you will have implemented both value iteration and Q-learning from scratch, compared their convergence and policies head-to-head, and understood the Bellman equation that underpins them both.

Quick Win: Run Both Algorithms

Let's see both algorithms in action. Click the badge to open the interactive notebook:

Watch value iteration discover optimal state values in just a few sweeps, with "heat" radiating outward from the goal:

Here is the complete implementation for both methods:

import numpy as np
import gymnasium as gym

# ── Value Iteration (model-based) ──────────────────────────
def value_iteration(env, gamma=0.95, theta=1e-8):
    """Compute optimal V* using the Bellman optimality equation."""
    nS = env.observation_space.n
    nA = env.action_space.n
    V = np.zeros(nS)

    while True:
        delta = 0
        for s in range(nS):
            action_values = np.zeros(nA)
            for a in range(nA):
                for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                    action_values[a] += prob * (reward + gamma * V[next_s])
            best_value = np.max(action_values)
            delta = max(delta, abs(best_value - V[s]))
            V[s] = best_value
        if delta < theta:
            break

    # Extract greedy policy from V*
    policy = np.zeros(nS, dtype=int)
    for s in range(nS):
        action_values = np.zeros(nA)
        for a in range(nA):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                action_values[a] += prob * (reward + gamma * V[next_s])
        policy[s] = np.argmax(action_values)

    return V, policy

# ── Q-Learning (model-free) ────────────────────────────────
def q_learning(env, n_episodes=10000, gamma=0.95, lr=0.8,
               epsilon_start=1.0, epsilon_end=0.01, decay_rate=7e-3):
    """Tabular Q-learning with epsilon-greedy exploration."""
    nS = env.observation_space.n
    nA = env.action_space.n
    Q = np.zeros((nS, nA))
    rewards = []
    epsilon = epsilon_start

    for ep in range(n_episodes):
        state, _ = env.reset()
        for t in range(100):
            if np.random.uniform() <= epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state, :])

            next_state, reward, terminated, truncated, _ = env.step(action)
            # Q-learning update
            Q[state, action] += lr * (
                reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
            )
            state = next_state
            if terminated or truncated:
                rewards.append(reward)
                epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(-decay_rate * ep)
                break

    policy = np.argmax(Q, axis=1)
    return Q, policy, rewards

# ── Run both on FrozenLake ─────────────────────────────────
env = gym.make('FrozenLake-v1', is_slippery=True)

V_star, vi_policy = value_iteration(env, gamma=0.95)
Q_star, ql_policy, ql_rewards = q_learning(env, n_episodes=10000, gamma=0.95)

The result: Value iteration converges in 184 sweeps and produces a policy that succeeds ~73% of the time. Q-learning, after 10,000 episodes of trial and error, learns a policy that also achieves ~73% success, and agrees with the VI policy on 14 out of 16 states. Both methods find near-identical strategies, but through very different paths.

What Just Happened?

Both algorithms answer the same question: "What is the best action in every state?" But they go about it in fundamentally different ways.

Value Iteration: Planning with a Blueprint

Value iteration has access to the environment's full transition model $P(s' \mid s, a)$ . This is the complete blueprint: for every state and action, you know exactly which states you might land in and with what probability.

The algorithm sweeps through every state, computing the value of the best action using the Bellman optimality equation:

Each sweep propagates value information one step further from the goal. In the GIF above, you can see this: after sweep 0, only the state next to the goal has any value (0.333). By sweep 5, the values have spread across the grid. By sweep 100, they have stabilised.

The key line in the code is this inner loop:

for prob, next_s, reward, done in env.unwrapped.P[s][a]:
    action_values[a] += prob * (reward + gamma * V[next_s])

This sums over all possible outcomes of taking action $a$ from state $s$ , weighting each by its transition probability. No randomness, no sampling; it is a deterministic computation over the full model.

Q-Learning: Learning by Doing

Q-learning has no access to the transition model. It learns by interacting with the environment, collecting $(s, a, r, s')$ tuples, and updating its Q-table one experience at a time:

Q[state, action] += lr * (
    reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
)

This is the temporal difference (TD) update. The term $r + \gamma \max_{a'} Q(s', a')$ is the TD target: what the agent thinks the return should be based on the immediate reward plus the estimated future value. The difference between this target and the current estimate $Q(s, a)$ is the TD error, which drives learning.

Because Q-learning relies on sampled experience rather than exhaustive computation, it needs many more interactions (10,000 episodes vs 184 sweeps). It also needs an exploration strategy (epsilon-greedy) to ensure it visits enough state-action pairs to build an accurate Q-table. If you have already read our Q-learning tutorial, these mechanics will be familiar.

Why Both Reach the Same Answer

This is not a coincidence. Both algorithms are solving the same Bellman optimality equation. Value iteration solves it through repeated full sweeps over the state space. Q-learning solves it through stochastic approximation: each sampled experience nudges the Q-values toward the true solution, one step at a time.

Given enough sweeps, value iteration converges exactly. Given enough episodes, Q-learning converges asymptotically (with probability 1, under mild conditions on the learning rate and exploration). On FrozenLake, both methods produce policies that agree on 14 of 16 states and achieve the same ~73% success rate.

Going Deeper

Why Not 100%? The Stochasticity Tax

Even the optimal policy only succeeds about 73% of the time on slippery FrozenLake. This is not a bug in the algorithm. The environment is genuinely stochastic: each action has only a 1/3 chance of going in the intended direction, with 1/3 probability of sliding in each perpendicular direction. Some starting positions are simply doomed to fail because all paths to the goal pass near holes, and the ice will occasionally slide you in.

Convergence: 184 Sweeps vs 10,000 Episodes

Value iteration converges to the exact solution in 184 sweeps:

The Bellman error (maximum change in any state value) decreases exponentially. This is because value iteration is a contraction mapping: each sweep brings V closer to the true V* by a factor of at least $\gamma$ . With $\gamma = 0.95$ , the error shrinks by at least 5% per sweep, guaranteeing convergence.

Q-learning, by contrast, follows a noisier path:

The rolling average hovers around 40-60% during training because the agent is still exploring (epsilon > 0). But the extracted greedy policy, evaluated after training with epsilon = 0, achieves the same 73% as value iteration.

The Model-Based vs Model-Free Tradeoff

This comparison crystallises one of the deepest tradeoffs in reinforcement learning:

Property	Value Iteration	Q-Learning
Needs transition model?	Yes (env.P)	No
Steps to converge	184 sweeps	~10,000 episodes
Optimality guarantee	Exact	Asymptotic
Works for unknown environments?	No	Yes
Memory	O(\|S\|)	O(\|S\| × \|A\|)

Value iteration is faster and guarantees exact optimality, but it requires something that is rarely available in practice: the full transition model $P(s' \mid s, a)$ . In robotics, game-playing, or any complex real-world task, you almost never have this. That is why model-free methods like Q-learning (and its deep successor, DQN) dominate modern RL.

Hyperparameter Sensitivity

The original code uses a high learning rate ( $\alpha = 0.8$ ) and fast epsilon decay ( $\text{decay\_rate} = 7 \times 10^{-3}$ ). This means Q-learning explores aggressively early on and then commits to exploitation within about 1,000 episodes. The high learning rate works here because FrozenLake has a small, discrete state space. For larger problems, you would need to lower $\alpha$ considerably (our DQN post uses 0.001 with a neural network).

Value iteration, by contrast, has no learning rate. The discount factor $\gamma = 0.95$ is the only tunable parameter, and it has a clear interpretation: how much to value future rewards relative to immediate ones. Higher gamma means the agent plans further ahead but converges more slowly.

When to Use Which

Use value iteration when:

You have a complete model of the environment (transition probabilities and rewards)
The state space is small enough to sweep over exhaustively
You need guaranteed optimality

Use Q-learning when:

You can only interact with the environment through trial and error
The model is unknown or too complex to specify
You are willing to trade computation for generality

In practice, most interesting problems fall into the Q-learning camp, which is why model-free methods get so much attention. Not all model-free approaches use value functions, though. Methods like the cross-entropy method and simulated annealing search policy space directly without ever estimating state values. But understanding value iteration is essential because it reveals the Bellman equation that underlies all value-based RL. As we saw in our policy gradient post, even gradient-based methods ultimately try to maximise the same value function.

Deep Dive: The Papers

Bellman's Foundation

Value iteration traces directly to Richard Bellman's 1957 monograph Dynamic Programming. Bellman introduced the principle of optimality:

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision."

This recursive insight leads to the Bellman optimality equation. For the state-value function:

And for the action-value function (the one Q-learning estimates):

Value iteration simply applies the first equation as an update rule, sweeping over all states until convergence. The convergence is guaranteed because the Bellman operator is a contraction in the sup-norm with coefficient $\gamma < 1$ (proven by Bellman himself and later formalised by Denardo, 1967).

Watkins' Q-Learning

Q-learning was introduced by Christopher Watkins in his 1989 PhD thesis at Cambridge, with the convergence proof published in Watkins & Dayan (1992). The key insight was that you can learn $Q^*$ directly from experience, without ever knowing the transition model:

Watkins & Dayan proved that Q-learning converges to $Q^*$ with probability 1, provided:

All state-action pairs are visited infinitely often
The learning rate $\alpha$ satisfies: $\sum \alpha_t = \infty$ and $\sum \alpha_t^2 < \infty$

The first condition is why we need epsilon-greedy exploration. The second is a standard stochastic approximation requirement (Robbins-Monro conditions). In practice, we use a fixed or slowly decaying learning rate and rely on the algorithm converging "well enough" rather than proving formal convergence.

The DP-RL Connection

Sutton & Barto's Reinforcement Learning: An Introduction (2nd ed., 2018) makes the connection explicit in Chapters 4 and 6. Value iteration is presented as a dynamic programming method (Chapter 4), while Q-learning is a temporal difference method (Chapter 6). The book shows that TD methods can be viewed as sampling-based approximations to DP: where DP backs up values using the full distribution over successors, TD methods back up using a single sampled successor.

This connection runs deep. Every model-free RL algorithm, from Q-learning to DQN to policy gradients, is implicitly solving a Bellman equation. The difference is in how they approximate the expectation: through tabular sweeps (DP), sampled transitions (TD), or complete episode returns (Monte Carlo).

Try It Yourself

The interactive notebook includes exercises:

Non-slippery mode: Set is_slippery=False and compare. Both methods should now achieve ~100% success. How does this change the convergence speed?
8x8 grid: Try FrozenLake8x8-v1. Value iteration still works perfectly. How does Q-learning cope with the larger state space?
Learned transition model: The original code includes a learn_trans_matrix() function that estimates $P(s' \mid s, a)$ from random play, then runs VI on the learned model. Try this hybrid approach. How many random episodes do you need before the learned model matches the true one?
Discount factor sensitivity: Vary $\gamma$ from 0.5 to 0.99 and plot the success rate for both methods. When does a low gamma hurt?

Understanding value iteration gives you the theoretical bedrock of RL. Understanding Q-learning gives you the practical tool that works when models are not available. Together, they frame the central tradeoff that drives all of modern reinforcement learning.

Interactive Tools

Q-Learning Visualiser — Watch value iteration and Q-learning converge on grid worlds in the browser

Q-Learning on Frozen Lake from Scratch — Deep dive into tabular Q-learning on the same environment
Deep Q-Networks: Experience Replay and Target Networks — Scaling Q-learning with neural networks
Policy Gradients: REINFORCE from Scratch — The policy-based alternative to value methods
Cross-Entropy Method: Evolution-Style RL — A gradient-free approach to the same control problems

Frequently Asked Questions

What is the difference between value iteration and Q-learning?

Value iteration is a dynamic programming method that requires a complete model of the environment (transition probabilities and rewards) and sweeps through all states systematically. Q-learning is model-free: it learns from experience without knowing the environment dynamics. Both converge to optimal values, but value iteration is faster when a model is available, while Q-learning works when it is not.

What is the Bellman equation?

The Bellman equation expresses the value of a state as the immediate reward plus the discounted value of the next state. It is the foundation of both value iteration and Q-learning. Value iteration solves it by iterating the equation across all states until convergence. Q-learning solves it incrementally by updating one state-action pair at a time from experience.

When should I use dynamic programming instead of Q-learning?

Use dynamic programming (value iteration, policy iteration) when you have a complete and accurate model of the environment. This is common in games with known rules, inventory management, and operations research. When the model is unknown, too complex, or too large to enumerate, use model-free methods like Q-learning.

What is the difference between value iteration and policy iteration?

Value iteration updates the value function using the Bellman optimality equation until convergence, then extracts the policy. Policy iteration alternates between evaluating the current policy exactly and improving it greedily. Policy iteration often converges in fewer iterations but each iteration is more expensive. For small state spaces, both work well.

Does value iteration always converge?

Yes, for finite MDPs with a discount factor less than 1. The Bellman operator is a contraction mapping, guaranteeing convergence at a geometric rate. The number of iterations needed depends on the discount factor (higher gamma means slower convergence) and the desired precision. In practice, convergence is usually fast for small to moderate state spaces.

DEV Community