In the Q-learning post, we trained an agent to navigate a 4×4 frozen lake using a simple lookup table — 16 states × 4 actions = 64 numbers. But what happens when the state space isn't a grid?
CartPole has four continuous state variables: cart position, cart velocity, pole angle, and pole angular velocity. Even if you discretised each into 100 bins, you'd need 100⁴ = 100 million Q-values. An Atari game frame is 210×160 pixels with 128 colours — roughly $10^{18}$ possible states. Tables don't work here.
The solution: replace the Q-table with a neural network. Feed in the state, get out Q-values for every action. But naively combining neural networks with Q-learning is unstable — the network chases a moving target while training on correlated sequential data. DeepMind solved both problems with two elegant tricks: experience replay and a target network.
By the end of this post, you'll implement a Deep Q-Network from scratch in PyTorch, train it to balance a pole, and understand why these two tricks were the key insight behind Mnih et al. (2013) — the paper that launched modern deep reinforcement learning.
The Problem: CartPole
OpenAI's CartPole-v1 is a classic control problem: balance a pole on a moving cart by pushing left or right.
|
| ← pole (keep upright!)
|
┌─────┴─────┐
│ cart │
└───────────┘
◄── push left push right ──►
─────────────────────────────────
State (4 continuous values):
- Cart position (
$x$) - Cart velocity (
$\dot{x}$) - Pole angle (
$\theta$) - Pole angular velocity (
$\dot{\theta}$)
Actions: Push left (0) or push right (1).
Reward: +1 for every timestep the pole stays upright. The episode ends when the pole tilts beyond ±12° or the cart moves beyond ±2.4 from centre. Maximum score is 500 (the episode is truncated there).
Unlike FrozenLake's 16 discrete states, CartPole's state space is continuous and 4-dimensional. A Q-table is useless here — we need function approximation.
Quick Win: Run the Algorithm
Let's see DQN in action. Click the badge to open the interactive notebook:
The GIF shows the agent's progress: in early episodes the pole topples immediately, but after training with experience replay and a target network, it learns to balance for the full 500 steps.
import numpy as np
import random
from collections import deque
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
# --- Neural network: maps state → Q-values for each action ---
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
)
def forward(self, x):
return self.net(x)
# --- Experience replay buffer ---
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (np.array(states), np.array(actions), np.array(rewards),
np.array(next_states), np.array(dones))
def __len__(self):
return len(self.buffer)
# --- DQN Agent ---
class DQNAgent:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3,
gamma=0.99, buffer_size=10000, batch_size=64,
target_update_freq=5, epsilon_start=1.0,
epsilon_end=0.05, epsilon_step=0.0005, min_buffer=500):
self.action_dim = action_dim
self.gamma = gamma
self.batch_size = batch_size
self.target_update_freq = target_update_freq
self.min_buffer = min_buffer
# Linear epsilon decay — only during training, not during buffer collection.
# Rate from the original CartPole code: epsilon -= 1/(n_episodes/0.05)
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_step = epsilon_step
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.q_network = QNetwork(state_dim, action_dim, hidden_dim).to(self.device)
self.target_network = QNetwork(state_dim, action_dim, hidden_dim).to(self.device)
self.target_network.load_state_dict(self.q_network.state_dict())
self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
self.buffer = ReplayBuffer(buffer_size)
def select_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_dim)
with torch.no_grad():
state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
return self.q_network(state_t).argmax(dim=1).item()
def train_step(self):
if len(self.buffer) < max(self.batch_size, self.min_buffer):
return None
states, actions, rewards, next_states, dones = self.buffer.sample(
self.batch_size
)
states_t = torch.FloatTensor(states).to(self.device)
actions_t = torch.LongTensor(actions).to(self.device)
rewards_t = torch.FloatTensor(rewards).to(self.device)
next_states_t = torch.FloatTensor(next_states).to(self.device)
dones_t = torch.FloatTensor(dones).to(self.device)
# Current Q-values: Q(s, a) from the online network
q_values = self.q_network(states_t).gather(1, actions_t.unsqueeze(1)).squeeze()
# Target Q-values: r + γ max_a' Q_target(s', a') from the frozen network
with torch.no_grad():
next_q_values = self.target_network(next_states_t).max(dim=1).values
targets = rewards_t + self.gamma * next_q_values * (1 - dones_t)
loss = nn.MSELoss()(q_values, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Linear epsilon decay — only during training
self.epsilon = max(self.epsilon_end, self.epsilon - self.epsilon_step)
return loss.item()
def update_target_network(self):
self.target_network.load_state_dict(self.q_network.state_dict())
# --- Training loop with early stopping ---
import copy
env = gym.make("CartPole-v1")
agent = DQNAgent(
state_dim=env.observation_space.shape[0],
action_dim=env.action_space.n,
)
n_episodes = 500
early_stop_reward = 400 # stop when rolling avg hits this
early_stop_window = 50 # rolling window size
rewards_history = []
best_avg_reward = -float('inf')
best_weights = None
for episode in range(n_episodes):
state, _ = env.reset()
total_reward = 0
for step in range(500):
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.buffer.push(state, action, reward, next_state, done)
agent.train_step()
state = next_state
total_reward += reward
if done:
break
if episode % agent.target_update_freq == 0:
agent.update_target_network()
rewards_history.append(total_reward)
# Track best weights and check for early stopping
if len(rewards_history) >= early_stop_window:
avg = np.mean(rewards_history[-early_stop_window:])
if avg > best_avg_reward:
best_avg_reward = avg
best_weights = copy.deepcopy(agent.q_network.state_dict())
if avg >= early_stop_reward:
print(f"Early stopping at episode {episode} (avg reward: {avg:.1f})")
break
if episode % 50 == 0:
avg = np.mean(rewards_history[-50:])
print(f"Episode {episode:3d} | Avg reward: {avg:6.1f} | Epsilon: {agent.epsilon:.3f}")
# Restore best weights
if best_weights is not None:
agent.q_network.load_state_dict(best_weights)
env.close()
print(f"\nFinal avg reward (last 50): {np.mean(rewards_history[-50:]):.1f}")
The result: The agent learns to balance the pole and early stopping kicks in once the 50-episode rolling average hits 400. We also save the best weights, because DQN can suffer from catastrophic forgetting — performance collapses if you keep training past the sweet spot. Compare this to tabular Q-learning: here we handle a continuous state space with just a small neural network instead of an impossibly large table.
Visualise the Learning
import matplotlib.pyplot as plt
rolling = np.convolve(rewards_history, np.ones(20)/20, mode='valid')
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rolling, 'b-', linewidth=0.8)
ax.axhline(y=500, color='g', linestyle='--', alpha=0.5, label='Max score (500)')
ax.set_xlabel('Episode')
ax.set_ylabel('Reward (20-episode rolling avg)')
ax.set_title('DQN on CartPole-v1')
ax.set_ylim(0, 550)
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
The reward curve shows two phases: the agent plateaus around 200 while the replay buffer fills with diverse experiences, then rapidly improves once it starts exploiting what it's learned. Early stopping halts training once the rolling average hits 400 — without it, continued training often leads to catastrophic forgetting where performance suddenly collapses.
What Just Happened?
We replaced the Q-table with a neural network, but two critical tricks made it work. Without them, training is unstable or fails entirely.
Trick 1: Experience Replay
In tabular Q-learning, we updated $Q(s, a)$ immediately after each transition. With a neural network, this is catastrophic. Consecutive experiences are highly correlated — if the cart drifts left for 10 steps, the network sees 10 similar "going left" transitions in a row and overfits to that pattern.
Experience replay breaks this correlation:
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
# ...
Every transition $(s, a, r, s', \text{done})$ is stored in a fixed-size buffer. During training, we sample a random minibatch from the buffer — not the latest transitions, but a mix of old and new experiences from different parts of the state space.
Think of it like studying for an exam: instead of re-reading the last chapter (correlated, recent data), you shuffle all your flash cards and quiz yourself on a random mix (uncorrelated, diverse data).
The original code used exactly this pattern — a deque with a fixed capacity:
self.experience_pool = deque([], pool_size) # from the original dqn.py
Mnih et al. used a buffer of 1 million transitions for Atari. Our CartPole agent uses 10,000 — more than enough for this simpler problem.
Trick 2: Target Network (Frozen Q-hat)
The second instability comes from the training target itself. In Q-learning, we update toward:
But the Q-values on the right come from the same network we're updating. It's like a dog chasing its tail — every gradient step shifts the target, causing oscillations or divergence.
The fix: maintain a separate target network $\hat{Q}$ (called frozen_model in the original code, target_network in ours). This network is a periodic copy of the online network:
# Online network: updated every step
q_values = self.q_network(states_t).gather(1, actions_t.unsqueeze(1))
# Target network: frozen copy, updated every N episodes
with torch.no_grad():
next_q_values = self.target_network(next_states_t).max(dim=1).values
targets = rewards_t + self.gamma * next_q_values * (1 - dones_t)
The target network is frozen between updates. Every target_update_freq episodes, we copy the online network's weights into it:
if episode % agent.target_update_freq == 0:
agent.update_target_network()
This gives the online network a stable target to train against. The original code explored different freeze intervals:
freeze_everys = [1, 5, 10] # from the original cartpole_q_net.py
Updating every episode (freeze_every=1) essentially disables the trick. Freezing for longer (5 or 10 episodes) gives more stable training but slightly slower adaptation.
The Training Step
Putting it together, each training step:
- Sample a random minibatch from the replay buffer
-
Compute current Q-values from the online network:
$Q(s, a)$ -
Compute target Q-values from the frozen network:
$r + \gamma \max_{a'} \hat{Q}(s', a')$ -
Minimise MSE loss between current and target:
$\mathcal{L} = (Q(s,a) - y)^2$ - Backpropagate and update only the online network's weights
loss = nn.MSELoss()(q_values, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
This is supervised learning with self-generated labels. The "label" for $Q(s, a)$ is the bootstrapped estimate $r + \gamma \max_{a'} \hat{Q}(s', a')$ — not a ground truth, but a better estimate than what we had.
The Full Picture
┌──────────────┐ action ┌─────────────┐
│ Environment ├───────────►│ Agent │
│ (CartPole) │◄───────────┤ ε-greedy │
└──────┬───────┘ s,r,s',d └──────┬──────┘
│ │
│ (s,a,r,s',done) │
▼ │
┌──────────────┐ │
│ Replay │ random batch │
│ Buffer ├───────────────────┘
│ (10,000) │ │
└──────────────┘ ▼
┌──────────────┐ copy weights
│ Online Net ├────────────────►┌──────────────┐
│ Q(s,a) │ every N eps │ Target Net │
└──────┬───────┘ │ Q̂(s',a') │
│ └──────┬───────┘
│ MSE loss │
└─────────── vs ──────────────────┘
Q(s,a) ≈ r + γ max Q̂(s',a')
Going Deeper
Why Naive Q-Learning with Neural Nets Fails
There are three reasons combining neural networks with Q-learning is unstable — Sutton & Barto call it the deadly triad:
- Correlated samples — Sequential experience creates biased gradient estimates. A batch of "cart drifting left" transitions teaches the network to handle left-drifts but destroys its knowledge of right-drifts. Experience replay fixes this.
-
Non-stationary targets — The target
$r + \gamma \max_{a'} Q(s', a')$changes with every gradient step because$Q$is the network we're updating. The target network fixes this by providing a frozen target that only changes periodically. - Function approximation — A neural network generalises: updating Q-values for one state affects nearby states. This can create feedback loops where overestimation in one region cascades. This is partially addressed by the target network and is further improved by Double DQN, which decouples action selection from value estimation.
Practical Tip: Early Stopping
Experience replay and the target network reduce instability, but they don't eliminate it. In practice, DQN agents often exhibit catastrophic forgetting — performance peaks, then collapses as continued training destabilises the network. This is visible in the ablation study: even the full DQN's reward curve can crash if training runs too long.
The fix is simple and borrowed from supervised learning: monitor a rolling average of the reward and stop training once it plateaus. Save the best weights along the way:
best_avg_reward = -float('inf')
best_weights = None
for episode in range(n_episodes):
# ... training loop ...
if len(rewards_history) >= 50:
avg = np.mean(rewards_history[-50:])
if avg > best_avg_reward:
best_avg_reward = avg
best_weights = copy.deepcopy(agent.q_network.state_dict())
if avg >= 400: # stop when converged
break
# Restore best weights
agent.q_network.load_state_dict(best_weights)
This pattern applies broadly: any RL agent trained with function approximation can overfit or destabilise. Always track the best checkpoint.
Hyperparameters
| Parameter | What it controls | Our value | DeepMind Atari |
|---|---|---|---|
hidden_dim |
Network capacity | 128 | ~3 conv layers + FC |
lr |
Learning rate | 1e-3 | 2.5e-4 (RMSprop) |
gamma |
Discount factor | 0.99 | 0.99 |
buffer_size |
Replay buffer capacity | 10,000 | 1,000,000 |
min_buffer |
Fill before training | 500 | 50,000 |
batch_size |
Minibatch size | 64 | 32 |
target_update_freq |
Target network sync | Every 5 episodes | Every 10,000 steps |
epsilon_start |
Initial exploration | 1.0 | 1.0 |
epsilon_end |
Final exploration | 0.05 | 0.1 |
epsilon_step |
Exploration decay rate | 0.0005 per training step | Linear over 1M frames |
Buffer size matters. The original code experimented with pool sizes of 500 vs 1,500 and found the larger buffer consistently improved learning. DeepMind used 1 million transitions — enough to hold roughly the last ~4 days of gameplay.
Batch size matters. The original code tested batch sizes of 16, 32, 64, and 128, finding that "increasing from 16 → 32 → 64 → 128 consistently boosts agent's training speed." Larger batches give lower-variance gradient estimates, but eventually hit diminishing returns.
Gamma requires care. The original code's author noted: "Initially I was using $\gamma = 0.75$ for the CartPole problem which is grossly small." With $\gamma = 0.75$, a reward 10 steps away is worth $0.75^{10} \approx 0.06$ — the agent becomes too myopic to learn long-horizon balancing. With $\gamma = 0.99$, that same reward is worth $0.99^{10} \approx 0.90$.
Epsilon Annealing
The original code contained a hard-won insight about epsilon decay:
# The rate with which you anneal epsilon is very crucial in the training quality.
# We found that we were annealing too early, not allowing the agent to explore
# sufficiently.
Epsilon controls the explore-exploit trade-off — just like in tabular Q-learning. But with DQN, the stakes are higher: the network needs diverse training data to generalise well. Anneal too fast and the agent gets stuck in a local optimum; too slow and training wastes time on random actions.
Two critical details from the original code:
-
Fill the buffer first — epsilon stays at 1.0 (pure random) until
min_buffertransitions are collected. This seeds the replay buffer with diverse experiences before any training begins. - Linear decay during training only — epsilon decrements by a fixed step (0.0005) each training step, decaying from 1.0 to 0.05 over ~1,900 training steps. DeepMind used a similar linear decay from 1.0 to 0.1 over 1 million frames.
From CartPole to Atari
Our CartPole DQN takes a 4-dimensional state vector as input. DeepMind's Atari DQN takes raw pixels: 84×84 grayscale frames, stacked 4 deep (to capture motion). The key architectural difference is convolutional layers for spatial feature extraction:
Atari DQN Architecture:
Input: 84×84×4 (4 stacked frames)
→ Conv2D(32, 8×8, stride 4) → ReLU
→ Conv2D(64, 4×4, stride 2) → ReLU
→ Conv2D(64, 3×3, stride 1) → ReLU
→ Flatten → Dense(512) → ReLU
→ Dense(num_actions) # one output per action
The same two tricks — experience replay and target network — are what made this work. The network architecture is straightforward; the training stability innovations are the real contribution.
When NOT to Use DQN
-
Continuous action spaces — DQN requires
$\arg\max_a Q(s, a)$over discrete actions. For continuous control (robotic arms, throttle), use policy gradient methods like DDPG or SAC - Very long episodes — DQN's bootstrapping can propagate errors over long horizons. For episodic tasks with clear terminal rewards, Monte Carlo methods can be more stable
- Simple environments — If the state space is small and discrete, tabular Q-learning is simpler, interpretable, and guaranteed to converge. Don't use a neural network when a table will do
- Multi-agent settings — DQN assumes a stationary environment. Other agents make the environment non-stationary, breaking core assumptions
Ablation: What Happens Without the Tricks?
To see why experience replay matters, we train a variant without it — using a tiny 64-transition buffer that can't break temporal correlations:
# Full DQN with experience replay (early stopped)
agent_full = DQNAgent(state_dim=4, action_dim=2)
# No experience replay: tiny buffer, small batch
agent_no_replay = DQNAgent(state_dim=4, action_dim=2, buffer_size=64, batch_size=8, min_buffer=8)
The contrast is stark:
- Full DQN (blue) — Converges to ~400+ reward. Early stopping captures the peak before catastrophic forgetting can set in
- No experience replay (red) — Stuck below 150 for 500 episodes. Without random sampling from a large buffer, the network trains on correlated sequential data and never generalises
This matches what DeepMind found: experience replay was the critical innovation. The correlated, non-i.i.d. nature of sequential RL data is fundamentally incompatible with stable neural network training — random minibatches from a large buffer fix this.
Deep Dive: The Papers
Mnih et al. (2013) — The NIPS Workshop Paper
The story begins at DeepMind in 2013. Volodymyr Mnih and colleagues published Playing Atari with Deep Reinforcement Learning at the NIPS Deep Learning Workshop. The goal was ambitious: learn to play Atari 2600 games directly from pixels, using a single architecture and hyperparameter set across all games.
The key quote from the paper:
"We demonstrate that a convolutional neural network can learn successful control policies from raw video data [...] using only the reward signal. [...] The network was not provided with any game-specific information or hand-designed visual features."
Previous attempts at combining neural networks with Q-learning had failed. The paper identified two reasons:
"Firstly, the sequence of observations in RL is correlated, unlike the independent and identically distributed (i.i.d.) assumption of most deep learning methods. Secondly, the data distribution changes as the agent learns new behaviours."
Their solution was experience replay, borrowed from Lin (1992):
"We store the agent's experiences at each time step,
$e_t = (s_t, a_t, r_t, s_{t+1})$, in a data set$\mathcal{D} = e_1, \ldots, e_N$, pooled over many episodes into a replay memory."
Mnih et al. (2015) — The Nature Paper
Two years later, the expanded version appeared in Nature as Human-level control through deep reinforcement learning. The key addition was the target network:
"The second modification [...] is to use a separate network for generating the targets
$y_j$in the Q-learning update. More precisely, every$C$updates we clone the network$Q$to obtain a target network$\hat{Q}$and use$\hat{Q}$for generating the Q-learning targets for the following$C$updates."
The loss function, from the paper:
Where:
-
$\theta_i$— weights of the online network at iteration$i$ -
$\theta_i^{-}$— weights of the target network (frozen, periodically copied from$\theta$) -
$U(\mathcal{D})$— uniform random sampling from the replay buffer
The Nature paper tested on 49 Atari games. DQN achieved human-level performance or better on 29 of them — a remarkable result for a single, general-purpose algorithm.
The DQN Algorithm (Paper Pseudocode)
From Algorithm 1 of the Nature paper:
Initialize replay memory D with capacity N
Initialize action-value function Q with random weights θ
Initialize target action-value function Q̂ with weights θ⁻ = θ
For episode = 1, M do:
Initialise state s₁
For t = 1, T do:
With probability ε select random action aₜ
Otherwise select aₜ = argmax_a Q(sₜ, a; θ)
Execute action aₜ, observe reward rₜ and next state sₜ₊₁
Store transition (sₜ, aₜ, rₜ, sₜ₊₁) in D
Sample random minibatch of transitions (sⱼ, aⱼ, rⱼ, sⱼ₊₁) from D
Set yⱼ = rⱼ if episode terminates at step j+1
yⱼ = rⱼ + γ max_a' Q̂(sⱼ₊₁, a'; θ⁻) otherwise
Perform gradient descent on (yⱼ - Q(sⱼ, aⱼ; θ))² w.r.t. θ
Every C steps: Q̂ ← Q (reset target network)
Our Implementation vs the Paper
| Paper (DQN, 2015) | Our code |
|---|---|
| Input: 84×84×4 pixel frames | Input: 4-dim state vector |
| 3 conv layers + 2 FC layers | 2 hidden FC layers (128 units) |
Replay buffer $\mathcal{D}$, capacity 1M |
ReplayBuffer, capacity 10,000 |
Uniform random sample $U(\mathcal{D})$
|
random.sample(self.buffer, batch_size) |
Target network $\hat{Q}(\theta^-)$
|
self.target_network |
Clone every $C = 10{,}000$ steps |
update_target_network() every 10 episodes |
RMSprop, $lr = 0.00025$
|
Adam, $lr = 0.001$
|
Loss: $(y - Q(s,a;\theta))^2$
|
nn.MSELoss()(q_values, targets) |
| 49 Atari games | CartPole-v1 |
The architecture differs, but the algorithm is identical. The same two tricks — experience replay and target network — stabilise training whether the input is raw pixels or a 4-dimensional state vector.
What Came After
DQN spawned a wave of improvements:
- Double DQN (van Hasselt et al., 2016) — Fixes Q-value overestimation by using the online network to select actions but the target network to evaluate them
- Prioritised Experience Replay (Schaul et al., 2016) — Sample transitions with high TD error more often (learn from surprises)
-
Dueling DQN (Wang et al., 2016) — Separate streams for state value
$V(s)$and advantage$A(s, a)$ - Rainbow (Hessel et al., 2018) — Combines all improvements into one agent
- Distributional RL (Bellemare et al., 2017) — Learn the full distribution of returns, not just the mean
Historical Context
- Watkins (1989) — Q-learning: model-free, off-policy control
- Lin (1992) — Experience replay for RL (first proposed)
- Riedmiller (2005) — Neural Fitted Q-Iteration (NFQ) — early neural network Q-learning
- Mnih et al. (2013) — DQN: experience replay + deep learning on Atari
- Mnih et al. (2015) — DQN + target network in Nature, human-level on 29/49 games
- Silver et al. (2016) — AlphaGo: deep RL defeats world Go champion
Further Reading
- Mnih et al. (2013) — Playing Atari with Deep Reinforcement Learning — the original DQN paper
- Mnih et al. (2015) — Human-level control through deep reinforcement learning — Nature version with target network
- Sutton & Barto (2018) — Chapter 16.5 (Deep Reinforcement Learning) — freely available online
- DeepMind's DQN code — Original Torch7 implementation
- Next in the series: Policy Gradients: REINFORCE from Scratch — directly optimise the policy instead of learning Q-values
Interactive Tools
- Q-Learning Visualiser — Watch tabular Q-learning train step-by-step on grid worlds before scaling up to DQN
Related Posts
- Q-Learning from Scratch — Tabular Q-learning, the foundation DQN builds on
- Backpropagation Demystified — The gradient engine powering DQN's neural network
- Genetic Algorithms — Evolution-based optimisation, an alternative to gradient-based RL
Try It Yourself
The interactive notebook includes exercises:
-
Ablation study — Train with
target_update_freq=1(no target network) andbuffer_size=1(no replay). How do the reward curves compare to full DQN? -
Gamma sweep — Try
$\gamma \in \{0.5, 0.9, 0.99, 0.999\}$. How does the discount factor affect convergence speed? -
Network size — Try
hidden_dimof 32, 64, 128, and 256. Is bigger always better for CartPole? -
Soft target updates — Replace the hard copy with Polyak averaging:
$\theta^- \leftarrow \tau \theta + (1-\tau) \theta^-$with$\tau = 0.005$. Does this improve stability? -
Double DQN — Modify the target to use the online network for action selection:
$a^* = \arg\max_{a'} Q(s', a'; \theta)$, then evaluate with the target network:$\hat{Q}(s', a^*; \theta^-)$. Does this reduce overestimation?
DQN showed that deep learning and reinforcement learning could be combined — but only with the right tricks. Experience replay breaks temporal correlations, the target network provides stable training signals, and together they turned an unstable combination into a system that matched human performance on Atari. Next up: policy gradient methods, which skip Q-values entirely and directly optimise the policy.
Frequently Asked Questions
What is a Deep Q-Network and how does it differ from tabular Q-learning?
A DQN replaces the Q-table with a neural network that takes a state as input and outputs Q-values for all actions. This allows it to handle environments with large or continuous state spaces where a table would be impossibly large. The core Q-learning update remains the same, but the function approximation introduces new stability challenges.
Why is experience replay important for DQN?
Without experience replay, the network trains on consecutive, highly correlated transitions, which violates the i.i.d. assumption of gradient descent and causes unstable learning. Experience replay stores transitions in a buffer and samples random mini-batches for training, breaking correlations and allowing each experience to be reused multiple times.
What is a target network and why is it needed?
In Q-learning, the same network both selects actions and computes target values, creating a moving target problem that destabilises training. A target network is a separate, slowly updated copy used only to compute targets. This decouples the target from the current network's rapidly changing weights, stabilising learning significantly.
How do I choose the replay buffer size?
A buffer that is too small loses useful past experiences and reduces diversity. A buffer that is too large wastes memory and may contain outdated transitions from a much weaker policy. For most environments, 100,000 to 1,000,000 transitions works well. Start with 100,000 and increase if training is unstable.
What is the difference between DQN and Double DQN?
Standard DQN tends to overestimate Q-values because it uses the max operator to both select and evaluate actions. Double DQN fixes this by using the online network to select the best action but the target network to evaluate it. This simple change often improves performance significantly with no computational overhead.




Top comments (0)