Berkan Sesen

Posted on Apr 8 • Originally published at sesen.ai

Policy Gradients: REINFORCE from Scratch with NumPy

#reinforcementlearning #deeplearning #optimisation

In the DQN post, we trained a neural network to estimate Q-values and then picked the best action with argmax. That works when the action space is discrete — push left or push right. But what if you need to control a robotic arm with continuous joint angles, or steer a car with a continuous throttle? You can't argmax over infinity.

Policy gradient methods flip the approach: instead of learning a value function and deriving a policy, we directly parameterise the policy and optimise it via gradient ascent. The network outputs action probabilities, we sample from them, and we nudge the parameters toward actions that led to high rewards. No Q-values, no argmax, no experience replay — just a policy, a gradient, and a reward signal.

By the end of this post, you'll implement the REINFORCE algorithm entirely from scratch in NumPy — including the forward pass, backpropagation, and RMSProp optimiser — and train it to balance CartPole. The entire implementation is about 100 lines. No PyTorch, no TensorFlow, just NumPy and the chain rule.

Quick Win: Run the Algorithm

Let's see REINFORCE in action. Click the badge to open the interactive notebook:

The animation shows the agent converging to the maximum score of 500 within about 3,000 episodes, using nothing but NumPy. No deep learning framework, no replay buffer — just a policy, a gradient, and RMSProp.

import numpy as np
import gymnasium as gym

# --- Hyperparameters ---
# Original code (CartPole-v0, max 200 steps): H=100, lr=1e-4, gamma=0.95, batch_size=5
# Adapted for CartPole-v1 (max 500 steps): higher gamma and learning rate
H = 100              # hidden layer neurons
batch_size = 5       # episodes per parameter update
learning_rate = 1e-3 # RMSProp learning rate (original: 1e-4, raised for longer episodes)
gamma = 0.99         # discount factor (original: 0.95, raised for longer horizon)
decay_rate = 0.99    # RMSProp decay
epsilon = 1e-5       # RMSProp epsilon

# --- Network functions ---
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def forward(x, model):
    h = np.dot(model['W1'], x)
    h[h < 0] = 0  # ReLU
    logp = np.dot(model['W2'], h)
    p = sigmoid(logp)
    return p, h

def backward(eph, epdlogp, epx, model):
    dW2 = np.dot(eph.T, epdlogp).ravel()
    dh = np.outer(epdlogp, model['W2'])
    dh[eph <= 0] = 0  # backprop through ReLU
    dW1 = np.dot(dh.T, epx)
    return {'W1': dW1, 'W2': dW2}

def discount_rewards(r, gamma):
    """Standard full-horizon discounting."""
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

# --- Initialisation ---
env = gym.make('CartPole-v1')
observation, _ = env.reset()
D = len(observation)  # input dimension (4 for CartPole)

model = {
    'W1': np.random.randn(H, D) / np.sqrt(D),  # Xavier init
    'W2': np.random.randn(H) / np.sqrt(H),
}
grad_buffer = {k: np.zeros_like(v) for k, v in model.items()}
rmsprop_cache = {k: np.zeros_like(v) for k, v in model.items()}

# --- Training loop ---
xs, hs, dlogps, drs, episode_durations = [], [], [], [], []
episode_number = 0
t = 0

while episode_number < 5000:
    x = observation
    aprob, h = forward(x, model)
    action = 1 if np.random.uniform() < aprob else 0

    xs.append(x)
    hs.append(h)
    dlogps.append(action - aprob)  # policy gradient

    observation, reward, terminated, truncated, _ = env.step(action)
    drs.append(reward)

    if terminated or truncated:
        episode_number += 1
        episode_durations.append(t)
        t = 0

        epx = np.vstack(xs)
        eph = np.vstack(hs)
        epdlogp = np.vstack(dlogps)
        epr = np.vstack(drs)
        xs, hs, dlogps, drs = [], [], [], []

        # Discount and standardise rewards
        discounted_epr = discount_rewards(epr, gamma)
        discounted_epr -= np.mean(discounted_epr)
        std = np.std(discounted_epr)
        if std > 0:
            discounted_epr /= std

        # The PG magic: weight gradients by advantage
        epdlogp *= discounted_epr
        grad = backward(eph, epdlogp, epx, model)
        for k in model:
            grad_buffer[k] += grad[k]

        # RMSProp update every batch_size episodes
        if episode_number % batch_size == 0:
            for k, v in model.items():
                g = grad_buffer[k]
                rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
                model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + epsilon)
                grad_buffer[k] = np.zeros_like(v)

        if episode_number % 500 == 0:
            avg = np.mean(episode_durations[-100:])
            print(f'Episode {episode_number}, 100-ep avg: {avg:.1f}')

        observation, _ = env.reset()
    t += 1

env.close()
print(f'Final 100-episode average: {np.mean(episode_durations[-100:]):.1f}')

The result: The agent converges to the 500-step maximum, using nothing but NumPy. No deep learning framework — we compute every gradient by hand.

Visualise the Learning

import matplotlib.pyplot as plt

rolling = np.convolve(episode_durations, np.ones(100)/100, mode='valid')

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rolling, 'b-', linewidth=0.8)
ax.axhline(y=500, color='g', linestyle='--', alpha=0.5, label='Max score (500)')
ax.set_xlabel('Episode')
ax.set_ylabel('Duration (100-episode rolling avg)')
ax.set_title('REINFORCE on CartPole-v1')
ax.set_ylim(0, 550)
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

What Just Happened?

We built a complete RL agent with no frameworks — just a policy network, a gradient, and a reward signal. Let's walk through each piece.

The Policy Network (4 → 100 → 1)

The network is a two-layer perceptron that maps a 4-dimensional CartPole state to a single probability:

State [x, ẋ, θ, θ̇]  →  Hidden (100 ReLU)  →  Output (sigmoid)  →  P(push right)
         4                    100                     1

def forward(x, model):
    h = np.dot(model['W1'], x)   # (100, 4) × (4,) → (100,)
    h[h < 0] = 0                  # ReLU activation
    logp = np.dot(model['W2'], h) # (100,) × (100,) → scalar
    p = sigmoid(logp)             # squash to [0, 1]
    return p, h

The output p is the probability of pushing right. We sample from this Bernoulli distribution:

action = 1 if np.random.uniform() < aprob else 0

This is fundamentally different from DQN's approach. DQN outputs Q-values for every action and picks the highest (deterministic, epsilon-greedy). Here, the network outputs a probability and we sample — the policy is stochastic by design. This built-in exploration means we don't need an epsilon schedule.

The Policy Gradient Signal

After taking an action, we record the "gradient that encourages the action that was taken":

dlogps.append(action - aprob)

If we pushed right (action=1) and aprob=0.7, then dlogp = 1 - 0.7 = 0.3 — a positive gradient that nudges the network to make "push right" even more likely. If we pushed left (action=0) and aprob=0.7, then dlogp = 0 - 0.7 = -0.7 — a negative gradient that decreases the probability of pushing right (making left more likely).

But at this point, we don't know if the action was good. That's what the reward tells us.

The PG Magic Line

This is where policy gradients really happen:

epdlogp *= discounted_epr  # modulate gradient with advantage

Every action's gradient gets multiplied by its discounted reward. Good actions (high reward) get their gradients amplified — "do more of this". Bad actions (low or negative reward) get their gradients flipped — "do less of that". This single line is the heart of REINFORCE.

Think of it like a coach watching game film: every play gets a grade. Plays that led to scoring get reinforced; plays that led to turnovers get discouraged. The magnitude of the grade determines how strongly the feedback applies.

Reward Discounting

CartPole gives +1 reward at every timestep the pole stays up. We discount future rewards so earlier actions, which contributed to a longer run, receive more credit:

def discount_rewards(r, gamma):
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

With gamma=0.99, an action taken 100 steps before the end still receives $0.99^{100} \approx 0.37$ of the terminal reward. This gives the network a smooth gradient across the episode: early actions in long episodes receive higher discounted returns than the same actions in short episodes, so the network learns to favour strategies that keep the pole up for longer.

Reward Standardisation (Variance Reduction)

After discounting, we standardise the rewards to have zero mean and unit variance:

discounted_epr -= np.mean(discounted_epr)
std = np.std(discounted_epr)
if std > 0:
    discounted_epr /= std

This is critical for stable training. Without standardisation, the magnitude of the gradient depends on the absolute reward scale. With it, roughly half the actions get reinforced (above-average) and half get discouraged (below-average). It's a simple form of a baseline — one of the most important variance reduction techniques in policy gradients.

Manual Backpropagation

We compute gradients by hand, just as in the backpropagation post:

def backward(eph, epdlogp, epx, model):
    dW2 = np.dot(eph.T, epdlogp).ravel()     # gradient for output weights
    dh = np.outer(epdlogp, model['W2'])        # backprop to hidden layer
    dh[eph <= 0] = 0                           # backprop through ReLU
    dW1 = np.dot(dh.T, epx)                   # gradient for input weights
    return {'W1': dW1, 'W2': dW2}

The chain rule flows backwards: output layer gradient → hidden layer gradient (masked by ReLU) → input layer gradient. This is identical to standard backprop, except the "loss" is the policy gradient signal (epdlogp already weighted by advantage).

Batch Gradient Accumulation

Rather than updating after every episode, we accumulate gradients over batch_size=5 episodes:

for k in model:
    grad_buffer[k] += grad[k]

if episode_number % batch_size == 0:
    # RMSProp update using accumulated gradients
    ...

This reduces gradient variance — each update reflects 5 episodes worth of experience rather than just one. It's the same idea as minibatch gradient descent in supervised learning.

RMSProp from Scratch

The optimiser is RMSProp, implemented manually:

rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + epsilon)

RMSProp maintains a running average of squared gradients and divides by their root. This gives adaptive per-parameter learning rates: parameters with consistently large gradients get smaller effective learning rates, and vice versa. The decay_rate=0.99 means the running average looks back roughly 100 updates.

Going Deeper

The Policy Gradient Theorem

The goal of policy gradients is to maximise expected cumulative reward:

Where $\tau$ is a trajectory (sequence of states and actions) sampled under policy $\pi_\theta$ . The policy gradient theorem (Sutton et al., 1999) tells us how to compute the gradient of this objective:

Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from timestep $t$ .

The intuition: $\nabla_\theta \log \pi_\theta(a_t | s_t)$ points in the direction that makes action $a_t$ more likely. Multiplying by $G_t$ scales this — if the return was high, push hard in that direction; if low, push the other way.

The Log-Likelihood Trick

How do we differentiate through sampling? We can't backpropagate through np.random.uniform(). The score function estimator (also called the log-likelihood trick or REINFORCE trick) sidesteps this.

For a Bernoulli policy with probability $p$ :

But in practice, we use a simpler form. The gradient of the log-likelihood for a sigmoid output is just $a - p$ :

dlogps.append(action - aprob)  # this IS the score function

This is action - aprob — exactly what the code computes. When action=1 and aprob=0.7, the gradient is 0.3: "make right more likely." This gradient gets backpropagated through the network to update all weights.

Why Reward Standardisation Reduces Variance

REINFORCE is an unbiased estimator of the policy gradient, but it has high variance. A single trajectory might get lucky or unlucky, leading to noisy gradient estimates.

Subtracting a baseline $b$ from the return doesn't change the expected gradient (it's still unbiased) but can dramatically reduce variance:

Our code uses the mean reward as the baseline:

discounted_epr -= np.mean(discounted_epr)

With this baseline, actions that performed better than average get reinforced, and below-average actions get discouraged. Without it, if all rewards are positive (as in CartPole, where every alive step gives +1), the gradient would reinforce all actions — just some more than others. The baseline makes the signal much cleaner.

Hyperparameters

Parameter	What it controls	Value	Original (v0)	Why changed
`H`	Hidden neurons	100	100	Unchanged — enough for CartPole
`learning_rate`	RMSProp step size	1e-3	1e-4	Longer episodes → more samples per update → can use larger steps
`gamma`	Discount factor	0.99	0.95	Longer episodes (500 vs 200) need longer-horizon credit assignment
`batch_size`	Episodes per update	5	5	Unchanged
`decay_rate`	RMSProp memory	0.99	0.99	Unchanged
`epsilon`	RMSProp stability	1e-5	1e-5	Unchanged

The original code targeted CartPole-v0 (max 200 steps) and used a custom "blame window" that penalised only the last N=10 actions. This worked well for short episodes but created a performance ceiling on CartPole-v1 (max 500 steps) — as episodes get longer, the blame signal becomes proportionally smaller and gets washed out after standardisation.

The fix is textbook REINFORCE: standard full-horizon discounting with gamma=0.99. The higher gamma ensures actions 100+ steps before the terminal state still receive meaningful credit. The higher learning rate (1e-3) compensates for the longer episodes providing more gradient samples per update.

Learning rate is critical. Too high and the policy oscillates wildly. Too low and learning crawls. With standard discounting and gamma=0.99, lr=1e-3 converges in ~3,000 episodes.

Value-Based vs Policy-Based: When to Use Each

	DQN (Value-Based)	REINFORCE (Policy-Based)
Outputs	Q-values for each action	Action probabilities
Action selection	Argmax (deterministic)	Sample (stochastic)
Exploration	Epsilon-greedy (bolted on)	Built-in (stochastic policy)
Action spaces	Discrete only	Discrete and continuous
Sample efficiency	Higher (replay buffer)	Lower (on-policy, no replay)
Stability	Needs replay + target net	Needs variance reduction
Convergence	To optimal Q → optimal policy	Directly to (locally) optimal policy

Use policy gradients when:

The action space is continuous (robotics, continuous control)
You want a stochastic policy (exploration, multi-modal strategies)
The policy is simpler than the value function

Use DQN when:

The action space is small and discrete
Sample efficiency matters (you can't run millions of episodes)
You have access to a simulator for experience replay

When NOT to Use Vanilla REINFORCE

High-dimensional action spaces — Variance becomes unmanageable. Use actor-critic methods (A2C, PPO) that learn a value function baseline
Sample-scarce settings — REINFORCE is on-policy: every trajectory is used once then discarded. Off-policy methods like DQN are far more sample-efficient
Long episodes — The credit assignment problem worsens. Which of the 10,000 actions caused success? Actor-critic methods handle this better with per-step value estimates
When stability matters — Vanilla REINFORCE can have large policy swings. PPO clips the gradient to prevent catastrophic updates

Deep Dive: The Papers

Williams (1992) — The REINFORCE Paper

The REINFORCE algorithm was introduced by Ronald J. Williams in his 1992 paper Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. This is one of the foundational papers in policy gradient methods.

Williams framed the problem precisely: given a stochastic network (he called it a "connectionist network") that produces actions probabilistically, how do we adjust the weights to maximise expected reward?

His key result:

"For each such algorithm, convergence to at least a local maximum in expected reinforcement is assured by a simple condition on the sequence of values used to scale the gradient estimate."

The REINFORCE update rule from the paper:

Where:

$w_{ij}$ — weight from unit $j$ to unit $i$
$\alpha_{ij}$ — learning rate
$r$ — the reinforcement (reward)
$b_{ij}$ — a reinforcement baseline (reduces variance without introducing bias)
$g_i$ — the probability of the output of unit $i$

In our code, this maps directly:

$\frac{\partial \ln g_i}{\partial w_{ij}}$ → dlogps (the score function, computed via backprop)
$r - b_{ij}$ → discounted_epr (standardised, which implicitly subtracts a mean baseline)
The product epdlogp *= discounted_epr is the REINFORCE update

Williams proved that this estimator is unbiased: in expectation, it equals the true policy gradient regardless of the baseline $b$ . The baseline only affects variance — choosing it well (e.g., as the mean reward) can dramatically speed convergence.

Sutton, McAllester, Singh & Mansour (1999) — The Policy Gradient Theorem

The policy gradient theorem, proved in Policy Gradient Methods for Reinforcement Learning with Function Approximation, generalised Williams' result to arbitrary function approximators:

Where $d^{\pi}(s)$ is the stationary state distribution under policy $\pi$ . The theorem shows that the policy gradient doesn't depend on the gradient of the state distribution — a surprising and essential result that makes policy gradient methods practical.

Karpathy (2016) — Deep RL: Pong from Pixels

The original code was inspired by Andrej Karpathy's influential blog post Deep Reinforcement Learning: Pong from Pixels. Karpathy demonstrated that a simple two-layer network trained with REINFORCE could learn to play Atari Pong directly from raw pixels — all in ~130 lines of Python.

The key architectural choices we inherit:

Two-layer network with ReLU hidden layer and sigmoid output
Manual NumPy backprop — no framework dependency
RMSProp as the optimiser
Reward discounting with standardisation

Karpathy's version used the raw Pong pixel difference as input (preprocessed frames). Our CartPole version uses the 4-dimensional state vector directly, but the algorithm is identical.

The REINFORCE Algorithm (Pseudocode)

From Sutton & Barto (2018), Section 13.3:

Initialize policy parameters θ arbitrarily
For each episode:
    Generate trajectory τ = (s₀, a₀, r₁, s₁, a₁, ..., sT) following π_θ
    For each step t = 0, 1, ..., T-1:
        Gₜ ← Σ_{k=t+1}^{T} γ^{k-t-1} rₖ       (return from step t)
        θ ← θ + α γ^t Gₜ ∇_θ ln π_θ(aₜ|sₜ)    (policy gradient update)

Our Implementation vs the Theory

Theory (Williams, 1992)	Our code
$\frac{\partial \ln g_i}{\partial w_{ij}}$ (score function)	`action - aprob` backpropagated through network
$r - b_{ij}$ (reward minus baseline)	`discounted_epr` after mean subtraction
$\Delta w = \alpha (r-b) \nabla \ln \pi$	`epdlogp *= discounted_epr` then `backward()`
Single-sample update	Batch of 5 episodes
Generic optimiser	RMSProp (adaptive learning rates)

What Came After

REINFORCE spawned the entire family of modern policy gradient methods:

Actor-Critic (Konda & Tsitsiklis, 2000) — Learn a value function baseline alongside the policy
A3C (Mnih et al., 2016) — Asynchronous actor-critic with parallel workers
PPO (Schulman et al., 2017) — Clipped surrogate objective for stable updates; the default algorithm behind ChatGPT's RLHF
DDPG (Lillicrap et al., 2016) — Continuous-action policy gradients with a deterministic policy
SAC (Haanoja et al., 2018) — Entropy-regularised policy gradients for robust exploration

Historical Context

Williams (1992) — REINFORCE: the first general-purpose policy gradient algorithm
Sutton et al. (1999) — Policy gradient theorem: proves the gradient formula for function approximators
Kakade (2001) — Natural policy gradients: use Fisher information for better updates
Schulman et al. (2015) — TRPO: trust regions for policy optimisation
Schulman et al. (2017) — PPO: simplified TRPO with clipped objective; now the industry standard
OpenAI (2017) — RLHF: policy gradients for aligning language models

Interactive Tools

Q-Learning Visualiser — Compare value-based RL with the policy gradient approach covered in this post

Deep Q-Networks: Experience Replay and Target Networks — Value-based RL with neural networks, the approach we move beyond here
Q-Learning from Scratch — Tabular value-based RL, the foundation for understanding the value→policy shift
Backpropagation Demystified — We implement backprop manually here too; this post covers the fundamentals
Genetic Algorithms — Gradient-free optimisation, an alternative to policy gradients for non-differentiable objectives

Try It Yourself

The interactive notebook includes exercises:

Remove reward shaping — Replace discount_rewards with standard full-horizon discounting. How does training speed change?
Vary the blame window — Try $N \in \{5, 10, 20, 50\}$ . How does the learning curve respond?
Remove standardisation — Comment out the mean/std normalisation of rewards. Does the agent still learn?
Gamma sweep — Try $\gamma \in \{0.8, 0.95, 0.99, 0.999\}$ . How does the discount factor affect learning?
Compare with DQN — Plot the REINFORCE learning curve alongside DQN's on the same axes. Which learns faster? Which is more stable?

REINFORCE showed that you can optimise a policy directly — no Q-values needed. The gradient signal is noisy, but with variance reduction (baselines, reward standardisation, batch updates) it works remarkably well. The same core idea — gradient ascent on expected reward — underpins every modern policy gradient algorithm, from PPO to the RLHF that aligns large language models. The fundamentals haven't changed since Williams (1992); only the variance reduction has gotten better.

Frequently Asked Questions

What is the difference between policy gradient methods and value-based methods like DQN?

Value-based methods learn a value function (such as Q-values) and derive a policy indirectly by choosing the action with the highest value. Policy gradient methods parameterise the policy directly and optimise it via gradient ascent on expected reward. The key advantage of policy gradients is that they naturally handle continuous action spaces and produce stochastic policies with built-in exploration.

Why is reward standardisation important in REINFORCE?

Without standardisation, all rewards in CartPole are positive (+1 per timestep), so the gradient reinforces every action, just some more than others. Subtracting the mean makes roughly half the actions receive positive reinforcement (above average) and half negative (below average). This acts as a simple baseline that dramatically reduces gradient variance and stabilises training.

What does the discount factor gamma control?

Gamma determines how much weight future rewards receive relative to immediate rewards. A value of 0.99 means an action taken 100 steps before the end still receives about 37% of the terminal reward. Higher gamma values encourage long-term planning but increase variance, while lower values make the agent more short-sighted but produce more stable gradients.

Why was RMSProp chosen as the optimiser instead of plain gradient ascent?

RMSProp maintains a running average of squared gradients and adapts the learning rate for each parameter individually. Parameters with consistently large gradients get smaller effective learning rates, preventing them from dominating the update. This adaptive behaviour is especially important in reinforcement learning, where gradient magnitudes can vary wildly across episodes and parameters.

When should I use REINFORCE versus more advanced algorithms like PPO?

Vanilla REINFORCE is best suited for simple environments, educational purposes, and situations where you want full control over the implementation. For complex environments, long episodes, or production systems, PPO and other actor-critic methods are preferred because they use a learned value function baseline that reduces variance and clip the gradient to prevent catastrophic policy updates.

Why does REINFORCE accumulate gradients over multiple episodes before updating?

Accumulating gradients over a batch of episodes (5 in this implementation) reduces the variance of the gradient estimate. A single episode can be noisy due to the stochastic nature of both the policy and the environment. Averaging over multiple episodes produces a smoother, more reliable gradient signal, similar to minibatch gradient descent in supervised learning.

DEV Community