DEV Community

Cover image for Trust Region Methods: From REINFORCE to TRPO to PPO
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

Trust Region Methods: From REINFORCE to TRPO to PPO

In the REINFORCE post, we built a policy gradient agent from scratch in NumPy and watched it learn CartPole. It worked — eventually. But the reward curve looked like a seismograph. One batch of unlucky episodes and the agent seemed to forget everything it had learned. The fix was always the same: run it again and hope for a smoother trajectory.

The problem is not randomness. It is the update rule. Vanilla policy gradients compute a direction to improve the policy, then take a step of whatever size the learning rate dictates. If the step is too large, the policy lands in a region where it collects terrible data, which produces an even worse update, and the whole thing collapses. There are no guardrails.

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) add those guardrails. TRPO constrains each update so the new policy stays close to the old one, measured by KL divergence. PPO achieves a similar effect with a simpler clipped objective that requires no second-order optimization. The result is the same: stable, monotonic improvement instead of the lottery that is vanilla REINFORCE.

By the end of this post, you will implement all three methods from scratch in PyTorch and see the difference on a single chart.

Let's Build It

Click the badge to open the interactive notebook:

Open In Colab

Here is the payoff — all three methods training on CartPole-v1, animated side by side:

Animation showing REINFORCE, TRPO, and PPO reward curves during training on CartPole-v1 — PPO converges fastest and most stably

REINFORCE (red) takes until iteration 79 to reach 400, with visible oscillation along the way. TRPO (blue) gets there by iteration 18 but shows periodic dips — the trust region prevents catastrophic collapse but allows some instability. PPO (green) reaches 400 by iteration 15 and holds steady, the most stable of the three.

All three methods share the same architecture and hyperparameters (from John Schulman's original modular_rl codebase). The only difference is the update rule.

Shared Components

Both the policy and value networks use the same architecture from the original code: two hidden layers of 64 units each with tanh activation.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import gymnasium as gym

class PolicyNetwork(nn.Module):
    """2-layer MLP [64,64] tanh -> softmax."""
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, act_dim),
            nn.Softmax(dim=-1),
        )
        # Small init on the output Linear layer (before Softmax)
        self.net[-2].weight.data *= 0.1
        self.net[-2].bias.data *= 0.1

    def forward(self, x):
        return self.net(x)

class ValueNetwork(nn.Module):
    """2-layer MLP [64,64] tanh -> linear."""
    def __init__(self, obs_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.net(x).squeeze(-1)
Enter fullscreen mode Exit fullscreen mode

Rollout Collection and GAE

Every method collects a batch of experience, then computes advantages using Generalised Advantage Estimation:

def collect_rollouts(env, policy, timesteps_per_batch=10000):
    """Collect rollouts until we have >= timesteps_per_batch timesteps."""
    observations, actions, rewards, dones, log_probs = [], [], [], [], []
    episode_rewards = []
    timesteps = 0

    while timesteps < timesteps_per_batch:
        obs, _ = env.reset()
        episode_reward = 0
        done = False

        while not done:
            obs_t = torch.FloatTensor(obs)
            with torch.no_grad():
                probs = policy(obs_t)
            dist = Categorical(probs)
            action = dist.sample()

            observations.append(obs)
            actions.append(action.item())
            log_probs.append(dist.log_prob(action).item())

            obs, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated
            rewards.append(reward)
            dones.append(done)
            episode_reward += reward
            timesteps += 1

        episode_rewards.append(episode_reward)

    return {
        'observations': np.array(observations),
        'actions': np.array(actions),
        'rewards': np.array(rewards),
        'dones': np.array(dones),
        'log_probs': np.array(log_probs),
        'episode_rewards': episode_rewards,
    }

def compute_gae(rewards, dones, values, gamma=0.99, lam=1.0):
    """Generalised Advantage Estimation."""
    advantages = np.zeros_like(rewards, dtype=np.float32)
    last_advantage = 0
    last_value = 0

    for t in reversed(range(len(rewards))):
        if dones[t]:
            last_advantage = 0
            last_value = 0
        delta = rewards[t] + gamma * last_value - values[t]
        advantages[t] = last_advantage = delta + gamma * lam * last_advantage
        last_value = values[t]

    returns = advantages + values
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    return advantages, returns
Enter fullscreen mode Exit fullscreen mode

REINFORCE with Baseline

The simplest method. Compute the policy gradient and take a step:

def train_reinforce(n_iters=200, timesteps_per_batch=10000, gamma=0.99,
                    lam=1.0, lr=1e-3):
    env = gym.make('CartPole-v1')
    policy = PolicyNetwork(4, 2)
    value_fn = ValueNetwork(4)
    policy_opt = optim.Adam(policy.parameters(), lr=lr)
    value_opt = optim.Adam(value_fn.parameters(), lr=1e-3)

    all_rewards = []

    for iteration in range(n_iters):
        rollout = collect_rollouts(env, policy, timesteps_per_batch)
        obs_t = torch.FloatTensor(rollout['observations'])
        acts_t = torch.LongTensor(rollout['actions'])

        with torch.no_grad():
            values = value_fn(obs_t).numpy()

        advantages, returns = compute_gae(
            rollout['rewards'], rollout['dones'], values, gamma, lam
        )
        adv_t = torch.FloatTensor(advantages)
        ret_t = torch.FloatTensor(returns)

        # Policy update: vanilla policy gradient
        probs = policy(obs_t)
        dist = Categorical(probs)
        log_probs = dist.log_prob(acts_t)
        policy_loss = -(log_probs * adv_t).mean()

        policy_opt.zero_grad()
        policy_loss.backward()
        policy_opt.step()

        # Value function update
        for _ in range(10):
            v_pred = value_fn(obs_t)
            v_loss = ((v_pred - ret_t) ** 2).mean()
            value_opt.zero_grad()
            v_loss.backward()
            value_opt.step()

        all_rewards.append(np.mean(rollout['episode_rewards']))
    return all_rewards
Enter fullscreen mode Exit fullscreen mode

TRPO

Same objective, but constrained. The conjugate gradient method solves for the natural gradient direction without forming the full Fisher matrix:

def conjugate_gradient(f_Ax, b, cg_iters=10, residual_tol=1e-10):
    """Solve Hx = b using conjugate gradient."""
    p = b.clone()
    r = b.clone()
    x = torch.zeros_like(b)
    rdotr = r.dot(r)

    for _ in range(cg_iters):
        z = f_Ax(p)
        v = rdotr / (p.dot(z) + 1e-8)
        x += v * p
        r -= v * z
        newrdotr = r.dot(r)
        mu = newrdotr / (rdotr + 1e-8)
        p = r + mu * p
        rdotr = newrdotr
        if rdotr < residual_tol:
            break
    return x

def linesearch(f, x, fullstep, expected_improve_rate,
               max_backtracks=10, accept_ratio=0.1):
    """Backtracking line search."""
    fval = f(x)
    for stepfrac in 0.5 ** np.arange(max_backtracks):
        xnew = x + stepfrac * fullstep
        newfval = f(xnew)
        actual_improve = fval - newfval
        expected_improve = expected_improve_rate * stepfrac
        ratio = actual_improve / (expected_improve + 1e-8)
        if ratio > accept_ratio and actual_improve > 0:
            return True, xnew
    return False, x

def flat_params(model):
    return torch.cat([p.data.view(-1) for p in model.parameters()])

def set_flat_params(model, flat):
    offset = 0
    for p in model.parameters():
        size = p.numel()
        p.data.copy_(flat[offset:offset+size].view(p.shape))
        offset += size
Enter fullscreen mode Exit fullscreen mode

The TRPO update loop:

def train_trpo(n_iters=200, timesteps_per_batch=10000, gamma=0.99, lam=1.0,
               max_kl=0.01, cg_damping=1e-3, cg_iters=10):
    env = gym.make('CartPole-v1')
    policy = PolicyNetwork(4, 2)
    value_fn = ValueNetwork(4)
    value_opt = optim.Adam(value_fn.parameters(), lr=1e-3)

    all_rewards = []

    for iteration in range(n_iters):
        rollout = collect_rollouts(env, policy, timesteps_per_batch)
        obs_t = torch.FloatTensor(rollout['observations'])
        acts_t = torch.LongTensor(rollout['actions'])

        with torch.no_grad():
            values = value_fn(obs_t).numpy()

        advantages, returns = compute_gae(
            rollout['rewards'], rollout['dones'], values, gamma, lam
        )
        adv_t = torch.FloatTensor(advantages)
        ret_t = torch.FloatTensor(returns)

        old_params = flat_params(policy).clone()
        with torch.no_grad():
            old_probs = policy(obs_t).detach()

        # Surrogate loss and gradient
        probs = policy(obs_t)
        dist = Categorical(probs)
        log_probs = dist.log_prob(acts_t)
        old_dist = Categorical(old_probs)
        old_log_probs = old_dist.log_prob(acts_t)

        ratio = torch.exp(log_probs - old_log_probs)
        surr_loss = -(ratio * adv_t).mean()

        grads = torch.autograd.grad(surr_loss, policy.parameters())
        pg = torch.cat([g.view(-1) for g in grads]).detach()

        # Fisher-vector product (without forming the full matrix)
        def fvp(v):
            probs_ = policy(obs_t)
            kl = (old_probs * (torch.log(old_probs + 1e-8)
                  - torch.log(probs_ + 1e-8))).sum(dim=-1).mean()
            grads_ = torch.autograd.grad(kl, policy.parameters(),
                                         create_graph=True)
            flat_grad = torch.cat([g.view(-1) for g in grads_])
            kl_v = (flat_grad * v).sum()
            grads2 = torch.autograd.grad(kl_v, policy.parameters())
            return torch.cat([g.view(-1) for g in grads2]).detach() \
                   + cg_damping * v

        # Conjugate gradient to find the natural gradient direction
        stepdir = conjugate_gradient(fvp, -pg, cg_iters=cg_iters)

        # Compute step size from the trust region constraint
        shs = 0.5 * stepdir.dot(fvp(stepdir))
        lm = torch.sqrt(shs / max_kl)
        fullstep = stepdir / lm

        # Line search to ensure actual improvement
        neggdotstepdir = -pg.dot(stepdir)
        def surrogate_loss(params):
            set_flat_params(policy, params)
            with torch.no_grad():
                p = policy(obs_t)
                d = Categorical(p)
                lp = d.log_prob(acts_t)
                r = torch.exp(lp - old_log_probs.detach())
                return -(r * adv_t).mean().item()

        success, new_params = linesearch(
            surrogate_loss, old_params, fullstep, neggdotstepdir / lm
        )
        set_flat_params(policy, new_params)

        # Value function update
        for _ in range(10):
            v_pred = value_fn(obs_t)
            v_loss = ((v_pred - ret_t) ** 2).mean()
            value_opt.zero_grad()
            v_loss.backward()
            value_opt.step()

        all_rewards.append(np.mean(rollout['episode_rewards']))
    return all_rewards
Enter fullscreen mode Exit fullscreen mode

PPO-Clip

The simplest of the three to implement. Replace the constrained optimization with a clipped surrogate:

def train_ppo(n_iters=200, timesteps_per_batch=10000, gamma=0.99, lam=1.0,
              clip_epsilon=0.2, epochs=10, lr=1e-3):
    env = gym.make('CartPole-v1')
    policy = PolicyNetwork(4, 2)
    value_fn = ValueNetwork(4)
    policy_opt = optim.Adam(policy.parameters(), lr=lr)
    value_opt = optim.Adam(value_fn.parameters(), lr=1e-3)

    all_rewards = []

    for iteration in range(n_iters):
        rollout = collect_rollouts(env, policy, timesteps_per_batch)
        obs_t = torch.FloatTensor(rollout['observations'])
        acts_t = torch.LongTensor(rollout['actions'])

        with torch.no_grad():
            values = value_fn(obs_t).numpy()
            old_probs = policy(obs_t).detach()
            old_dist = Categorical(old_probs)
            old_log_probs = old_dist.log_prob(acts_t).detach()

        advantages, returns = compute_gae(
            rollout['rewards'], rollout['dones'], values, gamma, lam
        )
        adv_t = torch.FloatTensor(advantages)
        ret_t = torch.FloatTensor(returns)

        # Multiple epochs of minibatch updates
        batch_size = 128
        n_samples = len(obs_t)

        for epoch in range(epochs):
            indices = np.random.permutation(n_samples)
            for start in range(0, n_samples, batch_size):
                end = start + batch_size
                idx = indices[start:end]

                mb_obs = obs_t[idx]
                mb_acts = acts_t[idx]
                mb_adv = adv_t[idx]
                mb_old_lp = old_log_probs[idx]
                mb_ret = ret_t[idx]

                # Clipped surrogate objective
                probs = policy(mb_obs)
                dist = Categorical(probs)
                log_probs = dist.log_prob(mb_acts)

                ratio = torch.exp(log_probs - mb_old_lp)
                surr1 = ratio * mb_adv
                surr2 = torch.clamp(ratio, 1 - clip_epsilon,
                                    1 + clip_epsilon) * mb_adv
                policy_loss = -torch.min(surr1, surr2).mean()

                policy_opt.zero_grad()
                policy_loss.backward()
                policy_opt.step()

                # Value loss
                v_pred = value_fn(mb_obs)
                v_loss = ((v_pred - mb_ret) ** 2).mean()
                value_opt.zero_grad()
                v_loss.backward()
                value_opt.step()

        all_rewards.append(np.mean(rollout['episode_rewards']))
    return all_rewards
Enter fullscreen mode Exit fullscreen mode

Run All Three

reinforce_rewards = train_reinforce(n_iters=200)
trpo_rewards = train_trpo(n_iters=200)
ppo_rewards = train_ppo(n_iters=200)
Enter fullscreen mode Exit fullscreen mode

The notebook extends each function to also return per-iteration KL divergence for the diagnostic plots below.

Reward curves for REINFORCE, TRPO, and PPO on CartPole-v1 — REINFORCE oscillates, TRPO climbs steadily with some dips, PPO converges fastest and holds

PPO reaches 400 by iteration 15, TRPO by iteration 18, and REINFORCE takes until iteration 79. More importantly, look at the variance: REINFORCE oscillates throughout training while PPO holds near 500 once it gets there.

What Just Happened?

The Policy Gradient

All three methods optimise the same objective: find parameters $\theta$ that maximise expected cumulative reward. The policy gradient theorem gives us the direction:

equation

where $\hat{A}_t$ is the advantage — how much better action $a$ was compared to the average. REINFORCE follows this gradient directly with a fixed learning rate. The problem is that this gradient says nothing about how far to step.

Why Vanilla Steps Are Dangerous

A small change in parameter space can cause a large change in the policy's behaviour. If one gradient step happens to land in a region where the policy collects bad data, the next gradient will push even harder in a bad direction. This positive feedback loop is why REINFORCE curves oscillate: the agent is one unlucky batch away from forgetting everything.

GAE: Reducing Variance

Before fixing the update rule, we fix the advantage estimate. Generalised Advantage Estimation (GAE) computes:

equation

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error. With $\lambda = 1$ (the setting in the original code), GAE reduces to the full Monte Carlo return minus the baseline — the highest-variance, lowest-bias option. With smaller $\lambda$, it biases towards the value function estimate but reduces variance. The original modular_rl code defaults to $\lambda = 1.0$.

TRPO: Constraining the Update

TRPO keeps the same objective but adds a constraint:

equation

The ratio $r_t(\theta) = \pi_\theta(a|s) / \pi_{\theta_\text{old}}(a|s)$ is the importance sampling correction. The constraint $D_\text{KL} \leq \delta$ (with $\delta = 0.01$ in the original code) ensures the new policy does not stray too far from the old one.

Solving this constrained problem requires the natural gradient, which accounts for the curvature of the policy distribution. The conjugate gradient algorithm solves the system $Hx = g$ (where $H$ is the Fisher information matrix and $g$ is the policy gradient) without ever forming $H$ explicitly. Instead, it only needs the Fisher-vector product $Hv$, which can be computed cheaply with two backward passes.

The fvp function in our code does exactly this:

  1. Compute the KL divergence between old and new policies
  2. Take the gradient of KL w.r.t. parameters → this is one row of the Fisher matrix
  3. Dot with vector $v$ and differentiate again → this gives $Hv$

After CG finds the direction, a line search with backtracking (halving the step up to 10 times) confirms the step actually improves the surrogate.

PPO-Clip: The Shortcut

PPO replaces the constrained optimization with a clipped surrogate:

equation

The clip with $\epsilon = 0.2$ means the ratio is clamped to $[0.8, 1.2]$. If the ratio tries to go beyond that range, the gradient is zero — the update stops. This is a first-order method: no Fisher matrix, no conjugate gradient, no line search. Just Adam with minibatches.

The PPO clipped objective showing how clipping prevents the ratio from deviating too far, for both positive and negative advantages

The left panel shows positive advantage ($A > 0$): the policy should increase this action's probability, but clipping caps the benefit at $r = 1.2$. The right panel shows negative advantage ($A < 0$): the policy should decrease the action's probability, but clipping prevents removing it entirely past $r = 0.8$. In both cases, the gradient becomes zero outside the clipped region, preventing overly aggressive updates.

Going Deeper

The Instability Problem, Visualised

The key insight is in the KL divergence per update. Here is what each method actually does to the policy:

KL divergence per update for all three methods — REINFORCE is unbounded, TRPO hovers near 0.01, PPO stays below 0.02

REINFORCE (left) has unbounded KL divergence — each update can change the policy by an arbitrary amount, though in practice the KL stays small because the learning rate is small. TRPO (centre) explicitly constrains KL to near the target of 0.01 via the trust region. PPO (right) implicitly constrains KL through the clipping mechanism, typically staying below 0.02.

The Surrogate Objective Landscape

To understand why these constraints matter, consider the surrogate loss as a function of the policy ratio:

Surrogate loss landscape showing REINFORCE's unbounded linear objective, TRPO's constrained trust region, and PPO's clipped flat regions

REINFORCE follows the linear surrogate without bound — the objective keeps improving as the ratio increases, tempting the optimiser to take huge steps. TRPO restricts optimisation to a small trust region (blue zone) around the current policy. PPO's clipped objective flattens outside $[0.8, 1.2]$, achieving a similar effect without explicit constraint enforcement.

Why CG + Line Search Is Expensive

TRPO's main drawback is computational cost. Each update requires:

  1. Computing the policy gradient — one backward pass
  2. 10 CG iterations — each requires a Fisher-vector product (two backward passes)
  3. Line search — up to 10 forward passes to verify improvement

That is roughly 30 backward passes per update, compared to PPO's single backward pass per minibatch. On CartPole this difference is seconds, but on Atari or robotics tasks with large networks, it becomes prohibitive.

PPO's Two Variants

The original PPO paper describes two approaches:

PPO-Clip (what we implement): Clips the ratio directly. Simple, stable, requires no extra hyperparameters beyond $\epsilon$.

PPO-Penalty (from the original modular_rl code): Adds a KL penalty to the loss with an adaptive coefficient. If the KL exceeds 1.3× the target, the penalty coefficient increases by 50%. If it drops below 0.7× the target, the coefficient decreases. This is what ppo.py in the original code implements. In practice, PPO-Clip performs equally well and is simpler, which is why it became the standard.

Hyperparameter Sensitivity

The hyperparameters that matter most, all from the original modular_rl codebase:

Parameter Value Source Sensitivity
gamma 0.99 core.py Low — standard for most environments
GAE lambda 1.0 core.py Medium — lower values reduce variance but add bias
Hidden layers [64, 64] agentzoo.py Low — sufficient for CartPole, scale up for harder tasks
Activation tanh agentzoo.py Low — ReLU works similarly
TRPO max_kl 0.01 trpo.py High — too large causes instability, too small slows learning
TRPO cg_damping 1e-3 trpo.py Medium — prevents CG from diverging
PPO clip epsilon 0.2 Schulman (2017) Medium — 0.1-0.3 all work, 0.2 is standard
PPO epochs 10 ppo.py Medium — more epochs = more data reuse but risk of overfitting
Learning rate 1e-3 ppo.py High — too large causes instability across all methods
Timesteps per batch 10000 core.py Medium — larger batches reduce variance but slow wall-clock time

When PPO Fails

PPO is remarkably robust, but it is not a silver bullet:

  • Very sparse rewards: If the agent almost never receives reward, all three methods struggle. PPO's advantage over REINFORCE shrinks when there is little signal to exploit.
  • High-dimensional continuous actions: The clipping mechanism was designed for the ratio $r_t(\theta)$, which can behave differently in high-dimensional action spaces. Trust region violations can occur along dimensions the clip does not monitor.
  • Non-stationary environments: PPO assumes the environment is stationary between batches. If the environment changes faster than the batch collection, the trust region is protecting against the wrong baseline.

Method Comparison

REINFORCE TRPO PPO
Update rule Gradient descent Natural gradient + line search Clipped SGD
KL constraint None Explicit (hard) Implicit (soft)
Second-order info No Yes (Fisher-vector product) No
Minibatch updates No (full batch) No (full batch) Yes
Implementation complexity Low High Low
Iterations to 400 (CartPole) ~80 ~18 ~15
Stability Low High Highest
Wall-clock per iteration Fast Slow (CG + line search) Fast
Used in practice Rarely Rarely Widely (RLHF, robotics, games)

Where This Comes From

The Lineage

The trust region idea has a clear intellectual lineage:

Williams (1992) introduced REINFORCE — the first general-purpose policy gradient algorithm. It proved that you could optimise a stochastic policy by sampling, without needing a model of the environment. The price was high variance.

Kakade (2001) introduced the natural policy gradient, showing that accounting for the Fisher information geometry of the policy space (rather than the Euclidean geometry of parameter space) gave more stable updates. This is the theoretical ancestor of TRPO.

Schulman et al. (2015), Trust Region Policy Optimization, formalised this into a practical algorithm. The key contribution was proving a monotonic improvement guarantee: if you constrain the KL divergence, the new policy is guaranteed to be at least as good as the old one (in expectation). The conjugate gradient + line search machinery made this computationally feasible.

Schulman et al. (2016), High-Dimensional Continuous Control Using Generalized Advantage Estimation, introduced GAE — the variance reduction technique all three methods in this post use. The $\lambda$ parameter interpolates between high-variance Monte Carlo returns and low-variance (but biased) TD estimates.

Schulman et al. (2017), Proximal Policy Optimization Algorithms, replaced the constrained optimization with a clipped objective. The paper showed that this simpler approach matched or exceeded TRPO's performance across Atari, MuJoCo, and other benchmarks, with none of the second-order computation overhead.

PPO in the Wild

PPO's simplicity and stability made it the default choice for modern RL applications:

  • RLHF in large language models: ChatGPT and similar systems use PPO (or close variants) to fine-tune language models from human feedback. The clipped objective prevents the model from deviating too far from the base model — exactly the same trust region intuition applied to language.
  • Robotics: OpenAI's Dactyl used PPO to train a robotic hand to manipulate objects. The stability of PPO was critical — a single catastrophic policy update could damage physical hardware.
  • Game AI: Dota 2's OpenAI Five and many game-playing agents use PPO variants because they need to train for billions of timesteps without diverging.

The entire journey from REINFORCE to PPO spans 25 years, and each step addressed a concrete failure mode of the previous method. REINFORCE worked but oscillated. The natural gradient was stable but expensive. TRPO was practical but complex. PPO kept the stability and dropped the complexity. That is why it won.

Try It Yourself

  1. Run the notebook: Open the Colab notebook and train all three methods. Watch the reward curves and compare convergence.
  2. Try LunarLander-v3: Replace CartPole-v1 with LunarLander-v3 (8-dimensional observation, 4 discrete actions). PPO should still converge reliably; REINFORCE will struggle more.
  3. Tune the clip epsilon: Try $\epsilon = 0.1$ and $\epsilon = 0.3$. Smaller values make PPO more conservative (slower but more stable); larger values allow bigger updates.
  4. Implement the adaptive KL penalty: The original ppo.py uses a KL penalty instead of clipping. Add a kl_coeff that increases by 50% when $D_\text{KL} > 1.3 \times \text{target}$ and decreases by 33% when $D_\text{KL} < 0.7 \times \text{target}$.
  5. Compare wall-clock time: Time each method per iteration. TRPO should be noticeably slower due to the CG + line search overhead.

Related Posts

Frequently Asked Questions

Why is vanilla REINFORCE unstable compared to TRPO and PPO?

REINFORCE computes a gradient direction but has no mechanism to control how far each update moves the policy. A single batch of unlucky episodes can produce a large gradient that overshoots into a bad region, which then generates even worse data, creating a destructive feedback loop. TRPO and PPO add guardrails: TRPO explicitly constrains the KL divergence between old and new policies, while PPO clips the probability ratio to prevent large changes.

What is the difference between TRPO and PPO?

Both methods prevent the policy from changing too much in a single update, but they achieve this differently. TRPO uses a hard KL divergence constraint enforced through second-order optimisation (conjugate gradient + line search), making it computationally expensive. PPO uses a clipped surrogate objective that zeroes out the gradient when the policy ratio moves outside a fixed range, achieving a similar effect with simple first-order gradient descent.

Why did PPO become the standard algorithm over TRPO?

PPO matches or exceeds TRPO's performance while being dramatically simpler to implement and much cheaper to run. TRPO requires roughly 30 backward passes per update (for conjugate gradient and line search), while PPO needs just one backward pass per minibatch. PPO also supports minibatch updates, making it more sample-efficient. This combination of simplicity, stability, and performance made it the default choice for applications like RLHF, robotics, and game AI.

What is Generalised Advantage Estimation (GAE)?

GAE is a variance reduction technique that computes a weighted average of multi-step TD errors. The lambda parameter controls the bias-variance tradeoff: lambda = 1.0 gives the full Monte Carlo return minus the baseline (high variance, no bias), while smaller values lean more heavily on the value function estimate (lower variance, some bias). All three methods in this post use GAE to compute the advantage estimates that drive their policy updates.

What does the clip epsilon parameter control in PPO?

The clip epsilon (typically 0.2) defines how far the new policy can deviate from the old one. The probability ratio between new and old policies is clamped to the range [1 - epsilon, 1 + epsilon]. Outside this range, the gradient becomes zero, preventing the optimiser from making overly aggressive updates. Values of 0.1 to 0.3 all work well in practice, with 0.2 being the standard default.

Can PPO handle continuous action spaces?

Yes. PPO works with both discrete and continuous action spaces. For continuous actions, the policy network outputs the mean and standard deviation of a Gaussian distribution instead of categorical probabilities. The clipped objective and trust region intuition remain the same. PPO has been successfully applied to continuous control tasks in robotics and physics simulation environments like MuJoCo.

Top comments (0)