Fixing an Off-By-One Bug in PufferLib's PPO Implementation

#machinelearning #reinforcementlearning #opensource #python

Fixing an Off-By-One Bug in PufferLib's PPO Implementation
The Problem
I was looking through PufferLib issues and found #363 talking about a weird off-by-one bug in the PPO training loop. Turns out the code stores 64 states but only calculates 63 transitions, so the last sample in every batch ends up with an advantage of zero.
Doesn't sound like a big deal until you realize that's 1 out of every 64 samples getting garbage gradients. The policy gradient barely updates, entropy loss makes actions more random, and the value function just stays conservative instead of actually learning anything.
Digging Into the Code
The bug lives in pufferlib/pufferl.py. Here's what's happening:
All the rollout data (observations, rewards, actions, etc.) gets stored in buffers shaped like (segments, horizon, ...). With the default horizon=64, you'd think you're storing 64 complete transitions.
But you're not.
The buffers store 64 states at indices 0-63. But transitions need a state AND the next state. So:

rewards[0] = reward from transition 0→1
rewards[1] = reward from transition 1→2
...
rewards[63] = reward from transition 63→??? (there is no state 64)

That last reward never gets used. And when computing advantages, advantages[:, -1] ends up being 0 because there's no next value to bootstrap from.
The Fix
Two changes needed:

Buffer sizes (lines 95-108) # Before self.observations = torch.zeros(segments, horizon, *obs_space.shape, ...) self.rewards = torch.zeros(segments, horizon, device=device) # ... all other buffers

After

self.observations = torch.zeros(segments, horizon + 1, *obs_space.shape, ...)
self.rewards = torch.zeros(segments, horizon + 1, device=device)

... all other buffers

Now we store 65 states (0-64), which gives us 64 complete transitions.

Loop condition (line 300) # Before if l+1 >= config['bptt_horizon']:

After

if l+1 > config['bptt_horizon']:

Changed >= to > so the loop actually collects that 64th transition instead of stopping early.
What I Learned
This bug was subtle. The code mostly worked—63 out of 64 samples trained correctly. But that last sample in every segment was basically wasted compute.
It's a good reminder that RL bugs don't always crash your training. Sometimes they just quietly make things a bit worse, and you never notice unless you're really paying attention to the details.
Links

PR #445: https://github.com/PufferAI/PufferLib/pull/445
Original Issue #363: https://github.com/PufferAI/PufferLib/issues/363
My GitHub: https://github.com/jacobarrio