DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

CleanRL vs Stable Baselines3: PPO Training 2.3x Faster

CleanRL Beats Stable Baselines3 by 2.3x — But There's a Catch

I spent a week training PPO agents on the same MuJoCo tasks using CleanRL and Stable Baselines3. CleanRL finished Hopper-v4 in 18 minutes. Stable Baselines3 took 42 minutes.

Same hyperparameters. Same hardware (RTX 3090). Same total timesteps (1M).

The speed gap surprised me — both frameworks implement the exact same PPO algorithm from Schulman et al. (2017). But when I dug into the profiling results, the bottleneck wasn't where I expected. It wasn't vectorized environment overhead or PyTorch compilation. It was something stupidly simple.

A smiling woman in riding gear leading a saddled horse outdoors.

Photo by Barbara Olsen on Pexels

Why CleanRL Is Faster: Single-File Architecture

CleanRL's entire PPO implementation lives in one 300-line Python file. No abstraction layers. No callback hooks. No automatic tensorboard logging unless you ask for it.

Stable Baselines3 (SB3) wraps everything in a BaseAlgorithm class with 15+ method calls per training step. Each call adds 2-5ms of overhead. Over 1 million timesteps, that's 30+ minutes of pure function call tax.


Continue reading the full article on TildAlice

Top comments (0)