CleanRL Beats Stable Baselines3 by 2.3x — But There's a Catch
I spent a week training PPO agents on the same MuJoCo tasks using CleanRL and Stable Baselines3. CleanRL finished Hopper-v4 in 18 minutes. Stable Baselines3 took 42 minutes.
Same hyperparameters. Same hardware (RTX 3090). Same total timesteps (1M).
The speed gap surprised me — both frameworks implement the exact same PPO algorithm from Schulman et al. (2017). But when I dug into the profiling results, the bottleneck wasn't where I expected. It wasn't vectorized environment overhead or PyTorch compilation. It was something stupidly simple.
Why CleanRL Is Faster: Single-File Architecture
CleanRL's entire PPO implementation lives in one 300-line Python file. No abstraction layers. No callback hooks. No automatic tensorboard logging unless you ask for it.
Stable Baselines3 (SB3) wraps everything in a BaseAlgorithm class with 15+ method calls per training step. Each call adds 2-5ms of overhead. Over 1 million timesteps, that's 30+ minutes of pure function call tax.
Continue reading the full article on TildAlice

Top comments (0)