Why Your First RL Algorithm Choice Costs You 10x Compute
Pick the wrong algorithm for continuous control and you'll burn through cloud credits before seeing a working policy. I've watched DQN struggle on HalfCheetah for 48 hours while SAC converged in 4. The advice online is generic: "PPO is stable, SAC is sample-efficient, DQN is simple." But what does that actually mean when you're staring at a flat reward curve at 3am?
This benchmark measures wall-clock training time and sample efficiency across three MuJoCo continuous control tasks. Same hardware (M1 MacBook Pro, 16GB RAM), same total timesteps (1M), same network architecture where applicable. The goal: find out which algorithm gets you to a working policy fastest.
The Setup: Leveling the Playing Field
I used Gymnasium 0.29.1 with MuJoCo 2.3.7 on three environments:
- HalfCheetah-v4: Run forward as fast as possible (12-dim action space)
- Hopper-v4: One-legged robot staying upright (3-dim action space)
- Ant-v4: Four-legged walker (8-dim action space)
Continue reading the full article on TildAlice

Top comments (0)