DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

DQN vs PPO vs SAC: MuJoCo Training Speed Benchmarks

Why Your First RL Algorithm Choice Costs You 10x Compute

Pick the wrong algorithm for continuous control and you'll burn through cloud credits before seeing a working policy. I've watched DQN struggle on HalfCheetah for 48 hours while SAC converged in 4. The advice online is generic: "PPO is stable, SAC is sample-efficient, DQN is simple." But what does that actually mean when you're staring at a flat reward curve at 3am?

This benchmark measures wall-clock training time and sample efficiency across three MuJoCo continuous control tasks. Same hardware (M1 MacBook Pro, 16GB RAM), same total timesteps (1M), same network architecture where applicable. The goal: find out which algorithm gets you to a working policy fastest.

Detailed close-up of a spider with egg sac on a green leaf in Chinchiná, Colombia.

Photo by Richard Ramirez Duque on Pexels

The Setup: Leveling the Playing Field

I used Gymnasium 0.29.1 with MuJoCo 2.3.7 on three environments:

  • HalfCheetah-v4: Run forward as fast as possible (12-dim action space)
  • Hopper-v4: One-legged robot staying upright (3-dim action space)
  • Ant-v4: Four-legged walker (8-dim action space)

Continue reading the full article on TildAlice

Top comments (0)