PPO vs SAC vs TD3: MuJoCo Humanoid Training in 5M Steps

#ppo #sac #td3 #mujoco

SAC Took 11 Hours to Stand. TD3 Did It in 4.

I trained three state-of-the-art RL algorithms on the Humanoid-v4 environment until they could walk—or at least stumble convincingly. The performance gap was larger than I expected, and the reasons why tell you more about these algorithms than any theoretical explanation.

This is the MuJoCo benchmark everyone references but few people run themselves. The Humanoid task is notorious: 376-dimensional observation space (joint positions, velocities, center-of-mass), 17-dimensional continuous action space (torque commands), and a reward function that punishes you for falling over while rewarding forward velocity. It's the perfect stress test for continuous control algorithms because it requires both stability and progress.

I ran PPO, SAC, and TD3 for 5 million timesteps each on an RTX 3090, logged every metric, and watched them fail in different ways.