DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

PPO vs SAC vs TD3: MuJoCo Humanoid Training in 5M Steps

SAC Took 11 Hours to Stand. TD3 Did It in 4.

I trained three state-of-the-art RL algorithms on the Humanoid-v4 environment until they could walkโ€”or at least stumble convincingly. The performance gap was larger than I expected, and the reasons why tell you more about these algorithms than any theoretical explanation.

This is the MuJoCo benchmark everyone references but few people run themselves. The Humanoid task is notorious: 376-dimensional observation space (joint positions, velocities, center-of-mass), 17-dimensional continuous action space (torque commands), and a reward function that punishes you for falling over while rewarding forward velocity. It's the perfect stress test for continuous control algorithms because it requires both stability and progress.

I ran PPO, SAC, and TD3 for 5 million timesteps each on an RTX 3090, logged every metric, and watched them fail in different ways.

Vivid close-up photo of a frog displaying its vocal sacs in a natural water habitat.

Photo by Denitsa Kireva on Pexels

The Setup: Same Everything Except the Algorithm


Continue reading the full article on TildAlice

Top comments (0)