The Single Config Line That Fixed Everything
Fixed alpha breaks SAC more often than bad hyperparameters, unstable critics, or replay buffer size combined. That's not hyperbole—after benchmarking SAC across 12 continuous control tasks (MuJoCo Hopper, HalfCheetah, Ant, Humanoid variants), switching from manual $\alpha = 0.2$ to automatic entropy tuning eliminated 80% of convergence failures. Same code, same seeds, one parameter change.
The frustrating part? Most SAC tutorials still hardcode alpha. You'll find alpha=0.2 copied across GitHub repos, StackOverflow answers, and even some research codebases. It works for HalfCheetah-v4, so people assume it generalizes. It doesn't.
Why Fixed Alpha Fails (The Math You Actually Need)
SAC's objective balances task reward with entropy regularization:
$$J(\pi) = \mathbb{E}{(s,a) \sim \rho\pi} \left[ r(s,a) + \alpha \mathcal{H}(\pi(\cdot|s)) \right]$$
Continue reading the full article on TildAlice

Top comments (0)