The Problem: Your DQN Agent Plateaus at 60% Optimal
You've trained a DQN agent for 2 million steps. The loss curve looks stable. The epsilon has decayed to 0.01. But your agent stubbornly hovers around 60% of the optimal policy's performance, refusing to improve further.
This isn't a hyperparameter issue. It's overestimation bias — and it's baked into the core Q-learning update rule.
The vanilla DQN update uses $Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')$ to bootstrap future value estimates. That $\max$ operator is the culprit. When Q-values contain estimation noise (and they always do early in training), taking the max systematically picks overestimated values. Your agent starts believing certain actions are better than they actually are, gets stuck exploiting them, and never explores the truly optimal path.
I'll show you three Double-Q variants that fix this, compare them on CartPole and LunarLander, and explain when each one breaks down.
Why max Causes Overestimation: The Math
Continue reading the full article on TildAlice

Top comments (0)