DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

DQN Overestimation Bias: 3 Double-Q Fixes That Work

The Problem: Your DQN Agent Plateaus at 60% Optimal

You've trained a DQN agent for 2 million steps. The loss curve looks stable. The epsilon has decayed to 0.01. But your agent stubbornly hovers around 60% of the optimal policy's performance, refusing to improve further.

This isn't a hyperparameter issue. It's overestimation bias — and it's baked into the core Q-learning update rule.

The vanilla DQN update uses $Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')$ to bootstrap future value estimates. That $\max$ operator is the culprit. When Q-values contain estimation noise (and they always do early in training), taking the max systematically picks overestimated values. Your agent starts believing certain actions are better than they actually are, gets stuck exploiting them, and never explores the truly optimal path.

I'll show you three Double-Q variants that fix this, compare them on CartPole and LunarLander, and explain when each one breaks down.

Abstract 3D render visualizing artificial intelligence and neural networks in digital form.

Photo by Google DeepMind on Pexels

Why max Causes Overestimation: The Math


Continue reading the full article on TildAlice

Top comments (0)