DEV Community

Cover image for Decoupled Weight Decay Regularization
Paperium
Paperium

Posted on • Originally published at paperium.net

Decoupled Weight Decay Regularization

Decoupled weight decay — why Adam can finally catch up to SGD

Many models learn faster with Adam, but they often didn't do as well on real tasks as SGD.
The reason was not magic, it was how weight decay was mixed into training.
People usually tied decay to the step that lowers the loss, and that works one way for some methods, but not for others.
A simple fix is to keep the decay step separate — or decoupled — so it doesn't get tangled with learning updates.
This small change lets models trained with Adam generalize much better, and close the gap with SGD on image tasks.
Many teams added it to popular tools and saw quicker wins, some were surprised how big the change could be.
If you try it, you may get cleaner results with less fuss.
The idea is simple, easy to add, and helps models behave more like people expect them to; so try it if you use Adam, it might work for you.

Read article comprehensive review in Paperium.net:
Decoupled Weight Decay Regularization

🤖 This analysis and review was primarily generated and structured by an AI . The content is provided for informational and quick-review purposes.

Top comments (0)