DEV Community

Cover image for 📌 Day 16: 21 Days of Building a Small Language Model: Choosing the right optimizer for Your LLM📌
Prashant Lakhera
Prashant Lakhera

Posted on

📌 Day 16: 21 Days of Building a Small Language Model: Choosing the right optimizer for Your LLM📌

For years, AdamW has been the default optimizer for training large language models. It’s reliable, well-understood, and works out of the box for almost everything.

But as models scale, optimizer choice starts to matter a lot more, especially for memory and compute.

✅ That’s where Muon is getting attention.

✅ Instead of storing second-moment statistics like AdamW, Muon uses momentum + orthogonalized updates (via Newton–Schulz), which makes it:

~50% lighter on optimizer memory

🔗 Blog link: https://www.linkedin.com/pulse/day-16-21-days-building-small-language-model-choosing-lakhera-lj3jc

I’ve covered all the concepts here at a high level to keep things simple. For a deeper exploration of these topics, feel free to check out my book "Building A Small Language Model from Scratch: A Practical Guide."

✅ Gumroad: https://plakhera.gumroad.com/l/BuildingASmallLanguageModelfromScratch

✅ Amazon: https://www.amazon.com/dp/B0G64SQ4F8/

✅ Leanpub: https://leanpub.com/buildingasmalllanguagemodelfromscratch/

Top comments (0)