Understanding the Design of Optimizers with me

Tri Vo — Mon, 03 Nov 2025 04:16:30 +0000

Ok, it's midnight during Halloween, and I'm talking about Optimizers. Such a thrill lol. And the goal today is to teach you how AdamW is calculated mathematically, and what the intention is behind its design.

So, what is an optimizer in the context of LLMs? First of all, this name to me is rather deceptive. This is actually less of an object and more of a verb/ methodology/ concept.

Optimizer is a WAY to update the parameters in our LLMs during training. Specifically, it's how to update each weight value in, for example, our attention matrices or in the linear layer of our MLP layer during backprop.

In the simplest form, with "chain rule", we have this classic optimizer called Stochastic Gradient Descent (SGD):

θ_{t} = θ_{t - 1} - η \nabla_{θ} L (θ_{t})

, where

η

is the learning rate and

\nabla_{θ} L_{t} (θ_{t})

is the derivative of our Loss function with respect to

θ

One question I have here is "Why Stochastic?". None of the calculations in the above formula results in a different solution every time we solve it. True. The answer doesn't lie in the formula, but is cleverly hidden in the LLM's implementation. Because training an LLM requires tons of inputs, calculating the gradients of all (or batch gradient descent) would be terribly expensive. Therefore, we would only sample a few inputs to calculate our loss, or so-called mini-batch gradient descent. And this sampling has introduced stochasticity into our calculation.
(FYI: the full update is also usually called Deterministic Gradient Descent (DGD), but this is less popular imo.)

Remember, our goal is to learn AdamW, and this design is very simple and far from AdamW's. So the question is, how to improve the updates? Would always greedily (and blindly) follow the best direction at the immediate location be our best option? Here is where the fun parts begin.

Considering the above image from a global perspective, it would make more sense if the blue point approached the local minimum, but it actually oscillates a lot before convergence. You can think of this like building a road downhill. It's easier for a car to drive in zigzags but longer curves than to drive straight down the hill and obviously faster. In the context of a loss landscape, the next optimal reduction doesn't guarantee the best direction to the lower contour, as the contours can have small spikes that are not those "easier options". Therefore, new "routing methods" are introduced to tackle this local exploration behavior.

What if we have a history of previous steps, and we'll make the next direction based on the tendency from the past decisions. In physics, this concept is called momentum. This "internal force" will kick you out of your intended next location based on your past movements - just like the "momentum" when you jump off a moving bus. With this extra movement, you can EXPLORE a better place in your contour map - it's like escaping from your destined death! or just a longer route. Here, we record the effect of past gradients in $v_{t}$ like the following:

v_{t} = β v_{t - 1} + (1 - β) \nabla_{θ} L (θ)

θ_{t} = θ_{t - 1} - η v_{t}

It's great that we add more spices to our previous vanilla SGD optimizer. What other spices can we use for this vanilla recipe? With the goal to EXPLORE other directions, we can weigh the directions based on their frequency.

v_{t} = v_{t - 1} + (\nabla_{θ} L (θ))^{2}

θ_{t} = θ_{t - 1} - \frac{η}{v _{t} + ϵ} \nabla_{θ} L (θ)

AdaGrad adds the normalized (square) gradient over time and divides the current gradient by the L2 regularization term. In simple words, as the gradients get larger, they would be penalized harder and vice versa. And

ϵ

is a very small number, around

1 0^{- 8}

here to make sure we do not divide by 0.

This strategy is helpful for sparse data where there is a lack of some specific signals, as we can give them more presence. Likewise, it reduces the impact of dominant decisions. In the end, this is very helpful for smoothing the gradient signal.

However, Summation is a simple yet naive approach for regularization. As we accumulate the gradients over time linearly, especially during very long training steps, it would bundle up. If we don't introduce any "restriction" for our regularization, it would eventually collapse our gradients to zero. And Root Mean Squared Propagation (RMSProp) was invented for this problem. This is the same as AdaGrad, but it introduces some "weightings" for our regularization.

v_{t} = β v_{t - 1} + (1 - β) (\nabla_{θ} L (θ))^{2}

θ_{t} = θ_{t - 1} - \frac{η}{v _{t} + ϵ} \nabla_{θ} L (θ)

, with

β

is from 0 to 1. In accumulation, the further gradient would have less impact on the current regularization because:

v_{t} = (1 - β) (\nabla_{θ_{t}} L (θ_{t}))^{2} + β v_{t - 1} = (1 - β) (\nabla_{θ_{t}} L (θ_{t}))^{2} + β [(1 - β) (\nabla_{θ_{t - 1}} L (θ_{t - 1}))^{2} + β v_{t - 2}] = (1 - β) (\nabla_{θ_{t}} L (θ_{t}))^{2} + β (1 - β) (\nabla_{θ_{t - 1}} L (θ_{t - 1}))^{2} + β^{2} v_{t - 2}

After each step, the later gradients would be times to

β

once, exponentially degrading their magnitude as

β

< 1 . Tuning this hyperparameter would solve both gradient vanishing and gradient exploding for us!

Ok, we have explored 2 approaches to adaptively steer the guiding wheel - Momentum and Gradient smoothing (RSMProp). What if we combine both of the ideas? - This is where we get the Adam optimizer. In other words, replacing the plain $\nabla_{θ} L (θ)$ updates in RSMProp with the Momentum-inspired updates.

, where $v^_{t}$ is from RSMProp and $m^_{t}$ is from Momentum.

Oke, finally we have reached the focus point of this discussion.

Now, the first two lines make sense for us since they are straight out of the previous approaches. However, where do the next two lines come from?

First, let's expand the $m_{t}$ formula to analyze how the values change over time.

We can reach this general formula if we keep expanding. And since $m_{0}$ is usually initialized as 0, we can shorten the final formula.

Then take the expectation of both sides. Since the expectation of sum equals the sum of expectation, we can mathematically deduce the following equation.

The gradient distribution here could be either stationary or unstationary, telling us how drastically the values oscillate away from the mean value. And with the summation follows a geometric series, we have our final formula as:

If stationarity, the expected value for the gradient distribution is approximately constant, termed $μ$ , across all training steps and $ζ$ is 0.
If non-stationarity, $ζ$ is not 0 to account for the difference with the stationary. However, $ζ$ can still remain small because the decay rate $β_{1}$ is chosen such that the further gradients' contributions are exponentially downweighted to insignificance. In the original Adam paper, $β_{1}$ is tuned to a very large number (0.9); hence, the previous gradient is scaled to (1 - 0.9) = 0.1. Over time, the effect of non-stationarity or the value of $ζ$ is insignificant to speak of. And this gives us the final estimation of $m_{t}$ equal to a scaled mean of gradients.

This analysis shows one huge issue with Adam's update policy. When t is small (t = 1), $μ$ is scaled by $1 - 0. 9^{1} = 0.1$ . When t is getting bigger (t = 20), $μ$ is scaled by $1 - 0. 9^{20} = 0.88$ . In other words, the effect of $m_{t}$ heavily depends on the time step, and is affected by unbalanced scalings. This has encouraged the author of Adam to regulate $m_{t}$ with what we see in the third line of the Adam formulas - a division over $1 - β_{1 t}$ :

We have done with the difficult part. Now, moving from Adam to AdamW is very easy to digest. The "W" in AdamW actually means "Weight Decay". The idea here is to make the next weight's values not deviate so far away from the previous, creating a more stable movement for our updates. And we implement that idea by adding a part of the current weights in the update. This concept is very favoured in different fields. It is very similar to the residual path in convolution networks or transformer architecture. And as we move to more advanced optimizers, this "trick" starts to appear everywhere, such as in the Lion and the Sophia optimizers.

It's the end of the blog, and I'll leave you guys to go now. The below would be more personal and less educated in all senses.

Cheers if you stay until this. Here, I'll express myself a bit and say the purpose of this blog. I'm in a research community of the very top and sharpest minds all over the world. I'm in a team of 4 led by an MS/PhD from NUS, and the topic is to compare AdamW and Muon optimizers in larger-scale post-training stages.

Benchmarking these optimizers is easy in the sense that the optimizers are the deepest component of the transformer architecture. Naturally, you have more control over the influencers and less noise in the results when benchmarking. Therefore, the results would be more reliable, and we could think of a reasoning without major assumptions.
However, benchmarking optimizers is also very difficult due to the depth of the required knowledge. These are the first building blocks of our architecture. A small adjustment of the foundation can cause the whole building to collapse. Therefore, I feel like I need to understand deep down the mathematics behind these optimizers to possibly explain, for example, how a specific behavior could be explained by adding one more variable in the formula. Moreover, due to its nature, we would need to measure both the model performance and the hardware performance to ensure a fair comparison. Working at such a low level causes me some issues with the engineering parts.
Therefore, I want to learn and share about this topic. Both to prove that it's not too terrible to dive deep into maths, but also to help myself keep up with the team's workload. I really want to contribute to the greatest and be a firm foundation for my future self.

Hope I'll get myself to learn, write, and share about more interesting topics in the near future. But these are what we have gone through today:

SGD
Momentum
AdaGrad
RMSProp
Adam / AdamW

References:
[ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION] https://arxiv.org/pdf/1412.6980
[Understanding Deep Learning Optimizers: Momentum, AdaGrad, RMSProp & Adam] https://towardsdatascience.com/understanding-deep-learning-optimizers-momentum-adagrad-rmsprop-adam-e311e377e9c2/
[ML@Purdue] https://www.instagram.com/mlpurdue/

Peer pressure - a lost college dude trying to do LLM research amid battlefield.

Tri Vo — Tue, 23 Sep 2025 05:13:01 +0000

I just finished my 5-hour straight research session, and it's currently 12:51AM when I'm typing these words.

So for background info., I'm a junior-year Data Science major who loves to learn more and have thoughtful conversation about LLMs. The great world of ideas and innovations in this LLMs research community has also fascinated me. And wallowing myself through the NEURIPS 2025 new submission has intrigued me as ever - and I wish this awe can be lengthened as much as possible... But reality hits. And it makes sense that reality hit me like that.

I'm just an unemployed college dude, burdened by a huge expectation to get myself an internship next year, and having no great accomplishments (or anchors) in my hand that I'm proud of showing people. My close friend just got a plane ticket to Chicago for his next interview round for next year intern. A person I don't like shared tons of stories showing off their friends earning internship offers from big companies. A dude from my talent bootcamp posted his re-make of the homework on LinkedIn with tons of interaction - and I know that dude did do it and improved the infra to host GPT-oss. Another dude on LinkedIn - he is from another team of my Algoverse's mentor, and currently a sophomore in Purdue ECE, I think - announced that their paper has just been accepted to NEURIPS workshop; This announcement was reposted by Kevin Zhu; So firstly, he is younger than me, I don't like my mentor, and my team is just shit, I'm sorry but I just felt like I'm inferior. And now, I've been buried myself in this "dream" or maybe "nightmare" because my research proposal has been rejected previously, and I was asked to "work better" on another one. The last 5 hours both make me intrigued, also give me frustration or even sadness.

Then just to add fuel to the fire, my application status seems to have stopped short. I want to achieve what gives me the inspiration, but that "inspiration" only accepts MS or PhD, and people with great publications and are properly taught.

Always. It's always like this. After spending effort, I still think I've tried hard enough ... devoured 5-6 papers? why not 10. Idk. My style is that I want to know more, and keep having more in my reading list. But this is going nowhere. I'm planning to go ask CS577's professor for any ideas - but I think I should have an idea first (to exchange for his ideas, I usually think). And the funny part is that, I'm not even in his class kkk.

If my life is an RL environment and I'm the Agent. The current states are : 3 days til the exam about Real-time Analysis I haven't studied for, 4 hours of sleep, 5 days til the deadline of the RL Assignment for the NTI Talent bootcamp, and 6 days til the deadline for the next research proposal (Mon).

Wish me luck, and I hope to survive til the next 7 days.
T.V.

Well, hope it’s my new beginning

Tri Vo — Sat, 20 Sep 2025 20:14:22 +0000

Hi, my full name in the mother tongue is Vo Quang Tri. But from now, I’ll address myself as Tri Vo.

I’ve been holding this thought for a while. But today, I decided to write blogs about interesting things in ML/AI/DL. There are many times I get goosebumps reading some articles or LinkedIn posts, and I really want to share the feeling with someone, with a community that shares the same interests — The people who would give me a “STFUU” when hearing that news and dive deep into the source code with me.

I kept waiting for the right circle but couldn’t find any.

So now I’m here. Just randomly passing my findings and learnings on the Internet, looking for them to reach the correct people. Someone who needs it.

Now, the beginning of a new me starts.

Best regards, Tri Vo

DEV Community: Tri Vo

Understanding the Design of Optimizers with me

Peer pressure - a lost college dude trying to do LLM research amid battlefield.

Well, hope it’s my new beginning