Shrijith Venkatramana

Posted on Jul 4

Adam: The Optimization Algorithm That Made LLMs Practical

#ai #algorithms #llm #machinelearning

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Without Adam, there's a good chance ChatGPT, Claude, Gemini, Llama, and many of today's large language models would have taken much longer to become reality.

When people talk about the breakthroughs behind modern AI, they usually mention the Transformer, attention, GPUs, or massive datasets.

Rarely does anyone mention the optimizer.

Yet every single gradient update during the training of a modern LLM depends on an optimization algorithm deciding how much every parameter should change. With billions of parameters and trillions of training tokens, that decision becomes one of the most important engineering problems in machine learning.

One optimizer, proposed in 2014, ended up becoming the default choice across much of deep learning:

Adam (Adaptive Moment Estimation).

Let's explore why.

Before Adam: Why Training Neural Networks Was So Difficult

Imagine you're hiking down a mountain in dense fog.

You can only see the slope directly beneath your feet.

The obvious strategy is simple:

Take one step downhill.

This is essentially what gradient descent does.

For a neural network, the gradient tells us which direction reduces the loss.

The update rule is simply:

[
\theta = \theta - \eta \nabla L
]

where

θ = model parameters
η = learning rate
∇L = gradient

Simple.

Unfortunately, real neural networks don't resemble smooth mountains.

Instead, they're more like:

steep cliffs
long narrow valleys
flat plateaus
noisy terrain
millions—or billions—of dimensions

The gradient also changes every mini-batch because we only compute it on a small sample of data.

This leads to stochastic gradient descent (SGD).

Instead of walking smoothly downhill, it's like trying to descend while someone randomly shakes the mountain beneath you.

Momentum: Giving Optimization Some Inertia

Researchers realized that humans don't instantly stop and change direction every step.

Instead, we build momentum.

Optimization borrowed the same idea.

Instead of only following today's gradient, momentum also remembers previous gradients.

Imagine pushing a heavy shopping cart.

If you push in roughly the same direction repeatedly, it builds speed.

Small bumps don't immediately change its path.

Mathematically,

[
v_t=\beta v_{t-1}+(1-\beta)g_t
]

where

(g_t) is today's gradient
(v_t) is the accumulated velocity

Now updates become

[
\theta=\theta-\eta v_t
]

This smooths noisy updates and helps escape shallow local irregularities.

Momentum was already a huge improvement.

But another problem remained.

Different Parameters Learn at Different Speeds

Suppose you're training a neural network with 500 million parameters.

Some parameters receive gradients like

0.00002

Others receive gradients like

Using one global learning rate becomes problematic.

If the learning rate is large enough for tiny gradients...

...the large gradients explode.

If it's safe for the large gradients...

...the tiny gradients barely move.

It's like paying every employee in a company exactly the same bonus regardless of performance, seniority, or role.

Some people are overpaid.

Others barely notice the reward.

Optimization needs to adapt individually.

This insight led to algorithms like AdaGrad and RMSProp.

Adam combined the best ideas from both.

Adam: Combining Momentum and Adaptive Learning Rates

In 2014, Diederik P. Kingma and Jimmy Ba introduced Adam in the paper:

Adam: A Method for Stochastic Optimization

The idea is beautifully elegant.

Adam maintains two running statistics for every parameter.

First moment

The average gradient.

Think of this as momentum.

[
m_t
]

Second moment

The average squared gradient.

Think of this as measuring how "volatile" or "uncertain" this parameter's updates have been.

[
v_t
]

The update becomes approximately

[

\theta

\eta
\frac{m_t}
{\sqrt{v_t}+\epsilon}
]

This produces a fascinating behavior.

If a parameter consistently receives huge gradients,

its denominator becomes larger,

making future updates smaller.

If another parameter rarely changes,

its denominator stays small,

allowing relatively larger updates.

Every parameter effectively receives its own personalized learning rate.

A Back-of-the-Envelope Example

Suppose two parameters have identical momentum:

Parameter A

Average gradient = 2
Average squared gradient = 100

Parameter B

Average gradient = 2
Average squared gradient = 4

Ignoring ε,

Parameter A updates by

2 / √100 = 0.2

Parameter B updates by

2 / √4 = 1

Even though both gradients are identical today,

Adam trusts Parameter B much more because its historical variance is lower.

This automatic scaling is one reason Adam trains deep networks so effectively.

Why Bias Correction Exists

There's one subtle issue.

At the beginning of training,

both moving averages start at zero.

That means early estimates are biased toward zero.

Kingma and Ba introduced bias correction:

[

\hat m_t

\frac{m_t}
{1-\beta_1^t}
]

and

[

\hat v_t

\frac{v_t}
{1-\beta_2^t}
]

These corrections rapidly remove the startup bias.

It's a small mathematical trick that has a surprisingly large practical impact during the first optimization steps.

Why Adam Became So Important for Deep Learning

Consider training GPT-style models.

Modern LLMs easily contain

7 billion parameters
70 billion parameters
over a trillion parameters in some research systems

Every optimization step updates every trainable parameter.

Even a modest training run might execute hundreds of thousands of optimization steps.

That means Adam performs on the order of

billions of parameters
×

hundreds of thousands of updates

resulting in quadrillions of parameter update decisions over the course of training.

Without stable optimization,

training would often diverge.

Learning would become painfully slow.

GPU time—costing thousands or even millions of dollars—would be wasted.

Optimization isn't merely a mathematical curiosity.

It's an operations and economics problem.

A 10% improvement in convergence speed on a multi-million-dollar training run can translate into hundreds of thousands of dollars in savings, shorter experimentation cycles, and faster scientific progress. Faster convergence also means researchers can iterate on model architectures more quickly, reducing the opportunity cost of long training jobs.

Adam became popular because it usually works well with relatively little hyperparameter tuning.

Researchers could spend less time adjusting learning rates and more time exploring new model architectures.

That practicality accelerated progress across computer vision, speech recognition, recommendation systems, and eventually large language models.

Adam Isn't Perfect

As influential as Adam has been, researchers have also identified limitations.

Some studies found that vanilla stochastic gradient descent can produce models that generalize better on certain vision tasks.

Others observed convergence issues under specific theoretical settings.

As LLMs grew larger, practitioners developed variants such as:

AdamW (decoupled weight decay)
AdaFactor (reduced memory footprint)
Lion (sign-based optimization)

In fact, many modern Transformer implementations train with AdamW, which separates weight decay from Adam's adaptive updates and often improves regularization.

Engineering rarely ends with one perfect algorithm.

Instead, progress comes through continual refinement.

The Bigger Lesson

When the Transformer paper appeared in 2017, attention deservedly captured the headlines.

But Transformers alone weren't enough.

Modern deep learning stands on layers of innovations:

better hardware
larger datasets
improved initialization
normalization methods
residual connections
efficient optimizers

Adam is one of those foundational technologies.

It's rarely discussed outside machine learning circles, yet it quietly powers the optimization of billions of parameters every day.

Sometimes the biggest breakthroughs aren't new model architectures.

Sometimes they're simply better ways of taking the next step downhill.

Did Adam fundamentally change deep learning, or was it simply the optimizer that happened to arrive at the right time?

I'd love to hear your thoughts—and if you've trained neural networks yourself, have you ever switched away from Adam and seen better results?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub