<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harshil Rami</title>
    <description>The latest articles on DEV Community by Harshil Rami (@harshil_rami_8533a7388ef7).</description>
    <link>https://dev.to/harshil_rami_8533a7388ef7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3892935%2Fa5861571-b59d-4821-9c9a-87be902476c2.png</url>
      <title>DEV Community: Harshil Rami</title>
      <link>https://dev.to/harshil_rami_8533a7388ef7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harshil_rami_8533a7388ef7"/>
    <language>en</language>
    <item>
      <title>Blog 2: Momentum-Based Optimizers</title>
      <dc:creator>Harshil Rami</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:03:58 +0000</pubDate>
      <link>https://dev.to/harshil_rami_8533a7388ef7/blog-2-momentum-based-optimizers-2h98</link>
      <guid>https://dev.to/harshil_rami_8533a7388ef7/blog-2-momentum-based-optimizers-2h98</guid>
      <description>&lt;h3&gt;
  
  
  &lt;em&gt;Giving the optimizer a memory — and teaching it to look before it leaps&lt;/em&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;SGD knows where it is. Momentum knows where it's been. Nesterov knows where it's going.&lt;br&gt;
That single sentence is the entire story of this post.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The ravine problem, visualized
&lt;/h2&gt;

&lt;p&gt;Let's be concrete about what SGD's zig-zagging actually looks like.&lt;/p&gt;

&lt;p&gt;Suppose your loss surface is an elongated valley — steep walls on the left and right, a gentle slope running toward the minimum far ahead. This is the classic &lt;strong&gt;ravine geometry&lt;/strong&gt;, and it's not an academic toy. It shows up naturally when your features have very different scales, when layers have different learning dynamics, or when you're in the early phases of training a deep network.&lt;/p&gt;

&lt;p&gt;SGD's update on this surface behaves as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Along the &lt;strong&gt;steep axis&lt;/strong&gt; (across the ravine): the gradient is large. SGD takes a big step, overshoots, corrects back, overshoots again. The updates oscillate violently.&lt;/li&gt;
&lt;li&gt;Along the &lt;strong&gt;shallow axis&lt;/strong&gt; (down the ravine, toward the minimum): the gradient is small. SGD takes tiny, tentative steps. Progress is glacial.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a path that looks like a snake moving sideways more than forward. You can shrink the learning rate to tame the oscillations on the steep axis, but that makes the shallow axis even slower. There's no single learning rate that handles both directions well.&lt;/p&gt;

&lt;p&gt;This is the fundamental limitation SGD leaves on the table, and it's what momentum is designed to fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Momentum: giving the optimizer velocity
&lt;/h2&gt;

&lt;p&gt;The core idea behind momentum is borrowed directly from physics. Instead of updating parameters based on the current gradient alone, we maintain a &lt;strong&gt;velocity vector&lt;/strong&gt; &lt;em&gt;v&lt;/em&gt; that accumulates a running average of past gradients.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vₜ = β · vₜ₋₁ + η · ∇L(θₜ)
θₜ₊₁ = θₜ − vₜ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;em&gt;β&lt;/em&gt; (beta) is the momentum coefficient — typically 0.9. Some formulations absorb the learning rate differently; the semantics are equivalent.&lt;/p&gt;

&lt;p&gt;Let's unpack what this actually does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the oscillating (steep) axis:&lt;/strong&gt;&lt;br&gt;
Gradients alternate sign — positive, negative, positive, negative. The velocity accumulates these with the decay factor &lt;em&gt;β&lt;/em&gt;. Because they cancel each other out over time, the velocity along this axis stays small. Oscillations are damped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the consistent (shallow) axis:&lt;/strong&gt;&lt;br&gt;
Gradients consistently point in the same direction — always slightly downhill. The velocity accumulates these constructively. Each step adds to the previous. The effective step size grows, and the optimizer accelerates.&lt;/p&gt;

&lt;p&gt;This is the momentum effect: &lt;strong&gt;dampening in oscillating directions, acceleration in consistent ones.&lt;/strong&gt; The optimizer builds up speed where the surface is consistently sloped and brakes naturally where the surface is ambiguous.&lt;/p&gt;
&lt;h3&gt;
  
  
  Effective learning rate under momentum
&lt;/h3&gt;

&lt;p&gt;With &lt;em&gt;β&lt;/em&gt; = 0.9, a gradient that persists in the same direction for many steps produces a velocity roughly &lt;em&gt;1/(1−β) = 10×&lt;/em&gt; the nominal learning rate. This is why momentum often requires a slightly lower learning rate than vanilla SGD — the effective step size is larger.&lt;/p&gt;

&lt;p&gt;More precisely, if the gradient is constant at &lt;em&gt;g&lt;/em&gt;, the velocity converges to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;v* = η · g / (1 − β)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So momentum scales up the effective learning rate by &lt;em&gt;1/(1−β)&lt;/em&gt;. Set &lt;em&gt;β = 0.9&lt;/em&gt; → 10× amplification. Set &lt;em&gt;β = 0.99&lt;/em&gt; → 100×. This amplification is the source of both momentum's power and its instability if misconfigured.&lt;/p&gt;

&lt;h3&gt;
  
  
  The ball analogy
&lt;/h3&gt;

&lt;p&gt;Momentum is often described as a ball rolling down a hill. The ball doesn't instantly respond to every local slope — it carries inertia. A small bump doesn't stop it; it takes a sustained uphill slope to decelerate it meaningfully.&lt;/p&gt;

&lt;p&gt;This analogy is accurate and useful, but it has a limit: the ball analogy suggests the optimizer might overshoot and roll up the other side of a valley. That's a real failure mode of high-momentum settings. If &lt;em&gt;β&lt;/em&gt; is too large, the optimizer can oscillate around minima rather than settling into them, or sail through a narrow good basin entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nesterov Accelerated Gradient: look before you leap
&lt;/h2&gt;

&lt;p&gt;Momentum is good. Nesterov Accelerated Gradient (NAG), proposed by Yurii Nesterov in 1983, makes one surgical improvement that turns out to matter significantly in practice.&lt;/p&gt;

&lt;p&gt;The problem with standard momentum: &lt;strong&gt;the gradient is evaluated at the current position, before applying the velocity.&lt;/strong&gt; By the time you apply the update, you're no longer at that position — you've already moved. You're using stale directional information.&lt;/p&gt;

&lt;p&gt;NAG fixes this with a simple conceptual shift: &lt;strong&gt;evaluate the gradient at the position you're about to arrive at&lt;/strong&gt;, then correct from there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;θ_lookahead = θₜ − β · vₜ₋₁          # project forward
vₜ = β · vₜ₋₁ + η · ∇L(θ_lookahead)  # gradient at projected position
θₜ₊₁ = θₜ − vₜ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The momentum step projects you forward to where you'll be &lt;em&gt;before&lt;/em&gt; the gradient correction. Then you evaluate the gradient there. This means the correction accounts for the momentum-driven position, not the pre-momentum position.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this helps
&lt;/h3&gt;

&lt;p&gt;Think of it this way. Standard momentum is like running toward a wall and only noticing the wall &lt;em&gt;after&lt;/em&gt; you've taken your full step. Nesterov is like looking ahead as you run and starting to slow down &lt;em&gt;before&lt;/em&gt; you hit the wall.&lt;/p&gt;

&lt;p&gt;In regions where the momentum is carrying you toward a steep uphill, NAG detects that uphill slope earlier and applies a corrective force sooner. The update is more anticipatory than reactive.&lt;/p&gt;

&lt;p&gt;In practice, NAG converges faster than standard momentum on convex problems — Nesterov's original theoretical analysis showed an &lt;em&gt;O(1/k²)&lt;/em&gt; convergence rate versus SGD's &lt;em&gt;O(1/k)&lt;/em&gt;, a meaningful gap. For non-convex deep learning loss surfaces, the improvement is empirical rather than provably guaranteed, but it's consistently observed.&lt;/p&gt;

&lt;h3&gt;
  
  
  NAG in the equivalent update form
&lt;/h3&gt;

&lt;p&gt;The two-equation NAG formulation above has an equivalent single-equation form that's more commonly implemented:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vₜ = β · vₜ₋₁ + ∇L(θₜ − β · vₜ₋₁)
θₜ₊₁ = θₜ − η · vₜ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are equivalent; the second form makes it clearer that the only change from standard momentum is &lt;em&gt;where the gradient is evaluated&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-side comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Gradient Descent&lt;/th&gt;
&lt;th&gt;SGD&lt;/th&gt;
&lt;th&gt;Momentum&lt;/th&gt;
&lt;th&gt;NAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gradient source&lt;/td&gt;
&lt;td&gt;Full dataset&lt;/td&gt;
&lt;td&gt;Single sample&lt;/td&gt;
&lt;td&gt;Mini-batch&lt;/td&gt;
&lt;td&gt;Mini-batch (at projected pos.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Velocity &lt;em&gt;vₜ&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Velocity &lt;em&gt;vₜ&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oscillation handling&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Dampens via averaging&lt;/td&gt;
&lt;td&gt;Dampens + anticipates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convergence rate (convex)&lt;/td&gt;
&lt;td&gt;&lt;em&gt;O(1/k)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;O(1/√k)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;O(1/k)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;O(1/k²)&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical &lt;em&gt;β&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;0.9–0.99&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The convergence rate column is worth reading carefully. SGD's &lt;em&gt;O(1/√k)&lt;/em&gt; is actually &lt;em&gt;worse&lt;/em&gt; than GD's &lt;em&gt;O(1/k)&lt;/em&gt; — the variance of stochastic gradients costs you a square root. Momentum restores GD-level rates. Nesterov goes further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation and practical notes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Standard Momentum
&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataloader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;grad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;beta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grad&lt;/span&gt;
    &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

&lt;span class="c1"&gt;# Nesterov Accelerated Gradient
&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataloader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Evaluate gradient at lookahead position
&lt;/span&gt;    &lt;span class="n"&gt;theta_lookahead&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;beta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
    &lt;span class="n"&gt;grad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;theta_lookahead&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;beta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grad&lt;/span&gt;
    &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In PyTorch, both are available via &lt;code&gt;torch.optim.SGD&lt;/code&gt; with &lt;code&gt;momentum&lt;/code&gt; and &lt;code&gt;nesterov&lt;/code&gt; flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Momentum
&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SGD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# NAG
&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SGD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nesterov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Practical hyperparameter guidance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;β = 0.9&lt;/em&gt; is the standard default. It works well across a wide range of architectures.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;β = 0.99&lt;/em&gt; gives stronger smoothing but risks slower response to genuine direction changes and can overshoot narrow minima.&lt;/li&gt;
&lt;li&gt;When switching from SGD to momentum, reduce the learning rate by roughly &lt;em&gt;1/(1−β)&lt;/em&gt; — so if &lt;em&gt;η = 0.1&lt;/em&gt; for SGD, try &lt;em&gt;η = 0.01&lt;/em&gt; with &lt;em&gt;β = 0.9&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;NAG is almost always preferable to plain momentum for the same computational cost. Default to it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where momentum still falls short
&lt;/h2&gt;

&lt;p&gt;Momentum is a major step forward. But it inherits one fundamental limitation from SGD: &lt;strong&gt;a single global learning rate for all parameters.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your model's parameters live in very different regimes. The embedding layer of a language model sees only a handful of tokens in each batch — its effective gradient is sparse and noisy. The final linear layer sees a dense, consistent gradient every step. Both are updated with the same &lt;em&gt;η&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is deeply suboptimal. Sparse parameters should move faster when they do receive signal. Dense parameters can afford more conservative updates to avoid oscillation.&lt;/p&gt;

&lt;p&gt;Momentum has no mechanism to learn this. It smooths over time but not over parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's exactly the problem Blog 3 solves.&lt;/strong&gt; AdaGrad will introduce per-parameter learning rates, scaling each update by the history of that parameter's gradient magnitude. RMSProp will fix AdaGrad's long-term decay problem. And the combination of per-parameter scaling with momentum will eventually give us Adam.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things to hold onto
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Velocity accumulation&lt;/td&gt;
&lt;td&gt;Past gradients persist via &lt;em&gt;β&lt;/em&gt; decay&lt;/td&gt;
&lt;td&gt;Accelerates in consistent directions, dampens oscillations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lookahead gradient (NAG)&lt;/td&gt;
&lt;td&gt;Gradient evaluated at projected position&lt;/td&gt;
&lt;td&gt;Earlier correction, better convergence rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective LR scaling&lt;/td&gt;
&lt;td&gt;Velocity → &lt;em&gt;η/(1−β)&lt;/em&gt; effective step&lt;/td&gt;
&lt;td&gt;Must tune LR down when adding momentum&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Blog 3&lt;/strong&gt;, we shift from &lt;em&gt;when&lt;/em&gt; the optimizer has seen a gradient to &lt;em&gt;which parameters&lt;/em&gt; have seen large gradients. AdaGrad introduces a per-parameter accumulator — parameters that receive frequent, large gradients get smaller effective learning rates; sparse parameters get larger ones. RMSProp then fixes AdaGrad's fatal flaw: the accumulator grows without bound, eventually shrinking all learning rates to zero.&lt;/p&gt;

&lt;p&gt;If momentum gave the optimizer a memory across time, adaptive methods give it a memory across parameters. Both are necessary. Neither is sufficient alone.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Blog 2 of an 8-part series on optimization algorithms for deep learning.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>gradientdescent</category>
    </item>
    <item>
      <title>Blog 1: Foundations of Gradient Descent</title>
      <dc:creator>Harshil Rami</dc:creator>
      <pubDate>Wed, 22 Apr 2026 17:54:43 +0000</pubDate>
      <link>https://dev.to/harshil_rami_8533a7388ef7/blog-1-foundations-of-gradient-descent-p6n</link>
      <guid>https://dev.to/harshil_rami_8533a7388ef7/blog-1-foundations-of-gradient-descent-p6n</guid>
      <description>&lt;h3&gt;
  
  
  &lt;em&gt;How neural networks learn — and why the obvious approach breaks immediately&lt;/em&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Every optimizer you'll ever use — Adam, AdamW, Lion, LAMB — is an answer to a problem that gradient descent creates. To understand why those answers exist, you need to feel the problem first.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The loss surface is a landscape you can't see
&lt;/h2&gt;

&lt;p&gt;Imagine you're blindfolded, standing somewhere on a hilly terrain. Your only tool is a stick: you can poke the ground around you and measure the slope. Your goal is to reach the lowest valley.&lt;/p&gt;

&lt;p&gt;That's optimization.&lt;/p&gt;

&lt;p&gt;The "terrain" is your loss surface — a high-dimensional function &lt;em&gt;L(θ)&lt;/em&gt; mapping your model's parameters &lt;em&gt;θ&lt;/em&gt; to a scalar loss. You can't see the whole surface. You can only evaluate the gradient at your current position and take a step.&lt;/p&gt;

&lt;p&gt;The question every optimizer tries to answer: &lt;strong&gt;which direction, and how far?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gradient Descent: the right idea, the wrong scale
&lt;/h2&gt;

&lt;p&gt;Gradient Descent (GD) is the foundational answer. Given a loss function &lt;em&gt;L(θ)&lt;/em&gt;, we compute the gradient over the entire dataset and update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;θ ← θ − η · ∇L(θ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;em&gt;η&lt;/em&gt; (eta) is the learning rate — a scalar controlling step size.&lt;/p&gt;

&lt;p&gt;The update rule is clean. The gradient points in the direction of steepest ascent, so we move opposite to it. Mathematically, this is the direction of maximum local decrease in loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The intuition is correct. The implementation is catastrophically expensive.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To compute &lt;code&gt;∇L(θ)&lt;/code&gt; exactly, you need to pass your entire dataset through the model. For ImageNet-scale data or modern LLM corpora, this means billions of examples per update. You'd compute one parameter update per epoch. On a 100M parameter model. That's not slow — it's dead on arrival.&lt;/p&gt;

&lt;p&gt;GD also has a subtle failure mode people underappreciate: when your dataset has redundant structure (and it almost always does), successive gradients are nearly identical. You're paying full-dataset cost for almost zero additional information after the first few passes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stochastic Gradient Descent: embrace the noise
&lt;/h2&gt;

&lt;p&gt;The fix seems almost too simple: instead of computing the gradient over all &lt;em&gt;N&lt;/em&gt; samples, pick &lt;strong&gt;one sample at random&lt;/strong&gt; and update on that alone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;θ ← θ − η · ∇Lᵢ(θ)    for a randomly sampled i
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is Stochastic Gradient Descent (SGD). The gradient estimate is now noisy — it's a single-sample approximation of the true gradient. But that noise turns out to be a feature, not a bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why noisy updates help:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Escaping shallow local minima.&lt;/strong&gt; A noisy gradient doesn't always point exactly downhill. This stochasticity gives the optimizer a kind of thermal energy — it can jitter out of shallow basins that would trap a deterministic update.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Better generalization (empirically).&lt;/strong&gt; The noise in SGD acts as implicit regularization. Models trained with SGD often generalize better than those trained with exact gradient methods, particularly in overparameterized regimes. There's a growing body of theory around this — the "flat minima" hypothesis suggests noisy SGD preferentially finds wider, flatter basins that transfer better to test data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Speed per effective update.&lt;/strong&gt; One SGD step is O(1) in data cost. You can make &lt;em&gt;N&lt;/em&gt; updates in the time GD makes one, seeing every sample along the way.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The cost of noise:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SGD's gradient estimate has high variance. Updates zig-zag erratically, especially in directions where the loss surface has high curvature along one axis and low curvature along another (the classic "ravine" geometry). The path to the minimum looks like a drunk person's walk rather than a confident descent.&lt;/p&gt;

&lt;p&gt;You can reduce the learning rate to smooth this out, but then you lose the speed advantage. You're always trading variance against convergence rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mini-Batch Gradient Descent: the practical compromise
&lt;/h2&gt;

&lt;p&gt;The resolution in practice is obvious in retrospect: &lt;strong&gt;use a small batch of &lt;em&gt;B&lt;/em&gt; samples&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;θ ← θ − η · (1/B) · Σᵢ∈Bₜ ∇Lᵢ(θ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;em&gt;Bₜ&lt;/em&gt; is a randomly sampled mini-batch of size &lt;em&gt;B&lt;/em&gt; at step &lt;em&gt;t&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is Mini-Batch Gradient Descent (MBGD) — and when practitioners say "SGD" today, this is almost always what they mean. Typical batch sizes range from 32 to 512, though the right choice depends on your model, hardware, and regularization goals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What mini-batching buys you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variance reduction.&lt;/strong&gt; Averaging over &lt;em&gt;B&lt;/em&gt; samples reduces gradient variance by a factor of &lt;em&gt;B&lt;/em&gt; compared to single-sample SGD, without &lt;em&gt;B&lt;/em&gt;× the compute cost (thanks to parallelism on GPU/TPU).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware efficiency.&lt;/strong&gt; GPUs are throughput machines — they saturate at batch sizes that fully utilize memory bandwidth. A single-sample forward pass wastes most of your compute budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enough noise to generalize.&lt;/strong&gt; Mini-batch gradients are still noisy enough to provide the regularization benefits of SGD, unlike full-batch gradients.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The batch size isn't free:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Larger batches reduce noise, which sounds good, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They can converge to &lt;strong&gt;sharper minima&lt;/strong&gt; with worse generalization (the "large-batch training problem" — Keskar et al., 2017).&lt;/li&gt;
&lt;li&gt;They require &lt;strong&gt;proportionally larger learning rates&lt;/strong&gt; to maintain the same effective update magnitude, but scaling LR linearly with batch size breaks down at large B.&lt;/li&gt;
&lt;li&gt;Beyond a critical batch size, you're paying compute cost without improving convergence speed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last point becomes the central tension in Blog 6, when we look at LARS and LAMB — optimizers specifically designed to handle very large batches in distributed LLM training.&lt;/p&gt;

&lt;h2&gt;
  
  
  The update rule in full
&lt;/h2&gt;

&lt;p&gt;Here's where we stand after mini-batch SGD. The complete training loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;get_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;grad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;θ&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;θ&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grad&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. It's what trains ResNets, early language models, and still forms the backbone of large-scale training in certain regimes (SGD with momentum remains competitive with Adam on image classification tasks).&lt;/p&gt;

&lt;p&gt;But watch what happens on a ravine-shaped loss surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The gradient along the short axis (high curvature) is large → big oscillating steps&lt;/li&gt;
&lt;li&gt;The gradient along the long axis (low curvature, toward the minimum) is small → slow progress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The optimizer zig-zags across the ravine instead of marching down it. You need a very small learning rate to prevent divergence on the steep axis, which makes the shallow axis painfully slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is exactly the problem Blog 2 solves.&lt;/strong&gt; Momentum will give the optimizer memory — a velocity vector that accumulates in persistent directions and dampens oscillations. Nesterov will take that one step further, looking ahead before committing to the update.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things to hold onto
&lt;/h2&gt;

&lt;p&gt;Before moving to momentum, here are the three tensions that the rest of this series is spent resolving:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;What causes it&lt;/th&gt;
&lt;th&gt;Solved by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Slow, expensive updates&lt;/td&gt;
&lt;td&gt;Full-dataset gradient&lt;/td&gt;
&lt;td&gt;SGD / Mini-batch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-variance, zig-zagging path&lt;/td&gt;
&lt;td&gt;Single/small-batch noise&lt;/td&gt;
&lt;td&gt;Momentum (Blog 2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uniform learning rate for all params&lt;/td&gt;
&lt;td&gt;LR is a global scalar&lt;/td&gt;
&lt;td&gt;AdaGrad, RMSProp (Blog 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every optimizer from here on is a targeted intervention on one of these failure modes — or a combination of several at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key equations, plain-English summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Update rule&lt;/th&gt;
&lt;th&gt;One sentence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GD&lt;/td&gt;
&lt;td&gt;&lt;code&gt;θ ← θ − η · ∇L(θ)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exact gradient, entire dataset, one step per epoch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;&lt;code&gt;θ ← θ − η · ∇Lᵢ(θ)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Noisy gradient, one sample, fast but erratic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MBGD&lt;/td&gt;
&lt;td&gt;&lt;code&gt;θ ← θ − η · (1/B)·Σ∇Lᵢ(θ)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Averaged gradient, batch of B, the practical default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Blog 2&lt;/strong&gt;, we add velocity. Momentum accumulates past gradients into a running average, smoothing the zig-zagging path and accelerating convergence in consistent directions. Nesterov takes the lookahead step — evaluating the gradient at a projected future position rather than the current one.&lt;/p&gt;

&lt;p&gt;If SGD is someone walking blindfolded downhill, momentum is that same person carrying a ball that's already rolling. It takes more to change direction. That turns out to be exactly what you want.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Blog 1 of an 8-part series on optimization algorithms for deep learning. Each post covers one family of optimizers, following a problem → limitation → next solution arc.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>gradientdescent</category>
    </item>
  </channel>
</rss>
