Introduction
In the previous post, we built the nn.Module, which gave us:
- A clean way to define layers
- Automatic parameter tracking
- Training and evaluation modes
At this point, we can:
- Build models
- Compute losses
- Compute gradients via backpropagation
But there’s one critical piece missing. We still don't know how to update the parameters.
Without parameter updates, our neural network is just a very expensive random number generator.
In this post We’ll build:
- A base Optimizer class
- SGD (Stochastic Gradient Descent)
- Adam
Want to skip the series and read the full book now?
- Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/
How Do Gradients Update Weights?
Lets look at the simple training loop of mnist.
logits = model(x_batch)
loss = softmax_loss(logits, y_one_hot)
# clear old gradients
for p in model.parameters():
p.grad = None
# compute gradients
loss.backward()
# update parameters
for p in model.parameters():
p.data = p.data - lr * p.grad
What are we doing here?
- loss.backward() computes gradients.
- p.grad tells us which direction increases error.
- We move in the opposite direction to reduce loss.
We need a training loop each time we train a model, meaning we need to update parameters each time, writing this loop always is not good.
What if we want to use some magic technique during weight updates? Do we need to mess up the whole training loop just for a single change?
Optimizer Base Class
What should every optimizer do?
- Hold model parameters
- Update parameters using gradients
- Clear previous gradients at each step.
So we always do:
optimizer.zero_grad()
loss.backward()
optimizer.step()
class Optimizer:
def __init__(self, params):
self.params = params
def zero_grad(self):
for p in self.params:
p.grad = None
def step(self):
raise NotImplementedError
Stochastic Gradient Descent (SGD)
It is a simple weight update rule used most frequently in simpler models.
The weight update rule is pretty simple too .
param = param - lr * grad
Where:
-
lris the learning rate
class SGD(Optimizer):
def __init__(self, params, lr=0.01):
super().__init__(params)
self.lr = lr
def step(self):
for p in self.params:
if p.grad is None:
continue
p.data -= self.lr * p.grad
Example :
optimizer = SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()
loss.backward()
optimizer.step()
SGD works and also has limitations.
Why We Need Better Optimizers
SGD does not remember how were the weights updated in the past. It has no memory of the past. Not all parameters behave the same.
Some need:
- Big steps
- Small updates
- Momentum from past gradients
Adam
Adam tracks the gradient history.
- Normal gradients for direction
- Squared gradients for magnitude.
Adam uses this information to adapt each parameter’s learning rate, making weight updates smarter
class Adam(Optimizer):
"""
Implements the Adam optimization algorithm.
"""
def __init__(
self,
params,
lr=0.001,
beta1=0.9,
beta2=0.999,
eps=1e-8,
):
super().__init__(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0
self.m = {}
self.v = {}
def step(self):
self.t += 1
for p in self.params:
if p.grad is None:
continue
grad = p.grad
# First moment
m = self.m.get(p, 0) * self.beta1 + (1 - self.beta1) * grad
self.m[p] = m
# Second moment
v = self.v.get(p, 0) * self.beta2 + (1 - self.beta2) * (grad ** 2)
self.v[p] = v
m_hat = m / (1 - self.beta1 ** self.t)
v_hat = v / (1 - self.beta2 ** self.t)
p.data -= self.lr * m_hat / (v_hat ** 0.5 + self.eps)
Conclusion
In this post, we implemented:
- Optimizer base class
- SGD, the simplest optimizer
- Adam, a powerful optimizer
Want to skip the series and read the full book now?
- Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/
Top comments (0)