Build a Deep Learning Library from Scratch Using NumPy (Part 5: Optimizers)

#showdev #deeplearning #pytorch #programming

Introduction

In the previous post, we built the nn.Module, which gave us:

A clean way to define layers
Automatic parameter tracking
Training and evaluation modes

At this point, we can:

Build models
Compute losses
Compute gradients via backpropagation

But there’s one critical piece missing. We still don't know how to update the parameters.

Without parameter updates, our neural network is just a very expensive random number generator.

In this post We’ll build:

A base Optimizer class
SGD (Stochastic Gradient Descent)
Adam

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

How Do Gradients Update Weights?

Lets look at the simple training loop of mnist.

logits = model(x_batch)
loss = softmax_loss(logits, y_one_hot)

# clear old gradients
for p in model.parameters():
    p.grad = None

# compute gradients
loss.backward()

# update parameters
for p in model.parameters():
    p.data = p.data - lr * p.grad

What are we doing here?

loss.backward() computes gradients.
p.grad tells us which direction increases error.
We move in the opposite direction to reduce loss.

We need a training loop each time we train a model, meaning we need to update parameters each time, writing this loop always is not good.
What if we want to use some magic technique during weight updates? Do we need to mess up the whole training loop just for a single change?

Optimizer Base Class

What should every optimizer do?

Hold model parameters
Update parameters using gradients
Clear previous gradients at each step.

So we always do:

optimizer.zero_grad()
loss.backward()
optimizer.step()

class Optimizer:
    def __init__(self, params):
        self.params = params

    def zero_grad(self):
        for p in self.params:
            p.grad = None

    def step(self):
        raise NotImplementedError

Stochastic Gradient Descent (SGD)

It is a simple weight update rule used most frequently in simpler models.
The weight update rule is pretty simple too .

param = param - lr * grad

Where:

lr is the learning rate


class SGD(Optimizer):
    def __init__(self, params, lr=0.01):
        super().__init__(params)
        self.lr = lr

    def step(self):
        for p in self.params:
            if p.grad is None:
                continue
            p.data -= self.lr * p.grad

Example :

optimizer = SGD(model.parameters(), lr=0.01)

optimizer.zero_grad()
loss.backward()
optimizer.step()

SGD works and also has limitations.

Why We Need Better Optimizers

SGD does not remember how were the weights updated in the past. It has no memory of the past. Not all parameters behave the same.

Some need:

Big steps
Small updates
Momentum from past gradients

Adam

Adam tracks the gradient history.

Normal gradients for direction
Squared gradients for magnitude.

Adam uses this information to adapt each parameter’s learning rate, making weight updates smarter

class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.
    """
    def __init__(
        self,
        params,
        lr=0.001,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
    ):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps

        self.t = 0
        self.m = {}
        self.v = {}

    def step(self):
        self.t += 1

        for p in self.params:
            if p.grad is None:
                continue

            grad = p.grad

            # First moment
            m = self.m.get(p, 0) * self.beta1 + (1 - self.beta1) * grad
            self.m[p] = m

            # Second moment
            v = self.v.get(p, 0) * self.beta2 + (1 - self.beta2) * (grad ** 2)
            self.v[p] = v

            m_hat = m / (1 - self.beta1 ** self.t)
            v_hat = v / (1 - self.beta2 ** self.t)

            p.data -= self.lr * m_hat / (v_hat ** 0.5 + self.eps)