DEV Community

Cover image for Build a Deep Learning Library from Scratch Using NumPy (Part 5: Optimizers)
zekcrates
zekcrates

Posted on

Build a Deep Learning Library from Scratch Using NumPy (Part 5: Optimizers)

Introduction

In the previous post, we built the nn.Module, which gave us:

  • A clean way to define layers
  • Automatic parameter tracking
  • Training and evaluation modes

At this point, we can:

  • Build models
  • Compute losses
  • Compute gradients via backpropagation

But there’s one critical piece missing. We still don't know how to update the parameters.

Without parameter updates, our neural network is just a very expensive random number generator.

In this post We’ll build:

  • A base Optimizer class
  • SGD (Stochastic Gradient Descent)
  • Adam

Want to skip the series and read the full book now?

How Do Gradients Update Weights?

Lets look at the simple training loop of mnist.

logits = model(x_batch)
loss = softmax_loss(logits, y_one_hot)

# clear old gradients
for p in model.parameters():
    p.grad = None

# compute gradients
loss.backward()

# update parameters
for p in model.parameters():
    p.data = p.data - lr * p.grad

Enter fullscreen mode Exit fullscreen mode

What are we doing here?

  • loss.backward() computes gradients.
  • p.grad tells us which direction increases error.
  • We move in the opposite direction to reduce loss.

We need a training loop each time we train a model, meaning we need to update parameters each time, writing this loop always is not good.
What if we want to use some magic technique during weight updates? Do we need to mess up the whole training loop just for a single change?

Optimizer Base Class

What should every optimizer do?

  • Hold model parameters
  • Update parameters using gradients
  • Clear previous gradients at each step.

So we always do:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Enter fullscreen mode Exit fullscreen mode
class Optimizer:
    def __init__(self, params):
        self.params = params

    def zero_grad(self):
        for p in self.params:
            p.grad = None

    def step(self):
        raise NotImplementedError
Enter fullscreen mode Exit fullscreen mode

Stochastic Gradient Descent (SGD)

It is a simple weight update rule used most frequently in simpler models.
The weight update rule is pretty simple too .

param = param - lr * grad
Enter fullscreen mode Exit fullscreen mode

Where:

  • lr is the learning rate

class SGD(Optimizer):
    def __init__(self, params, lr=0.01):
        super().__init__(params)
        self.lr = lr

    def step(self):
        for p in self.params:
            if p.grad is None:
                continue
            p.data -= self.lr * p.grad
Enter fullscreen mode Exit fullscreen mode

Example :

optimizer = SGD(model.parameters(), lr=0.01)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Enter fullscreen mode Exit fullscreen mode

SGD works and also has limitations.

Why We Need Better Optimizers

SGD does not remember how were the weights updated in the past. It has no memory of the past. Not all parameters behave the same.

Some need:

  • Big steps
  • Small updates
  • Momentum from past gradients

Adam

Adam tracks the gradient history.

  • Normal gradients for direction
  • Squared gradients for magnitude.

Adam uses this information to adapt each parameter’s learning rate, making weight updates smarter

class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.
    """
    def __init__(
        self,
        params,
        lr=0.001,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
    ):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps

        self.t = 0
        self.m = {}
        self.v = {}

    def step(self):
        self.t += 1

        for p in self.params:
            if p.grad is None:
                continue

            grad = p.grad

            # First moment
            m = self.m.get(p, 0) * self.beta1 + (1 - self.beta1) * grad
            self.m[p] = m

            # Second moment
            v = self.v.get(p, 0) * self.beta2 + (1 - self.beta2) * (grad ** 2)
            self.v[p] = v

            m_hat = m / (1 - self.beta1 ** self.t)
            v_hat = v / (1 - self.beta2 ** self.t)

            p.data -= self.lr * m_hat / (v_hat ** 0.5 + self.eps)

Enter fullscreen mode Exit fullscreen mode

Conclusion

In this post, we implemented:

  • Optimizer base class
  • SGD, the simplest optimizer
  • Adam, a powerful optimizer

Want to skip the series and read the full book now?

Top comments (0)