Let’s Build a Deep Learning Library from Scratch Using NumPy (Part 3: Training MNIST)

#showdev #python #deeplearning #machinelearning

Introduction

In Part 1, we built the Tensor class and a computation graph.
In Part 2, we implemented automatic differentiation from scratch.

In this part, we will:

Use our custom autograd engine (babygrad)
Build a small neural network
Train it on the MNIST handwritten digits dataset

Missed Part 1?

Read it here: https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-1-32p9

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Loading MNIST data

You can easily download MNIST data and the files look like this

# data/ 
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz

The images file should return (num_images,784) and labels files should return (num_images) where labels are in the range 0–9.

import struct
import gzip
import numpy as np

def parse_mnist(image_filename, label_filename):
    with gzip.open(image_filename, 'rb') as f:
        magic, num_images, rows, cols = struct.unpack('>IIII', f.read(16))
        image_data = np.frombuffer(f.read(), dtype=np.uint8)
        images = image_data.reshape(num_images, rows * cols)

    with gzip.open(label_filename, "rb") as f:
        magic, num_labels = struct.unpack('>II', f.read(8))
        labels = np.frombuffer(f.read(), dtype=np.uint8)

    images = images.astype(np.float32) / 255.0
    return images, labels

Now that we have our data lets create a simple model that we will train on the data.

It will have only 2 weights(W1,W2)

W1 = (784, 100)
W2 = (100, 10)

from babygrad import Tensor, ops

class SimpleNN:
    def __init__(self, input_size, hidden_size, num_classes):
        self.W1 = Tensor(
            np.random.randn(input_size, hidden_size).astype(np.float32)
            / np.sqrt(hidden_size),
            requires_grad=True
        )
        self.W2 = Tensor(
            np.random.randn(hidden_size, num_classes).astype(np.float32)
            / np.sqrt(num_classes),
            requires_grad=True
        )
    def forward(self, x):
        z1 = x @ self.W1
        a1 = ops.relu(z1)
        logits = a1 @ self.W2
        return logits
    def parameters(self):
        return [self.W1, self.W2]

The model will take an image of size (784,) first go through W1

(5,784) @ (784,100) -> (5,100)
x @ self.W1

#Note: The '@' is our `matmul` function defined in the previous part.

Note: The @ operator uses our custom matmul op implemented in Part 2.

Now the shape of the image after passing through W1 is (5,100).

We only have 10 labels (digits 0–9) so that means the model predicts one value out of those 10.

We will send the above result to W2.

(5,100) @ (100,10) = (5,10)
logits = a1 @ self.W2

Now we have (5,10).

The output (5, 10) contains raw class scores (logits) for each digit.

But logits alone aren’t enough we need a loss function.

This is the loss which we will decrease by updating our (W1,W2) by using their gradients.

Loss function

def softmax_loss(logits: Tensor, y_true: Tensor) -> Tensor:
    batch_size = logits.shape[0]
    log_sum_exp = ops.log(ops.exp(logits).sum(axes=1))
    z_y = (logits * y_true).sum(axes=1)
    loss = log_sum_exp - z_y
    return loss.sum() / batch_size

This gives us a single loss value.

We now have everything

Data
Model
Loss function

The only thing left is to train this model using the data . We can do this by adding a training loop.

Training loop

def train_epoch(model, X_train, y_train, lr, batch_size):
    for i in range(0, X_train.shape[0], batch_size):
        x_batch = Tensor(X_train[i:i+batch_size])
        y_batch_np = y_train[i:i+batch_size]

        logits = model.forward(x_batch)

        num_classes = logits.shape[1]
        y_one_hot = np.zeros((y_batch_np.shape[0], num_classes),
                             dtype=np.float32)
        y_one_hot[np.arange(y_batch_np.shape[0]), y_batch_np] = 1
        y_one_hot = Tensor(y_one_hot)

        loss = softmax_loss(logits, y_one_hot)

        # Zero gradients
        for p in model.parameters():
            p.grad = None

        # Backprop
        # Gradients are calculated.
        loss.backward()

        # Parameters (w1,w2) updated using gradients
        for p in model.parameters():
            p.data -= lr * p.grad

        preds = logits.data.argmax(axis=1)
        acc = np.mean(preds == y_batch_np)

        print(
            f"Loss: {loss.data:.4f}, Accuracy: {acc*100:.2f}%"
        )

This loop:

Builds the computation graph.
Calls backward().
Updates parameters using gradients.

After training for some time.

  Batch  13: Loss = 0.2163, Accuracy = 96.09%
  Batch  14: Loss = 0.1742, Accuracy = 96.09%
  Batch  15: Loss = 0.1630, Accuracy = 96.88%
  Batch  16: Loss = 0.1862, Accuracy = 95.31%
  Batch  17: Loss = 0.1637, Accuracy = 96.09%
  Batch  18: Loss = 0.1812, Accuracy = 95.31%
  Batch  19: Loss = 0.2156, Accuracy = 94.53%
  Batch  20: Loss = 0.1259, Accuracy = 99.22%

Conclusion

At this point, we’ve successfully trained a neural network on MNIST using an autograd engine built entirely from scratch.

This is the core of every modern deep learning library.
Everything that comes next optimizers, deeper networks, CNNs will be built
on top of this same foundation