Introduction
Welcome back! In Part 1, we built the foundation of our deep learning library babygrad:
- Defined the Tensor class.
- Tracked operations and inputs.
Now, it’s time to implement automatic differentiation, the core of deep learning. This is what allows libraries like PyTorch to compute gradients
automatically so we can train models.
Missed Part 1?
Read it here: https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-1-32p9
Want to skip the series and read the full book now?
- Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/
What is a Gradient?
It is a measure of how much the final error changes when we change a specific input.
What is Automatic Differentiation?
Automatic differentiation works by:
- Updating a computation graph as we perform functions on Tensors.
- Breaking complex functions into simpler operations.
- Using the chain rule to propagate gradients from outputs back to inputs.
a = Tensor([2.0], requires_grad=True)
b = Tensor([3.0], requires_grad=True)
c = a * b
If we call c.backward() it will use the chain rule to calculate gradients with respect to the parents.
# derivative of a*b wrt to a is b .
# derivative of a*b wrt to b is a.
dc_da = b
dc_db = a
Just like multiplication, we can have many different operations in a computation graph, each responsible for computing its own gradients using the chain rule.
Using just a few simple operations like this, and defining their backward passes, we can surprisingly build and train full models.
Functions: Forward and Backward
Each operation in babygrad is a class (like Add or Mul) that implements:
- forward: calculates the result from inputs
- backward: calculates the gradients of the inputs
As each Function will have the same methods(forward,backward) Lets define a common base class called Function.
class Function:
def __call__(self, *inputs):
requires_grad = any(t.requires_grad for t in inputs)
#"inputs" are Tensors, we call ".data" to get the numpy array
inputs_data = [t.data for t in inputs]
# Using the inputs_data we call "forward" method that each
# subclass will implement of its own.
output_data = self.forward(*inputs_data)
# Wrap around in Tensor.
output_tensor = Tensor(output_data, requires_grad=requires_grad)
if requires_grad:
# Who are its parents?
output_tensor._op = self
output_tensor._inputs = inputs
return output_tensor
def forward(self, *args):
raise NotImplementedError()
def backward(self, out_grad, node):
# node is the input from forward method.
# out_grad is the upstream gradient
raise NotImplementedError()
Any Function like Add or Mul will be a subclass of this Function. When we call the Function, under the hood it will call Function.forward() method and create the new Tensor.
Lets see some examples of this Function class in action to understand better.
class Add(Function):
def forward(self, a, b):
return a + b
def backward(self, out_grad, node):
#gradient of a+b with respect to a is 1
# gradient of a+b wrt to b is 1
# we have to pass upstream gradient(out_grad) to our parents
# so they can find their own gradients too.
return out_grad* 1, out_grad* 1
def add(a, b):
return Add()(a, b)
Each backward function will take out_grad which is the upstream gradient. The gradients which are returned in each backward function are called local gradients.
These local gradients are multiplied by the upstream gradient (out_grad) to apply the chain rule.
Each operation in the computation graph will calculate its local derivatives and that's it.
Just a little change inside the Tensor class and we can override the python + operator to use our function add(a,b) instead of default function.
class Tensor:
# overrides "+" with our function.
def __add__(self,other):
if not isinstance(other, Tensor):
other = Tensor(other)
return add(self, other)
a = Tensor([2.0], requires_grad=True)
b = Tensor([3.0], requires_grad=True)
c = a+b # It will use our **add** function
print(c.data) # [3,5,7]
print(c._op) # <Add object>
print(c._inputs) # [a, b]
You should implement all the other Functions available in the book.
- Sub
- Div
- Pow
- Transpose
- Reshape
- BroadcastTo
- Summation
- MatMul
- Negate
- Log
- Exp
- Relu
- Sigmoid
- Tanh
- Sqrt
- abs
Most of them are pretty basic and simple derivatives. For the forward pass if there is a NumPy function available use that.
The most rewarding functions will be reshape, broadcast_to ,
Matmul,Summation because most of the heavy lifting will be done by them.
Now that we have written our Functions and have overridden them inside our Tensor class . One final method we must implement inside our class will be backward class.
When we do loss.backward() , this backward() function we will implement inside Tensor class.
Backward Pass
The backward pass computes gradients from outputs back to inputs using the chain rule.
- Each backward method knows its local derivative.
- Gradients are accumulated from all child nodes.
When we call loss.backward() or c.backward() from above. what should be the gradient of loss or c be? Each Function.backward() takes out_grad and node.
What should be the first out_grad?
Simple! It should just be 1 Because changing the output will affect the output by 1 This is our upstream gradient.
We have already mentioned that we will use Computation graph for backward pass. Before even calling loss.backward() our Computation graph has already been constructed. We just need to move from output to input and call backward on each.
So we need 2 operations
- Traverse the computation graph
- Call node._op.backward for all nodes.
Traverse
We will use DFS to traverse the computation graph.
class Tensor:
def backward(self, grad=None):
if not self.requires_grad:
raise RuntimeError(
"Cannot call backward on a tensor that does not require gradients."
)
# Store the nodes from the graph
topo_order = []
visited = set()
def build_topo(node):
if id(node) not in visited:
visited.add(id(node))
for parent in node._inputs:
build_topo(parent)
topo_order.append(node)
build_topo(self)
a = babygrad.Tensor([1,2,3])
b =babygrad.Tensor([2,3,4])
c = a+ b
d = babygrad.Tensor([3,4,5])
e = c+d
e.backward()
# first traverses the computation graph
# topo_order = [a,b,c,d,e]
For the above equations we get the topo_order=[a,b,c,d,e].
All the nodes participating in the graph are present in the topo_order.
Now we just need to use this topo_order in order to calculate gradients for all.
Calculating gradients
class Tensor:
def backward(self, grad=None):
if not self.requires_grad:
raise RuntimeError(
"Cannot call backward on a tensor that does not require gradients."
)
# Build the "Family Tree" in order (Topological Sort)
topo_order = []
visited = set()
def build_topo(node):
# done above
build_topo(self)
# Initialize the Ledger
grads = {}
if grad is None:
# The "output" gradient: dL/dL = 1
grads[id(self)] = Tensor(np.ones_like(self.data))
else:
grads[id(self)] = _ensure_tensor(grad)
# Walk the Graph Backwards
for node in reversed(topo_order):
out_grad = grads.get(id(node))
if out_grad is None:
continue
# Store the final result in the .grad attribute
if node.grad is None:
node.grad = np.array(out_grad.data, copy=True)
else:
node.grad += out_grad.data
# Propagate to Parents
if node._op:
# finally calling node._op.backward()
input_grads = node._op.backward(out_grad, node)
if not isinstance(input_grads, tuple):
input_grads = (input_grads,)
for i, parent in enumerate(node._inputs):
if parent.requires_grad:
parent_id = id(parent)
if parent_id not in grads:
# First time seeing this parent
grads[parent_id] = input_grads[i]
else:
# Sum the gradients!
grads[parent_id] = grads[parent_id] + input_grads[i]
A few questions
- Why traverse the nodes in reverse?
In the backward pass we go from output to inputs or children to parents. We would first calculate the gradients for children and then of the parents.
- Why use grads = {} ?
In the loop we already have out_grad gradient of the current node we then call node._op.backward(out_grad,node) to get the gradients of node's parents.
Thats cute we have calculated the gradients for parents but we haven't stored them yet in parent.grad.
So to do that we need to store them somewhere thats where the Ledger comes in. And in the loop when the node is equal to parent node then node.grad = np.array(out_grad.data, copy=True) this line is storing parent.grad.
At this point, we’ve built a complete autograd engine from scratch.
Everything that comes next: optimizers, neural networks, training loops,
and datasets will be built on top of this exact system.
All the other things that will come next will be hugely dependent on the code written here.
What's next?
- In the next part we will use the code we have written to train MNIST.
Want to skip the series and read the full book now?
- Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/
Top comments (0)