## DEV Community

ashwins-code

Posted on • Updated on

# Deep Learning Library From Scratch 5: Automatic Differentiation Continued

Hi Guys! Welcome to part 5 of this series of building a deep learning library from scratch. This post will cover the code of the automatic differentiation part of the library. Automatic Differentiation was discussed in the previous post, so do check it out if you don't know what Autodiff is.

The github repo for this series is....

# Zen - Deep Learning Library

A deep learning library written in Python.

Contains the code for my blog series where we build this library from scratch

• mnist.py contains an example for a MNIST digit recogniser using the library
• rnn.py contains an example of a Recurrent Neural Network which learns to fit to the graph sin(x) * cos(x)

## Approach

Automatic Differentiation relies on a computation graph to calculate derivatives.

Ultimately, it just boils down to nodes with some connections and we traverse these nodes in some way to calculate derivatives.

For our library, we will build our computation graph on the fly, meaning any calculations performed would be recorded onto the computation graph.

Once we have the graph, we need to find a way to use it to calculate the derivatives of all the variables in that graph.

For example, say we produced the following graph...

This represents...

$c = a + b \newline e = c * d \newline$

Now, using the graph, our aim is to find the derivative of e with respect all the variables in that graph (a,b,c,d,e)

For my implementation, I found it easiest to traverse the graph in a depth-first manner to calculate derivatives.

So firstly, we start at e and find $\frac{de}{de}$ with respect to e (which is just 1).

Then, we look at node c, meaning we now need to calculate $\frac{de}{dc}$ . We can see that $e$ is the result of a multiplication between $c$ and $d$ , meaning $\frac{de}{dc} = d$ (since we treat everything apart from the variable we are on as a constant).

Remembering we are traversing depth-first, the next node we move onto is node a, meaning we calculate $\frac{de}{da}$ . This is a bit more tricky, since a does not have a direct connection to e. However, using the chain rule, we know that $\frac{de}{da} = \frac{de}{dc}\frac{dc}{da}$ . We just calculated $\frac{de}{dc}$ , so all we need to calculate now is $\frac{dc}{da}$ . We can see that c is an addition of a and b, so $\frac{de}{da} = \frac{de}{dc}\frac{dc}{da} = d$

Hopefully, you can now see how we would use a graph to find the derivatives of all the variables in that graph.

## Tensor Class

Firstly, we need to create our tensor class, which would act as the variable nodes on our graph.

import numpy as np
import string
import random

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

np.seterr(invalid='ignore')

def is_matrix(o):
return type(o) == np.ndarray

def same_shape(s1, s2):
for a, b in zip(s1, s2):
if a != b:
return False

return True

class Tensor:
__array_priority__ = 1000
def __init__(self, value, trainable=True):
self.value = value
self.dependencies = []
self.shape = 0
self.matmul_product = False
self.trainable = trainable
self.id = id_generator()

if is_matrix(value):
self.shape = value.shape


What's going on here?

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))


Function that generates a unique id using random characters

def is_matrix(o):
return type(o) == np.ndarray


Simple function which checks if a value is a numpy array or not.

self.value = value


This line should be self-explanatory, it just holds the value that is given to the tensor.

self.dependencies = []


If the tensor was the result of any operation e.g add or divide, this property would hold the list of tensors that were involved in the operation, producing this tensor (this is how the computation graph is built). If the tensor is not a result of any operation, then this is empty.

self.grads = []


This property would hold the list of derivative of each of the tensor's dependencies with respect to the tensor.

self.shape = 0
...
if is_matrix(value):
self.shape = value.shape


self.shape holds the shape of the tensor's value. Only numpy arrays can have a shape, which is why it is 0 by default.

self.matmul_product = False


Specifies whether the tensor was a result of a matrix multiplicaton or not (this will help later since the chain rule works differently for matrix multiplacation).

self.gradient = np.ones_like(self.value)


After we use the computation graph to calculate gradients, this property would hold the gradient calculated for the tensor. It is initially set to a matrix of 1s, the same shape as its value.

self.trainable = trainable


Some nodes on the graph do not need their derivatives to be calculated, so this property specifies whether this is the case or not for this tensor.

self.id = id_generator()


Tensors will need to have something to uniquely identify themselves. We will see this coming in use when we reimplement our optimisers to use this automatic differentiation module in a later post.

## Operations with Tensors

class Tensor:
__array_priority__ = 1000
def __init__(self, value, trainable=True):
self.value = value
self.dependencies = []
self.shape = 0
self.matmul_product = False
self.trainable = trainable
self.id = id_generator()

if is_matrix(value):
self.shape = value.shape

def depends_on(self, target):
if self == target:
return True

dependencies = self.dependencies

for dependency in dependencies:
if dependency == target:
return True
elif dependency.depends_on(target):
return True

return False

def __mul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __rmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value + other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value + other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __sub__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other)

var = Tensor(self.value - other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __rsub__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value - self.value)
var.dependencies.append(other)
var.dependencies.append(self)
return var

def __pow__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value ** other.value)
var.dependencies.append(self)
var.dependencies.append(other)

grad_wrt_self = other.value * self.value ** (other.value - 1)

grad_wrt_other = (self.value ** other.value) * np.log(self.value)

return var

def __rpow__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value ** self.value)
var.dependencies.append(other)
var.dependencies.append(self)

grad_wrt_other = self.value * other.value ** (self.value - 1)

grad_wrt_self = (other.value ** self.value) * np.log(other.value)

return var

def __truediv__(self, other):
return self * (other ** -1)

def __rtruediv__(self, other):
return other * (self ** -1)

def __matmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value @ other.value)
var.dependencies.append(self)
var.dependencies.append(other)

var.matmul_product = True
return var

def __rmatmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value @ self.value)
var.dependencies.append(other)
var.dependencies.append(self)

var.matmul_product = True

return var


To understand what is happening here, let's look at one of the methods

def __mul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var


Firstly, __mul__ is an operator overloader. This just means that whenever we want to multiply a tensor with something else, this method would be called. You also see __rmul__, which is the same thing, but is called when the Tensor object is on the right hand side of the operation.

For example...

t = Tensor(10)
t * 5 #__mul__ is called
5 * t #__rmul__ is called

if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)


other represents the thing that this tensor is being multiplied with. If other is not a tensor, we want to convert it to a tensor, holding the value of other. Since other was not already a tensor, it means that it is a constant, so we do not need to calculate its derivative, since we only want to calculate the derivative of something if we need to change its value, usually when training a model. This is why trainable=False.

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)


var holds the resulting tensor of this operation. The second and third lines add the two tensors used in this operator to var's dependencies (look back up if you forgot what the dependencies property is used for)

var.grads.append(other.value) # dvar/dother
return var


This now adds the derivates of the two operands with respect to var to var's grads. These lines would obviously be different in the other class methods, since the derivatives depend on the operation being applied (in this case it's multiplication). Note how the order of how they're added corresponds with the order of the tensors in the dependencies property. We don't store them as tensors, since it will be much quicker to calculate derivatives using raw values, instead of our tensor class.

## Calculating gradients using the graph!

def get_gradients(self, grad = None):

if dependency.trainable:

if self.matmul_product:
if dependency == self.dependencies[0]:
else:
else:

for i, dim in enumerate(dependency.shape):
if dim == 1:



This method recursively implements the depth-first derivative calculating that was outlined earlier. This can be called for any tensor in the graph, not just the top one. It would result in all the derivatives of all the tensors, that were used to produce this tensor, being calculated.

for dependency, _grad in zip(self.dependencies, self.grads):
if dependency.trainable:

if self.matmul_product:
if dependency == self.dependencies[0]:
else:
else:

for i, dim in enumerate(dependency.shape):
if dim == 1:



This part of the code goes through each of the tensor's dependencies along with the derivative the tensor with respect to that dependency (held in _grad). If the dependency is trainable, it then checks if it was a result of a matrix multiplication or not.

The chain rule is then accordingly applied, using the previously calculated gradient grad and the gradient of the current dependency _grad. local_grad stores the result after the chain rule is applied.

def same_shape(s1, s2):
for a, b in zip(s1, s2):
if a != b:
return False

return True

if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):

for i, dim in enumerate(dependency.shape):
if dim == 1:


If we focus on this part of the code, this handles any case where local_grad and grad aren't the same shape (they need to be in order for chain rule to be applied). This shape mismatch arises if there was any broadcasting performed in any of the calculations. Broadcasting is term used to describe how numpy would perform operations involving arrays of different shapes. You can read more about it on the numpy docs. All this part of the code does is sum across the broadcasted axis of grad, in order to reduce its shape to match the shape of local_grad.

dependency.gradient += local_grad


The gradient is then recorded to the gradient property of the dependency. It then continues the depth-first traversal by calling the get_gradients method on the dependency, passing through the gradient that was just calculated.

Overall, our code for autodiff should look like this...

import numpy as np
import string
import random

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

np.seterr(invalid='ignore')

def is_matrix(o):
return type(o) == np.ndarray

def same_shape(s1, s2):
for a, b in zip(s1, s2):
if a != b:
return False

return True

class Tensor:
__array_priority__ = 1000
def __init__(self, value, trainable=True):
self.value = value
self.dependencies = []
self.shape = 0
self.matmul_product = False
self.trainable = trainable
self.id = id_generator()

if is_matrix(value):
self.shape = value.shape

def depends_on(self, target):
if self == target:
return True

dependencies = self.dependencies

for dependency in dependencies:
if dependency == target:
return True
elif dependency.depends_on(target):
return True

return False

def __mul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __rmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value + other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value + other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __sub__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other)

var = Tensor(self.value - other.value)
var.dependencies.append(self)
var.dependencies.append(other)
return var

def __rsub__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value - self.value)
var.dependencies.append(other)
var.dependencies.append(self)
return var

def __pow__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value ** other.value)
var.dependencies.append(self)
var.dependencies.append(other)

grad_wrt_self = other.value * self.value ** (other.value - 1)

grad_wrt_other = (self.value ** other.value) * np.log(self.value)

return var

def __rpow__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value ** self.value)
var.dependencies.append(other)
var.dependencies.append(self)

grad_wrt_other = self.value * other.value ** (self.value - 1)

grad_wrt_self = (other.value ** self.value) * np.log(other.value)

return var

def __truediv__(self, other):
return self * (other ** -1)

def __rtruediv__(self, other):
return other * (self ** -1)

def __matmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(self.value @ other.value)
var.dependencies.append(self)
var.dependencies.append(other)

var.matmul_product = True
return var

def __rmatmul__(self, other):
if not (isinstance(other, Tensor)):
other = Tensor(other, trainable=False)

var = Tensor(other.value @ self.value)
var.dependencies.append(other)
var.dependencies.append(self)

var.matmul_product = True

return var

if not self.depends_on(target):
return 0

if self == target:

if self.matmul_product:
if dependency == self.dependencies[0]:
else:
else:

if dim == 1:

if dependency.trainable:

if self.matmul_product:
if dependency == self.dependencies[0]:
else:
else:

for i, dim in enumerate(dependency.shape):
if dim == 1:

def __repr__(self):
return f"Tensor ({self.value})"


and it can be used like so...

a = Tensor(10)
b = Tensor(5)
c = 2
d = (a*b)**c


OUTPUT:
500.0 1000.0


## Thank you

Thank you for reading through all of this post! The code of this post can be seen in the Github repo linked at the start in autodiff.py.