Ganesh Kumar

Posted on Jun 13

Why Do Neural Networks Need the Chain Rule? How do we apply it?

Hello, I'm Ganesh. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star git-lrc on GitHub to help more developers discover the project. Do give it a try and share your feedback for improving the product.

In the previous article, we introduced backpropagation and learned that neural networks improve by reducing prediction errors.

We also saw that backpropagation relies on two fundamental ideas:

The Chain Rule
Gradient Descent

But we haven't yet answered an important question:

How does we calculate wieghts and biases to decrease the error?

To answer that, let's look at a very small neural network.

A Simple Neural Network

Imagine a neural network with:
Similar to the previous example.

One input neuron
Two hidden neurons
One output neuron

Calculating Last Bias In the last layer

Let's asssume we have wieght and bias of all hidden layer and we only want to find last bias b3

Now from gradient descent, we can update the last bias b3 using the partial derivative of loss with respect to b3

The Error rate is done with Residuals.
Residual = Observed - Predicted

SSR = Sum of (Observed - Predicted)^2

So, We take 3 samples for training

Starting, Ending and middle values.

Finaly By calculating SSR.

Use of Chain Rule

We actually calculated b3 only using gradient descent.

Now Using chain Value generated from the weight and bias of previous layers

Predicted = Top Layer + Bottom Layer + Bias (b3)

Using Chain Rule we can write Dirivative of SSR with

dssr/db3 = dssr/dpredicted * dpredicted/db3

dssr/dpredicted = (Observed - Predicted)^2

As predicted, it is not constant and we are dirving it.

dssr/dpredicted = 2*(Observed - Predicted)*(d(Observed - Predicted))/dpredicted)

dssr/dpredicted = 2*(Observed - Predicted)(-1)
dssr/dpredicted = -2(Observed - Predicted)

For dpredicted/db3

dpredicted = Top Layer + Bottom Layer + Bias (b3)
Both Top Layer and Bottom Layer is constant for this calculation
dpredicted/db3 = 1

Finaly dssr/db3 = -2*(Observed - Predicted) * 1

Slop Calculation and Learning

Now we have 3 values of predicted for 3 samples

dssr/db3 = Σ(-2*(Observed-Predicted))

dssr/db3 = -2 * [(Observed1 - Predicted1) * 1 + (Observed2 - Predicted2) * 1 + (Observed3 - Predicted3) * 1]

dssr/db3 = -2 * [(Residual1) + (Residual2) + (Residual3)]

dssr/db3 = -2 * (ResidualSum)

For our training data I got slope = -15.7

step size = slope x learning rate

step size = -15.7 x 0.1 = -1.57

new b3 = old b3 + step size

new b3 = 0 + (-1.57) = -1.57

Then again, recalculating SSR with new b3 we got slop.

slop = -6.26

step size = -6.26 x 0.1 = -0.626

new b3 = -1.57 + (-0.626) = -2.196

Similarly after calculatinng multiple times utile we get step size close to 0.

Final Result
We found the optimal
b3 = 2.21

Conclusion

We could able to apply these chain rule, gradient descent and backpropagation in a very small neural network.

In next article we will discuss how to calculate wieghts and biases in same neural network.

Any feedback or contributors are welcome! It’s online, source-available, and ready for anyone to use.

⭐ Star git-lrc on GitHub

DEV Community