Based on Paper The Matrix Calculus You Need For Deep Learning by Terence Parr and Jeremy Howard. Thanks for this paper.
The paper is beginner-friendly, but I wanted to write this blog to note down points which would make it easier to understand the paper much better. As we learn some topics which are slightly difficult, we find it to explain to a beginner, in a way we learnt, who may not know anything in that field, so this blog is for beginner.
Deep Learning is all about linear algebra and calculus. If you try to read any deep learning paper, matrics calculus is a needed component to understanding the concept. May be word need may not be the right word to use, since Jeremy's courses show how to become a world-class deep learning practitioner with only a minimal level of calculus,\ check fast.ai for courses.
I have written my understanding of paper in form of three blogs. This is part1 and check this website for two more parts.
Deep learning is the basically use of neurons with many layers. what does each neuron do??
Each neuron applies a function on input and gives an output. The activation of a single computation unit in a neural network is typically calculated using the dot product of an edge weight vector w with an input vector x plus a scalar bias (threshold):
z(x) = w · x + b
letters written bold are vectors. w is a vector
Function z(x) is called the unit's affine function and is followed by a rectified linear unit, which clips negative values to zero: max(0, z(x)). This computation takes place in neurons. Neural networks consist of many of these units, organized into multiple collections of neurons called layers. The activation of one layer's units becomes the input to the next layer's units. Math becomes simple when inputs, weights, and functions are treated as vectors, and the flow of values can be treated as matrix operations.
The most important math used here is differentiation, calculating the rate of change and optimizing the loss function to decrease error is the main purpose. Training phase is all about choosing weights w and bias b so that we get the desired output for all N inputs x. To do that, we minimize a loss function. To minimize the loss, we use SGD. Measuring how the output changes with respect to a change in weight is the same as calculating the (partial) derivative of the output w.r.t weight w. All of those require the partial derivative (the gradient) of activation(x) with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the overall loss function keeps getting smaller across all x inputs.
Wait!! is this it???
For complete blog look for in my website here