<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rafay Khan</title>
    <description>The latest articles on DEV Community by Rafay Khan (@rafayak).</description>
    <link>https://dev.to/rafayak</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F195770%2F10de8761-2d3c-4e87-9535-bba3ac3e7863.jpg</url>
      <title>DEV Community: Rafay Khan</title>
      <link>https://dev.to/rafayak</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rafayak"/>
    <language>en</language>
    <item>
      <title>Nothing but NumPy: Understanding &amp; Creating Neural Networks with Computational Graphs from Scratch</title>
      <dc:creator>Rafay Khan</dc:creator>
      <pubDate>Wed, 17 Jul 2019 13:53:05 +0000</pubDate>
      <link>https://dev.to/rafayak/nothing-but-numpy-understanding-creating-neural-networks-with-computational-graphs-from-scratch-5983</link>
      <guid>https://dev.to/rafayak/nothing-but-numpy-understanding-creating-neural-networks-with-computational-graphs-from-scratch-5983</guid>
      <description>&lt;p&gt;Understanding new concepts can be hard, especially these days when there is an avalanche of resources with only cursory explanations for complex concepts. This blog is the result of a dearth of detailed walkthroughs on how to create neural networks in the form of computational graphs.&lt;/p&gt;

&lt;p&gt;In this, and some following, blog posts, I will consolidate all that I have learned as a way to give back to the community and help new entrants. I will be creating common forms of neural networks all with the help of nothing but &lt;a href="//www.numpy.org"&gt;NumPy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This blog post is divided into two parts, the first part will be understanding the basics of a neural network and the second part will comprise the code for implementing everything learned from the first part.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part Ⅰ: Understanding a Neural Network
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Let's dig in🍽️
&lt;/h3&gt;

&lt;p&gt;Neural networks are a model inspired by how the brain works. Similar to neurons in the brain, our 'mathematical neurons' are also, intuitively, connected; they take inputs(dendrites), do some simple computation on them and produce outputs(axons).&lt;/p&gt;

&lt;p&gt;The best way to learn something is to build it. Let's start with a simple neural network and hand-solve it. This will give us an idea of how the computations flow through a neural network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdg7wckx2vkpsqkpcmztu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdg7wckx2vkpsqkpcmztu.png" alt="Fig 1. Simple input-output only neural network"&gt;&lt;/a&gt;Fig 1. Simple input-output only neural network &lt;/p&gt;

&lt;p&gt;As in the figure above, most of the time you will see a neural network depicted similarly. But this succinct and simple looking picture hides a bit of the complexity. Let's expand it out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftrcf17nn78z21xmtzffb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftrcf17nn78z21xmtzffb.png"&gt;&lt;/a&gt;Fig 2. Expanded neural network&lt;/p&gt;

&lt;p&gt;Now, let's go over each node in our graph and see what it represents.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxc990gikalygp2g11tu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxc990gikalygp2g11tu0.png"&gt;&lt;/a&gt;Fig 3. Inputs nodes &lt;em&gt;x₁&lt;/em&gt; and &lt;em&gt;x₂&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These nodes represent our inputs for our first and second features, &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt;, that define a single example we feed to the neural network, thus called &lt;strong&gt;&lt;em&gt;“Input Layer”&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuez5ds02tolxjm23c5jj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuez5ds02tolxjm23c5jj.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 4. Weights



&lt;p&gt;&lt;strong&gt;&lt;em&gt;w₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;w₂&lt;/em&gt;&lt;/strong&gt; represent our weight vectors (in some neural network literature it is denoted with the theta symbol, &lt;strong&gt;&lt;em&gt;θ&lt;/em&gt;&lt;/strong&gt;). Intuitively, these dictate how much influence each of the input features should have in computing the next node. If you are new to this, think of them as playing a similar role to the ‘slope’ or ‘gradient’ constant in a linear equation.&lt;/p&gt;

&lt;p&gt;Weights are the main values our neural network has to “learn”. So initially, we will set them to &lt;strong&gt;&lt;em&gt;random values&lt;/em&gt;&lt;/strong&gt; and let the “learning algorithm” of our neural network decide the best weights that result in the correct outputs.&lt;/p&gt;

&lt;p&gt;Why random initialization? More on this later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd3oh7wlxb4nae0dvs9dr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd3oh7wlxb4nae0dvs9dr.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 5. Linear operation



&lt;p&gt;This node represents a linear function. Simply, it takes all the inputs coming to it and creates a linear equation/combination out of them. ( By convention, it is understood that a linear combination of weights and inputs is part of each node, except for the input nodes in the input layer, thus this node is often omitted in figures, like in Fig.1. In this example, I’ll leave it in)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6wze2pnuj3fpcp94rjqk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6wze2pnuj3fpcp94rjqk.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 6. Linear operation



&lt;p&gt;This &lt;strong&gt;&lt;em&gt;σ&lt;/em&gt;&lt;/strong&gt; node takes the input and passes it through the following function, called the &lt;em&gt;sigmoid function&lt;/em&gt;(because of its S-shaped curve), also known as the &lt;em&gt;logistic function&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F038gcuo063bf1zzdvhnq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F038gcuo063bf1zzdvhnq.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 7. Sigmoid(Logistic) function



&lt;p&gt;Sigmoid is one of the many “activations functions” used in neural networks. The job of an activation function is to change the input to a different range. For example, if z &amp;gt; 2 then, σ(z) ≈ 1 and similarly, if z &amp;lt; -2 then, σ(z) ≈ 0. So, the sigmoid function squashes the output range to (0, 1) (this ‘()’ notation implies exclusive boundaries; never completely outputs 0 or 1 as the function asymptotes, but reaches very close to boundary values)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In our above neural network since it is the last node, it performs the function of output&lt;/strong&gt;. The predicted output is denoted by &lt;strong&gt;&lt;em&gt;ŷ&lt;/em&gt;&lt;/strong&gt;. (Note: in some neural network literature this is denoted by &lt;strong&gt;&lt;em&gt;‘h(θ)’&lt;/em&gt;&lt;/strong&gt;, where ‘h’ is called the hypothesis i.e. this is the hypothesis of the neural network, a.k.a the output prediction, given parameter θ; where θ are weights of the neural networks)&lt;/p&gt;




&lt;p&gt;Now that we know what each and everything represents let’s flex our muscles by computing each node by hand on some dummy data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzrgio4rdgugo3wejmodl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzrgio4rdgugo3wejmodl.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 8. OR gate



&lt;p&gt;The data above represents an &lt;strong&gt;OR&lt;/strong&gt; gate(output 1 if any input is 1). Each row of the table represents an ‘example’ we want our neural network to learn from. After learning from the given examples we want our neural network to perform the function of an OR gate; given the input features, &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt;, try to output the corresponding &lt;strong&gt;&lt;em&gt;y(also called ‘label’)&lt;/em&gt;&lt;/strong&gt;. I have also plotted the points on a 2-D plane so that it is easy to visualize(green crosses represent points where the output(&lt;strong&gt;y&lt;/strong&gt;) is &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt; and the red dot represents the point where the output is &lt;strong&gt;&lt;em&gt;0&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;This OR-gate data is particularly interesting, as it is &lt;strong&gt;&lt;em&gt;linearly separable&lt;/em&gt;&lt;/strong&gt; i.e. we can draw a straight line to separate the green cross from the red dot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhpcbca884v4wahor0m2e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhpcbca884v4wahor0m2e.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 9. Showing that the OR gate data is linearly separable



&lt;p&gt;We’ll shortly see how our simple neural network performs this task.&lt;/p&gt;

&lt;p&gt;Data flows from left-to-right in our neural network. In technical terms, this process is called &lt;strong&gt;‘forward propagation’&lt;/strong&gt;; the computations from each node are forwarded to the next node, it is connected to.&lt;/p&gt;

&lt;p&gt;Let’s go through all the computations our neural network will perform on the given the first example, &lt;strong&gt;&lt;em&gt;x₁=0&lt;/em&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;x₂=0&lt;/em&gt;&lt;/strong&gt;. Also, we’ll initialize weights &lt;strong&gt;&lt;em&gt;w₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;w₂&lt;/em&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;em&gt;w₁=0.1&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;w₂=0.6&lt;/em&gt;&lt;/strong&gt; (recall, these weights a have been randomly selected)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffklrrwsvg0alxem7zqds.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffklrrwsvg0alxem7zqds.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 10. Forward propagation of the first example from OR table data



&lt;p&gt;With our current weights, &lt;strong&gt;&lt;em&gt;w₁= 0.1&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;w₂ = 0.6&lt;/em&gt;&lt;/strong&gt;, our network’s output is a bit far from where we’d like it to be. The predicted output, &lt;strong&gt;&lt;em&gt;ŷ&lt;/em&gt;&lt;/strong&gt;, should be &lt;strong&gt;&lt;em&gt;ŷ≈0&lt;/em&gt;&lt;/strong&gt; for &lt;strong&gt;&lt;em&gt;x₁=0&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂=0&lt;/em&gt;&lt;/strong&gt;, right now it's &lt;strong&gt;&lt;em&gt;ŷ=0.5&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So, how does one tell a neural network how far it is from our desired output? In comes the &lt;strong&gt;&lt;em&gt;Loss Function&lt;/em&gt;&lt;/strong&gt; to the rescue.&lt;/p&gt;




&lt;h3&gt;
  
  
  Loss Function
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;&lt;em&gt;Loss Function&lt;/em&gt;&lt;/strong&gt; is a simple equation that tells us how far our neural network’s predicted output(&lt;strong&gt;&lt;em&gt;ŷ&lt;/em&gt;&lt;/strong&gt;) is from our desired output(&lt;strong&gt;&lt;em&gt;y&lt;/em&gt;&lt;/strong&gt;), &lt;strong&gt;&lt;em&gt;for ONE example, only&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;em&gt;derivative&lt;/em&gt;&lt;/strong&gt; of the loss function dictates whether to increase or decrease weights. A positive derivative would mean decrease the weights and negative would mean increase the weights. &lt;strong&gt;&lt;em&gt;The steeper the slope the more incorrect the prediction was.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3jkzbbd7jyqr74lo4und.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3jkzbbd7jyqr74lo4und.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 11. Loss function visualized



&lt;p&gt;&lt;em&gt;The Loss function curve depicted in Figure 11 is an ideal version. In real-world cases, the Loss function may not be so smooth, with some bumps and saddles points along the way to the minimum.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are many different kinds of loss functions &lt;strong&gt;&lt;em&gt;each essentially calculating the error between predicted output and desired output&lt;/em&gt;&lt;/strong&gt;. Here we’ll use one of the simplest loss functions, the &lt;strong&gt;&lt;em&gt;Squared-Error Loss function&lt;/em&gt;&lt;/strong&gt;. Defined as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F84djv30dhr6w7vpwjxsf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F84djv30dhr6w7vpwjxsf.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 12. Loss Function. Calculating error for a single example



&lt;p&gt;Taking the &lt;strong&gt;square keeps everything nice and positive and the fraction (1/2) is there so that it cancels out when taking the derivative of the squared term&lt;/strong&gt; &lt;em&gt;(it is common among some machine learning practitioners to leave the fraction out)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Intuitively, the Squared Error Loss function helps us in minimizing the vertical distance between our predictor line(blue line) and actual data(green dot). Behind the scenes, this predictor line is our &lt;strong&gt;&lt;em&gt;z&lt;/em&gt;&lt;/strong&gt;(linear function) node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Faxy337v232iae90ojm7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Faxy337v232iae90ojm7q.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 13. Visualization of the effect of the Loss function







&lt;p&gt;Now that we know the purpose of a Loss function let’s calculate the error in our current prediction &lt;strong&gt;&lt;em&gt;ŷ=0.5&lt;/em&gt;&lt;/strong&gt;, given &lt;strong&gt;&lt;em&gt;y=0&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fsd6z3v6ljg4ne9165dg9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fsd6z3v6ljg4ne9165dg9.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 14. Loss calculated for 1ˢᵗ example



&lt;p&gt;as we can see the Loss is 0.125. Given this, &lt;em&gt;we can now use the derivative of the Loss function to check whether we need to increase or decrease our weights.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This process is called &lt;strong&gt;&lt;em&gt;backpropagation&lt;/em&gt;&lt;/strong&gt;, as we’ll be doing the opposite of the forward phase. Instead of going from input to output we’ll track backward from output to input. Simply, backpropagation allows us to figure out how much of the Loss each part of the neural network was responsible for.&lt;/p&gt;




&lt;p&gt;To perform backpropagation we’ll employ the following technique: &lt;em&gt;at each node, we only have our local gradient computed(partial derivatives of that node), then during backpropagation, as we are receiving numerical values of gradients from upstream, we take these and multiply with local gradients to pass them on to their respective connected nodes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flg1qm4usalpqidlo20wv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flg1qm4usalpqidlo20wv.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 15. Gradient Flow



&lt;p&gt;This is a generalization of the &lt;strong&gt;&lt;em&gt;chain rule&lt;/em&gt;&lt;/strong&gt; from calculus.&lt;/p&gt;




&lt;p&gt;Since &lt;strong&gt;&lt;em&gt;ŷ&lt;/em&gt;&lt;/strong&gt;(predicted label) dictates our &lt;strong&gt;&lt;em&gt;Loss&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;y&lt;/em&gt;&lt;/strong&gt;(actual label) is constant, for a single example, &lt;em&gt;we will take the partial derivative of Loss with respect to &lt;strong&gt;ŷ&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1ayk7soj9f0gwy1u36ec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1ayk7soj9f0gwy1u36ec.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 16. The partial derivative of Loss w.r.t ŷ



&lt;p&gt;Since the backpropagation steps can seem a bit complicated I’ll go over them step by step:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmth4xvo1pd192vtntp2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmth4xvo1pd192vtntp2g.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 17.a. Backpropagation






&lt;p&gt;For the next calculation, we’ll need the derivative of the sigmoid function, since it forms the local gradient of the red node. Let’s derive that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8olcek7z3blol39qc4qm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8olcek7z3blol39qc4qm.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fv4j7zwks2mdzdv2nrem4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fv4j7zwks2mdzdv2nrem4.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F10fwknidljvf4ft3f369.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F10fwknidljvf4ft3f369.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 18. The derivative of the Sigmoid function.



&lt;p&gt;Let’s use this in the next backward calculation&lt;/p&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fo133o9x62icgeo6285ya.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fo133o9x62icgeo6285ya.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 17.b. Backpropagation



&lt;p&gt;The backward computations should not propagate all the way to inputs as we don’t want to change our input data(i.e. red arrows should not go to green nodes). We only want to change the weights associated with inputs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftaomfov3or7hvnqzzetr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftaomfov3or7hvnqzzetr.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 17.c. Backpropagation



&lt;p&gt;Notice something weird? &lt;em&gt;The derivatives to the Loss with respect to the weights,w₁ &amp;amp; w₂, are ZERO!&lt;/em&gt; We can’t increase or decrease the weights if their derivatives are zero. So then, how do we get our desired output in this instance if we can’t figure out how to adjust the weights? &lt;em&gt;The key thing to note here is that the local gradients (&lt;strong&gt;&lt;em&gt;∂z/∂w₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;∂z/∂w₂&lt;/em&gt;&lt;/strong&gt;) are &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt;&lt;/em&gt;, both of which, in this example, happens to be zero (i.e. provide no information)&lt;/p&gt;

&lt;p&gt;This brings us to the concept of &lt;strong&gt;&lt;em&gt;bias&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Bias
&lt;/h3&gt;

&lt;p&gt;Recall equation of a line from your high school days.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxqdeukildqe895m8kc4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxqdeukildqe895m8kc4p.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 19. Equation of a Line



&lt;p&gt;Here &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt; is the bias term. Intuitively, the bias tells us that all outputs computed with &lt;strong&gt;&lt;em&gt;x&lt;/em&gt;&lt;/strong&gt;(&lt;em&gt;independent variable&lt;/em&gt;) should have an additive bias of &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;. So, when &lt;strong&gt;&lt;em&gt;x=0&lt;/em&gt;&lt;/strong&gt;(no information coming from the &lt;em&gt;independent variable) the output should be biased to just&lt;/em&gt; &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note that without the bias term a line can only pass through the origin(0, 0) and the only differentiating factor between lines would then be the gradient &lt;strong&gt;&lt;em&gt;m&lt;/em&gt;&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd4nnz0oc1dh6y99ve7y0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd4nnz0oc1dh6y99ve7y0.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 20. Lines from origin





&lt;p&gt;So, using this new information let’s add another node to a neural network; the bias node. &lt;em&gt;(In neural network literature, every layer, except the input layer, is assumed to have a bias node, just like the linear node, so this node is also often omitted in figures.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fb6plufpnsckqddkbdigk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fb6plufpnsckqddkbdigk.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 21. Expanded neural network with a bias node



&lt;p&gt;Now let’s do a forward propagation with the same example, &lt;strong&gt;&lt;em&gt;x₁=0&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;x₂=0&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;y=0&lt;/em&gt;&lt;/strong&gt; and let’s set bias, &lt;strong&gt;&lt;em&gt;b=0&lt;/em&gt;&lt;/strong&gt; (&lt;em&gt;initial bias is always set to zero, rather than a random number&lt;/em&gt;), and let the backpropagation of Loss figure out the bias.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9qo6hllwukkd0ltnao5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9qo6hllwukkd0ltnao5i.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 22. Forward propagation of the first example from OR table data with a bias unit



&lt;p&gt;Well, the forward propagation with a bias of &lt;strong&gt;&lt;em&gt;“b=0”&lt;/em&gt;&lt;/strong&gt; didn’t change our output at all, but let’s do the backward propagation before we make our final judgment.&lt;/p&gt;

&lt;p&gt;As before let’s go through backpropagation in a step by step manner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhbb22sxihz72ww7oary6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhbb22sxihz72ww7oary6.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 23.a. Backpropagation with bias



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fg2ai773u9lcypjdcb9zh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fg2ai773u9lcypjdcb9zh.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 23.b. Backpropagation with bias



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Frwax3w4cbiubhe3n7blt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Frwax3w4cbiubhe3n7blt.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 23.c. Backpropagation with bias



&lt;p&gt;Hurrah! we just figured out how much to adjust the bias. Since the derivative of bias(&lt;strong&gt;&lt;em&gt;∂L/∂b&lt;/em&gt;&lt;/strong&gt;) is positive 0.125, we will need to adjust the bias by moving in the negative direction of the gradient(recall the curve of the Loss function from before). This is technically called &lt;strong&gt;&lt;em&gt;gradient descent&lt;/em&gt;&lt;/strong&gt;, as we are “descending” away from the sloping region to a flat region using the direction of the gradient. Let’s do that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Frl0e9orqxecso5kuqhb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Frl0e9orqxecso5kuqhb8.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 24. Calculated new bias using gradient descent



&lt;p&gt;Now, that we’ve slightly adjusted the bias to &lt;strong&gt;&lt;em&gt;b=-0.125&lt;/em&gt;&lt;/strong&gt;, let’s test if we’ve done the right thing by doing a &lt;strong&gt;&lt;em&gt;forward propagation&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;checking the new Loss&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9domolr0n2gjthk0tbt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9domolr0n2gjthk0tbt8.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 25. Forward propagation with newly calculated bias



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftflu4ss167ncbsh9j6db.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftflu4ss167ncbsh9j6db.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 26. Loss after newly calculated bias



&lt;p&gt;Now our predicted output is &lt;strong&gt;&lt;em&gt;ŷ≈0.469&lt;/em&gt;&lt;/strong&gt;(&lt;em&gt;rounded to 3 decimal places&lt;/em&gt;), that’s a slight improvement from the previous 0.5 and Loss is down from 0.125 to around &lt;strong&gt;&lt;em&gt;0.109&lt;/em&gt;&lt;/strong&gt;. This slight correction is something that the neural network has ‘learned’ just by comparing its predicted output with the desired output, &lt;strong&gt;&lt;em&gt;y&lt;/em&gt;&lt;/strong&gt;, and then moving in the direction opposite of the gradient. Pretty cool, right?&lt;/p&gt;

&lt;p&gt;Now you may be wondering, this is only a small improvement from the previous result and how do we get to the minimum Loss. Two things come into play: &lt;strong&gt;&lt;em&gt;a) how many iterations of ‘training’ we perform&lt;/em&gt;&lt;/strong&gt; (each training cycle is forward propagation followed by backward propagation and updating the weights through gradient descent) &lt;strong&gt;&lt;em&gt;b) the learning rate&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Learning rate??? What’s that? Let’s talk about it&lt;/p&gt;


&lt;h3&gt;
  
  
  Learning Rate
&lt;/h3&gt;

&lt;p&gt;Recall, how we calculated the new bias, above, by moving in the direction opposite of the gradient(i.e. &lt;strong&gt;&lt;em&gt;gradient descent&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fazz60l4q4lxaa26e96ud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fazz60l4q4lxaa26e96ud.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 27. The equation for updating bias



&lt;p&gt;Notice that when we updated the bias we moved &lt;strong&gt;&lt;em&gt;1 step in the opposite direction of the gradient.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fifw5hnx1llzukzd120yn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fifw5hnx1llzukzd120yn.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 28. The equation for updating bias showing "step"



&lt;p&gt;We could have moved 0.5, 0.9, 2, 3 or whatever fraction of steps we desired in the opposite direction of the gradient. This &lt;em&gt;‘number of steps’&lt;/em&gt; is what we define as the &lt;strong&gt;&lt;em&gt;learning rate&lt;/em&gt;&lt;/strong&gt;, often denoted with &lt;strong&gt;&lt;em&gt;α&lt;/em&gt;&lt;/strong&gt;(alpha).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F45w97es9xnz9mw725hcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F45w97es9xnz9mw725hcm.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 29. The general equation for gradient descent



&lt;p&gt;Learning rate defines how quickly we reach the minimum loss. Let’s visualize below what the learning rate is doing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fn2maila4cpxvx3w57j5h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fn2maila4cpxvx3w57j5h.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 30. Visualizing the effect of learning rate.



&lt;p&gt;As you can see with a lower learning rate(α=0.5) our descent along the curve is slower and we take many steps to reach the minimum point. On the other hand, with a higher learning rate(α=5) we take much bigger steps and reach the minimum point much faster.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The keen-eyed may have noticed that gradient descent steps(green arrows) keep getting smaller as we get closer and closer to the minimum, why is that? Recall, that the learning rate is being multiplied by the gradient at that point along the curve; as we descend away from sloping regions to flatter regions of the u-shaped curve, near the minimum point, the gradient keeps getting smaller and smaller, thus the steps also get smaller. Therefore, changing the learning rate during training is not necessary(some variations of gradient descent start with a high learning rate to descend quickly down the slope and then reduce it gradually, this is called “annealing the learning rate”)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So what’s the takeaway? Just set the learning rate as high possible and reach the optimum loss quickly. NO. Learning rate can be a double-edged sword. Too high a learning rate and the parameters(weights/biases) don’t reach the optimum instead start to diverge away from the optimum. To small a learning rate and the parameters take too long to converge to the optimum.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fiol0nbqlpr42ms0im189.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fiol0nbqlpr42ms0im189.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 31. Visualizing the effect of very low vs. very high learning rate.



&lt;p&gt;Small learning rate(α=5*10⁻¹⁰) resulting is numerous steps to reach the minimum point is self-explanatory; multiply gradient with a small number(α) results in a proportionally small step.&lt;/p&gt;

&lt;p&gt;Large learning rate(α=50) causing gradient descent to diverge may be confounding, but the answer is quite simple; note that at each step gradient descent approximates its path downward by moving in straight lines(green arrows in the figures), in short, it estimates its path downwards. When the learning rate is too high we force gradient descent to take larger steps. Larger steps tend to overestimate the path downwards and shoot past the minimum point, then to correct the bad estimate gradient descent tries to move towards the minimum point but again overshoots past the minimum due to the large learning rate. This cycle of continuous overestimates eventually cause the results to diverge(Loss after each training cycle increase, instead of decrease).&lt;/p&gt;

&lt;p&gt;Learning rate is what’s called a &lt;strong&gt;&lt;em&gt;hyper-parameter&lt;/em&gt;&lt;/strong&gt;. Hyper-parameters are parameters that the neural network can’t essentially learn through backpropagation of gradients, they have to be hand-tuned according to the problem and its dataset, by the creator of the neural network model. &lt;em&gt;(The choice of the Loss function, above, is also hyper-parameter)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In short, the goal is not the find the “perfect learning rate ” but instead a learning rate large enough so that the neural network trains successfully and efficiently without diverging.&lt;/p&gt;



&lt;p&gt;So, far we’ve only used one example(&lt;strong&gt;&lt;em&gt;x₁=0&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂=0&lt;/em&gt;&lt;/strong&gt;) to adjust our weights and bias(&lt;em&gt;actually, only our bias up till now&lt;/em&gt;🙃) and that reduced the loss on one example from our entire dataset(OR gate table). But we have more than one example to learn from and we want to reduce our loss across all of them. &lt;strong&gt;&lt;em&gt;Ideally, in one training iteration, we would like to reduce our loss across all the training examples.&lt;/em&gt;&lt;/strong&gt; This is called &lt;strong&gt;&lt;em&gt;Batch Gradient Descent&lt;/em&gt;&lt;/strong&gt;(or full batch gradient descent), as we use the entire batch of training examples per training iteration to improve our weights and biases. &lt;em&gt;(Others forms are &lt;strong&gt;&lt;em&gt;mini-batch gradient descent&lt;/em&gt;&lt;/strong&gt;, where we use a subset of the data set in each iteration and &lt;strong&gt;&lt;em&gt;stochastic gradient descent&lt;/em&gt;&lt;/strong&gt;, where we only use one example per training iteration as we’ve done so far)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A training iteration where the neural network goes through all the training examples is called an &lt;strong&gt;&lt;em&gt;Epoch&lt;/em&gt;&lt;/strong&gt;. If using mini-batches than an epoch would be complete after the neural network goes through all the mini-batches, similarly for stochastic gradient descent where a batch is just one example.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Before we proceed further we need to define something called a &lt;strong&gt;&lt;em&gt;Cost Function&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Cost Function
&lt;/h3&gt;

&lt;p&gt;When we perform &lt;em&gt;“batch gradient descent”&lt;/em&gt; we need to slightly change our Loss function to accommodate not just one example but all the examples in the batch. This adjusted Loss function is called the &lt;strong&gt;&lt;em&gt;Cost Function&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Also, note that the curve of the Cost Function is similar to the curve of the Loss function(same U-Shape).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Instead of calculating the Loss on one example the cost function calculates average Loss across ALL the examples.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fl3xxpazqx73u2fvqttwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fl3xxpazqx73u2fvqttwn.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 32. Cost function



&lt;p&gt;Intuitively, the Cost function is expanding out the capability of the Loss function. Recall, how the Loss function was helping to minimize the vertical distance between a &lt;em&gt;single&lt;/em&gt; data point and the predictor line(&lt;strong&gt;&lt;em&gt;z&lt;/em&gt;&lt;/strong&gt;). &lt;strong&gt;The Cost function is helping to minimize the vertical distance(Squared Error Loss) between multiple data points, concurrently.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fknqchtiau983lgpshu6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fknqchtiau983lgpshu6z.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 33. Visualization of the effect of the Cost function



&lt;p&gt;&lt;strong&gt;&lt;em&gt;During batch gradient descent we’ll use the derivative of the Cost function, instead of the Loss function&lt;/em&gt;&lt;/strong&gt;, to guide our path to minimum cost across all examples. &lt;em&gt;(In some neural network literature, the Cost Function is at times also represented with the letter &lt;strong&gt;&lt;em&gt;‘J’&lt;/em&gt;&lt;/strong&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s take a look at how the derivative equation of the Cost function differs from the plain derivative of the Loss function.&lt;/p&gt;
&lt;h4&gt;
  
  
  The derivative of Cost Function
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fezjopewoucgicbu7hoco.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fezjopewoucgicbu7hoco.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 34. Cost function showing it takes input vectors



&lt;p&gt;Taking the derivative of this Cost function, which takes vectors as inputs and sums them, can be a bit dicey. So, let’s start out on a simple example before we generalize the derivative.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6wpn2i6ncdn6ejva7dnk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6wpn2i6ncdn6ejva7dnk.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 35. Calculation of Cost on a simple vectorized example



&lt;p&gt;Nothing new here in the calculation of the Cost. Just as expected the Cost, in the end, is the average of the Loss, but the implementation is now vectorized &lt;em&gt;(we performed vectorized subtraction followed by element-wise exponentiation, called Hadamard exponentiation)&lt;/em&gt;. Let’s derive the partial derivatives.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Foj03ytj8r3h3srnmwnf0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Foj03ytj8r3h3srnmwnf0.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 36. Calculation of Jacobian on the simple example



&lt;p&gt;From this, we can generalize the partial derivative equation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz9a6bjcisydp9j99625p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz9a6bjcisydp9j99625p.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 37. Generalized partial derivative equation



&lt;p&gt;Right now we should take a moment to note how the derivative of the Loss is different for the derivative of the Cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd8cx33jnf2n759bdskja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fd8cx33jnf2n759bdskja.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 38. Comparison between the partial derivative of Loss and Cost with respect to(w.r.t) &lt;b&gt;ŷ⁽ⁱ⁾&lt;/b&gt;



&lt;p&gt;We’ll later see how this small change manifests itself in the calculation of the gradient.&lt;/p&gt;



&lt;p&gt;Back to batch gradient descent.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For each training iteration create separate temporary variables(capital deltas, Δ) that will accumulate the gradients(small deltas, δ) for the weights and biases from each of the &lt;strong&gt;&lt;em&gt;“m”&lt;/em&gt;&lt;/strong&gt; examples in our training set, then at the end of the iteration update the weights using the average of the accumulated gradients. This is a slow method. &lt;em&gt;(for those familiar time complexity analysis you may notice that as the training data set grows this becomes a polynomial-time algorithm, O(n²))&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fw1qdgm3e5krbeq6pl5oa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fw1qdgm3e5krbeq6pl5oa.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 39. Batch Gradient Descent slow method



&lt;ol&gt;
&lt;li&gt;The quicker method is similar to above but instead uses vectorized computations to calculate all the gradients for all the training examples in one go, so the inner loop is removed. Vectorized computations run much quicker on computers. This is the method employed by all the popular neural network frameworks and the one we’ll follow for the rest of this blog.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For vectorized computations, we’ll make an adjustment to the “Z” node of the neural network computation graph and use the Cost function instead of the Loss function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8lzmdjhjjwxi239qh5yg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8lzmdjhjjwxi239qh5yg.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 40. Vectorized implementation of Z node



&lt;p&gt;Note that in the figure above we take &lt;strong&gt;&lt;em&gt;dot-product&lt;/em&gt;&lt;/strong&gt; between &lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;X&lt;/em&gt;&lt;/strong&gt; which can be either an appropriate size matrix or vector. The bias, &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;, is still a single number(&lt;em&gt;a scalar quantity&lt;/em&gt;) here and will be added to the output of the dot product in an element-wise fashion. The predicted output will not be just a number, but instead a vector, &lt;strong&gt;&lt;em&gt;Ŷ&lt;/em&gt;&lt;/strong&gt;, where each element is the predicted output of their respective example.&lt;/p&gt;

&lt;p&gt;Let’s set up out data(&lt;strong&gt;&lt;em&gt;X, W, b &amp;amp; Y&lt;/em&gt;&lt;/strong&gt;) before doing forward and backward propagation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fn2xvemilggldoc7xt6ok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fn2xvemilggldoc7xt6ok.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 41. Setup data for vectorized computations.



&lt;p&gt;We are now finally ready to perform forward and backward propagation using &lt;strong&gt;&lt;em&gt;Xₜᵣₐᵢₙ&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;Yₜᵣₐᵢₙ&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(NOTE: all the results below are rounded to 3 decimal points, just for brevity)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fbmug1olcjmo1aa5jp3bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fbmug1olcjmo1aa5jp3bn.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7815coqj7o3gxw4wfqee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7815coqj7o3gxw4wfqee.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 42. Vectorized Forward Propagation on OR gate dataset



&lt;p&gt;How cool is that we calculated all the forward propagation steps for all the examples in our data set in one go, just by vectorizing our computations.&lt;/p&gt;

&lt;p&gt;We can now calculate the &lt;strong&gt;&lt;em&gt;Cost&lt;/em&gt;&lt;/strong&gt; on these output predictions. &lt;em&gt;(We’ll go over the calculation in detail, to make sure there is no confusion)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fai89pw3gflutjr0p0lfq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fai89pw3gflutjr0p0lfq.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 43. Calculation of Cost on the OR gate data



&lt;p&gt;Our &lt;strong&gt;Cost&lt;/strong&gt; with our current weights, &lt;strong&gt;W&lt;/strong&gt;, turns out to be &lt;strong&gt;0.089&lt;/strong&gt;. Our Goal now is to reduce this cost using backpropagation and gradient descent. As before we’ll go through backpropagation in a step by step manner&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0u62cfib25a4cwtj46a0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0u62cfib25a4cwtj46a0.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffgaouo6412gxl2xc2n5v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffgaouo6412gxl2xc2n5v.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 44.a. Vectorized Backward on OR gate data



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcj7hg0tj9fc6d3po6t0c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcj7hg0tj9fc6d3po6t0c.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fojrtthit2k2ut653uwok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fojrtthit2k2ut653uwok.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 44.b. Vectorized Backward on OR gate data



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxbfwwcdvnnffqbtjjdcy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxbfwwcdvnnffqbtjjdcy.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flpo1q8mmo5t3gy7lv8by.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flpo1q8mmo5t3gy7lv8by.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 44.c. Vectorized Backward on OR gate data



&lt;p&gt;Voila, we used a vectorized implementation of batch gradient descent to calculate all the gradients in one go.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Those with a keen eye may be wondering how are the local gradients and the final gradients are being calculated in this last step. Don’t worry, I’ll explain the derivation of the gradients in this last step, shortly. For now, its suffice to say that the gradients defined in this last step are an optimization over the naive way of calculating ∂Cost/∂W and ∂Cost/∂b)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s update the weights and bias, keeping learning rate same as the non-vectorized implementation from before i.e. &lt;strong&gt;&lt;em&gt;α=1&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpytf2zdsdpicrhkxa4qw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpytf2zdsdpicrhkxa4qw.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 45. Calculated new Weights and Bias



&lt;p&gt;Now that we have updated the weights and bias lets do a &lt;strong&gt;forward propagation&lt;/strong&gt; and &lt;strong&gt;calculate the new Cost&lt;/strong&gt; to check if we’ve done the right thing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flbxnno5sqlcs9gtpkjdi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flbxnno5sqlcs9gtpkjdi.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm59aza9klkwp9buolzi3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm59aza9klkwp9buolzi3.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 46. Vectorized Forward Propagation with updated weights and bias



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6u3rjqcqtxqr9ajjcu9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6u3rjqcqtxqr9ajjcu9f.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 47. New Cost after updated parameters



&lt;p&gt;So, we &lt;em&gt;reduced our Cost(Average Loss across all examples)&lt;/em&gt; from an initial Cost of around &lt;strong&gt;&lt;em&gt;0.089&lt;/em&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;em&gt;0.084&lt;/em&gt;&lt;/strong&gt;. We will need to do multiple training iterations before we can converge to a low Cost.&lt;/p&gt;

&lt;p&gt;At this point, I would recommend that you perform backpropagation step yourself. The result of that should be (rounded to 3 decimal places): &lt;strong&gt;&lt;em&gt;∂Cost/∂W = [-0.044, -0.035]&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;∂Cost/∂b = [-0.031].&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recall, before we trained the neural network, how we predicted the neural network can separate the two classes in Figure 9, well after about 5000 Epochs(full batch training iterations) Cost steadily decreases to about &lt;strong&gt;&lt;em&gt;0.0005&lt;/em&gt;&lt;/strong&gt; and we get the following decision boundary :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnri2nqzowvqe7x6e9que.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnri2nqzowvqe7x6e9que.png"&gt;&lt;/a&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3vn3odi9qeyjbf06j1pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3vn3odi9qeyjbf06j1pl.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 48. Cost curve and Decision boundary after 5000 epochs



&lt;p&gt;The &lt;strong&gt;Cost curve&lt;/strong&gt; is basically the value of Cost plotted after a certain number of iterations(epochs). Notice that the Cost curve flattens after about 3000 epochs this means that the weights and bias of the neural network have converged, so further training will only slightly improve our weights and bias. Why? Recall the u-shaped Loss curve, as we descend closer and closer the minimum point(flat region) the gradients become smaller and smaller thus the steps gradient descent takes are very small.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Decision Boundary&lt;/strong&gt; shows at the line along which the decision of the neural network changes from one output to the other. We can better visualize this by coloring the area below and above the decision boundary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9johty29zmet36qav1dd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9johty29zmet36qav1dd.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 49. Decision boundary visualized after 5000 epochs



&lt;p&gt;This makes it much clearer. The red shaded area is the area below the decision boundary and everything below the decision boundary has an output(&lt;strong&gt;&lt;em&gt;ŷ&lt;/em&gt;&lt;/strong&gt;) of &lt;strong&gt;&lt;em&gt;0&lt;/em&gt;&lt;/strong&gt;. Similarly, everything above the decision boundary, shaded green, has an output of &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt;. In conclusion, our simple neural network has learned a decision boundary by looking at the training data and figuring out how to separate its two output classes(&lt;strong&gt;&lt;em&gt;y=1&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;y=0&lt;/em&gt;&lt;/strong&gt;)🙌. Now the output neuron fires up🔥(produces 1) whenever &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt; or both are 1.&lt;/p&gt;

&lt;p&gt;Now would be a good time to see how the &lt;strong&gt;&lt;em&gt;“1/m”&lt;/em&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;em&gt;“m”&lt;/em&gt;&lt;/strong&gt; is the total number of examples in the training dataset) in the Cost function manifested in the final calculation of the gradients.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkkikpzlhyz94vd343oua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkkikpzlhyz94vd343oua.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 50. Comparing the effect of derivative w.r.t Cost and Loss on parameters of the neural network



&lt;p&gt;&lt;strong&gt;&lt;em&gt;From this, the most important point to know is that the gradient used to update our weights, using the Cost function, is the average of all the gradients calculated during a training iteration; same applies to bias&lt;/em&gt;&lt;/strong&gt;. You may want to confirm this yourself by checking the vectorized calculations.&lt;/p&gt;

&lt;p&gt;Taking the average of all the gradients has some benefits. Firstly, it gives us a less noisy estimate of the gradient. Second, the resultant learning curve is smooth helping us easily determine if the neural network is learning or not. Both of these features come in very handy when training neural networks on much trickier datasets, such as those with wrongly labeled examples.&lt;/p&gt;


&lt;h3&gt;
  
  
  This is great and all but how did you calculate the gradients ∂Cost/∂W and ∂Cost/∂b?🤔
&lt;/h3&gt;

&lt;p&gt;Neural network guides and blog posts I learned from often omitted complex details or gave very vague explanations for them. Not in this blog we’ll go over everything leaving no stone unturned.&lt;/p&gt;
&lt;h4&gt;
  
  
  First, we’ll tackle ∂Cost/∂b. Why did we sum the gradients?
&lt;/h4&gt;

&lt;p&gt;To explain this I employ our computational graph technique on three very simple equations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4g3zdnbrdbxubu4atf9i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4g3zdnbrdbxubu4atf9i.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 51. Computational graph of simple equations



&lt;p&gt;I am particularly interested in the &lt;strong&gt;b&lt;/strong&gt; node, so let’s do backpropagation on this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9zrosvtd12heblinx6j5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9zrosvtd12heblinx6j5.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 52. Backpropagation on the computational graph of simple equations



&lt;p&gt;Note that the &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt; node is receiving gradients from &lt;strong&gt;two&lt;/strong&gt; other nodes. So the total of the gradients flowing into node &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt; is the sum of the two gradients flowing in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7f9e3vsv1o4la3c1tppq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7f9e3vsv1o4la3c1tppq.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 53. Sum of gradients flowing into node &lt;b&gt;b&lt;/b&gt;



&lt;p&gt;From this example, we can generalize the following rule: &lt;strong&gt;&lt;em&gt;Sum all the incoming gradients to a node, from all the possible paths&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s visualize how this rule is used in the calculation of the &lt;strong&gt;&lt;em&gt;bias&lt;/em&gt;&lt;/strong&gt;. Our neural network can be seen as doing &lt;strong&gt;&lt;em&gt;independent&lt;/em&gt;&lt;/strong&gt; calculations for each of our examples but using shared parameters for weights and bias, during a training iteration. Below bias(&lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;) is visualized as a shared parameter for all individual calculations our neural network performs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6qqk6qfj6ewovmzrxejh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6qqk6qfj6ewovmzrxejh.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 54. Visualizing bias parameter being shared across a training epoch.



&lt;p&gt;Following the general rule defined above, we will sum all the incoming gradients from all the possible paths to the bias node, &lt;strong&gt;b&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0u1ya8fnz6t6wlfxh1li.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0u1ya8fnz6t6wlfxh1li.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 55. Visualizing all possible backpropagation paths to shared bias parameter



&lt;p&gt;Since the ∂Z/∂b (local gradient at the Z node) is equal to &lt;strong&gt;1&lt;/strong&gt;, the total gradient at &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt; is the sum of gradients from each example with respect to the Cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvwz30dj8oxrflcae1xhy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvwz30dj8oxrflcae1xhy.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 56. Proof that ∂Cost/∂b is the sum of upstream gradients



&lt;p&gt;Now that we’ve got derivative of the bias figured out let’s move on to derivative of weights, and more importantly the local gradient with respect to weights.&lt;/p&gt;
&lt;h4&gt;
  
  
  How is the local gradient(∂Z/∂W) equal to transpose of the input training data(X_train)?
&lt;/h4&gt;

&lt;p&gt;This can be answered in a similar way to the above calculation for bias, but the main complication here is the calculating the derivative of the dot product between the weight matrix(&lt;strong&gt;W&lt;/strong&gt;) and the data matrix(&lt;strong&gt;Xₜᵣₐᵢₙ&lt;/strong&gt;), which forms our local gradient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fij1oeugf50snn84fc6ia.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fij1oeugf50snn84fc6ia.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 57.a. Figuring out the derivative of the dot product.



&lt;p&gt;This derivative of the dot product is a bit complicated as &lt;strong&gt;we are no longer working with scalar quantities&lt;/strong&gt;, instead, both &lt;strong&gt;W&lt;/strong&gt; and &lt;strong&gt;X&lt;/strong&gt; are matrices and the result of &lt;strong&gt;W⋅X&lt;/strong&gt; is also a matrix. Let’s dive a bit deeper using a simple example first and then generalizing from it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6oy72r4rmund6367n753.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6oy72r4rmund6367n753.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 57.b. Figuring out the derivative of the dot product.



&lt;p&gt;Let’s calculate the derivative of the &lt;strong&gt;A&lt;/strong&gt; with respect to &lt;strong&gt;W&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuqte14ax6ijqmzqzrbvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuqte14ax6ijqmzqzrbvk.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 57.c. Figuring out the derivative of the dot product.



&lt;p&gt;Let us visualize this in case of a training iteration where multiple examples are being processed at the same time. &lt;em&gt;(Note that input examples are column vectors.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvr8meel3pxclla57ei73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvr8meel3pxclla57ei73.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 58. Visualizing weights being shared across a training epoch



&lt;p&gt;Just as the bias(&lt;strong&gt;b&lt;/strong&gt;) was being shared across each calculation in a training iteration, weights(&lt;strong&gt;W&lt;/strong&gt;) are also being shared. We can also visualize the gradient flowing back to the weights, as follows &lt;em&gt;(note that the local derivative of each example w.r.t to &lt;strong&gt;W&lt;/strong&gt; results in a row vector of the input example i.e. transpose of input)&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fe9e97f7xfnc7yfgupoex.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fe9e97f7xfnc7yfgupoex.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 59. Visualizing all possible backpropagation paths to shared weights parameter



&lt;p&gt;Again, following the general rule defined above, we will sum all the incoming gradients from all the possible paths to the weights node, &lt;strong&gt;W&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5uh87cryy1ptdomegw77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5uh87cryy1ptdomegw77.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 60. Derivation of ∂Cost/∂W after visualization.



&lt;p&gt;Up till now what we’ve done to calculate, &lt;strong&gt;&lt;em&gt;∂Cost/∂W&lt;/em&gt;&lt;/strong&gt;, though is correct and serves as a good explanation however, it is not an optimized calculation. We can vectorize this calculation, too. Let’s do that next&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fx4mgz2oq8yfdvuz22083.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fx4mgz2oq8yfdvuz22083.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 61. Proof that ∂Cost/∂W is the dot product between the Upstream gradient and the transpose of &lt;b&gt;Xₜᵣₐᵢₙ&lt;/b&gt; 


&lt;h4&gt;
  
  
  Is there an easier way of figuring this out, without the math?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Yes! Use &lt;em&gt;dimension analysis&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In our OR gate example we know that the gradient flowing into node &lt;strong&gt;&lt;em&gt;Z&lt;/em&gt;&lt;/strong&gt; is a (1 × 4) matrix, &lt;strong&gt;Xₜᵣₐᵢₙ&lt;/strong&gt; is a (2 × 4) matrix and the derivative of Cost with respect to the &lt;strong&gt;W&lt;/strong&gt; needs to be of the same size as &lt;strong&gt;W&lt;/strong&gt;, which is (1 × 2). So, the only way to generate a (1 × 2) matrix would be to take the dot product of between Z and transpose of &lt;strong&gt;Xₜᵣₐᵢₙ&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqmjgqzvq9wdcqhyfys1y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqmjgqzvq9wdcqhyfys1y.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Similarly, knowing that bias, &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;, is a simple (1 × 1) matrix and the gradient flowing into node Z is (1 × 4), using dimension analysis we can be sure that the gradient of Cost w.r.t &lt;strong&gt;&lt;em&gt;b&lt;/em&gt;&lt;/strong&gt;, also needs to be a (1 × 1) matrix. The only way we can achieve this, given the local gradient(&lt;strong&gt;&lt;em&gt;∂Z/∂b&lt;/em&gt;&lt;/strong&gt;) is just equal to &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt;, is by summing up the upstream gradient.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;On a final note when deriving derivative expressions work on small examples and then generalize from there. For example here, while calculating the derivative of the dot product w.r.t to &lt;strong&gt;W&lt;/strong&gt;, we used a single column vector as a test case and generalized from there, if we would have used the entire data matrix then the derivative would have resulted in a (4 × 1 × 2) tensor (multidimensional matrix), calculation on which can get a bit hairy.&lt;/em&gt;&lt;/p&gt;



&lt;p&gt;Before concluding this section lets go over a slightly more complicated example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fo9dv23fqtyanhvi1tuvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fo9dv23fqtyanhvi1tuvx.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 62. XOR gate data



&lt;p&gt;Figure 62, above, represents an XOR gate data. Looking at it note that the label, &lt;strong&gt;y&lt;/strong&gt;, is equal to &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt; only when one of the values &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt; is equal to &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt;, not both. This makes it a particularly challenging dataset as the data is not linearly separable, i.e. there is no single straight line decision boundary that can successfully separate the two classes(&lt;strong&gt;&lt;em&gt;y=1 and y=0&lt;/em&gt;&lt;/strong&gt;) in the data. XOR used to be the bane of earlier forms of artificial neural networks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F69ghath4q1qr2bcl8jh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F69ghath4q1qr2bcl8jh1.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 63. Some linear decision boundaries that are wrong



&lt;p&gt;Recall that our current neural network was successful only because it could figure out the straight line decision boundary that could successfully separate the two classes of the OR gate dataset. A straight line won’t cut it here. So, how do we get a neural network to figure this one out?&lt;/p&gt;

&lt;p&gt;Well, we can do two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Amend the data itself, so that in addition to features &lt;strong&gt;&lt;em&gt;x₁&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;x₂&lt;/em&gt;&lt;/strong&gt; a third feature provides some additional information to help the neural network decide on a good decision boundary. This process is called &lt;strong&gt;&lt;em&gt;feature engineering&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Change the architecture of the neural network, making it deeper.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's go over both and see which one is better.&lt;/p&gt;
&lt;h4&gt;
  
  
  Feature Engineering
&lt;/h4&gt;

&lt;p&gt;Let’s look at a dataset similar looking to the XOR data that will help us in making an important realization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fjcee7vqvetvxpcpkyjto.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fjcee7vqvetvxpcpkyjto.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 64. XOR-like data in different quadrants



&lt;p&gt;The data in Figure 64 is exactly like the XOR data except each data point is spread out in different quadrants. Notice that in the &lt;strong&gt;1ˢᵗ and 3ʳᵈ quadrant all the values are positive&lt;/strong&gt; and in the &lt;strong&gt;2ⁿᵈ and 4ᵗʰ all the values are negative.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz5ko6wyk9ud59dw6sm5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz5ko6wyk9ud59dw6sm5z.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 65. Positive and negative quadrants



&lt;p&gt;Why is that? In the &lt;strong&gt;1ˢᵗ&lt;/strong&gt; and &lt;strong&gt;3ʳᵈ&lt;/strong&gt; quadrants &lt;strong&gt;the signs of values are being squared&lt;/strong&gt;, while in the &lt;strong&gt;2ⁿᵈ&lt;/strong&gt; and &lt;strong&gt;4ᵗʰ&lt;/strong&gt; quadrants &lt;strong&gt;the values are a simple product between a negative and positive number resulting in a negative number.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc74z42vcutr7lzbini1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc74z42vcutr7lzbini1r.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 66. Result of the product of features



&lt;p&gt;So this gives us a pattern to work with using the product of two features. We can even see a similar pattern in the XOR data, where each quadrant can be identified in a similar way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzvxsthsxr5exp5xu6njn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzvxsthsxr5exp5xu6njn.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 67. Quadrant-pattern in XOR data plot



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Therefore, a good third feature, x₃, would be the product of features x₁ and x₂(i.e. x₁*x₂).&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Product of features is called a &lt;strong&gt;&lt;em&gt;feature cross&lt;/em&gt;&lt;/strong&gt; and results in a new &lt;strong&gt;&lt;em&gt;synthetic feature&lt;/em&gt;&lt;/strong&gt;. Feature crosses can be either the feature itself(eg. &lt;strong&gt;&lt;em&gt;x₁²&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;x₁³&lt;/em&gt;&lt;/strong&gt;,…), a product of two or more features(eg. &lt;strong&gt;&lt;em&gt;x₁*x₂&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;x₁*x₂*x₃&lt;/em&gt;&lt;/strong&gt;, …) or even a combination of both(eg. &lt;strong&gt;&lt;em&gt;x₁²*x₂&lt;/em&gt;&lt;/strong&gt;). For example, in a housing dataset where the input features are the width and length of houses in yards and label is the location of the house on the map, a better predictor for this location could be the feature cross between width and length of houses, giving us a new feature of “size of house in square yards”.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s add the new synthetic feature to our training data, Xₜᵣₐᵢₙ.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ff2q3ybhq5r0moo9p0qfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ff2q3ybhq5r0moo9p0qfe.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 68. New training data



&lt;p&gt;Using this feature cross we can now successfully learn a decision boundary without changing the architecture of the neural network significantly. We only need to add an input node for &lt;strong&gt;&lt;em&gt;x₃&lt;/em&gt;&lt;/strong&gt; and a corresponding weight(randomly set to 0.2) to the input layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F610pm2y4ylq2vuoe9y74.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F610pm2y4ylq2vuoe9y74.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 69. Neural Network with feature cross(x₃) as input



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpffm0zwgiralojmrcfhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpffm0zwgiralojmrcfhn.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 70. Expanded neural network with feature cross(x₃) as input



&lt;p&gt;Given below is the first training iteration of the neural network, you may go through the computations yourself and confirm them as they make for a good exercise. Since we are already familiar with this neural network architecture, I will not go through all the computations in a step-by-step by step manner, as before.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(All calculations below are rounded to 3 decimal places)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqxki2xg7uhwxn7bw7r46.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqxki2xg7uhwxn7bw7r46.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 71. Forward Propagation in the first training iteration



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcd2hqd5ja0fx7q5n1nwl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcd2hqd5ja0fx7q5n1nwl.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 72. Backpropagation in the first training iteration




&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ff5bwj5qoxdckxtxfkpyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ff5bwj5qoxdckxtxfkpyd.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 73. Gradient descent update for new weights and bias, in the first training iteration



&lt;p&gt;After 5000 epochs, the learning curve, and the decision boundary look as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fk7htutjd23xswvchv678.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fk7htutjd23xswvchv678.png"&gt;&lt;/a&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvh67nhk1xwm6grt7fhtv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvh67nhk1xwm6grt7fhtv.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 74. Learning Curve and Decision Boundary of the neural net with a feature cross



&lt;p&gt;As before, to visualize better we can shade the regions where the decision of the neural network changes from one to the other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm1xw4ldpp7xmjpeqzxpm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm1xw4ldpp7xmjpeqzxpm.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 75. Shaded Decision Boundary for better visualization



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note that feature engineering allowed us to create a decision boundary that is nonlinear&lt;/em&gt;&lt;/strong&gt;. How did it do that? We just need to take a look at what function the Z node is computing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm8zshj3sf6162l3k2n5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fm8zshj3sf6162l3k2n5z.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 76. Node &lt;b&gt;Z&lt;/b&gt; is computing a polynomial after adding a feature cross



&lt;p&gt;Thus, feature cross helped us to create complex non-linear decision boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;This is a very powerful idea!&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Changing Neural Network Architecture
&lt;/h4&gt;

&lt;p&gt;This is the more interesting approach as it allows us to bypass the feature engineering ourselves and &lt;strong&gt;&lt;em&gt;lets the neural network figure out the feature crosses itself!&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s take a look at the following neural network:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3sj3x1m81j7tb37c27q6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3sj3x1m81j7tb37c27q6.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 77. Neural network with one hidden layer.



&lt;p&gt;So we’ve added a bunch of new nodes in the middle of our neural network architecture from the OR gate example, keeping the input layer and the output layer the same. This column of new nodes in the middle is called a &lt;strong&gt;&lt;em&gt;hidden&lt;/em&gt;&lt;/strong&gt; layer. &lt;em&gt;Why hidden layer? Because after defining it we don’t have any direct control over how the neurons in the hidden layers learn, unlike the input and output layer which we can change by changing the data; also since the hidden layers neither constitute as the output or the input of the neural network they are in essence hidden from the user.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;We can have an arbitrary number of hidden layers with an arbitrary number of neurons in each layer&lt;/em&gt;&lt;/strong&gt;. This structure needs to be defined by the creator of the neural network. Thus, &lt;strong&gt;&lt;em&gt;the number of hidden layers and the number of neurons in each of the layers are also hyper-parameters. The more hidden layers we add the deeper our neural network architecture becomes and the more neurons we add in the hidden layers the wider the network architecture becomes. The depth of a neural net model is where the term “Deep learning” comes from.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The architecture in Figure 77 with one hidden layer of three sigmoid neurons, was selected after some experimentation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Since this is a new architecture I’ll go over the computations step-by-step.&lt;/p&gt;

&lt;p&gt;First, let’s expand out the neural network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc801iua9u2ve3wdpfgo5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc801iua9u2ve3wdpfgo5.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 78. Expanded neural network with one hidden layer



&lt;p&gt;Now let’s perform a &lt;strong&gt;&lt;em&gt;forward propagation&lt;/em&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhr2dmof6tk10uc663irw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fhr2dmof6tk10uc663irw.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmj8entthrt990jfq8b9o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmj8entthrt990jfq8b9o.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 79.a. Forward propagation on the neural net with a hidden layer



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7yl0py90n09zkcjvrs7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7yl0py90n09zkcjvrs7z.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F075awbquv4wq1agsbczg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F075awbquv4wq1agsbczg.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 79.b. Forward propagation on the neural net with a hidden layer



&lt;p&gt;We can now calculate the Cost:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffo1esaqqki6pq738xzjj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffo1esaqqki6pq738xzjj.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 80. Cost of the neural net with one hidden layer after &lt;b&gt;first&lt;/b&gt; forward propagation



&lt;p&gt;After the calculation of Cost, we can now do our backpropagation and improve the weights and biases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2ih9ds5f0qin7sfjr4gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2ih9ds5f0qin7sfjr4gi.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fty23cskpoj4uitjcw4z0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fty23cskpoj4uitjcw4z0.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 81.a. Backpropagation on the neural net with a hidden layer



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F08bwow5gaom42d4ayted.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F08bwow5gaom42d4ayted.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4xw5oji7teytfmdq5a98.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4xw5oji7teytfmdq5a98.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 81.b. Backpropagation on the neural net with a hidden layer



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6qfvcrs711fup36kmyz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6qfvcrs711fup36kmyz8.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmyks2lqeqcz3oppg8sfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmyks2lqeqcz3oppg8sfs.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 81.c. Backpropagation on the neural net with a hidden layer



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ft5v582lt4ixkn6kdi5tu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ft5v582lt4ixkn6kdi5tu.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F810p7xveqoqzno78u6oi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F810p7xveqoqzno78u6oi.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 81.d. Backpropagation on the neural net with a hidden layer



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftpm8dglq3ukbp2lmc7dh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftpm8dglq3ukbp2lmc7dh.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9s3qqtf036pcrn96xx6s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9s3qqtf036pcrn96xx6s.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 81.e. Backpropagation on the neural net with a hidden layer



&lt;p&gt;Whew 😅! That was a lot, but it did a great deal to improve our understanding. Let’s perform the gradient descent update:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0gexchdyz9ubd95bfu0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0gexchdyz9ubd95bfu0g.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 82. Gradient descent update for the neural net with a hidden layer



&lt;p&gt;At this point, I would encourage all readers to perform one training iteration themselves. The resultant gradients should be approximately(rounded to 3 decimal places):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F98tjcf89jbvnm8qosn2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F98tjcf89jbvnm8qosn2u.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 83. Derivatives computed during 2ⁿᵈ training iteration



&lt;p&gt;After 5000 epochs the Cost steadily decreases to about &lt;strong&gt;&lt;em&gt;0.0009&lt;/em&gt;&lt;/strong&gt; and we get the following Learning Curve and Decision Boundary:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmzh6jfpy4dymea5ioeur.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmzh6jfpy4dymea5ioeur.png"&gt;&lt;/a&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkg0ulphl97xuha0xgo9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkg0ulphl97xuha0xgo9v.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 84. Learning Curve and Decision boundary of the neural net with one hidden layer



&lt;p&gt;Let’s also visualize where the decision of the neural network changes from 0(red) to 1(green):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqa2yemvcfqxvplk3wgr9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqa2yemvcfqxvplk3wgr9.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 85. Shaded decision boundary of the neural net with one hidden layer



&lt;p&gt;This shows that the neural network has in fact learned where to fire-up(output 1) and where to lay dormant(output 0).&lt;/p&gt;

&lt;p&gt;If we add another hidden layer with maybe 2 or 3 sigmoid neurons we can get an even more complex decision boundary that may fit our data even more tightly, but let’s leave that for the coding section.&lt;/p&gt;

&lt;p&gt;Before we conclude this section I want to answer some remaining questions:&lt;/p&gt;
&lt;h4&gt;
  
  
  1- So, which one is better Feature Engineering or a Deep Neural Network?
&lt;/h4&gt;

&lt;p&gt;Well, the answer depends on many factors. Generally, if we have a lot of training data we can just use a deep neural net to achieve acceptable accuracy, but if data is limited we may need to perform some feature engineering to extract more performance out of our neural network. As you saw in the feature engineering example above, to make good feature crosses one needs to have intimate knowledge of the dataset they are working with.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Feature engineering along with a deep neural network is a powerful combination.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  2- How to count the number of layers in a Neural Network?
&lt;/h4&gt;

&lt;p&gt;By convention, we don’t count layers without tunable weights and bias. Therefore, though the input layer is a separate “layer” we don’t count it when specifying the depth of a neural network.&lt;/p&gt;

&lt;p&gt;So, our last example was a &lt;em&gt;“2 layer neural network”&lt;/em&gt; (one hidden + output layer) and all the examples before it just a &lt;em&gt;“1 layer neural network”&lt;/em&gt; (output layer, only).&lt;/p&gt;
&lt;h4&gt;
  
  
  3- Why use Activation Functions?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Activation functions are nonlinear functions and add nonlinearity to the neurons. The feature crosses are a result of stacking the activation functions in hidden layers.&lt;/em&gt;&lt;/strong&gt; The combination of a bunch of activation functions thus results in a complex non-linear decision boundary. In this blog, we used the sigmoid/logistic activation function, but there are many other types of activation functions(ReLU being a popular choice for hidden layers) each providing a certain benefit. &lt;strong&gt;&lt;em&gt;The choice of the activation function is also a hyper-parameter when creating neural networks.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Without activations functions to add nonlinearity, no matter how many linear functions we stack up the result of them will still be linear.&lt;/em&gt;&lt;/strong&gt; Consider the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz83i6y1dfctujip3fdvg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz83i6y1dfctujip3fdvg.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 86. Showing that stacking linear layers/functions results in a linear layer/function



&lt;p&gt;You may use any nonlinear function as an activation function. Some researchers have used even &lt;strong&gt;&lt;em&gt;cos&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;sin&lt;/em&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;tan&lt;/em&gt;&lt;/strong&gt; functions.&lt;/p&gt;
&lt;h4&gt;
  
  
  4- Why Random Initialization of Weights?
&lt;/h4&gt;

&lt;p&gt;This question is much easier to answer now. Note that if we had set all the weights in a layer to the same value than the gradient that passes through each node would be the same. In short, all the nodes in the layer would learn the same feature about the data. Setting the weights to &lt;strong&gt;&lt;em&gt;random values helps in breaking the symmetry of weights&lt;/em&gt;&lt;/strong&gt; so that each node in a layer has the opportunity to learn a unique aspect of the training data&lt;/p&gt;

&lt;p&gt;There are many ways to set weights randomly in neural networks. For small neural networks, it is ok to set the weights to small random values. For larger networks, we tend to use "Xavier" or "He" initialization methods(&lt;em&gt;will be in the coding section&lt;/em&gt;). Both these methods still set weights to random values but control their variance. &lt;em&gt;For now, its suffice to say use these methods when the network does not seem to converge and the Cost becomes static or reduces very slowly when using the "plain" method of setting weights to small random values&lt;/em&gt;. Weight initialization is an active research area and will be a topic for a future "Nothing but Numpy" blog.&lt;/p&gt;

&lt;p&gt;Biases can be randomly initialized, too. But in practice, it does not seem to have much of an effect on the performance of a neural network. Perhaps this is because the number of bias terms in a neural network is much fewer than the weights.&lt;/p&gt;

&lt;p&gt;The type of neural network we created here is called a &lt;strong&gt;&lt;em&gt;"fully-connected feedforward network"&lt;/em&gt;&lt;/strong&gt; or simply a &lt;strong&gt;&lt;em&gt;"feedforward network"&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This concludes Part Ⅰ.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part Ⅱ: Coding a Modular Neural Network
&lt;/h2&gt;

&lt;p&gt;The implementation in this part follows OOP principals.&lt;/p&gt;

&lt;p&gt;Let's first see the &lt;strong&gt;Linear Layer&lt;/strong&gt; class. The constructor takes as arguments: the shape of the data coming in(&lt;code&gt;input_shape&lt;/code&gt;), the number of neurons the layer outputs(&lt;code&gt;n_out&lt;/code&gt;) and what type of random weight initialization need to be performed(&lt;code&gt;ini_type="plain"&lt;/code&gt;, default is "plain" which is just small random gaussian numbers).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;initialize_parameters&lt;/code&gt; is a helper function used to define weights and bias. We'll look at it separately, later.&lt;/p&gt;

&lt;p&gt;Linear Layer implements the following functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;forward(A_prev)&lt;/code&gt;: This function allows the linear layer to take in activations from the previous layer(the input data can be seen as activations from the input layer) and performs the linear operation on them.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backward(upstream_grad)&lt;/code&gt;: This function computes the derivative of Cost w.r.t weights, bias, and activations from the previous layer(&lt;code&gt;dW&lt;/code&gt;, &lt;code&gt;db&lt;/code&gt; &amp;amp;&lt;code&gt;dA_prev&lt;/code&gt;, respectively)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;update_params(learning_rate=0.1)&lt;/code&gt;: This function performs the gradient descent update on weights and bias using the derivatives computed in the &lt;code&gt;backward&lt;/code&gt; function. The default learning rate(α) is 0.1&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 87. Linear Layer Class




&lt;p&gt;Now let's see the &lt;strong&gt;Sigmoid Layer&lt;/strong&gt; class, its constructor takes in as an argument the shape of data coming in(&lt;code&gt;input_shape&lt;/code&gt;) from a Linear Layer preceding it.&lt;/p&gt;

&lt;p&gt;Sigmoid Layer implements the following functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;forward(Z)&lt;/code&gt;: This function allows the sigmoid layer to take in the linear computations(&lt;code&gt;Z&lt;/code&gt;) from the previous layer and perform the sigmoid activation on them.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backward(upstream_grad)&lt;/code&gt;: This function computes the derivative of Cost w.r.t Z(&lt;code&gt;dZ&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 88. Sigmoid Activation Layer class




&lt;p&gt;The &lt;code&gt;initialize_parameters&lt;/code&gt; function is used only in the Linear Layer to set weights and biases. Using the size of the input(&lt;code&gt;n_in&lt;/code&gt;) and output(&lt;code&gt;n_out&lt;/code&gt;) it defines the shape the weight matrix and bias vector need to be in. This helper function then returns both the weight(W) and bias(b) in a python dictionary to the respective Linear Layer.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 89. Helper function to set weights and bias




&lt;p&gt;Finally, the Cost function &lt;code&gt;compute_cost(Y, Y_hat)&lt;/code&gt; takes as argument the activations from the last layer(&lt;code&gt;Y_hat&lt;/code&gt;) and the true labels(&lt;code&gt;Y&lt;/code&gt;) and computes and returns the Squared Error Cost(&lt;code&gt;cost&lt;/code&gt;) and its derivative(&lt;code&gt;dY_hat&lt;/code&gt;).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 90. Function to compute Squared Error Cost and derivative




&lt;p&gt;&lt;em&gt;At this point, you should open the &lt;a href="https://github.com/RafayAK/NothingButNumPy/blob/master/2_layer_toy_network_XOR.ipynb" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;2_layer_toy_network_XOR&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt; Jupyter notebook from this &lt;a href="https://github.com/RafayAK/NothingButNumPy" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;repository&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt; in a separate window and go over this blog and the notebook side-by-side.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now we are ready to create our neural network. Let's use the architecture defined in Figure 77 for XOR data.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 91. Defining the layers and training parameters




&lt;p&gt;Now we can start the main training loop:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
Fig 92. The training loop




&lt;p&gt;Running the loop in the notebook we see that the Cost decreases to about 0.0009 after 4900 epochs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
Cost at epoch#4600: 0.001018305488651183
Cost at epoch#4700: 0.000983783942124411
Cost at epoch#4800: 0.0009514180100050973
Cost at epoch#4900: 0.0009210166430616655
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Learning curve and Decision Boundaries look as follows:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7qs4g44i6zite76boarr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7qs4g44i6zite76boarr.png"&gt;&lt;/a&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9v8fexlcoc8ibw7uc9mo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9v8fexlcoc8ibw7uc9mo.png"&gt;&lt;/a&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpz849e0s3p4d5lcc9e12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpz849e0s3p4d5lcc9e12.png"&gt;&lt;/a&gt;&lt;/p&gt;
Fig 93. The Learning Curve, Decision Boundary, and Shaded Decision Boundary.



&lt;p&gt;The predictions our trained neural network produces are accurate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The predicted outputs:
 [[ 0.  1.  1.  0.]]
The accuracy of the model is: 100.0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure to check out the other notebooks in the &lt;a href="https://github.com/RafayAK/NothingButNumPy" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;repository&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. We'll be building upon the things we learned in this blog in future Nothing but NumPy blogs, therefore, it would behoove you to create the layer classes from memory as an exercise and try recreating the OR gate example from &lt;strong&gt;&lt;em&gt;Part Ⅰ&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This concludes the blog🙌🎉. I hope you enjoyed.&lt;br&gt;
For any questions feel free to reach out to me on &lt;a href="https://twitter.com/RafayAK" rel="noopener noreferrer"&gt;twitter&lt;/a&gt; &lt;a href="https://twitter.com/RafayAK" rel="noopener noreferrer"&gt;@RafayAK&lt;/a&gt;&lt;/p&gt;







&lt;h4&gt;
  
  
  This blog would not have been possible without following resources and people:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Andrej Karpathy's(&lt;a href="https://twitter.com/@karpathy" rel="noopener noreferrer"&gt;@karpathy&lt;/a&gt;) Stanford &lt;a href="http://cs231n.stanford.edu" rel="noopener noreferrer"&gt;course&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Christopher Olah's(&lt;a href="https://twitter.com/ch402" rel="noopener noreferrer"&gt;@ch402&lt;/a&gt;) &lt;a href="https://colah.github.io/" rel="noopener noreferrer"&gt;blogs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Andrew Trask's(&lt;a href="https://twitter.com/iamtrask" rel="noopener noreferrer"&gt;@iamtrask&lt;/a&gt;) &lt;a href="https://iamtrask.github.io/" rel="noopener noreferrer"&gt;blogs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Andrew Ng(&lt;a href="http://twitter.com/AndrewYNg" rel="noopener noreferrer"&gt;@AndrewYNg&lt;/a&gt;) and his Coursera courses on &lt;a href="https://www.coursera.org/specializations/deep-learning" rel="noopener noreferrer"&gt;deep learning&lt;/a&gt; and &lt;a href="https://www.coursera.org/learn/machine-learning" rel="noopener noreferrer"&gt;machine learning&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Terence Parr(&lt;a href="https://twitter.com/the_antlr_guy" rel="noopener noreferrer"&gt;@the_antlr_guy&lt;/a&gt;) and Jeremy Howard (&lt;a href="https://twitter.com/jeremyphoward" rel="noopener noreferrer"&gt;@jeremyphoward&lt;/a&gt;)(&lt;a href="https://explained.ai/matrix-calculus/index.html" rel="noopener noreferrer"&gt;https://explained.ai/matrix-calculus/index.html&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Ian Goodfellow(&lt;a href="https://twitter.com/goodfellow_ian" rel="noopener noreferrer"&gt;@goodfellow_ian&lt;/a&gt;) and his amazing &lt;a href="https://www.deeplearningbook.org/" rel="noopener noreferrer"&gt;book&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Finally, Hassan-uz-Zaman(&lt;a href="https://twitter.com/OKidAmnesiac" rel="noopener noreferrer"&gt;@OKidAmnesiac&lt;/a&gt;) and Hassan Tauqeer for invaluable feedback.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>numpy</category>
    </item>
  </channel>
</rss>
