loading...

Build a flexible Neural Network with Backpropagation in Python

shamdasani profile image Samay Shamdasani Updated on ・11 min read

What is a Neural Network?

Before we get started with the how of building a Neural Network, we need to understand the what first.

Neural networks can be intimidating, especially for people new to machine learning. However, this tutorial will break down how exactly a neural network works and you will have a working flexible neural network by the end. Let's get started!

Understanding the process

With approximately 100 billion neurons, the human brain processes data at speeds as fast as 268 mph! In essence, a neural network is a collection of neurons connected by synapses. This collection is organized into three main layers: the input layer, the hidden layer, and the output layer. You can have many hidden layers, which is where the term deep learning comes into play. In an artificial neural network, there are several inputs, which are called features, and produce a single output, which is called a label.


via Kabir Shah

The circles represent neurons while the lines represent synapses. The role of a synapse is to take the multiply the inputs and weights. You can think of weights as the "strength" of the connection between neurons. Weights primarily define the output of a neural network. However, they are highly flexible. After, an activation function is applied to return an output.

Here's a brief overview of how a simple feedforward neural network works:

  1. Takes inputs as a matrix (2D array of numbers)

  2. Multiplies the input by a set weights (performs a dot product aka matrix multiplication)

  3. Applies an activation function

  4. Returns an output

  5. Error is calculated by taking the difference from the desired output from the data and the predicted output. This creates our gradient descent, which we can use to alter the weights

  6. The weights are then altered slightly according to the error.

  7. To train, this process is repeated 1,000+ times. The more the data is trained upon, the more accurate our outputs will be.

At its core, neural networks are simple. They just perform a dot product with the input and weights and apply an activation function. When weights are adjusted via the gradient of loss function, the network adapts to the changes to produce more accurate outputs.

Our neural network will model a single hidden layer with three inputs and one output. In the network, we will be predicting the score of our exam based on the inputs of how many hours we studied and how many hours we slept the day before. Our test score is the output. Here's our sample data of what we'll be training our Neural Network on:

As you may have noticed, the ? in this case represents what we want our neural network to predict. In this case, we are predicting the test score of someone who studied for four hours and slept for eight hours based on their prior performance.

Forward Propagation

Let's start coding this bad boy! Open up a new python file. You'll want to import numpy as it will help us with certain calculations.

First, let's import our data as numpy arrays using np.array. We'll also want to normalize our units as our inputs are in hours, but our output is a test score from 0-100. Therefore, we need to scale our data by dividing by the maximum value for each variable.

import numpy as np

# X = (hours sleeping, hours studying), y = score on test
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)

# scale units
X = X/np.amax(X, axis=0) # maximum of X array
y = y/100 # max test score is 100

Next, let's define a python class and write an init function where we'll specify our parameters such as the input, hidden, and output layers.

class Neural_Network(object):
  def __init__(self):
    #parameters
    self.inputSize = 2
    self.outputSize = 1
    self.hiddenSize = 3

It is time for our first calculation. Remember that our synapses perform a dot product, or matrix multiplication of the input and weight. Note that weights are generated randomly and between 0 and 1.

The calculations behind our network

In the data set, our input data, X, is a 3x2 matrix. Our output data, y, is a 3x1 matrix. Each element in matrix X needs to be multiplied by a corresponding weight and then added together with all the other results for each neuron in the hidden layer. Here's how the first input data element (2 hours studying and 9 hours sleeping) would calculate an output in the network:

This image breaks down what our neural network actually does to produce an output. First, the products of the random generated weights (.2, .6, .1, .8, .3, .7) on each synapse and the corresponding inputs are summed to arrive as the first values of the hidden layer. These sums are in a smaller font as they are not the final values for the hidden layer.

(2 * .2) + (9 * .8) = 7.6 
(2 * .6) + (9 * .3) = 7.5 
(2 * .1) + (9 * .7) = 6.5

To get the final value for the hidden layer, we need to apply the activation function. The role of an activation function is to introduce nonlinearity. An advantage of this is that the output is mapped from a range of 0 and 1, making it easier to alter weights in the future.

There are many activation functions out there. In this case, we'll stick to one of the more popular ones - the sigmoid function.

S(7.6) = 0.999499799
S(7.5) = 1.000553084
S(6.5) = 0.998498818

Now, we need to use matrix multiplication again, with another set of random weights, to calculate our output layer value.

(.9994 * .4) + (1.000 * .5) + (.9984 * .9) = 1.79832

Lastly, to normalize the output, we just apply the activation function again.

S(1.79832) = .8579443067

And, there you go! Theoretically, with those weights, out neural network will calculate .85 as our test score! However, our target was .92. Our result wasn't poor, it just isn't the best it can be. We just got a little lucky when I chose the random weights for this example.

How do we train our model to learn? Well, we'll find out very soon. For now, let's continue coding our network.

If you are still confused, I highly recommend you check out this informative video which explains the structure of a neural network with the same example.

Implementing the calculations

Now, let's generate our weights randomly using np.random.randn(). Remember, we'll need two sets of weights. One to go from the input to the hidden layer, and the other to go from the hidden to output layer.

#weights
self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer

Once we have all the variables set up, we are ready to write our forward propagation function. Let's pass in our input, X, and in this example, we can use the variable z to simulate the activity between the input and output layers. As explained, we need to take a dot product of the inputs and weights, apply an activation function, take another dot product of the hidden layer and second set of weights, and lastly apply a final activation function to recieve our output:

def forward(self, X):
    #forward propagation through our network
    self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
    self.z2 = self.sigmoid(self.z) # activation function
    self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
    o = self.sigmoid(self.z3) # final activation function
    return o 

Lastly, we need to define our sigmoid function:

def sigmoid(self, s):
    # activation function 
    return 1/(1+np.exp(-s))

And, there we have it! A (untrained) neural network capable of producing an output.

import numpy as np

# X = (hours sleeping, hours studying), y = score on test
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)

# scale units
X = X/np.amax(X, axis=0) # maximum of X array
y = y/100 # max test score is 100

class Neural_Network(object):
  def __init__(self):
    #parameters
    self.inputSize = 2
    self.outputSize = 1
    self.hiddenSize = 3

    #weights
    self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
    self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer

  def forward(self, X):
    #forward propagation through our network
    self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
    self.z2 = self.sigmoid(self.z) # activation function
    self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
    o = self.sigmoid(self.z3) # final activation function
    return o 

  def sigmoid(self, s):
    # activation function 
    return 1/(1+np.exp(-s))

NN = Neural_Network()

#defining our output 
o = NN.forward(X)

print "Predicted Output: \n" + str(o) 
print "Actual Output: \n" + str(y) 

As you may have noticed, we need to train our network to calculate more accurate results.

Backpropagation

The "learning" of our network

Since we have a random set of weights, we need to alter them to make our inputs equal to the corresponding outputs from our data set. This is done through a method called backpropagation.

Backpropagation works by using a loss function to calculate how far the network was from the target output.

Calculating error

One way of representing the loss function is by using the mean sum squared loss function:

In this function, o is our predicted output, and y is our actual output. Now that we have the loss function, our goal is to get it as close as we can to 0. That means we will need to have close to no loss at all. As we are training our network, all we are doing is minimizing the loss.

To figure out which direction to alter our weights, we need to find the rate of change of our loss with respect to our weights. In other words, we need to use the derivative of the loss function to understand how the weights affect the input.

In this case, we will be using a partial derivative to allow us to take into account another variable.


via Kabir Shah

This method is known as gradient descent. By knowing which way to alter our weights, our outputs can only get more accurate.

Here's how we will calculate the incremental change to our weights:

1) Find the margin of error of the output layer (o) by taking the difference of the predicted output and the actual output (y)

2) Apply the derivative of our sigmoid activation function to the output layer error. We call this result the delta output sum.

3) Use the delta output sum of the output layer error to figure out how much our z2 (hidden) layer contributed to the output error by performing a dot product with our second weight matrix. We can call this the z2 error.

4) Calculate the delta output sum for the z2 layer by applying the derivative of our sigmoid activation function (just like step 2).

5) Adjust the weights for the first layer by performing a dot product of the input layer with the hidden (z2) delta output sum. For the second weight, perform a dot product of the hidden(z2) layer and the output (o) delta output sum.

Calculating the delta output sum and then applying the derivative of the sigmoid function are very important to backpropagation. The derivative of the sigmoid, also known as sigmoid prime, will give us the rate of change, or slope, of the activation function at output sum.

Let's continue to code our Neural_Network class by adding a sigmoidPrime (derivative of sigmoid) function:

def sigmoidPrime(self, s):
    #derivative of sigmoid
    return s * (1 - s)

Then, we'll want to create our backward propagation function that does everything specified in the four steps above:

def backward(self, X, y, o):
    # backward propgate through the network
    self.o_error = y - o # error in output
    self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error

    self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
    self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error

    self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
    self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights

We can now define our output through initiating foward propagation and intiate the backward function by calling it in the train function:

def train (self, X, y):
    o = self.forward(X)
    self.backward(X, y, o)

To run the network, all we have to do is to run the train function. Of course, we'll want to do this multiple, or maybe thousands, of times. So, we'll use a for loop.

NN = Neural_Network()
for i in xrange(1000): # trains the NN 1,000 times
  print "Input: \n" + str(X) 
  print "Actual Output: \n" + str(y) 
  print "Predicted Output: \n" + str(NN.forward(X)) 
  print "Loss: \n" + str(np.mean(np.square(y - NN.forward(X)))) # mean sum squared loss
  print "\n"
  NN.train(X, y)

Here's the full 60 lines of awesomeness:

import numpy as np

# X = (hours sleeping, hours studying), y = score on test
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)

# scale units
X = X/np.amax(X, axis=0) # maximum of X array
y = y/100 # max test score is 100

class Neural_Network(object):
  def __init__(self):
    #parameters
    self.inputSize = 2
    self.outputSize = 1
    self.hiddenSize = 3

    #weights
    self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
    self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer

  def forward(self, X):
    #forward propagation through our network
    self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
    self.z2 = self.sigmoid(self.z) # activation function
    self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
    o = self.sigmoid(self.z3) # final activation function
    return o 

  def sigmoid(self, s):
    # activation function 
    return 1/(1+np.exp(-s))

  def sigmoidPrime(self, s):
    #derivative of sigmoid
    return s * (1 - s)

  def backward(self, X, y, o):
    # backward propgate through the network
    self.o_error = y - o # error in output
    self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error

    self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
    self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error

    self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
    self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights

  def train (self, X, y):
    o = self.forward(X)
    self.backward(X, y, o)

NN = Neural_Network()
for i in xrange(1000): # trains the NN 1,000 times
  print "Input: \n" + str(X) 
  print "Actual Output: \n" + str(y) 
  print "Predicted Output: \n" + str(NN.forward(X)) 
  print "Loss: \n" + str(np.mean(np.square(y - NN.forward(X)))) # mean sum squared loss
  print "\n"
  NN.train(X, y)

There you have it! A full-fledged neural network that can learn from inputs and outputs. While we thought of our inputs as hours studying and sleeping, and our outputs as test scores, feel free to change these to whatever you like and observe how the network adapts! After all, all the network sees are the numbers. The calculations we made, as complex as they seemed to be, all played a big role in our learning model. If you think about it, it's super impressive that your computer, an object, managed to learn by itself!

Stay tuned for more machine learning tutorials on other models like Linear Regression and Classification!

This tutorial was originally posted on Enlight, a site to learn by building projects. For more content like this, be sure to check out Enlight!

References

Steven Miller

Welch Labs

Special thanks to Kabir Shah for his contributions to the development of this tutorial

Discussion

pic
Editor guide
Collapse
frenkel2008 profile image
max frenkel

Nice, but never seems to converge on array([[ 0.92, 0.86, 0.89]]). What's a good learning rate for the W update step? It should probably get smaller as error diminishes.

Actually, there is a bug in sigmoidPrime(), your derivative is wrong. It should return self.sigmoid(s) * (1 - self.sigmoid(s))

Collapse
shamdasani profile image
Samay Shamdasani Author

Hey Max,

I looked into this and with some help from my friend, I understood what was happening.

Your derivative is indeed correct. However, see how we return o in the forward propagation function (with the sigmoid function already defined to it). Then, in the backward propagation function we pass o into the sigmoidPrime() function, which if you look back, is equal to self.sigmoid(self.z3). So, the code is correct.

Collapse
shamdasani profile image
Samay Shamdasani Author

Hey! I'm not a very well-versed in calculus, but are you sure that would be the derivative? As I understand, self.sigmoid(s) * (1 - self.sigmoid(s)), takes the input s, runs it through the sigmoid function, gets the output and then uses that output as the input in the derivative. I tested it out and it works, but if I run the code the way it is right now (using the derivative in the article), I get a super low loss and it's more or less accurate after training ~100k times.

I'd really love to know what's really wrong. Could you explain why the derivative is wrong, perhaps from the Calculus perspective?

Collapse
justinpchang profile image
Justin Chang

The derivation for the sigmoid prime function can be found here.

Collapse
zhaytam profile image
Haytam Zanid

There is nothing wrong with your derivative. max is talking about the actual derivative definition but he's forgeting that you actually calculated sigmoid(s) and stored it in the layers so no need to calculate it again when using the derivative.

Collapse
rochaowng profile image
RochaOwng

Awesome tutorial, many thanks.
But I have one doubt, can you help me?

self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error

self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights

what means those T's? self.w2.T, self.z2.T etc...

Collapse
tamilarasu_u profile image
Tamilarasu U

T is to transpose matrix in numpy.
docs.scipy.org/doc/numpy-1.14.0/re...

Collapse
dhlpradip profile image
प्रदिप

Thanks for the great tutorial but how exactly can we use it to predict the result for next input? I tried adding 4,8 in the input and it would cause error as:
input:

Traceback (most recent call last):
[[0.5 1. ]
[0.25 0.55555556]
[0.75 0.66666667]
[1. 0.88888889]]
Actual Output:
File "D:/try.py", line 58, in
[[0.92]
[0.86]
[0.89]]
print ("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
Predicted Output:
[[0.17124108]
ValueError: operands could not be broadcast together with shapes (3,1) (4,1)
[0.17259949]
[0.20243644]
[0.20958544]]

Process finished with exit code 1

Collapse
tanaydin profile image
tanaydin sirin

after training done, you can make it like

Q = np.array(([4, 8]), dtype=float)
print "Input: \n" + str(Q)
print "Predicted Output: \n" + str(NN.forward(Q))

Collapse
bartekspitza profile image
bartekspitza

Nice guide. I have one question:

Shouldn't the input to the NN be a vector? Right now the NN is receiving the whole training matrix as its input. The network has two input neurons so I can't see why we wouldn't pass it some vector of the training data.

Tried googling this but couldnt find anything useful so would really appreciate your response!

Collapse
ayeo profile image
ayeo

I am not a python expert but it is probably usage of famous vectorized operations ;)

Collapse
mecatronicope profile image
mecatronicope

Ok, I believe i miss something. Where are the new inputs (4,8) for hours studied and slept? And the predicted value for the output "Score"?

Collapse
eternal_learner profile image
eternal_learner

Samay, this has been great to read.

Assume I wanted to add another layer to the NN.

Would I update the backprop to something like:

def backward(self, X, y, o):
# backward propgate through the network
self.o_error = y - o
self.o_delta = self.o_error*self.sigmoidPrime(o)

self.z3_error = self.o_delta.dot(self.W3.T) 
self.z3_delta = self.z3_error*self.sigmoidPrime(self.z3) 

self.z2_error = self.o_delta.dot(self.W2.T) 
self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) 

self.W1 += X.T.dot(self.z2_delta) 
self.W2 += self.z2.T.dot(self.z3_delta) 
self.W3 += self.z3.T.dot(self.o_delta) 
Collapse
jacockcroft profile image
Josh Cockcroft

Hi, this is a fantastic tutorial, thank you. I'm currently trying to build on this to take four inputs rather than two, but am struggling to get it to work. Do you have any guidance on scaling this up from two inputs?

Collapse
danielagustian profile image
DanielAgustian

Hello, i'm a noob on Machine Learning, so i wanna ask, is there any requirement for how many hidden layer do you need in a neural network? The hidden layer on this project is 3, is it because of input layer + output layer? Or it is completely random?

Collapse
stereodealer profile image
P̾̔̅̊͂̏̚aͬͪ̄v̋̒lo͛̎

what is the 'input later' ?

Collapse
davidmroth profile image
David Roth

Pretty sure the author meant 'input layer'.

Great article!

Collapse
shamdasani profile image
Samay Shamdasani Author

Yep! Just fixed it :)

Collapse
nrayamajhee profile image
Nishan Rayamajhee

Great Tutorial!

I translated this tutorial to rust with my own matrix operation implementation, which is terribly inefficient compared to numpy, but still produces similar result to this tutorial. Here's the docs: docs.rs/artha/0.1.0/artha/ and the code: gitlab.com/nrayamajhee/artha

Collapse
tamilarasu_u profile image
Tamilarasu U

Excellent article for a beginner, but I just noticed Bias is missing your neural network. Isn't it required for simple neural networks?

And also you haven't applied any Learning rate. Will not it make the Gradient descent to miss the minimum?

Collapse
ayeo profile image
ayeo

Great introduction! I have used it to implement this:

github.com/ayeo/letter_recognizer

Collapse
nnamdi profile image
Collapse
anton_1921 profile image
Antonio

(2 * .6) + (9 * .3) = 7.5 wrong.
It is 3.9

Collapse
shamdasani profile image
Samay Shamdasani Author

Good catch! That is definitely my mistake. If one replaces it with 3.9, the final score would only be changed by one hundredth (.857 --> .858)!

Collapse
breener96 profile image
Breener96

Great tutorial, explained everything so clearly!!

Collapse
s_sumdaeq profile image
Xingdu Qiao

Great article for beginners like me! Thank you very much!

Collapse
mraza007 profile image
Muhammad

Great article actually helped me understand how neural network works

Collapse
mohamednedal profile image
MohamedNedal

Hi, in this line:
for i in xrange(1000):
it told me that 'xrange' is not defined. Could you please explain how to fix it?

Collapse
ayeo profile image
ayeo

With newer python version function is renamed to "range"

Collapse
swati008 profile image
Info Comment marked as low quality/non-constructive by the community. View code of conduct
swati008

Hi,

It might sound silly but i am trying to do the same thing which has been discussed but i am not able to move forward.
Thanks for help and again i know it is basic but i am not able to figure it out.

Collapse
mohamednedal profile image
Info Comment marked as low quality/non-constructive by the community. View code of conduct
MohamedNedal

Hi, Could you tell how to use this code to make predictions on a new data?