Hayden Cordeiro

Posted on Jun 2

Learn Neural Networks: Build an XOR Gate From Scratch with Python Step by Step Walkthrough

For the purpose of this blog we will be building a neural network from scratch using python.
The goal will be for the neural network to learn the XOR gate.

Step 1: Create the neural network

The neural network will have 3 layers, input, output and a hidden layer.
The image below showcases the neural network

Step 2: Randomly Assign Weights and biases to the network

Step 3: Forward Pass

Some background information you need to know before forward pass.

1) Activation functions (We will be using sigmoid)
2) Linear algebra (Honestly if you do not know this maybe you should stop reading?!)

Sigmoid Activation Function:

you do not have to understand the formula for now, just understand that this is the formula and the deravative

During the forward pass, the outputs of each neuron are calculated.

Forward Pass Algorithm

The algorithm is straight forward

Take the weighted sum of the incoming inputs * and the weights
Add the bias of the current neuron to the weighted sum
Apply the activation function And viola! you have the output of one neuron.

Let's do it for one neuron together

1) Weighted Sum = 0.46 * 1 + 0.03 * 1 = 0.49
2) Adding the bias to the weighted sum = 0.49 + 0.12 = 0.61
3) Applying the activation function (sigmoid) = sigmoid(0.61) = 0.6479 =~ 0.65
(I simply substituted x with 0.61 in (1/1 + e^-x))

Now you just need to this 7B times more if you want to train a large network, or in case 4 times more.

Completed Forward Pass

Step 4: The Dreaded Backpropagation

There are 2 part of backpropagation
1) Calculate the error and associate how much each neuron contributed to that error
2) Updating Weights to reduce error

Calculating error

For the last layer (output layer), if you think about it intuitively.
To improve we need two things
1) By how much of a difference are we wrong by
2) Should our outputted value increase or decrease to match the expected output

Achieving the first part is straightforward, we can calculate
(output- expected output).

For the second part, you need a little knowledge about calculus (ie derivatives).

Don't be scared, if you don't know ill help you understand..

A derivative can be defined as:

The slope of the tangent to a curve at this point is known as the derivative of the function with respect to x

The goal is to get the error to min (or in this case slope ).
If the slope is positive we must reduce our value and vice versa for a negative slope.

Error Calculation Final Layer

= (output- expected_output) * derivative(output)

The derivative of the output value can be easily found by putting the value of output in place of x in this formula
(output) * (1 - output) (See figure 2 -> Sigmoid and its derivative)

Lets solve it together for the first neuron in the last layer

= (0.77 - 1) * derivative(0.77)
= -0.23 * derivative(0.77)
= -0.23 * [(0.77) * (1-0.77)]
= - 0.040733
=~ -0.04

After doing it for both the neurons in the output layer the network will look something like this

For neurons in the hidden layers

For the final layers its straightforward, you know the expected output and the output you got.
For the hidden layers you have to determine how much the neuron contributed to the error of the next layer.
Key word : How much?

every neuron might play a part in the error, you want to change the neuron's weight proportional to the impact it has on the next layer.

Therefore the error can be calculated by

weighted sum of the product of
(the outgoing weights of from the neuron * error of the corresponding next layer neuron) * ofc lets not forget the derivative

Lets take the neuron with the output 0.65 (First neuron from top in the hidden layer).

(I had to bump up the accuracy, since the error deltas are too small)
= (0.37 * -0.040) + (0.13 * 0.146)
= -0.0148+ 0.01898
= 0.00418

To get error for this neuron
= weighted sum * derivative(output)
= 0.00418* [(0.65) * (1-0.65)]
= 0.00095
=~ 0.001

Completed backpropagation

Step 10000 Maybe? : Finally the last step updating the weights

Gradient descent is a topic I consider out of the scope of the blog, but to give you a quick summary.

You dont want to update weights too quickly, making large changes to ur weights will sway your networks output by a large value preventing you from getting a model with high accurarcy.
The parameter that controls how much a weight is tweaked is called the learning rate.

However, to small of a learning rate will lead to spending hours on training , it's a tradeoff that you have to play around with.

lets assume learning rate of 0.1.
learning_rate = 0.1

Tuning weights and biases

weight = weight - (error * learning_rate * input)
bias = bias - (error * learning_rate)

updating weight 1 and bias for the first neuron in hidden layer

weight = 0.46 - (0.001 * 0.1 * 1)
weight = 0.4599 =~ 0.46

bias = 0.12 - (0.001 * 0.1)
bias = 0.1199 =~ 0.12

(note: since we are rounding off numbers, it looks like the weight is not updating, but every micro adjustments can make a large change to the network)

Weights and biases updated.

Although we had rounded off the values some weights and biases did change (marked as green).

PHEW!! We are finally done with one pass

The same process takes places 100s of times depending on the number of epochs, batch size and input data set
But lucky for us, we have computers to do that job for us.

Here is my implementation of a neural network from scratch.
https://github.com/haydencordeiro/NNFromScratch
Although it might not be a 100% replica, it’s good enough.

Here is a video of the training of this network (the same example)
https://www.youtube.com/watch?v=PkKaKe0_xCA

Congrats you now understand backpropagation (Hopefully!)

Disclaimer:

The purpose of this blog is to help you understand how a neural network learns, not to build a production-ready model. For simplicity, we use Mean Squared Error (MSE) as the loss function.

While MSE is mathematically easier to work with and helps explain the backpropagation process, it isn't ideal for classification problems like XOR. In real-world scenarios, especially for binary classification, Cross-Entropy Loss is generally preferred because it’s better at guiding the model when outputs are close to 0 or 1.

DEV Community