Hey, what's up π In my previous article i have described how to build neural network from scratch with only JavaScript. Today, at the request of several people, i'll try to explain mathematical principle of neural networks. Bro, you finally will understand what under the hood of that monster is!
And first, i'm gonna tell you another secret: there's no magic, just only math π΅
This article is based on my previous one. If you don't read it yet, it's time to do that! I will use the same formulas and try to explain them. Let's go!
Preparation
I'm gonna solve XOR again π It's not a joke, bro! There are many data science books start with solving it π One more time i remind you XOR input table.
Inputs | Outputs |
---|---|
0 0 | 0 |
0 1 | 1 |
1 0 | 1 |
1 1 | 0 |
To demonstrate it let's use the following structure of neural network.
Here we have 2 neurons in input layer, 4 in hidden and 1 in output layer.
Weights initialization
The main goal of neural network training is adjusting the weights to minimize the output error. In most cases, the weights is initializing randomly and during neural net training these ones is adjusting by backpropagation algorithm.
So, let's initialize the weights randomly from [0, 1]
range.
Graphically, it looks like this.
Forward propagation
Ok, let's compute neuron inputs. I will use only one input case to save time: 0
and 1
so the output will be 1
.
So, for the first neuron in the hidden layer:
net1_h = 0 * 0.2 + 1 * 0.6 = 0.6
/**
1..n, n = 2 (2 neurons in the input layer)
0 value of the first input element
1 value of the second input element
0.2 the weight from first input neuron to first hidden
0.6 the weight from second input neuron to first hidden
Understand, bro? π
*/
For second one and others:
net2_h = 0 * 0.5 + 1 * 0.7 = 0.7
net3_h = 0 * 0.4 + 1 * 0.9 = 0.9
net4_h = 0 * 0.8 + 1 * 0.3 = 0.3
Now, we need one more thing - we need to choose activation function. I'll use sigmoid.
f(x) = 1 / (1 + exp(-x))
deriv(x) = f(x) * (1 - f(x))
So, now we apply our activation to each of computed net:
output1_h = f(net1_h) = f(0.6) = 0.64
output2_h = f(net2_h) = f(0.7) = 0.66
output3_h = f(net3_h) = f(0.9) = 0.71
output4_h = f(net4_h) = f(0.3) = 0.57
We've got the output values for each neuron in the hidden layer. Graphically, it looks like this:
And now, when we've got output values for hidden layer neurons we can calculate the output value for the output layer.
net_o = 0.64 * 0.6 + 0.66 * 0.7 + 0.71 * 0.3 + 0.57 * 0.4 = 1.28
output_o = f(net_o) = f(1.28) = 0.78
And here we go.
Back propagation
Bro, look at the output value. What do you see? 0.78
right? If you remember the XOR table you know that we should have got 1
for this case 0 1
, but we've got 0.78
. That's called an error. Let's calculate that.
Output error and delta
The formula:
target = 1
error = target - output_o = 1 - 0.78 = 0.22
Now, we need to calculate the delta error. In general, that's the value by which you adjust the weights.
The formula:
You can use this site for sigmoid derivative calculation.
delta_error = deriv(output_o) * error = deriv(0.78) * 0.22 = 0.21 * 0.22 = 0.04
Hidden error and delta
Let's do the same for each neuron in the hidden layer. The formula is different a little bit.
We need to calculate the error for each neuron. Remember it, bro. Let's get started!
error1_h = delta_error * 0.6 = 0.04 * 0.6 = 0.024
error2_h = delta_error * 0.6 = 0.04 * 0.7 = 0.028
error3_h = delta_error * 0.6 = 0.04 * 0.3 = 0.012
error4_h = delta_error * 0.6 = 0.04 * 0.4 = 0.016
And again the delta!
delta_error1_h = deriv(output1_h) * error1_h = deriv(0.64) * 0.024 = 0.22 * 0.024 = 0.005
delta_error2_h = deriv(output2_h) * error2_h = deriv(0.66) * 0.028 = 0.224 * 0.028 = 0.006
delta_error3_h = deriv(output3_h) * error3_h = deriv(0.71) * 0.012 = 0.220 * 0.012 = 0.002
delta_error4_h = deriv(output4_h) * error4_h = deriv(0.57) * 0.016 = 0.23 * 0.016 = 0.003
The time has come! π
Now, we have all variables to update the weights. The formulas look like this.
Let's start from the hidden to the output.
learning_rate = 0.001
hidden_to_output_1 = old_weight + output1_h * delta_error * learning_rate = 0.6 + 0.64 * 0.04 * 0.001 = 0.6000256
hidden_to_output_2 = old_weight + output2_h * delta_error * learning_rate = 0.7 + 0.66 * 0.04 * 0.001 = 0.7000264
hidden_to_output_3 = old_weight + output3_h * delta_error * learning_rate = 0.3 + 0.71 * 0.04 * 0.001 = 0.3000284
hidden_to_output_4 = old_weight + output4_h * delta_error * learning_rate = 0.4 + 0.57 * 0.04 * 0.001 = 0.4000228
We've got the values too close to the old weights. It's because we chose the learning rate too small. It's a very important hyper parameter. When you choose it too small - your network will training for years π Otherwise, when it's a large number - your network will train faster, but it's accuracy may be low for new data. So you have to choose it correctly. The optimal value is in range between 1e-3
and 2e-5
.
Ok, let's do the same for the input to the hidden synapses.
//for the first hidden neuron
input_to_hidden_1 = old_weight + input_0 * delta_error1_h * learning_rate = 0.2 + 0 * 0.005 * 0.001 = 0.2
input_to_hidden_2 = old_weight + input_1 * delta_error1_h * learning_rate = 0.6 + 1 * 0.005 * 0.001 = 0.600005
//for the second one
input_to_hidden_3 = old_weight + input_0 * delta_error2_h * learning_rate = 0.5 + 0 * 0.006 * 0.001 = 0.5
input_to_hidden_4 = old_weight + input_1 * delta_error2_h * learning_rate = 0.7 + 1 * 0.006 * 0.001 = 0.700006
//for the third one
input_to_hidden_5 = old_weight + input_0 * delta_error3_h * learning_rate = 0.4 + 0 * 0.002 * 0.001 = 0.4
input_to_hidden_6 = old_weight + input_1 * delta_error3_h * learning_rate = 0.9 + 1 * 0.002 * 0.001 = 0.900002
//for the fourth one
input_to_hidden_7 = old_weight + input_0 * delta_error4_h * learning_rate = 0.8 + 0 * 0.003 * 0.001 = 0.8
input_to_hidden_8 = old_weight + input_1 * delta_error4_h * learning_rate = 0.3 + 1 * 0.003 * 0.001 = 0.300003
That's it! Finally π
Conclusions
Oh, finally we did all the math stuff! But we only did that for one training set - 0
and 1
. For our problem we solve (XOR) we have 4 training sets (see the table above). That means you have to do the same calculations we just did above for each training set! Brrr, that's terrible π Too much math π
So, in machine learning when you do one forward propagation step (from the input layer to the output) and one backward (from the output layer to the input) for one training set it's called an iteration. Another important term is epoch. Epoch counter is iterating when you pass through your neural network all the training sets. In our case, we have 4 training sets. One iteration means one training set passed through neural network. When all training sets passed through a network - here we have one epoch. Then: 4 iterations equals 1 epoch. Understand, bro? π€ In general, more epochs - a higher accuracy, less epochs - a lower accuracy.
That's it. No magic, only math. Hope, you've understood it, bro π See ya! Happy coding π
Top comments (1)
This is really helpful. I went through the other article and head began to spin. However I manage to complete reading it and then I moved to this article and began to read while referring some articles related to maths.
This is really interesting.