Aniruddha Karajgi

Posted on Nov 27, 2021 • Originally published at towardsdatascience.com on Nov 4, 2020

How Neural Networks Solve the XOR Problem

#perceptron #datascience #algorithms #deeplearning

And why hidden layers are so important

The perceptron is a classification algorithm. Specifically, it works as a linear binary classifier. It was invented in the late 1950s by Frank Rosenblatt.

The perceptron basically works as a threshold function — non-negative outputs are put into one class while negative ones are put into the other class.

Though there’s a lot to talk about when it comes to neural networks and their variants, we’ll be discussing a specific problem that highlights the major differences between a single layer perceptron and one that has a few more layers.

**The Perceptron**
   Structure and Properties
   Evalutation
   Training algorithm

**2d Xor problem**   
   The XOR function

**Attempt #1: The Single Layer Perceptron**   
Implementing the Perceptron algorithm
   Results
   The need for non-linearity

**Attempt #2: Multiple Decision Boundaries**  
   Intuition
   Implementing the OR and NAND parts

**The Multi-layered Perceptron**  
   Structure and Properties
   Training algorithm

**Attempt #3: The Multi-layered Perceptron**  
   Implementing the MLP
   Results

Structure and Properties

A perceptron has the following components:

Input nodes
Output node
An activation function
Weights and biases
Error function

Input Nodes

These nodes contain the input to the network. In any iteration — whether testing or training — these nodes are passed the input from our data.

Weights and Biases

These parameters are what we update when we talk about “training” a model. They are initialized to some random value or set to 0 and updated as the training progresses. The bias is analogous to a weight independent of any input node. Basically, it makes the model more flexible, since you can “move” the activation function around.

Evaluation

The output calculation is straightforward.

Compute the dot product of the input and weight vector
Add the bias
Apply the activation function.

This can be expressed like so:

This is often simplified and written as a dot- product of the weight and input vectors plus the bias.

Activation Function

This function allows us to fit the output in a way that makes more sense. For example, in the case of a simple classifier, an output of say -2.5 or 8 doesn’t make much sense with regards to classification. If we use something called a sigmoidal activation function, we can fit that within a range of 0 to 1, which can be interpreted directly as a probability of a datapoint belonging to a particular class.

Though there are many kinds of activation functions, we’ll be using a simple linear activation function for our perceptron. The linear activation function has no effect on its input and outputs it as is.

Classification

How does a perceptron assign a class to a datapoint?

We know that a datapoint’s evaluation is expressed by the relation wX + b . We define a threshold ( θ ) which classifies our data. Generally, this threshold is set to 0 for a perceptron.

So points for which wX + b is greater than or equal to 0 will belong to one class while the rest (wX + b is negative) are classified as belonging to the other class. We can express this as:

Training algorithm

To train our perceptron, we must ensure that we correctly classify all of our train data. Note that this is different from how you would train a neural network, where you wouldn’t try and correctly classify your entire training data. That would lead to something called overfitting in most cases.

We start the training algorithm by calculating the gradient , or Δw. Its the product of:

the value of the input node corresponding to that weight
The difference between the actual value and the computed value.

We get our new weights by simply incrementing our original weights with the computed gradients multiplied by the learning rate.

A simple intuition for how this works: if our perceptron correctly classifies an input data point, actual_value — computed_value would be 0 , and there wouldn’t be any change in our weights since the gradient is now 0.

The 2D XOR problem

In the XOR problem, we are trying to train a model to mimic a 2D XOR function.

The XOR function

The function is defined like so:

If we plot it, we get the following chart. This is what we’re trying to classify. The ⊕ (“o-plus”) symbol you see in the legend is conventionally used to represent the XOR boolean operator.

The XOR output plot — Image by Author using draw.io

Our algorithm —regardless of how it works — must correctly output the XOR value for each of the 4 points. We’ll be modelling this as a classification problem, so Class 1 would represent an XOR value of 1, while Class 0 would represent a value of 0.

Attempt #1: The Single Layer Perceptron

Let's model the problem using a single layer perceptron.

Input data

The data we’ll train our model on is the table we saw for the XOR function.

**Data Target  
[0, 0] 0  
[0, 1] 1  
[1, 0] 1  
[1, 1] 0**

Implementation

Imports

Apart from the usual visualization ( matplotliband seaborn) and numerical libraries (numpy), we’ll use cycle from itertools . This is done since our algorithm cycles through our data indefinitely until it manages to correctly classify the entire training data without any mistakes in the middle.

The data

We next create our training data. This data is the same for each kind of logic gate, since they all take in two boolean variables as input.

The training function

Here, we cycle through the data indefinitely, keeping track of how many consecutive datapoints we correctly classified. If we manage to classify everything in one stretch, we terminate our algorithm.

If not, we reset our counter, update our weights and continue the algorithm.

To visualize how our model performs, we create a mesh of datapoints, or a grid, and evaluate our model at each point in that grid. Finally, we colour each point based on how our model classifies it. So the Class 0 region would be filled with the colour assigned to points belonging to that class.

The Perceptron class

To bring everything together, we create a simple Perceptron class with the functions we just discussed. We have some instance variables like the training data, the target, the number of input nodes and the learning rate.

Results

Let’s create a perceptron object and train it on the XOR data.

You’ll notice that the training loop never terminates, since a perceptron can only converge on linearly separable data. Linearly separable data basically means that you can separate data with a point in 1D, a line in 2D, a plane in 3D and so on.

A perceptron can only converge on linearly separable data. Therefore, it isn’t capable of imitating the XOR function.

Remember that a perceptron must correctly classify the entire training data in one go. If we keep track of how many points it correctly classified consecutively, we get something like this.

The value of correct_counter over 100 cycles of training — Image by Author

The algorithm only terminates when correct_counter hits 4 — which is the size of the training set — so this will go on indefinitely.

The Need for Non-Linearity

It is clear that a single perceptron will not serve our purpose: the classes aren’t linearly separable. This boils down to the fact that a single linear decision boundary isn’t going to work.

Non-linearity allows for more complex decision boundaries. One potential decision boundary for our XOR data could look like this.

A potential non-linear decision boundary for our XOR model — Image by Author using draw.io

The 2d XOR problem — Attempt #2

We know that the imitating the XOR function would require a non-linear decision boundary.

But why do we have to stick with a single decision boundary?

The Intuition

Let’s first break down the XOR function into its AND and OR counterparts.

The XOR function on two boolean variables A and B is defined as:

Let’s add A.~A and B.~B to the equation. Since they both equate to 0, the equation remains valid.

Let’s rearrange the terms so that we can pull out A from the first part and B from the second.

Simplifying it further, we get:

Using DeMorgan’s laws for boolean algebra:~A + ~B = ~(AB) , we can replace the second term in the above equation like so:

Let’s replace A and B with x_1 and x_2 respectively since that’s the convention we’re using in our data.

The XOR function can be condensed into two parts: a NAND and an OR. If we can calculate these separately, we can just combine the results, using an AND gate.

Let’s call the OR section of the formula part I, and the NAND section as part II.

Modelling the OR part

We’ll use the same Perceptron class as before, only that we’ll train it on OR training data.

The OR truth table — Image by Author using draw.io

This converges, since the data for the OR function is linearly separable. If we plot the number of correctly classified consecutive datapoints as we did in our first attempt, we get this plot. It’s clear that around iteration 50, it hits the value 4, meaning that it classified the entire dataset correctly.

correct_counter measures the number of consecutive datapoints correctly classified by our Perceptron

The correct_counter plot for our OR perceptron — Image by Author

The decision boundary plot looks like this:

The Output plot of our OR perceptron — Image by Author

Modelling the NAND part

Let’s move on to the second part. We need to model a NAND gate. Just like the OR part, we’ll use the same code, but train the model on the NAND data. So our input data would be:

The NAND Truth table — Image by Author using draw.io

After training, the following plots show that our model converged on the NAND data and mimics the NAND gate perfectly.

Decision boundary and correct_counter plots for the NAND perceptron — Image by Author

Bringing everything together

Two things are clear from this:

we are performing a logical AND on the outputs of two logic gates (where the first one is an OR and the second one a NAND)
and that both functions are being passed the same input (x1 and x2).

Let’s model this into our network. First, let’s consider our two perceptrons as black boxes.

The plan for our model — Image by Author using draw.io

After adding our input nodes x_1 and x_2, we can finally implement this through a simple function.

Adding input nodes — Image by Author using draw.io

Finally, we need an AND gate, which we’ll train just we have been.

The correct_count and output plots of our AND perceptron. — Image by Author

What we now have is a model that mimics the XOR function.

If we were to implement our XOR model, it would look something like this:

If we plot the decision boundaries from our model, — which is basically an AND of our OR and NAND models — we get something like this:

The Output plot of our 2nd Attempt, showing a correct classification on our XOR data— Image by Author using draw.io

Out of all the 2 input logic gates, the XOR and XNOR gates are the only ones that are not linearly-separable.

Though our model works, it doesn’t seem like a viable solution to most non-linear classification or regression tasks. It’s really specific to this case, and most problems can’t be split into just simple intermediate problems that can be individually solved and then combined. For something like this:

A binary classification problem in two dimensions — Image by Author using draw.io

A potential decision boundary could be something like this:

A potential decision boundary that fits our example — Image by Author using draw.io

We need to look for a more general model, which would allow for non-linear decision boundaries, like a curve, as is the case above. Let’s see how an MLP solves this issue.

The Multi-layered Perceptron

The overall components of an MLP like input and output nodes, activation function and weights and biases are the same as those we just discussed in a perceptron.

The biggest difference? An MLP can have hidden layers.

Hidden layers

Hidden layers are those layers with nodes other than the input and output nodes.

An MLP is generally restricted to having a single hidden layer.

The hidden layer allows for non-linearity. A node in the hidden layer isn’t too different to an output node: nodes in the previous layers connect to it with their own weights and biases, and an output is computed, generally with an activation function.

The general structure of a multi-layered perceptron — Image by Author using draw.io