Tiny Two‑Layer Neural Network that Learns XOR
Meta description: Learn how a simple two‑layer NumPy neural network solves the XOR problem with back‑propagation, step‑by‑step code and explanations.
Tags: xor, neuralnetwork, twolayer, numpy, backpropagation, machinelearning, deeplearning, python, gradientdescent, sigmoid
Introduction
The exclusive‑or (XOR) problem is a classic benchmark for neural networks. It is easy to describe, but a single linear neuron cannot solve it. In this post we walk through a compact NumPy implementation of a two‑layer (one hidden layer) network that learns the XOR truth table from scratch. You will see how the data are prepared, how the parameters are initialized, how the forward and backward passes are performed, and why the hidden layer is essential.
What the Program Does – In a Nutshell
- Builds a tiny feed‑forward network with one hidden layer of four sigmoid units.
- Trains it on the four possible binary inputs of XOR using gradient descent.
- After 10 000 epochs the network’s predictions are close to the target values (≈ 0 for False, ≈ 1 for True).
The printed output after training looks like:
[[0.02]
[0.97]
[0.96]
[0.03]]
These correspond to the XOR results for the inputs (0,0), (0,1), (1,0) and (1,1).
Data Preparation – The XOR Truth Table
x1 x2 XOR
0 0 0
0 1 1
1 0 1
1 1 0
In the code the inputs are stored in a 4 × 2 NumPy array X and the targets in a 4 × 1 array y.
Parameter Initialization – Weights and Biases
W1 = np.random.randn(2, 4) # input → hidden (2 inputs, 4 hidden units)
B1 = np.zeros((1, 4)) # bias for each hidden unit
W2 = np.random.randn(4, 1) # hidden → output (4 hidden, 1 output)
B2 = np.zeros((1, 1)) # bias for the output unit
Random weights break symmetry; zero biases are a simple, common choice.
Hyper‑parameters – Epochs and Learning Rate
epochs = 10000 # number of full passes over the training set
lr = 0.1 # step size for gradient descent
More epochs give the network time to converge; the learning rate controls how large each update is.
Sigmoid Activation and Its Derivative
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(a):
# a = sigmoid(z); derivative = a * (1 - a)
return a * (1 - a)
The sigmoid maps any real number to the interval . Its derivative can be expressed directly in terms of the activation, which keeps the back‑propagation code concise.
Training Loop – Forward Pass, Loss, Back‑Propagation, and Gradient Descent
def train_network(X, y, W1, W2, B1, B2, epochs, lr):
n = X.shape[0] # number of training examples (4)
for _ in range(epochs):
# ---- forward pass -------------------------------------------------
Z1 = np.dot(X, W1) + B1
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2) # predictions
# ---- error at output (MSE loss) -----------------------------------
DZ2 = A2 - y # ∂L/∂Z2
# ---- gradients for output layer ------------------------------------
DW2 = (1 / n) * np.dot(A1.T, DZ2)
DB2 = (1 / n) * np.sum(DZ2, axis=0, keepdims=True)
# ---- back‑propagation to hidden layer -------------------------------
DA1 = np.dot(DZ2, W2.T)
DZ1 = DA1 * sigmoid_derivative(A1)
DW1 = (1 / n) * np.dot(X.T, DZ1)
DB1 = (1 / n) * np.sum(DZ1, axis=0, keepdims=True)
# ---- gradient‑descent update ----------------------------------------
W2 -= lr * DW2
B2 -= lr * DB2
W1 -= lr * DW1
B1 -= lr * DB1
return W1, B1, W2, B2
What the Loop Does
-
Forward pass – computes hidden activations
A1and final outputA2. -
Loss – mean‑squared‑error; its gradient w.r.t.
Z2is simplyA2‑y. - Back‑propagation – uses the chain rule to obtain gradients for every parameter.
-
Gradient descent – moves each weight and bias opposite to its gradient, scaled by
lr.
All operations are vectorised, so the training runs without explicit Python loops over the four examples.
Mathematical Derivation of the Gradients
For a single example the network equations are
The loss (mean‑squared‑error) is
Derivative w.r.t. the output activation:
In the code the factor is omitted from
DZ2 and later absorbed into the hidden‑layer error term; this simplification does not affect the final update.
Back‑propagating to the hidden layer:
Averaged over the batch, the gradients for the weight matrices and biases are
These formulas correspond exactly to the NumPy statements in the training loop.
Training the Network
W1, B1, W2, B2 = train_network(X, y, W1, W2, B1, B2, epochs, lr)
After 10 000 passes the parameters have been adjusted so that the network produces a high confidence (~1) for the true XOR cases and a low confidence (~0) for the false cases.
Final Forward Pass – Inspecting the Predictions
Z1 = np.dot(X, W1) + B1
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2)
print(A2)
Typical output:
[[0.02]
[0.97]
[0.96]
[0.03]]
Interpretation
| Input | Network output | XOR truth |
|---|---|---|
| (0,0) | 0.02 → False | 0 |
| (0,1) | 0.97 → True | 1 |
| (1,0) | 0.96 → True | 1 |
| (1,1) | 0.03 → False | 0 |
The network has successfully learned the XOR mapping.
Core Machine‑Learning Concepts Explained
- Neuron (linear part) – computes a weighted sum of inputs plus a bias.
-
Activation function – adds non‑linearity; sigmoid maps to
.
- Loss function – measures prediction error; we use mean‑squared‑error.
- Gradient – direction of steepest increase of the loss; we move opposite to it.
- Back‑propagation – systematic use of the chain rule to compute all gradients efficiently.
- Gradient descent – updates parameters by a small step proportional to the negative gradient.
- Epoch – one full sweep over the training set; multiple epochs let the model converge.
Why a Hidden Layer Is Necessary for XOR
A single linear neuron computes
Its decision boundary is a straight line in the plane. XOR’s positive points
and
are diagonally opposite; no straight line can separate them from the negative points
and
.
Adding a hidden layer with sigmoid units creates intermediate features such as “”. After training, some hidden neurons fire only for the mixed inputs, enabling the final linear combination to separate the two classes. Thus a single hidden layer gives the network a non‑linear decision surface that can represent XOR, demonstrating the power of depth.
Complete Code Implementation
python
import numpy as np
# ---- data -------------------------------------------------
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
]) # shape (4, 2)
y = np.array([[0], [1], [1], [0]]) # shape (4, 1)
# ---- parameter initialization -----------------------------
W1 = np.random.randn(2, 4)
B1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1)
B2 = np.zeros((1, 1))
# ---- hyper‑parameters --------------------------------------
epochs = 10000
lr = 0.1
# ---- helper functions --------------------------------------
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(a):
return a * (1 - a)
# ---- training function -------------------------------------
def train_network(X, y, W1, W2, B1, B2, epochs, lr):
n = X.shape[0]
for _ in range(epochs):
# forward
Z1 = np.dot(X, W1) + B1
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2) + B2
Top comments (0)