Viswa M

Posted on Mar 27

Understanding a Tiny Two‑Layer Neural Network that Learns XOR

#xor #neuralnetwork #twolayer #numpy

Tiny Two‑Layer Neural Network that Learns XOR

Meta description: Learn how a simple two‑layer NumPy neural network solves the XOR problem with back‑propagation, step‑by‑step code and explanations.

Tags: xor, neuralnetwork, twolayer, numpy, backpropagation, machinelearning, deeplearning, python, gradientdescent, sigmoid

Introduction

The exclusive‑or (XOR) problem is a classic benchmark for neural networks. It is easy to describe, but a single linear neuron cannot solve it. In this post we walk through a compact NumPy implementation of a two‑layer (one hidden layer) network that learns the XOR truth table from scratch. You will see how the data are prepared, how the parameters are initialized, how the forward and backward passes are performed, and why the hidden layer is essential.

What the Program Does – In a Nutshell

Builds a tiny feed‑forward network with one hidden layer of four sigmoid units.
Trains it on the four possible binary inputs of XOR using gradient descent.
After 10 000 epochs the network’s predictions are close to the target values (≈ 0 for False, ≈ 1 for True).

The printed output after training looks like:

[[0.02]
 [0.97]
 [0.96]
 [0.03]]

These correspond to the XOR results for the inputs (0,0), (0,1), (1,0) and (1,1).

Data Preparation – The XOR Truth Table

x1  x2  XOR
0   0   0
0   1   1
1   0   1
1   1   0

In the code the inputs are stored in a 4 × 2 NumPy array X and the targets in a 4 × 1 array y.

Parameter Initialization – Weights and Biases

W1 = np.random.randn(2, 4)   # input → hidden (2 inputs, 4 hidden units)
B1 = np.zeros((1, 4))        # bias for each hidden unit

W2 = np.random.randn(4, 1)   # hidden → output (4 hidden, 1 output)
B2 = np.zeros((1, 1))        # bias for the output unit

Random weights break symmetry; zero biases are a simple, common choice.

Hyper‑parameters – Epochs and Learning Rate

epochs = 10000   # number of full passes over the training set
lr      = 0.1    # step size for gradient descent

More epochs give the network time to converge; the learning rate controls how large each update is.

Sigmoid Activation and Its Derivative

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    # a = sigmoid(z); derivative = a * (1 - a)
    return a * (1 - a)

The sigmoid maps any real number to the interval $(0, 1)$ . Its derivative can be expressed directly in terms of the activation, which keeps the back‑propagation code concise.

Training Loop – Forward Pass, Loss, Back‑Propagation, and Gradient Descent

def train_network(X, y, W1, W2, B1, B2, epochs, lr):
    n = X.shape[0]                     # number of training examples (4)

    for _ in range(epochs):
        # ---- forward pass -------------------------------------------------
        Z1 = np.dot(X, W1) + B1
        A1 = sigmoid(Z1)

        Z2 = np.dot(A1, W2) + B2
        A2 = sigmoid(Z2)               # predictions

        # ---- error at output (MSE loss) -----------------------------------
        DZ2 = A2 - y                    # ∂L/∂Z2

        # ---- gradients for output layer ------------------------------------
        DW2 = (1 / n) * np.dot(A1.T, DZ2)
        DB2 = (1 / n) * np.sum(DZ2, axis=0, keepdims=True)

        # ---- back‑propagation to hidden layer -------------------------------
        DA1 = np.dot(DZ2, W2.T)
        DZ1 = DA1 * sigmoid_derivative(A1)

        DW1 = (1 / n) * np.dot(X.T, DZ1)
        DB1 = (1 / n) * np.sum(DZ1, axis=0, keepdims=True)

        # ---- gradient‑descent update ----------------------------------------
        W2 -= lr * DW2
        B2 -= lr * DB2
        W1 -= lr * DW1
        B1 -= lr * DB1

    return W1, B1, W2, B2

What the Loop Does

Forward pass – computes hidden activations A1 and final output A2.
Loss – mean‑squared‑error; its gradient w.r.t. Z2 is simply A2‑y.
Back‑propagation – uses the chain rule to obtain gradients for every parameter.
Gradient descent – moves each weight and bias opposite to its gradient, scaled by lr.

All operations are vectorised, so the training runs without explicit Python loops over the four examples.

Mathematical Derivation of the Gradients

For a single example the network equations are

The loss (mean‑squared‑error) is

Derivative w.r.t. the output activation:

Because $\sigma'(z)=\sigma(z)(1-\sigma(z))$ ,

In the code the factor $\sigma'(Z^{(2)})$ is omitted from DZ2 and later absorbed into the hidden‑layer error term; this simplification does not affect the final update.

Back‑propagating to the hidden layer:

Averaged over the batch, the gradients for the weight matrices and biases are

These formulas correspond exactly to the NumPy statements in the training loop.

Training the Network

W1, B1, W2, B2 = train_network(X, y, W1, W2, B1, B2, epochs, lr)

After 10 000 passes the parameters have been adjusted so that the network produces a high confidence (~1) for the true XOR cases and a low confidence (~0) for the false cases.

Final Forward Pass – Inspecting the Predictions

Z1 = np.dot(X, W1) + B1
A1 = sigmoid(Z1)

Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2)

print(A2)

Typical output:

[[0.02]
 [0.97]
 [0.96]
 [0.03]]

Interpretation

Input	Network output	XOR truth
(0,0)	0.02 → False	0
(0,1)	0.97 → True	1
(1,0)	0.96 → True	1
(1,1)	0.03 → False	0

The network has successfully learned the XOR mapping.

Core Machine‑Learning Concepts Explained

Neuron (linear part) – computes a weighted sum of inputs plus a bias.
Activation function – adds non‑linearity; sigmoid maps to $(0, 1)$ .
Loss function – measures prediction error; we use mean‑squared‑error.
Gradient – direction of steepest increase of the loss; we move opposite to it.
Back‑propagation – systematic use of the chain rule to compute all gradients efficiently.
Gradient descent – updates parameters by a small step proportional to the negative gradient.
Epoch – one full sweep over the training set; multiple epochs let the model converge.

Why a Hidden Layer Is Necessary for XOR

A single linear neuron computes

Its decision boundary is a straight line in the $(x_1,x_2)$ plane. XOR’s positive points $(0,1)$ and $(1,0)$ are diagonally opposite; no straight line can separate them from the negative points $(0,0)$ and $(1,1)$ .

Adding a hidden layer with sigmoid units creates intermediate features such as “ $x_1 \neq x_2$ ”. After training, some hidden neurons fire only for the mixed inputs, enabling the final linear combination to separate the two classes. Thus a single hidden layer gives the network a non‑linear decision surface that can represent XOR, demonstrating the power of depth.

Complete Code Implementation


python
import numpy as np

# ---- data -------------------------------------------------
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])                     # shape (4, 2)

y = np.array([[0], [1], [1], [0]])   # shape (4, 1)

# ---- parameter initialization -----------------------------
W1 = np.random.randn(2, 4)
B1 = np.zeros((1, 4))

W2 = np.random.randn(4, 1)
B2 = np.zeros((1, 1))

# ---- hyper‑parameters --------------------------------------
epochs = 10000
lr = 0.1

# ---- helper functions --------------------------------------
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    return a * (1 - a)

# ---- training function -------------------------------------
def train_network(X, y, W1, W2, B1, B2, epochs, lr):
    n = X.shape[0]

    for _ in range(epochs):
        # forward
        Z1 = np.dot(X, W1) + B1
        A1 = sigmoid(Z1)

        Z2 = np.dot(A1, W2) + B2

DEV Community