Nilavukkarasan R

Posted on Feb 17 • Edited on May 8

Multi-Layer Perceptron: Where One Line Becomes Two

#ai #deeplearning #mlp

"The perceptron has many limitations... the most serious is its inability to learn even the simplest nonlinear functions."
Marvin Minsky

XOR Needs More Than One Line

The perceptron solved AND, OR, and NAND. The natural next question: what can't it do?

XOR. Output 1 when inputs differ, 0 when they match.

[0, 0] → 0    [0, 1] → 1
[1, 0] → 1    [1, 1] → 0

The class 1 points sit diagonally opposite each other. Unlike AND or OR, where one straight line cleanly separates the classes, XOR needs at least two lines to carve out the right regions.

The obvious fix? Add more neurons. Stack another layer. Surely more layers means more power.

It doesn't.

Why Stacking Layers Alone Changes Nothing

A perceptron computes w·x + b and draws a line. Stack two layers:

Layer 1:  z₁ = w₁·x + b₁
Layer 2:  z₂ = w₂·z₁ + b₂

Expand it: z₂ = w₂·(w₁·x + b₁) + b₂ = (w₂·w₁)·x + (w₂·b₁ + b₂)

That's just W·x + B. A single line with different numbers. Two layers collapsed into one. Stack ten, a hundred, the math always simplifies to one straight line. More layers feel like more power. But without something to break the linearity between them, depth is an illusion.

The Carry

When I was a kid, single digit addition was simple. 3 + 5 = 8. One step, done.

Then came 27 + 15. I kept getting it wrong. I'd add 2 + 1 = 3, then 7 + 5 = 12, and write 312. Two separate problems stacked together. I was missing something invisible.

The breakthrough: 7 + 5 doesn't just equal 12. It creates a 1 that carries over to the next column. That carry doesn't stay where it was computed. It transforms into a 1 in a different column, changing what comes next.

Without the carry, stacking columns is useless. Each column is independent, and you get 312. With the carry, the columns interact, and you get 42.

Sigmoid: The Carry Between Layers

Perceptron:   output = w·x + b
MLP neuron:   output = sigmoid(w·x + b)

sigmoid(z) = 1 / (1 + e^(-z))

Sigmoid takes any number and squashes it between 0 and 1. Feed it −5, you get 0.007. Feed it 0, you get 0.5. Feed it +5, you get 0.993. It takes one layer's output and transforms it into a new range before the next layer sees it.

Layer 1:  h = sigmoid(w₁·x + b₁)
Layer 2:  y = sigmoid(w₂·h + b₂)

Try to simplify this into a single W·x + B. You can't. The sigmoid in the middle prevents the layers from collapsing. The hidden layer matters not because it adds more neurons, but because the activation function between layers stops them from collapsing into one.

Beyond Sigmoid

Sigmoid isn't the only activation function. There are others, each with a different shape:

sigmoid(z) = 1 / (1 + e^(-z))       → squashes to (0, 1)
tanh(z)    = (e^z - e^-z)/(e^z+e^-z) → squashes to (-1, 1)
ReLU(z)    = max(0, z)                → passes positives, zeros out negatives

Think of them as different volume knobs. Sigmoid only turns between 0 and 1. Tanh turns between -1 and 1, which is useful when you need the output centered around zero. ReLU is the simplest: if the signal is positive, pass it through unchanged. If negative, silence it.

ReLU is the default for hidden layers in modern networks. It's fast to compute and avoids a problem called vanishing gradients, where sigmoid and tanh squash large values so flat that the gradient nearly disappears, making learning extremely slow in deep networks. We'll see this problem firsthand at a later stage.

For output layers, the choice depends on the task: sigmoid for binary yes/no, softmax (a generalization of sigmoid) for picking one class out of many.

How It Solves XOR

A 2-2-1 network (2 inputs, 2 hidden neurons with sigmoid, 1 output) solves XOR. Each hidden neuron draws its own line. These two parallel lines create a band, and the region between them is where exactly one input is 1.

The output neuron combines them: neuron 1's signal (OR) minus neuron 2's signal (AND). What's left is OR but NOT AND, which is XOR.

  [0,0] → both neurons low  → output low  → class 0 ✓
  [0,1] → neuron 1 high, neuron 2 low → output high → class 1 ✓
  [1,0] → neuron 1 high, neuron 2 low → output high → class 1 ✓
  [1,1] → both neurons high → they cancel → output low → class 0 ✓

And this scales beyond a single hidden layer. Stack another layer on top, and the second layer doesn't see the original inputs. It sees the transformed outputs of the first layer. So the first layer draws boundaries, the second layer combines those boundaries into shapes, a third could combine shapes into patterns. Each layer builds on the previous one's transformation. That's why they're called deep neural networks.

See It

Open the playground. Perceptron on the left, stuck with one line, failing. MLP on the right, two hidden neurons creating a band that captures the XOR region.

What's Next

Two hidden neurons needed 9 hand-crafted weights to solve XOR. A network that recognizes handwritten digits needs thousands of neurons and hundreds of thousands of weights. One that understands language needs billions. The architecture scales, but hand-picking weights doesn't.

There has to be a way for the network to find its own weights. That's the next post.

References:
Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.

Series: From Perceptrons to Transformers | Code: GitHub

DEV Community