Activation Functions: Why a 100-Layer Network Without Them Is Still One Line

#deeplearning #machinelearning #beginners #javascript

A neuron computes w·x + b — a straight line. The little function after it, the activation, is what makes deep learning work. Day 2 of my DeepLearningFromZero series.

The problem: linear ∘ linear = linear

Stack two linear layers and the math collapses:

layer2(layer1(x)) = W₂(W₁x) = (W₂W₁)x   ← still one linear layer

So a 100-layer network of pure linear neurons can only ever draw a straight boundary. Useless for images, language, curves.

The fix: a nonlinear bend

const a = relu(dot(w, x) + b);   // relu = the bend

Now each layer warps space a little, and stacking them composes complex shapes. The activation is literally what lets neural nets approximate any function.

The functions

const relu    = z => Math.max(0, z);          // default for hidden layers
const sigmoid = z => 1 / (1 + Math.exp(-z));   // (0,1) — output probabilities
const tanh    = z => Math.tanh(z);             // (−1,1) — zero-centred
const leaky   = z => z > 0 ? z : 0.01 * z;     // no "dead" neurons

ReLU is the modern default: cheap, and its gradient is 1 for positive inputs, so it doesn't saturate and kill learning the way sigmoid does in deep nets. That's why very deep networks became trainable.

Leaky ReLU fixes the "dying ReLU" problem — a neuron stuck at 0 has zero gradient and can never recover; a tiny negative slope keeps a trickle flowing.