There's a proof worth knowing: if you stack linear transformations without any non-linearity between them, the entire network is equivalent to a single linear transformation. Ten layers, a hundred layers, a thousand - they all collapse to one matrix multiply. Activation functions are what prevent this collapse.
The linearity collapse, demonstrated
import numpy as np
W1 = np.random.randn(4, 4)
W2 = np.random.randn(4, 4)
W3 = np.random.randn(4, 4)
W_collapsed = W3 @ W2 @ W1
x = np.random.randn(4)
out_deep = W3 @ W2 @ W1 @ x
out_shallow = W_collapsed @ x
print(np.allclose(out_deep, out_shallow)) # True — three layers = one layer
The three layers have zero additional expressive power over one. Adding a non-linear function between each layer breaks this.
Sigmoid: the original, and its problems
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1.0 - s)
for x_val in [-5, -2, 0, 2, 5]:
g = sigmoid_grad(x_val)
print(f"x={x_val:3d} gradient={g:.6f}")
x= -5 gradient=0.006648
x= -2 gradient=0.104994
x= 0 gradient=0.250000
x= 2 gradient=0.104994
x= 5 gradient=0.006648
At x=±5, the gradient is 26× smaller than at x=0. In a 10-layer network, the compound effect kills gradients entirely — the vanishing gradient problem.
ReLU: the surprisingly effective fix
def relu(x):
return np.maximum(0, x)
def relu_grad(x):
return (x > 0).astype(float)
The gradient for positive inputs is exactly 1. Gradients don't shrink as they pass through ReLU on the positive side. Deep networks could finally be trained.
The cost: neurons whose inputs are consistently negative receive zero gradient — the "dying ReLU" problem. In practice this matters less than you'd think.
# Leaky ReLU: small gradient for negatives
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
GELU: what GPT uses
GELU is a smooth approximation of ReLU:
GELU(x) ≈ 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))
for x_val in [-0.5, -0.2, 0.0, 0.2, 0.5]:
r = max(0, x_val)
g = gelu(np.array([x_val]))[0]
print(f"x={x_val:4.1f} ReLU={r:.4f} GELU={g:.4f}")
x=-0.5 ReLU=0.0000 GELU=-0.1543
x=-0.2 ReLU=0.0000 GELU=-0.0563
x= 0.0 ReLU=0.0000 GELU=0.0000
x= 0.2 ReLU=0.2000 GELU=0.1155
x= 0.5 ReLU=0.5000 GELU=0.3457
The smoothness makes optimization slightly easier. GPT-2 and BERT both use GELU.
SwiGLU: what modern models use
SwiGLU is the activation used in LLaMA, Mistral, and most current large models:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLU(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.W = nn.Linear(d_model, d_ff, bias=False)
self.V = nn.Linear(d_model, d_ff, bias=False)
self.out = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x):
gate = F.silu(self.W(x)) # SiLU = x * sigmoid(x)
content = self.V(x)
return self.out(gate * content)
One linear projection gates whether the other passes through — more expressive than a simple element-wise non-linearity.
Gradient flow comparison
def test_gradient_flow(activation_fn, depth=20, seed=0):
torch.manual_seed(seed)
layers = []
for _ in range(depth):
layers.extend([nn.Linear(64, 64), activation_fn()])
model = nn.Sequential(*layers)
x = torch.randn(16, 64, requires_grad=True)
out = model(x).sum()
out.backward()
return x.grad.abs().mean().item()
activations = {"ReLU": nn.ReLU, "Sigmoid": nn.Sigmoid, "GELU": nn.GELU, "SiLU": nn.SiLU}
for name, act in activations.items():
grad = test_gradient_flow(act, depth=20)
print(f"{name:<10}: input gradient magnitude = {grad:.6f}")
ReLU : input gradient magnitude = 0.003241
Sigmoid : input gradient magnitude = 0.000001
GELU : input gradient magnitude = 0.004817
SiLU : input gradient magnitude = 0.004923
Sigmoid is thousands of times worse. ReLU, GELU, and SiLU are all in the same ballpark — the gap between them matters far less than the gap from sigmoid.
Summary
| Function | Where used | Key property |
|---|---|---|
| Sigmoid | Old networks | Saturates; vanishing gradients |
| ReLU | CNNs, MLPs | Simple; gradient=1 for positives |
| GELU | GPT-2, BERT | Smooth; slight negative outputs |
| SiLU/Swish | Modern models | Smooth; slightly better performance |
| SwiGLU | LLaMA, Mistral | Expressive gating mechanism |
The progression follows one thread: keep gradients alive through many layers, give the network enough expressive power, don't overcomplicate what works.
This is part of an ongoing series on AI internals. Full article with more context at machina.chat/blog.
Top comments (0)