Activation Functions: Why Non-Linearity Is Everything

#ai #machinelearning #deeplearning #python

There's a proof worth knowing: if you stack linear transformations without any non-linearity between them, the entire network is equivalent to a single linear transformation. Ten layers, a hundred layers, a thousand - they all collapse to one matrix multiply. Activation functions are what prevent this collapse.

The linearity collapse, demonstrated

import numpy as np

W1 = np.random.randn(4, 4)
W2 = np.random.randn(4, 4)
W3 = np.random.randn(4, 4)

W_collapsed = W3 @ W2 @ W1
x = np.random.randn(4)

out_deep = W3 @ W2 @ W1 @ x
out_shallow = W_collapsed @ x

print(np.allclose(out_deep, out_shallow))  # True — three layers = one layer

The three layers have zero additional expressive power over one. Adding a non-linear function between each layer breaks this.

Sigmoid: the original, and its problems

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1.0 - s)

for x_val in [-5, -2, 0, 2, 5]:
    g = sigmoid_grad(x_val)
    print(f"x={x_val:3d}  gradient={g:.6f}")

x= -5  gradient=0.006648
x= -2  gradient=0.104994
x=  0  gradient=0.250000
x=  2  gradient=0.104994
x=  5  gradient=0.006648

At x=±5, the gradient is 26× smaller than at x=0. In a 10-layer network, the compound effect kills gradients entirely — the vanishing gradient problem.

ReLU: the surprisingly effective fix

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

The gradient for positive inputs is exactly 1. Gradients don't shrink as they pass through ReLU on the positive side. Deep networks could finally be trained.

The cost: neurons whose inputs are consistently negative receive zero gradient — the "dying ReLU" problem. In practice this matters less than you'd think.

# Leaky ReLU: small gradient for negatives
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

GELU: what GPT uses

GELU is a smooth approximation of ReLU:

GELU(x) ≈ 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))

for x_val in [-0.5, -0.2, 0.0, 0.2, 0.5]:
    r = max(0, x_val)
    g = gelu(np.array([x_val]))[0]
    print(f"x={x_val:4.1f}  ReLU={r:.4f}  GELU={g:.4f}")

x=-0.5  ReLU=0.0000  GELU=-0.1543
x=-0.2  ReLU=0.0000  GELU=-0.0563
x= 0.0  ReLU=0.0000  GELU=0.0000
x= 0.2  ReLU=0.2000  GELU=0.1155
x= 0.5  ReLU=0.5000  GELU=0.3457

The smoothness makes optimization slightly easier. GPT-2 and BERT both use GELU.

SwiGLU: what modern models use

SwiGLU is the activation used in LLaMA, Mistral, and most current large models:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.W = nn.Linear(d_model, d_ff, bias=False)
        self.V = nn.Linear(d_model, d_ff, bias=False)
        self.out = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        gate = F.silu(self.W(x))   # SiLU = x * sigmoid(x)
        content = self.V(x)
        return self.out(gate * content)

One linear projection gates whether the other passes through — more expressive than a simple element-wise non-linearity.

Gradient flow comparison

def test_gradient_flow(activation_fn, depth=20, seed=0):
    torch.manual_seed(seed)
    layers = []
    for _ in range(depth):
        layers.extend([nn.Linear(64, 64), activation_fn()])
    model = nn.Sequential(*layers)
    x = torch.randn(16, 64, requires_grad=True)
    out = model(x).sum()
    out.backward()
    return x.grad.abs().mean().item()

activations = {"ReLU": nn.ReLU, "Sigmoid": nn.Sigmoid, "GELU": nn.GELU, "SiLU": nn.SiLU}
for name, act in activations.items():
    grad = test_gradient_flow(act, depth=20)
    print(f"{name:<10}: input gradient magnitude = {grad:.6f}")

ReLU      : input gradient magnitude = 0.003241
Sigmoid   : input gradient magnitude = 0.000001
GELU      : input gradient magnitude = 0.004817
SiLU      : input gradient magnitude = 0.004923

Sigmoid is thousands of times worse. ReLU, GELU, and SiLU are all in the same ballpark — the gap between them matters far less than the gap from sigmoid.

Summary

Function	Where used	Key property
Sigmoid	Old networks	Saturates; vanishing gradients
ReLU	CNNs, MLPs	Simple; gradient=1 for positives
GELU	GPT-2, BERT	Smooth; slight negative outputs
SiLU/Swish	Modern models	Smooth; slightly better performance
SwiGLU	LLaMA, Mistral	Expressive gating mechanism

The progression follows one thread: keep gradients alive through many layers, give the network enough expressive power, don't overcomplicate what works.

This is part of an ongoing series on AI internals. Full article with more context at machina.chat/blog.