Neural Network Regularization: Fighting Overfitting

#ai #machinelearning #regularization #weightdecay

"Learning without thought is labor lost." --Confucius

99% Accuracy That Means Nothing

Train a network on MNIST with Adam, mini-batches, 100 epochs. Training accuracy climbs past 99%. It feels like we've solved it.

Now check test accuracy. It hit 97% around epoch 50, then slowly dropped as training continued.

Imagine studying for an exam by memorizing that "Question 5 is always B" instead of understanding why B is correct. You'd ace the practice test but fail when questions are reordered. Neural networks do the same thing. With enough capacity and enough passes over the same data, they memorize training examples instead of learning the underlying patterns.

That's overfitting.

The Gap

Epoch 1:   Train: 87%  Test: 87%  Gap: 0%
Epoch 10:  Train: 97%  Test: 97%  Gap: 0%
Epoch 50:  Train: 99%  Test: 97%  Gap: 2%
Epoch 100: Train: 99.7% Test: 97%  Gap: 3%
Epoch 200: Train: 99.9% Test: 96%  Gap: 4%

Early on, both accuracies rise together. The network is learning real patterns. But past epoch 50, training accuracy keeps climbing while test accuracy stalls. The network has shifted from learning to memorizing.

Why? The network has 100,000 weights and only 60,000 training examples. It has enough capacity to memorize every single example. Given enough epochs, it will. The loss function only rewards getting training predictions right. It doesn't care whether the network learned a general rule or memorized each answer individually.

Dropout: Randomly Breaking the Network

What if we randomly disabled neurons during training? Not permanently. Just for each mini-batch, randomly turn off some neurons.

This sounds like chaos engineering. And thats exactly what it is. It forces the network to build redundancy.

Think of a team with a frontend developer, a backend developer, and a database specialist. To ship a feature, all three must contribute. If the database specialist is out sick, the feature stalls because nobody else knows the database layer. The team is fragile because each person owns one piece and nobody else can cover for them.

Now imagine the manager randomly rotates people out of the team each sprint. Nobody can afford to be the only person who knows their piece. Knowledge spreads. The frontend developer picks up some backend. The backend developer learns the database. The team becomes resilient because no single person is a bottleneck.

That's dropout. During training, each neuron has a chance (say 20%) of being turned off for that mini-batch. The network can't rely on any single neuron, so it spreads useful information across multiple pathways. At test time, all neurons are active, and the network benefits from all those redundant pathways working together.

if training:
    mask = np.random.binomial(1, 1 - dropout_rate, X.shape)
    X = X * mask / (1 - dropout_rate)   # scale to keep expected value the same
else:
    X = X                                # no dropout at test time

The scaling by 1 / (1 - dropout_rate) matters. Say you have 10 neurons and drop 20%. During training, only 8 are active. Their outputs sum to, say, 8.0. At test time, all 10 are active, so the sum jumps to 10.0. The next layer suddenly sees larger numbers than it was trained on, and predictions break.

The fix: during training, scale the surviving neurons' outputs up by 1 / (1 - 0.2) = 1.25. Now those 8 neurons produce 8.0 × 1.25 = 10.0, matching what the next layer will see at test time. Training and testing stay consistent.

Weight Decay: Preferring Simple Explanations

Dropout prevents neurons from co-depending. Weight decay takes a different approach: it penalizes large weights.

Total Loss = Prediction Loss + λ × (sum of squared weights)

Say your network has three weights: [3.0, -4.0, 2.0]. The sum of their squares is 9 + 16 + 4 = 29. With λ = 0.001, the penalty is 0.029, added on top of the prediction loss.

Now compare a network with weights [0.3, -0.4, 0.2]. Sum of squares: 0.09 + 0.16 + 0.04 = 0.29. Penalty: 0.00029. Ten times smaller weights, hundred times smaller penalty.

The optimizer now faces a trade-off: reduce prediction loss, but also keep weights small. A weight can grow large if it genuinely helps predictions enough to justify the penalty. But weights that grew large just to memorize noise get pulled back toward zero because the penalty outweighs their benefit.

Why does this help? Large weights make the network sensitive to small input changes, creating sharp decision boundaries that fit noise in the training data. Small weights produce smoother boundaries that generalize better.

In practice, most people use both. Start with dropout at 0.2 and weight decay at 0.0001, then adjust based on the gap. If the gap is still large, increase dropout. If training accuracy drops too much, ease off.

See It

Open the playground. Train with no regularization and watch the gap between train and test accuracy widen. Then add dropout at 0.2. The gap shrinks. Add weight decay at 0.0001. It shrinks further.

The visual is the two curves: training accuracy climbing, test accuracy following or falling behind. Regularization is what keeps them close together.

What's Next

We can now train networks that generalize. But we're still using fully connected networks where every input connects to every neuron. For MNIST's 28×28 images, that's 784 inputs. For a real photograph at 224×224×3, that's 150,528 inputs. The first layer alone would need 150 million weights.

We need an architecture that understands spatial structure. One where nearby pixels matter more than distant ones, and the same pattern can be detected anywhere in the image.

References:
Hinton, G. E., et al. Improving neural networks by preventing co-adaptation of feature detectors.

Series: From Perceptrons to Transformers | Code: GitHub