DEV Community

Nilavukkarasan R
Nilavukkarasan R

Posted on

Regularization: Fighting Overfitting

"Learning without thought is labor lost" --Confucius


When Your Network Becomes a Memorizer

In my last post, I showed you how to train a network on MNIST. Adam optimizer, mini-batches, 100 epochs. Training accuracy climbed over 99%.

It felt like we'd solved it.

But here's what I didn't show you: what happens if you keep training.

Running the network for 200 epochs drives training accuracy over 99%. But test accuracy tells a different story, it hits 97% around epoch 50, then slowly drops as training continues.

Why? Imagine studying for an exam by memorizing that "Question 5 is always B" instead of understanding why B is correct. You'd ace the practice test but fail when questions are reordered or rephrased. Neural networks do the same thing. They memorize training data so well they hit 99% accuracy, yet struggle with new examples because they never learned the underlying patterns.

The network isn't learning anymore. It's memorizing. That's overfitting, and it's one of the core challenges in training neural networks.

The Generalization Gap

Here's what happens when you train a network without any safeguards:

Epoch 1:   Train Acc: 87.3%  Test Acc: 86.9%  Gap: 0.4%
Epoch 10:  Train Acc: 97.2%  Test Acc: 96.8%  Gap: 0.4%
Epoch 50:  Train Acc: 99.1%  Test Acc: 97.2%  Gap: 1.9%
Epoch 100: Train Acc: 99.7%  Test Acc: 96.8%  Gap: 2.9%
Epoch 200: Train Acc: 99.95% Test Acc: 96.1%  Gap: 3.85%
Enter fullscreen mode Exit fullscreen mode

See that gap? It starts small, the network learns generalizable patterns. But as training continues, the gap widens. The network is still improving on training data, but test accuracy stalls and drops.

This gap is the generalization gap. It's the difference between what your network learned and what it actually understands.

Why Does This Happen?

A network has three things working against it:

1. Capacity: Your network has 100,000 weights. Your training set has 60,000 examples. Mathematically, the network has enough capacity to memorize every single example.

2. Time: Every epoch, the network sees the same training examples again. It gets more chances to memorize. After 200 epochs, it's seen each example 200 times. Memorization becomes easier than learning.

3. No penalty for complexity: The network doesn't care if it uses simple patterns or complex ones. Both reduce training loss equally. So it drifts toward complexity overfitting.

The solution? Force the network to generalize instead of memorize.

The Regularization Toolkit: A Big Picture View

Before we dive into specific techniques, let's zoom out and see the full landscape of solutions to overfitting. There are actually many ways to fix it:

1. Reduce Model Capacity

  • Use smaller networks (fewer neurons, fewer layers)
  • Prune weights after training
  • Use simpler architectures

The idea: if your network is smaller, it simply can't memorize as much. But this is a blunt instrument, you might lose the ability to learn complex patterns.

2. Increase Training Data

  • Collect more real data
  • Use data augmentation (rotations, crops, noise for images; paraphrasing for text)
  • Use synthetic data generation

The idea: with more diverse examples, memorization becomes harder. The network has to learn generalizable patterns to cover all the data.

3. Stop Training Early

  • Monitor test accuracy during training
  • Stop when test accuracy starts declining
  • This is called "early stopping"

The idea: overfitting gets worse over time. Stop before it happens.

4. Ensemble Methods

  • Train multiple networks and average their predictions
  • Use techniques like boosting or bagging

The idea: multiple imperfect models often generalize better than one perfect model.

5. Architectural Innovations

  • Skip connections (ResNets) allow training deeper networks that generalize better
  • Attention mechanisms focus on relevant parts of the input
  • Inductive biases (like convolutions for images) reduce the effective capacity

The idea: design the architecture to match the problem structure.

Our Focus: Regularization Techniques

In this post, we're going to deep-dive into two regularization techniques - dropout and weight decay.

Why these two? Because they represent two different philosophies:

  • Dropout prevents co-adaptation (neurons learning to work together in ways that only make sense for training data)
  • Weight decay encourages simplicity (smaller weights = simpler decision boundaries)

Together, they form a powerful one-two punch against overfitting. And understanding them deeply will help you understand other regularization techniques.

Dropout: An Ensemble of Smaller Networks

Here's an idea: what if we randomly disabled neurons during training?

Not permanently. Just during each forward pass.

This sounds like sabotage. Why would we intentionally break our network?

Because it forces the network to learn redundant representations.

Think of it like this: imagine you're building a team to solve problems. If you always have the same 10 people, they'll specialize and depend on each other. Person A always handles data, Person B always handles logic. If Person A gets sick, the team fails.

But if you randomly remove people from the team each day, they can't specialize. Everyone learns to do everything. The team becomes robust.

That's dropout.

When we randomly disable neurons during training, the network can’t rely on specific neurons to make a prediction. Instead, it must learn multiple pathways to the same answer. This redundancy prevents co-adaptation. i.e. neurons relying on each other in ways that only work for the training data.

If a layer has n neurons, there are 2ⁿ possible subnetworks, depending on which neurons are active or dropped. During training, each mini-batch randomly samples one of these subnetworks.

Imagine a hidden layer with 256 neurons and 50% dropout.

Each mini-batch activates a different random subset of neurons:

  • Mini-batch 1 trains with neurons {1, 3, 5, 7, ..., 255}
  • Mini-batch 2 trains with neurons {2, 4, 6, 8, ..., 256}
  • Mini-batch 3 trains with neurons {1, 2, 4, 7, ..., 254}

Each subset forms a slightly different network. Over training, the model samples from an enormous space of subnetworks and learns weights that perform well across many of them.

Modern implementations use inverted dropout.

During training, we randomly drop neurons and scale the activations so that their expected value stays the same. At test time, we simply run the full network without any dropout.

# Training: randomly disable neurons
if training:
    mask = np.random.binomial(1, 1 - dropout_rate, X.shape)
    X_dropped = X * mask / (1 - dropout_rate)  # Scale to maintain expected value
else:
    X_dropped = X  # No dropout at test time
Enter fullscreen mode Exit fullscreen mode

The scaling factor 1 / (1 - dropout_rate) is crucial.

Without it, the magnitude of activations during training would be smaller than during inference, causing inconsistent predictions.

By scaling during training, the expected activation remains the same whether dropout is active or not.

Dropout forces the network to learn robust representations. No neuron can assume another neuron will always be present, so useful features must be distributed across the network.

The result:

  • Less memorization
  • Better generalization
  • A model that performs well on unseen data, not just the training set.

Weight Decay: Occam's Razor for Neural Networks

Dropout prevents co-adaptation. But there's another approach: what if we penalize large weights?

This idea is called weight decay, also known as L2 regularization.

The idea is simple: add a penalty to the loss function proportional to the magnitude of weights.

Total Loss = Cross-Entropy Loss + λ * (sum of squared weights)
Enter fullscreen mode Exit fullscreen mode

The λ (lambda) parameter controls how much we penalize large weights. Higher λ means stronger penalty.

Why does this work? Large weights tend to make a network very sensitive to small changes in the input, producing sharp decision boundaries that can fit noise in the training data. Smaller weights produce smoother functions that change more gradually.

Consider two networks that both achieve 95% training accuracy:

Network A: Has weights like [0.1, 0.2, -0.15, 0.08, ...]. Small adjustments to inputs.

Network B: Has weights like [5.2, -8.7, 12.3, -6.1, ...]. Large adjustments to inputs.

Both fit the training data. But Network B's large weights create sharper, more extreme responses to inputs, which increases the risk of overfitting. Weight decay prefers Network A because its weights have smaller magnitude.

Without weight decay, the optimizer only cares about minimizing training loss.

With weight decay, the optimizer faces a trade-off:

  • Reduce training loss
  • Keep weights small

During backpropagation, the regularization term adds an extra component to the gradient:

gradient = original_gradient + λ * w

This gently pulls weights toward zero during training. The result is not that the network learns less, but that it learns more restrained solutions that tend to generalize better.

Weight decay doesn't restrict learning.
It simply nudges the model toward simpler explanations.

The Tuning Process:

  1. Train without regularization. Measure the train-to-test gap.
  2. If gap < 1%, you're good. No regularization needed.
  3. If gap is 1-3%, add dropout=0.2. Retrain and measure.
  4. If gap is still > 2%, add weight_decay=0.0001. Retrain and measure.
  5. If gap is still > 2%, increase dropout to 0.3 or 0.4.
  6. If gap is > 3%, you might need more data or a smaller network.

The key is experimentation. Every dataset is different. What works for MNIST might not work for ImageNet. Start conservative, measure, adjust.

Interactive Exploration

This is where the playground comes in. I've built a Streamlit app that lets you experiment with these techniques in real-time. It covers two parts overfitting and weight distribution to explore with.

What Clicked for Me

Regularization is a trade-off. You're not trying to achieve 100% training accuracy. You're trying to maximize test accuracy. I used to think "higher training accuracy = better network." Now I know better.

Dropout is elegant. It's not a hack. It's a principled way to train an ensemble of networks simultaneously.

Each breakthrough solved a problem the previous one created:

  • Perceptrons couldn't learn complex patterns → Multi-layer networks
  • Multi-layer networks couldn't learn efficiently → Backpropagation
  • Backpropagation was slow on large datasets → Optimization (mini-batches, Adam)
  • Optimization worked but overfitted → Regularization (dropout, weight decay)

We're building a complete system. Each piece is necessary.

What's Next

We can now train networks that actually work in the real world. They learn patterns, not memorize data. They generalize to new examples.

For now, we're still limited to fully connected networks on small images. MNIST is 28×28. Real images are 1000×1000 or larger. And fully connected networks don't scale, a 1000×1000 image would require 1 million input neurons.

We need a different architecture. One designed specifically for images.

Enter convolutional networks.

The jump from fully connected to convolutional is as big as the jump from perceptrons to multi-layer networks. It's a completely different way of thinking about neural networks.

And it's the next breakthrough in our journey.


References

  1. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors

  2. Ng, A. Y. Feature selection, L1 vs. L2 regularization, and rotational invariance.


Tags: #MachineLearning #AI #DeepLearning #Regularization #Dropout #WeightDecay #Overfitting #MNIST #NeuralNetworks

Code: GitHub Repository

Top comments (0)