DEV Community

Cover image for Deep neural networks: How does regularisation prevent overfitting?
Maria Boryslawska
Maria Boryslawska

Posted on • Updated on

Deep neural networks: How does regularisation prevent overfitting?

What really is overfitting? Overfitting is a very dangerous game many of us have played. It means that you get a low training dataset error and a high test set error.

Now, if you suspect that you might have a high variance problem, you can add more training data. However, it'll only work if you have more reliable data and can be expensive.

Another way is to apply regularisation. It can prevent overfitting and reduce errors in your neural network. So what types of regularisation do we have?

First and foremost, we can use the Euclidean norm L2. It's the most common type of regularisation. It will penalise the coefficients, and the slope of the line will be closer, but never equal to, zero.

Secondly, we can use the L1 regularisation, also called the Lasso Regression. By adjusting the slope, smaller weights will eventually fade away and become 0. This means that L1 regularisation helps us select the features that are important and turns the rest of them into zeros. So it is very useful for feature selection, if we have a huge number of features.

So how do we apply these to neural networks? Let's take a very simple NN as an example. If we set the weights to be close to zero, we will zero out the impact of a lot of the hidden units. And if that’s the case, our simple neural network can become much, much smaller. So the intuition of completely zeroing out of a bunch of hidden units isn't quite right.

If your regularisation parameter is large then your parameters will be relatively small, because they are being penalised.
So it's as if every layer will be roughly linear. As if it is just linear regression. If every layer is linear then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is at the end they are only able to compute a linear function.

Consequently, your whole neural network will be computing something like a big linear function which is able to do much simpler tasks than a very complex non-linear function. Therefore, it is also much less able to overfit.

Top comments (0)