Too Good to Be True
Overfitting is when you train a model too closely an existing data set. In every data set, there is signal and noise. When predictng something, you want your model to learn the signal or what part of the data affects your desired result is. For example, if you have a data set on house pricing, you want the model to learn that location or number of bedrooms is a signal that increases its price. Noise is just the randomness of collected data. You want all of your predictive models learning from the signal in the data and not the noise. Overfitting happens when you don't take into account that data could have noise and try to fit your model to the noise. Although an overfitted model will correctly classify the data it trained on, it will fail to correctly classify new data. Here is an example:
The black line represents a good model that pretty clearly identifies a dividing line that accurately classifies all the blue and red dots with a few exceptions, which can be attributed to the natural noise of data. The overfitted line divides the two colors with a 100% accuracy; however, it seems to look very unnatural and in simple terms "tries too hard". This overfitted line accounts for the noise in the data and does not filter it out.
How to Identify Overfitting
The easiest way to identify overfitting is splitting your data set before training your model into a train and test split (I like to use an 80%/20% split respectively). Then you can train your models on the train set and then see how accurately it predicts the test set. If it tests well on the training set like 92% but predicts the test set at a 65%, then it is overfit.
How to Prevent Overfitting
The most common practice I use to prevent overfitting is cross validation.
Cross validation is when you split your data into multiple sets of train test splits. If you're cross validating with 5 sets, it means you have 5 sets of 80/20 splits. Each split makes sure it doesn't have the same test split. This way you can train your models on multiple parts of the data set and thus reduces overfitting on one portion of the data.
Top comments (0)