## DEV Community # Cross Validation for Beginners

While attempting to solve a ML problem, we do a train_test split. If this split is done randomly than it might be possible that some dataset might be completely present in test set and absent from training set or vice versa. This reduces the accuracy of model. So Cross Validation comes into picture.
Cross-validation is a step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and also ensures that we do not overfit.Cross-validation is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts.

Types Of Cross Validation

i. Leave One Out CV :

• Split a dataset into a training set and a testing set, using all but one observation as part of the training set.
• Note that we only leave one observation “out” from the training set. This is where the method gets the name “leave-one-out” cross-validation.
• Use "Leave One Out" as test set.
• In the second experiment, "leave out" another set and take the rest of the data as training input.
• Repeat the Process.

Cons : Computationally Expensive and results in Low Bias.

Low Bias : For the training and test set, we will get good results but when we will try the model on new data accuracy will go low and error rate goes high.

ii. K-Fold CV : We have some data and we have k value. For example : number of data == 1000 and k == 5. Hence first 200 samples(1000/5 = 200) will be test data. In second experiment, next 200 will be test data. Process will be iterated for 5 times.
Out of all the 5 iterations, we will get 5 accuracies and we can select the best out of 5.

``````import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
# we create a new column called kfold and fill it with -1
df["kfold"] = -1
# the next step is to randomize the rows of the data
df = df.sample(frac=1).reset_index(drop=True)
# initiate the kfold class from model_selection module
kf = model_selection.KFold(n_splits=5)
# fill the new kfold column
for fold, (trn_, val_) in enumerate(kf.split(X=df)):
df.loc[val_, 'kfold'] = fold
# save the new csv with kfold column
df.to_csv("train_folds.csv", index=False)
``````

iii. Stratified CV : If we have a skewed dataset for classification with 90% positive samples and only 10% negative samples, we don't use random k-fold cross-validation. Using simple k-fold cross-validation for a dataset like this can result in folds with all negative samples. In these cases, we prefer using stratified k-fold cross-validation.
Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So, in each fold, we will have the same 90% positive and 10% negative samples. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant.

``````import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
# we create a new column called kfold and fill it with -1
df["kfold"] = -1
# the next step is to randomize the rows of the data
df = df.sample(frac=1).reset_index(drop=True)
# initiate the kfold class from model_selection module
kf = model_selection.KFold(n_splits=5)
# fill the new kfold column
for fold, (trn_, val_) in enumerate(kf.split(X=df)):
df.loc[val_, 'kfold'] = fold
# save the new csv with kfold column
df.to_csv("train_folds.csv", index=False)`
``````

iv. Time Series CV : The method that can be used for cross-validating the time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the accuracy for the forecasted data points. The same forecasted data points are then included as part of the next training dataset and subsequent data points are forecasted. Full Code

That's all folks.

If you have any doubt ask me in the comments section and I'll try to answer as soon as possible.