Usually, the Train/Test Splitting process is one of the Machine Learning tasks taken for granted. In fact, data scientists focus more on Data Preprocessing or Feature Engineering, delegating the process of dividing the dataset into a line of code.
In this short article, I describe three train/test splitting techniques, exploiting three different Python libraries:
In this tutorial, I assume that the whole dataset is available as a CSV file, which is loaded as a Pandas Dataframe. I consider the heart.csv dataset, which has 303 rows and 14 columns:
import pandas as pd df = pd.read_csv('source/heart.csv')
The output column corresponds to the target column and all the remaining ones correspond to the input features:
Y_col = 'output' X_cols = df.loc[:, df.columns != Y_col].columns
Scikit-learn provides a function, named
train_test_split(), which automatically splits a dataset into a training and test set. As input parameters of the function either lists or Pandas Dataframes can be passed.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df[X_cols], df[Y_col],test_size=0.2, random_state=42)
Other input parameters include:
test_size: the proportion of the dataset to be included in the test dataset.
random_state: the seed number to be passed to the shuffle operation, thus making the experiment reproducible.
The original dataset contains 303 records, the
train_test_split() function with
test_size=0.20 assigns 242 records to the training set and 61 to the test set.
Pandas provide a Dataframe function, named
sample(), which can be used to split a Dataframe into train and test sets. The function receives as input the frac parameter, which corresponds to the proportion of the dataset to be included in the result. Similarly to the scikit-learn
train_test_split(), also the
sample() function provides the random_state input parameter.
Continue Reading on Towards AI