Model Evaluation in Scikit-learn

#machinelearning #python #scikitlearn #datascience

Scikit-learn is one of the most popular Python libraries for Machine Learning. It provides models, datasets, and other useful functions. In this article, I will describe the most popular techniques provided by scikit-learn for Model Evaluation.

Model Evaluation permits us to evaluate the performance of a model, and compare different models, to choose the best one to send into production. There are different techniques for Model Evaluation, which depend on the specific task we want to solve. In this article, we focus on the following tasks:
Regression
Classification
For each task, I will describe how to calculate the most popular metrics, through a practical example.

1 Loading the Dataset

As an example dataset, I use the Wine Quality Data Set, provided by the UCI Machine Learning Repository. To use this dataset, you should cite the source properly, as follows:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.

I download the data folder, which contains two datasets: one for the red wine, and the other for the white wine. I build a single dataset, which is the concatenation of the two datasets, as follows.

I load both datasets as Pandas Dataframes, and, then, I merge them:

import pandas as pd targets = ['red', 'white'] df_list = [] df = pd.DataFrame() for target in targets: df_temp = pd.read_csv(f"../Datasets/winequality-{target}.csv", sep=';') df_temp['target'] = target df_list.append(df_temp) print(df_temp.shape) df = pd.concat([df_list[0], df_list[1]])

I have added a new column, which contains the original dataset name (red or white).

The dataset contains 6497 rows and 13 columns.
Now, I define a function, which encodes all the categorical columns:

from sklearn.preprocessing import LabelEncoder def transform_categorical(data): categories = (data.dtypes =="object") cat_cols = list(categories[categories].index) label_encoder = LabelEncoder() for col in cat_cols: data[col] = label_encoder.fit_transform(data[col])

I also define another function, which scales numerical columns:

from sklearn.preprocessing import MinMaxScaler def scale_numerical(data): scaler = MinMaxScaler() data[data.columns] = scaler.fit_transform(data[data.columns])