Tacio Nery

Posted on Aug 13, 2019 • Edited on Aug 18, 2019

A short brief about Classification

#python #machinelearning #classification #datascience

What is Classification?

This is the first question I did when I heard the term Classification. The definition says it is, fundamentally, a model to predict labels. It brought me a new question. What are labels? Well, in a dataset for a classification models we will find features and labels, where a feature is a column used as an input data and the label is the value we want to predict.
So, when we know what value we want to predict with a Machine Learning model we have a Classification Problem.

Testing Classification Models

Let's get a NBA Log dataset. The goal is to predict if a player will last longer than 5 years in league. This data contains a target column, the TARGET_5Yrs column can be 0 (< 5 years) or 1 (>= 5 years). As we know our target (label), we can say for sure this is a Classification problem.

This dataset can be found here.

Requirements

Here are the libraries we will use in this example.

import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# define the dataset path
DATASET_PATH = os.path.join("datasets")
SEED = 7

Loading the dataset

The first thing to do is load the dataset, let's use Pandas to do it and check how data is available.

# create a function to load the dataset
def load_nba_data(dataset_path=DATASET_PATH):
    csv_path = os.path.join(dataset_path, "nba_logreg.csv")
    return pd.read_csv(csv_path)

# load the dataset
nba_data = load_nba_data()
# replace NaN fields with 0
nba_data.fillna(0, inplace=True)
# show 10 first rows
nba_data.head(10)

	Name	GP	MIN	PTS	FGM	FGA	FG%	3P Made	3PA	3P%	...	FTA	FT%	OREB	DREB	REB	AST	STL	BLK	TOV	TARGET_5Yrs
0	Brandon Ingram	36	27.4	7.4	2.6	7.6	34.7	0.5	2.1	25.0	...	2.3	69.9	0.7	3.4	4.1	1.9	0.4	0.4	1.3	0.0
1	Andrew Harrison	35	26.9	7.2	2.0	6.7	29.6	0.7	2.8	23.5	...	3.4	76.5	0.5	2.0	2.4	3.7	1.1	0.5	1.6	0.0
2	JaKarr Sampson	74	15.3	5.2	2.0	4.7	42.2	0.4	1.7	24.4	...	1.3	67.0	0.5	1.7	2.2	1.0	0.5	0.3	1.0	0.0
3	Malik Sealy	58	11.6	5.7	2.3	5.5	42.6	0.1	0.5	22.6	...	1.3	68.9	1.0	0.9	1.9	0.8	0.6	0.1	1.0	1.0
4	Matt Geiger	48	11.5	4.5	1.6	3.0	52.4	0.0	0.1	0.0	...	1.9	67.4	1.0	1.5	2.5	0.3	0.3	0.4	0.8	1.0
5	Tony Bennett	75	11.4	3.7	1.5	3.5	42.3	0.3	1.1	32.5	...	0.5	73.2	0.2	0.7	0.8	1.8	0.4	0.0	0.7	0.0
6	Don MacLean	62	10.9	6.6	2.5	5.8	43.5	0.0	0.1	50.0	...	1.8	81.1	0.5	1.4	2.0	0.6	0.2	0.1	0.7	1.0
7	Tracy Murray	48	10.3	5.7	2.3	5.4	41.5	0.4	1.5	30.0	...	0.8	87.5	0.8	0.9	1.7	0.2	0.2	0.1	0.7	1.0
8	Duane Cooper	65	9.9	2.4	1.0	2.4	39.2	0.1	0.5	23.3	...	0.5	71.4	0.2	0.6	0.8	2.3	0.3	0.0	1.1	0.0
9	Dave Johnson	42	8.5	3.7	1.4	3.5	38.3	0.1	0.3	21.4	...	1.4	67.8	0.4	0.7	1.1	0.3	0.2	0.0	0.7	0.0

10 rows × 21 columns

Here we have a small sample of our dataset. Let's discard the Name and the TARGET_5Yrs columns, all the others are the features of every player, these will tell us if the player will last longer than 5 years in a league or not. The TARGET_5Yrs has the answer for every combination of features.

Let's check a quick description of our dataset with the info() function.

# first let's remove the uneeded Name column, 'cause it's not relevant for this experiment
nba_data = nba_data.drop('Name', 1)
nba_data.info()

# first let's remove the uneeded Name column, 'cause it's not relevant for this experiment
nba_data = nba_data.drop('Name', 1)
nba_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 20 columns):
GP             1340 non-null int64
MIN            1340 non-null float64
PTS            1340 non-null float64
FGM            1340 non-null float64
FGA            1340 non-null float64
FG%            1340 non-null float64
3P Made        1340 non-null float64
3PA            1340 non-null float64
3P%            1340 non-null float64
FTM            1340 non-null float64
FTA            1340 non-null float64
FT%            1340 non-null float64
OREB           1340 non-null float64
DREB           1340 non-null float64
REB            1340 non-null float64
AST            1340 non-null float64
STL            1340 non-null float64
BLK            1340 non-null float64
TOV            1340 non-null float64
TARGET_5Yrs    1340 non-null float64
dtypes: float64(19), int64(1)
memory usage: 209.5 KB

We can also check some statistics information on our dataset with the describe function.

nba_data.describe()

	GP	MIN	PTS	FGM	FGA	FG%	3P Made	3PA	3P%	FTM	FTA	FT%	OREB	DREB	REB	AST	STL	BLK	TOV	TARGET_5Yrs
count	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000
mean	60.414179	17.624627	6.801493	2.629104	5.885299	44.169403	0.247612	0.779179	19.149627	1.297687	1.821940	70.300299	1.009403	2.025746	3.034478	1.550522	0.618507	0.368582	1.193582	0.620149
std	17.433992	8.307964	4.357545	1.683555	3.593488	6.137679	0.383688	1.061847	16.051861	0.987246	1.322984	10.578479	0.777119	1.360008	2.057774	1.471169	0.409759	0.429049	0.722541	0.485531
min	11.000000	3.100000	0.700000	0.300000	0.800000	23.800000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.200000	0.300000	0.000000	0.000000	0.000000	0.100000	0.000000
25%	47.000000	10.875000	3.700000	1.400000	3.300000	40.200000	0.000000	0.000000	0.000000	0.600000	0.900000	64.700000	0.400000	1.000000	1.500000	0.600000	0.300000	0.100000	0.700000	0.000000
50%	63.000000	16.100000	5.550000	2.100000	4.800000	44.100000	0.100000	0.300000	22.200000	1.000000	1.500000	71.250000	0.800000	1.700000	2.500000	1.100000	0.500000	0.200000	1.000000	1.000000
75%	77.000000	22.900000	8.800000	3.400000	7.500000	47.900000	0.400000	1.200000	32.500000	1.600000	2.300000	77.600000	1.400000	2.600000	4.000000	2.000000	0.800000	0.500000	1.500000	1.000000
max	82.000000	40.900000	28.200000	10.200000	19.800000	73.700000	2.300000	6.500000	100.000000	7.700000	10.200000	100.000000	5.300000	9.600000	13.900000	10.600000	2.500000	3.900000	4.400000	1.000000

Let's take a look at our target is distributed over the dataset.

nba_data.groupby('TARGET_5Yrs').size()

TARGET_5Yrs
0.0    509
1.0    831
dtype: int64

In resume, we have 1340 objects in our dataset where 509 will not last longer than 5 years in league and the others 831 will.

Machine Learning Models Evaluation

As we saw in the beginning of this post, this is a Classification Problem. We will create some models with different ML algorithms and check their accuracy.

Spliting Data

Let's split our dataset into two new datasets. We will use 80% of the dataset to train our classification models and 20% of it to perform the validation.

data = nba_data.values
# data = np.array(data)

# now let's separate the features columns from the target column
X = data[:, 0:19]
Y = data[:, 19]

# as said before,  we will use 20% of the dataset for validation
validation_size = 0.20

# split the data into traning and testing
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=SEED)

Now that we have our training and testing set, we are going to create an array with the models we want to evaluate. We will use each model with the default settings.

models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

To evaluate the models we will user K-Fold cross-validation and measure the accuracy for each model. This techinique randomly splits the training set into K distincts subsets (folds), then it trains and evaluates the model K times picking a different fold for every evaliation. The result will be an array with the K evaluation scores. For this example we will a Cross-Validation using the StratifiedKFold from SKLearn. We will use the Mean of the accuracies of each model to determinate which one has the best results.

scoring = 'accuracy'
models_results = []
for name, model in models:
    results = []
    skfolds = model_selection.StratifiedKFold(n_splits=10, random_state=SEED)
    for train_index, test_index in skfolds.split(X_train, Y_train):
        X_train_folds = X_train[train_index]
        Y_train_folds = (Y_train[train_index])
        X_test_folds = X_train[test_index]
        Y_test_folds = (Y_train[test_index])

        model.fit(X_train_folds, Y_train_folds)
        pred = model.predict(X_test_folds)
        correct = sum(pred == Y_test_folds)
        results.append(correct / len(pred))
    models_results.append((name, results))


names = []
scores = []
# the snippet bellow calculates the mean of the accuracies
for name, results in models_results:
    mean = np.array(results).mean()
    std = np.array(results).std()
    print("Model: %s, Accuracy Mean: %f (%f)" % (name, mean, std))
    names.append(name)
    scores.append(results)

Model: LR, Accuracy Mean: 0.705244 (0.026186)
Model: LDA, Accuracy Mean: 0.706205 (0.027503)
Model: KNN, Accuracy Mean: 0.674429 (0.026029)
Model: CART, Accuracy Mean: 0.634372 (0.047236)
Model: NB, Accuracy Mean: 0.632433 (0.040794)
Model: SVM, Accuracy Mean: 0.619384 (0.021099)

The results above show us that the Linear Discriminant Analysis has the best accuracy score among the models we tested. The boxplot below shows the accuracy scores spread accross each fold.

fig = plt.figure()
fig.suptitle('Models Comparison')
ax = fig.add_subplot(111)
plt.boxplot(scores)
ax.set_xticklabels(names)
plt.show()

Making Predictions

Now we will check the accuracy of the LDA model by making some predictions with the validation set we've prepared before. To do so, we will create an instance of the model and use the method predict.

model = LinearDiscriminantAnalysis()
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
print("Accuracy: {}".format(accuracy_score(Y_test, predictions)))

Accuracy: 0.6902985074626866

We can also check the Confusion Matrix for this model

print(confusion_matrix(Y_test, predictions))

[[ 52  45]
 [ 38 133]]

Each row in a confusion matrix represents an actual target and each column represents a predicted target. The first row of this matrix contains the true negatives and the false positives. Which means that 52 samples were correctly classified and 45 were wrongly classified. The second row shows us the false negatives and the true positives, wich means that 38 samples were wrongly classified and 133 were classified correctly.

The confusion matrix provides a lot of information, but if you want to get a more concise metrics you can use the classification_report function of Scikit-Learn. It will provide the precision, recall and f1-score metrics.

print(classification_report(Y_test, predictions))

              precision    recall  f1-score   support

         0.0       0.58      0.54      0.56        97
         1.0       0.75      0.78      0.76       171

    accuracy                           0.69       268
   macro avg       0.66      0.66      0.66       268
weighted avg       0.69      0.69      0.69       268

The accuracy of the positive predictions is called precision. It's defined by the formula: TP/(TP + FP), where TP is the number of True Positives and FP is the number of False Positives. This metric is tipically used along the recall which is the true positive rate - the ratio of positive instances that are correctly detected by the model. It's equation is: TP / (TP + FN) where FN is the False Negatives.

Conclusion

This is a short brief about Classification with Python and Scikit-Learning. There is a lot more to cover, we can improve our models results by normalizing the data for example. There's also others metrics to cover. But the firts steps into Machine Learning world can be done with this tutorial. Hope you enjoy it!!

You can access the notebook for this example here.

References

Your First Machine Learning Project in Python Step-By-Step - https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent systems.

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

DEV Community

A short brief about Classification

What is Classification?

Testing Classification Models

Requirements

Loading the dataset

Machine Learning Models Evaluation

Spliting Data

Making Predictions

Conclusion

References

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Top comments (0)

Read next

AI Breakthrough Gives Robots Human-Like Object Manipulation Skills

UV Package Manager: Better Python Dependency Management

Study: Medium-Sized AI Models Match Larger Ones for Multi-Language Translation When Properly Prompted

AI Breakthrough: Self-Learning Math Provers Generate and Solve Their Own Theorems

Okay