DEV Community: Abzal Seitkaziyev

Classifiers' Evaluation Metrics

Abzal Seitkaziyev — Sat, 20 Mar 2021 03:36:36 +0000

Confusion matrix
Confusion matrix is a table that holds True and False Positive values ('TP' and 'FP'), as well as True and False Negative values ('TN' and 'FN').

Image

What is important for the project
For example, we have an image classifier, which identifies if a rock is a precious stone or not(e.g., diamond) and we use it for automated mining.
In this context, we may want to get as many stones as possible ('TP'), even if we have some not precious stones identified as diamonds ('FP'). Because it could be sorted out by an expert at a later stage.
Now let's imagine, that we are buying these stones by using our image classifier algorithm. We do not want to buy not precious stones('FP'), so our model should be very careful regarding False Positive predictions.

Common Evaluation Metrics
To evaluate and quantify the performance of a classification model, we can use common evaluation metrics: accuracy, balanced accuracy, precision, recall (a.k.a. sensitivity and True Positive Rate), Specificity (=1-False Positive Rate), ROC (=TPR vs FPR) and F1 score.
As we can see there are many options to choose from regarding evaluation metrics. However, all of these metrics can be calculated using confusion matrix values(TP, FP, TN, and FN). So, the main idea is to know what metrics are most important for the project, and how well balanced is the target we are trying to predict (classify).
The most general approach would be to choose a few metrics to optimize (e.g., accuracy, recall, precision, F1 score, ROC-AUC).

Coefficient of Determination R squared

Abzal Seitkaziyev — Mon, 15 Mar 2021 03:52:24 +0000

To measure the 'goodness of fit' of the line, when we do the linear regression analysis, Coefficient of Determination (R squared) could be calculated. R squared can measure how well our model explains the correlation. Here we measure percentage of variance explained by the linear model vs baseline model(in this case it is simply mean value of the target).

We can visualize it on the simple example. If we have some target, e.g. number of sales of the item during 5 days and we fitted a line. Now we want to check how good is our fit, basically how well we perform compare to naive prediction: calculating mean value of the sales.

from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt

# given target
y_true = [5, 10, 11, 16, 19]

# base line
y_mean = [np.mean(y_true) for i in range(len(y_true))]

# fit a line
from sklearn import linear_model
X = [1, 2, 3, 4, 5]
X = np.asarray(X).reshape(-1, 1)
Y = y_true
model = linear_model.LinearRegression()
model.fit(X, Y)

print(model.intercept_)
print(model.coef_[0])

model_intercept = 2
model_coef = 3.4

# regression line
y_pred = [3.4*i+2 for i in range(1,(len(y_true))+1)]

# calculate R squared using formula
var_mean = sum([(y_true[i]-y_mean[i])**2 for i in range(len(y_true))])
var_pred = sum([(y_true[i]-y_pred[i])**2 for i in range(len(y_true))])
r2 = (var_mean-var_pred)/(var_mean)
print(r2)

0.9730639730639731

# calculate R squared using scikit learn
r2_score(y_true, y_pred)

0.9730639730639731

# calculate using pearson correlation
correlation = np.corrcoef(y_true, y_pred)[0][-1]
correlation**2

0.9730639730639731

In this example regression line explained 97% better than just predicting mean value.

K-Means Clustering

Abzal Seitkaziyev — Mon, 08 Mar 2021 04:44:42 +0000

K-Means clustering is unsupervised algorithm, which is very intuitive and could be visualized geometrically.
Basically, we are trying to split the data into the k groups or clusters, and each cluster has a center, which is defined by calculating geometrical centroid of the cluster.

Steps of the K-means clustering algorithm:

1) Set k initial centers randomly

2) Calculate 'distances'(e.g., if in 2d space) from the data point to these centers and group the data by the 'closest' distances to these centers

3) Recalculate position of k centers (as a mean of the data in that cluster)

4) repeat steps 2 and 3 until no changes.

Here is the link I used to play and visualize clustering.

Introduction to Support Vector Machines

Abzal Seitkaziyev — Mon, 01 Mar 2021 04:40:19 +0000

Support Vector Machines(SVMs) are supervised models, and they could be very effective for classification, numerical prediction, and outlier detection problems.

The main idea is to separate different classes effectively: getting accurate results (e.g., higher accuracy score) and also balancing overfitting and underfitting (SVM introduces a slack term to account for this) at the same time.

SVM allows dividing classes using a line, plane, or hyperplane. For the simple example, with a line, we can divide by using maximum margin or soft margin. Soft margin is more flexible and allows misclassification by taking into account outliers, which gives a balance to not overfit or underfit.

Another thing, that sets SVM apart, is the use of so-called Kernel Functions. We could use Linear, Polynomial, Radial(RBF), or Sigmoid Functions in scikit-learn. These functions allow creating higher dimensions to separate classes better by use of hyperplanes.

image source

Even picture above shows a transformation of the data from 2d to 3d, SVM actually does not transform data into higher dimension but rather uses dot product result to find a relationship of the each label with the remaining labels (Kernel Trick).

Cross-Validation for Time Series

Abzal Seitkaziyev — Mon, 15 Feb 2021 04:26:25 +0000

Photo by Markus Winkler on Unsplash

To estimate the performance of the machine learning model, we may consider using cross-validation (CV), which uses multiple (e.g. n) train-test splits and trains/tests n models respectively.

There is a k-fold CV in scikit-learn, which splits data into k train-test groups, and it assumes that observations are independent. However, in time series there is a dependency between observations and it could lead to target leak in the estimation when k-fold CV is used.
For Time Series data I explored the following cross-validation techniques:

1) Scikit-learn's Time Series Split.
Here we use expanding window for the train set and a fixed-size window for the test data.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5 ] TEST: [6 7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
...

2) Blocking Time Series Split.
It's when we train and test on different blocks of data.
Example of the split:

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
TRAIN: [12 13 14 15 16 17 18 19 20 21] TEST: [22 23]
TRAIN: [24 25 26 27 28 29 30 31 32 33] TEST: [34 35]
...

3)Walk Forward Validation.

a) we use the fixed-size(sliding) window for the train data and one observation ahead for the test.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [1 2 3 4 5 6 ] TEST: [7]
TRAIN: [2 3 4 5 6 7 ] TEST: [8]
....

b) we use expanding window for the train data and one observation ahead for the test. Which is a variation of the scikit-learn's Time Series Split.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [0 1 2 3 4 5 6 ] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7 ] TEST: [8]
....

Gas Field Production Project. Part 1

Abzal Seitkaziyev — Mon, 08 Feb 2021 04:36:35 +0000

Photo by KOBU Agency on Unsplash

Here I will briefly describe Data Collection and Exploratory data analysis (EDA) process for the gas field production project. After researching a few oil and gas fields, I selected Lakeshore Gas Field (NY) for my project, as it was a gas field with around 4000 production wells. Using dynamic web scraping I collected field data and yearly water, gas, and oil production data for each well. As we can expect the data does not have many physical or extraction properties, e.g. pressure, well fracking, or other stimulation and maintenance activities. There are some options to generate pseudo-pressure values by using petroleum engineering models(e.g. by making some assumptions of the initial reservoir pressure and using production data-volume and mass). At this stage, the data we have is enough to do the EDA.

Here is some elevation data of the wells.

I plotted below the data related to the gas and water produced.

As we could expect, the gas production drops with time (as reservoir pressure drops) even with increased number of the active wells and stimulation activities.

Gradient Boosting Regressor Example

Abzal Seitkaziyev — Mon, 01 Feb 2021 04:53:23 +0000

Photo by Michael Dziedzic on Unsplash

In the previous post, I briefly explained Gradient Boosting using a classification problem. Here I will do step by step explanation of how Gradient Boosting Regressor works using sklearn and Python to complement a theory given here. I did this exercise mainly to build an intuition of processes inside the Gradient boosted trees and by doing so to avoid using it as some sort of 'black box' algorithm.

I used a dataset with car prices (source) for this purpose. So, for easy tracking of the processes inside the Gradient boosted trees, I used a small portion of the data with a minimum number of the trees(m=2), and the depth of a tree(max_depth=2).

1) First, we initialize the model, by getting initial predictions Pred_0. It is calculated as Mean value of the prices in the train dataset. Then we calculated initials residuals: Res_0 = train['price']-Pred_0. See below.

2) Here we fit all data points(= each row features and Res_0) into the first tree. This tree build by using 'MSE' as a criterion.

Each Value in the Leaf are calculated by the mean values of the residuals in each leaf. Then Prediction is calculated:
Pred_1 = Pred_0 + learning_rate*output_value_1

The we calculate residuals:
Res_1 = train['price']-Pred_1

Node #2, 3, 5, and 6 Predictions and Residuals:

3)
Here we fit all data points(= each row features and Res_1) into the second tree. This tree build by using 'MSE' as a criterion as well.

Node #5 Predictions shown below.

4) We continue this iterative training process. Here I used only two trees for the simplicity.

You can refer for the detailed code and step by step in chapter 5 here.

Gradient Boosting Classifier

Abzal Seitkaziyev — Mon, 25 Jan 2021 04:11:36 +0000

Photo by Rod Long on Unsplash

In this post, I would like to show briefly how theoretical components given in the gradient boost for classification are implemented in sklearn.

Gradient Boosting Classifier uses regressor trees. Meaning, we use mean squared error as splitting criteria and differentiable Loss function (Log - Loss by default). I guess the name of the regressor tree here comes from the Logistic regression Log-Loss function.

Let us see the first and last tree in the model (repo reference).

As we can see by iteratively building and fitting trees to the negative gradients of the Log-Loss Function (-G=Residuals =True_Probability - Predicted_Probability), each leaf gives an output value.
Each ouput value is calculated like
predicted_leaf_output = (sum of residuals in the leaf) / [sum of the (Predicted_Probability*(1-Predicted_Probability)].

Then using log of odds, we can convert it to the new probability:
log(new_odds) =log(previos_odds) + alpha * output_value
New_probability = e^log(new_odds)/[1+e^log(new_odds)].

Then we just use a loop to keep iteratively improve our prediction till we reach the maximum number of estimators or specified hyperparameters in the classifier.

When we need to predict for a new data, this data (number of known features with unknown label) will go through that trained model (n number of fitted trees), and using the above formula probability will be calculated and Class 0 or 1 assigned respectively for binary classification.

XG Boost for Classification

Abzal Seitkaziyev — Mon, 18 Jan 2021 03:23:16 +0000

The logic of the XGBoost Classification algorithm is similar to XGBoost Regression, with a few minor differences, like using the Log-Likelihood Loss function, instead of Least Squares and using Probability and Log of Odds in the calculations.

1) Define initial values and hyperparameters.

1a) Define differentiable Loss Function, e.g. 'Negative Log-Likelihood' :
L(yi,pi) =- [yi*ln(pi) +(1-yi)*ln(1-pi)], where
yi- True probabilities (1 or 0), pi - predicted probabilities.
Here we will convert pi to odds and use log of odds when optimizing the objective function.

1b) Assign a value to the initial predicted probabilities (p), by default, it is the same number for all observations, e.g. 0.5.

1c) Assign values to parameters:
learning rate (eta), max_depth, max_leaves, number of boosted rounds etc.
and regularization hyperparameters: lambda, gamma.
Default values in the XG Boost documentation.

2) Build 1 to N number of trees iteratively.
This step is the same as in the XGBoost Regression, where we fit each tree to the residuals.

The difference here:
a) in the formula of the output calculation, where
H = Previous_Probability*[1-Previous_probability].

b) we compute new prediction values as
log(new_odds) =log(previos_odds) + eta * output_value
new_probability = e^log(new_odds)/[1+e^log(new_odds)]

3)last step: get final predictions

XG Boost for Regression

Abzal Seitkaziyev — Mon, 11 Jan 2021 04:40:59 +0000

Photo by Markus Spiske on Unsplash

In the previous posts, I described how the Gradient Boosted Trees algorithm works for regression and classification problems. However, for the bigger datasets, the training could be a slow process. Some advanced versions of boosted trees address this issue, like Extreme Gradient Boost (XG Boost), Light GBM, and CatBoost. Here I will give an overview of how the XG Boost works for the regression.

1) Define initial values and hyperparameters.

1a) Define differentiable Loss Function, e.g. 'Least Squares' :
L(yi,pi) =1/2 (yi - pi)^2, where
y-True values, p- predictions

1b) Assign a value to the initial predictions (p), by default, it is the same number for all observations, e.g. 0.5.

2) Build 1 to N number of trees iteratively.
2a) Get Residuals (yi-pi) to fit observations to the tree
Note:
-Similar to the ordinary Gradient Boosted Trees, we fit trees iteratively to the residuals, not to the predictions.

-Building trees in XG boost a bit different compare to the ordinary regression trees, where we could use gini or entropy to get the gain. In XG boost we use the formula which is derived from the optimization problem of the objective function(objective function is a sum of the loss function and regularization terms).

G - represents sum of gradients (first derivate of a loss function with respect to prediction p), in our case its negative sum of the of the residuals in the leaf (L-left leaf, R-right leaf).

H - second derivative of Loss Function with respect to prediction p, and here equals to number of the residuals

-XG boost allows using a greedy algorithm or approximate greedy algorithm(for bigger datasets) when building the trees and calculating gains.

2c) once we choose the best tree by Gain calculated in the previous step and build the full tree(size of the tree will be limited either by gain values, which also includes gamma for pruning in the formula or by parameters we specified initially).
Now compute the output value for each leaf in the tree.

2d) Compute new prediction values as
new_p =previous_p + eta * output_value

3) get final predictions

Gradient Boost for Classification

Abzal Seitkaziyev — Mon, 04 Jan 2021 04:41:55 +0000

Photo by SpaceX on Unsplash

Here I would like to go through the steps of the Gradient Boost for Classification. Gradient Boost Classification is very similar to the Gradient Boost Regression algorithm with a few differences:

Target values in binary classification are 0s and 1s
Log Loss function(similar to logistic regression problems).
Using Log of odds and probabilities based on the Log Loss Function. So we can apply the same mathematical algorithm, we used in the previous post taking into account the above differences.

Below are the steps of Gradient Boost Classification algorithm when used with Logistic Loss Function L(y,F(x)).
I also tried to avoid mathematical notations and simplified all the steps.

1) Get the initial prediction logarithms of odds and probability of class = 1, so basically we count numbers of class 1s and class 0s and calculate P(class=1) and log(odds=1)=log(P/(1-P)).
Example: if we have balanced data we will have equal (or similar) numbers of 0s and 1s, so initially:
Predicted_Probability(class_1)=0.5
log(odds_1) = 0.

2) m is the number of weak learners. So we do the below steps for each decision tree (e.g. m=1 to m=100, when n_estimators=100):

a. Compute residuals (True-Predicted_Probability) for each tree iteratively (meaning previous residuals used as a target for the next decision tree).

Example: for the first tree, Residual = True - 0.5 (Predicted in the previous step) and True = 0 or 1 (per Target class)

b. Fit decision tree to the residuals

c. Compute the output value for each leaf in the tree. We cannot take simply an average of all the values in the leaf as we did in regression. Here we will use the following formula:
predicted_leaf_output = (sum of residuals in the leaf) / [sum of the (Predicted_Probability*(1-Predicted_Probability)]

d. First, update the predicted log of odds for each row of data:
log(odds) = Previously_predicted_log(odds) + leraning_rate * predicted_leaf_output

e. Then, calculate the Probability for each row of data using log(odds):
P= odds/(1+odds) or P = exp(log(odds))/[1+exp(log(odds)].

3) Compute the final prediction F(x).

Gradient Boost for Regression

Abzal Seitkaziyev — Mon, 28 Dec 2020 04:15:51 +0000

Photo by Bill Jelen on Unsplash

Gradient Tree Boosting

Gradient Tree Boosting is an ensemble algorithm that could be applied to both classification and regression problems. Here I will describe how gradient boost works for regression. Gradient Boost uses Decision Trees as weak learners, and each Decision Tree predicts pseudo-residual values, and all Decision Trees have the same 'weight' in the final decision (described by the learning rate).

There are a few key components in the Gradient Boosting algorithm:

a) Loss function - the natural choice for regression is 'Least Squares'
(Note: similar to the linear regression but commonly used with coefficient 1/2*(True-Predicted)^2, to avoid pseudo-residuals and computing actual residuals = (True-Predicted).

b) Hyperparameters:
learning rate (used to scale each weak learner prediction), and parameters related to the weak learners themselves (e.g. number of weak learners, maximum depth of each tree, etc.)

Algorithm steps

Let's dive into the details of the algorithm itself. This is the mathematical description of the Boosted gradient Trees.

Source is here.

Below is the simplified explanation of the above steps when using 'Least Squares' Loss Function L(y,F(x)):

1) Get the initial prediction, it is equal to the mean of the target column.
2) m is the number of weak learners. So we do the below steps for each decision tree (e.g. m=1 to m=100, when n_estimators=100):

a. Compute residuals (True-Predicted) for each tree iteratively (meaning previous residuals used as a target for the next decision tree).
note: for the first tree, Residual = True - Target_Mean (Predicted in the previous step)
b. Fit decision tree to the residuals
c. Compute output value for each leaf in the tree (in this case = mean of the residuals in this leaf)
d. Update the predicted values; new_prediction = previous_prediction + learning_rate * output_value.
e. Repeat steps 2a to 2e till all weak learners constructed.

3) Compute the final prediction F(x)

Summary

From the short description above we can see that there are some similarities with AdaBoost (like iterativeness - next trees depends on the previous predictions), and differences(each tree has the same learning rate vs different weights in AdaBoost).

References: