CatBoost: What's the Hype?

Introduction

Lately, while I have been browsing the internet for new Data Science tools, since the field is always changing, I keep coming across a new boosting algorithm that has taken over as the supposed king, CatBoost. So I thought I would give it a test by classifying some Credit Card Fraud data that I found on Kaggle, to see what all the hype is about. If you would like follow along with my experiment and are downloading the data, you may get a message saying the website is a possible fraud website, very ironic, though I did not have any issues after bypassing the security and downloading the data.

Updating Python to include CatBoost

First thing we need to do is update our toolbox with Catboost. To do this we through a conda install we will need to add conda forge to our channels with conda config --add channels conda-forge and then install by simply typing conda install catboost into our Terminal.

CatBoost

So like I said above CatBoost is another boosting algorithm and if you have no clue what that is I would recommend checking out my previous blog where I talk about what boosting is and how it works. For those of you who are familiar with boosting, here are some of the reasons that stuck out to me about why CatBoost is supposed to be superior to other boosting algorithms:

Implements Symmetric Trees: Apparently this reduces the prediction time for the model, with the default max depth of the trees equal to 6.
Random Permutations: The algorithm automatically splits the dataset into 4 permutations, which we have seen in some other boosting algorithms, but it is important because it reduces overfitting.
Automatic Categorial Feature Combinations: Finds additional connections within your features without you having to manually create them, producing better scores.
Automatic OneHotEncoding of Categorial Features: CatBoost will automatically OneHotEncode features with 2 categories saving you the time with the option of having encode more diverse features as well. Saving you the time and energy of having to do it yourself before plugging your data into the algorithm. There are also a ton of parameters you can tune for special occasions such as Data Changing Over Time, Weighted Datasets and Small/Large Datasets.

The Comparison

So to try out this new algorithm myself I decided to give it a comparison with a normal Random Forest Classifier and a Gradient Boosting Classfier to see if it lived up to its hype. Also since from what I had read about it working very well without any hyper-parameter tuning, I thought that I would leave all parameter set to default, which with my previous experiences I hoped for the best with my Gradient Boosting Classifier since I had needed to do a ton of tuning with them in the past.

Importing Tools

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

Data

df = pd.read_csv('../../Downloads/creditcard.csv')
class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))
df.head()

We can see that our dataset has quite a large imbalance, though since we are doing a first time run-through and not tuning our models, I am going to leave the data how it is.

Train Test Split

X=df.drop('Class',axis=1)
y=df['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)
print("Length of X_train is: " + str(len(X_train)))
print("Length of X_test is: "+str(len(X_test)))
print("Length of y_train is: "+str(len(y_train)))
print("Length of y_test is: "+str(len(y_test)))

Fitting our Models

%%time
modelrf = RandomForestClassifier()
modelrf.fit(X_train,y_train)

2 minutes 34 seconds

%%time
modelgb=GradientBoostingClassifier()
modelgb.fit(X_train,y_train)

3 minutes 58 seconds

%%time
modelcb=CatBoostClassifier(verbose=False)
modelcb.fit(X_train,y_train)

26 seconds

With my computer really needing to be reset and running very slow the the Cat Boosting Classifier still fit the data by over 3 minutes faster.

Predictions

predrf = modelrf.predict(X_test)
print("Mean Accuracy Random Forest:",modelrf.score(X_test,y_test))
print("F1 Random Forest:",metrics.f1_score(y_test, predrf))

predgb = modelgb.predict(X_test)
print("Mean Accuracy Gradient:",modelgb.score(X_test,y_test))
print("F1 Gradient:",metrics.f1_score(y_test, predgb))

predcb = modelcb.predict(X_test)
print("Mean Accuracy Cat:",modelcb.score(X_test,y_test))
print("F1 Cat:",metrics.f1_score(y_test, predcb))

Conclusion

For running base models the results were quite stunning! The CatBoost algorithm not only was significantly faster at fitting the data but it also performed better on the test data in accuracy and F1 scores. Though with the normal Random Forest Classifier besides the amount of time it took to fit the model I was surprised that it did find a good correlation in the data and preformed quite well for being a basic technique. Then the biggest disappointment, though I was ready for it since I decided not to tune any hyperparamers, was the Gradient Boosting which did not come out ready to play. From this trial I am excited to use the CatBoost algorithm in futures projects and I agree with all of the Data Scientist out there so excited about it!

If anyone reading this has had any experiences using this algorithm and would like to share your projects, I would love to see them and excited to see the progressions in technology!