DEV Community

Cover image for CatBoost and Water Pumps
Taras Baranyuk
Taras Baranyuk

Posted on

CatBoost and Water Pumps

Introduction

This article is based on the competition. In the largest country in East Africa, with about 60 million people, half of the population does not have access to clean water. Billions of dollars in foreign aid are being provided to the country to tackle the freshwater problem. However, the government cannot solve this problem. A significant part of water pumps is entirely out of order or practically does not function; the others require repair.

Data

The data has many characteristics associated with water pumps. The water supply points were divided into functional, non-functional and functional but in need of repair. The goal of the competition is to build a model that predicts the functionality of water supply points.
The modelling data has 59400 rows and 40 columns without the label that comes in a separate file.
The metric used for this competition is the classification rate, which calculates the percentage of rows where the predicted class in the submission matches the actual class in the test set. The maximum is 1, and the minimum is 0. The goal is to maximize the classification rate.

EDA (Exploratory Data Analysis)

A detailed description of each feature in the dataset can be found on the competition page.

First of all, let’s look at the target — the classes don’t have an even distribution.
Labels
The small number of labels for water pumps in need of repair. We will not solve this issue but use the appropriate metric when creating the model and library capabilities.

Let’s see how the water pumps are distributed across the territory of the country.
Map

It is known that some functions contain empty values - let's see them on the chart.
Empty values
We can see that there are very few rows with missing values, with scheme_name having the largest number.
The following heatmap represents the presence/absence relationships between variables. It is worth paying attention to the correlation between permit, installer and funder.
Heatmap
Let’s see the general picture of the relationships on the dendrogram.
Dendrogram
In the characteristics of water pumps, there is one that shows the amount of water. We can check how the water amount is related to the pumps’ condition (quantity_group).
The water amount
There are many wells with sufficient water that are not functioning. From the point of view of investment efficiency, it is logical to focus on repairing this particular group in the first place. Also, it is observed that most dry pumps are not working. By finding a solution to fill these wells again with water, they can probably be functional.

Does water quality affect the condition of the water pumps? We can see the data grouped by quality_group.
The water quality
Unfortunately, this graph is not very informative, since the number of sources with good water prevails. Let’s try to group only for sources with less quality water.
The water quality w/o 'good'
Most pumps with an unknown quality_group are non-functional.

There is another attractive characteristic of waterpoints — their type (waterpoint_type_group).
waterpoint_type_group
Analysis of the data by waterpoints shows that the group with other types contains many inoperative pumps. Are they outdated? We can check how the year the pump was constructed affects.
The year of construction
The older the waterpoint, the higher the probability that it is not functioning, mostly before the 80s.

Now we will try to get insights from the information about the funding organizations. The condition of the wells should be correlated with funding. Consider only organizations that fund more than 500 waterpoints.
The funding organizations
Danida — they have many working water points, the percentage of broken ones is very high. Similar situation with RWSSP(Rural Water Supply and Sanitation Program), Dhv and a few more. It should be noted that most of the wells financed by the German Republic and by Private Individuals are mostly in working state. In contrast, a large number of wells that are financed by the state are not functioning. Most of the water points established by the central government and district council are also not working.

Let us consider the hypothesis that the water’s purity and the water basin to which the well belongs can influence the functioning. First of all, let’s look at the water basins.
The water basins
Two basins stand out strongly — Reuben and Lake Rukwa. The number of broken water points there is the majority.

It is known that some of the wells are not free. We can assume that payments can positively affect keeping the pumps in working order.
Payments
The hypothesis is fully confirmed — payment for water helps to keep the source in a working state.

Let's build a simple DecisionTreeClassifier with a 4 depth and see how this tree looks.

from sklearn import tree
from dtreeviz.trees import *
from sklearn.utils.class_weight import compute_sample_weight

clf = tree.DecisionTreeClassifier(max_depth=4, random_state=42)

y_train = df['labels']
X_train = df.drop('labels', axis=1)
sample_weight = compute_sample_weight(
    class_weight='balanced', 
    y = y_train)

clf.fit(X_train, y_train, sample_weight=sample_weight)

dtreeviz(
    clf, x_data=X_train, y_data=y_train, target_name='labels',
    feature_names=X_train.columns.tolist(),
    class_names=["functional", "non functional",
        "functional needs repair"],
    title="Decision Tree")
Enter fullscreen mode Exit fullscreen mode

DecisionTreeClassifier
Here you can look at the full-size image.

The data contains numeric information that we can look at and maybe find something interesting in addition to categorical parameters.
Numerical features
Part of the data was filled with 0 values instead of real data. We can also see that amount_tsh is higher in workable water points (label = 0). Also, you should pay attention to the outliers in the amount_tsh feature. As a feature, one can note the difference in elevation and the fact that a significant part of the population lives 500 meters above the mean sea level.

Data Cleaning

Before starting to create a model, we need to clean and prepare the data.

  • The installer feature contains many repetitions with different cases, spelling errors and abbreviations. Let’s put everything in lowercase first. Then, using simple rules, we reduce the number of mistakes and do the grouping.
  • After cleaning, we replace any items that occur less than 71 times (0.95 quantiles) with ‘other’ items.
  • We repeat by analogy with the funder feature. The cut-off threshold is 98.
  • The data contains features with very similar categories. Let’s choose only one of them. Since there is not much data in the dataset, we leave the feature with the smallest categories. Delete scheme_management, quantity_group, water_quality, payment_type, extraction_type, waterpoint_type_group, region_code.
  • Replace the latitude and longitude values of outliers with the corresponding region_code median values.
  • A similar technique for replacing missing values is applicable for subvillage and scheme_name.
  • Missing values in public_meeting and permit are replaced with median values.
  • For subvillage, public_meeting, scheme_name, permit, we can create different binary features that show missing values.
  • The features scheme_management, quantity_group, water_quality, region_code, payment_type, extraction_type, waterpoint_type_group, date_recorded, and recorded_by can be deleted is either duplicate information or it is useless.

Modelling

The data contains a large number of categorical features. The most suitable for obtaining a base-line model, in my opinion, is CatBoost. It is a high-performance, open-source library for gradient boosting on decision trees.

We will not select the optimal parameters; let it be homework. Let’s write a function to initialize and train the model.

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        max_ctr_complexity=5,
        task_type='CPU',
        iterations=10000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )
return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=1000,
        plot=False,
        use_best_model=True)
Enter fullscreen mode Exit fullscreen mode

For the evaluation, AUC was chosen because the data is highly unbalanced, and this metric is the best for such cases.

For the target metric, we can write our function.

def classification_rate(y, y_pred):
    return np.sum(y==y_pred)/len(y)
Enter fullscreen mode Exit fullscreen mode

Since there is little data, it is not great to split the dataset into train and validation parts. In this case, it is better to use OOF (Out-of-Fold) predictions. We will not use third-party libraries; let’s try to write a simple function. Please note that splitting the dataset into folds must be stratified.

def get_oof(n_folds, x_train, y, x_test, cat_features, seeds):
    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  

    oof_train = np.zeros((len(seeds), ntrain, 3))
    oof_test = np.zeros((ntest, 3))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 3))
    test_pool = Pool(data=x_test, cat_features=cat_features) 
    models = {}
    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)          
        for i, (train_index, test_index) in enumerate(kf.split(x_train, y)):
            print(f'\nSeed {seed}, Fold {i}')
            x_tr = x_train.iloc[train_index, :]
            y_tr = y[train_index]
            x_te = x_train.iloc[test_index, :]
            y_te = y[test_index]
            train_pool = Pool(
                data=x_tr, label=y_tr, cat_features=cat_features)
            valid_pool = Pool(
                data=x_te, label=y_te, cat_features=cat_features)
            model = fit_model(
                train_pool, valid_pool,
                loss_function='MultiClass',
                random_seed=seed
            )
            oof_train[iseed, test_index, :] = model.predict_proba(x_te)
            oof_test_skf[iseed, i, :, :] = model.predict_proba(x_test)
            models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models
Enter fullscreen mode Exit fullscreen mode

To reduce the dependence on splitting randomness, we will set several different seeds to calculate predictions.
The learning curve of one of the folds
The learning curves look incredibly optimistic, and the model should look good.

Having looked at the importance of the model’s features, we can make sure that there is no obvious leak.
The feature importance in one of the models

After averaging the predictions:

balanced accuracy: 0.6703822994494413
classification rate: 0.8198316498316498
Enter fullscreen mode Exit fullscreen mode

This result was obtained when uploading predictions on the competition website.
Rusults

Considering that the top5 result was only about 0.005 better at the time of this writing, we can say that the base-line model is good.
Top5

To make sure that all the work on the analysis and data cleaning was not done in vain, we will build a model based solely on the data provided to us. The only thing we'll do is fill in the missing values with zeros.

balanced accuracy: 0.6549535670689709
classification rate: 0.8108249158249158
Enter fullscreen mode Exit fullscreen mode

Simple

The result is noticeably worse.

Summary

In this post, we:

  • got acquainted with the data and looked for insights that can lead to thoughts for feature generation;
  • cleaned up and prepared the provided data to create the model;
  • decided to use CatBoost, since the bulk of the features are categorical;
  • wrote a function for OOF-prediction;
  • got an excellent result for the base-line model.

The right approach to data preparation and choosing the right tools for creating a model can give great results even without making additional features.

As a homework assignment, I suggest adding new features, choosing the model’s optimal parameters, using other libraries for gradient boosting, and building ensembles from the resulting models.

The code from the article can be viewed here.

Top comments (1)

Collapse
 
romanromadin profile image
Roman Romadin

Awesome!