ddey117

Posted on Dec 1, 2021 • Edited on Dec 4, 2021

Using Machine Learning Classification Model for Predictive Maintenance of Tanzanian Water Pumps

Tanzania Machine Learning Water Pump Classification

Author: Dylan Dey

This project it available on github here: project repository.

Overview

The website 'DrivenData' finds partners to collaborate with in order to aggregate data to make available in an open competition of Data Scientists from around the globe to come together to solve large social issues. For this project, DrivenData partnered with Taarifa and the Tanzanian Ministry of Water in order to predict the functionality of water pumps in the country of Tanzania. Machine learning classifiers from the ‘scikit-learn ‘ library were used to create a model for predictive maintenance of the water pumps.

Business Problem

Global Water Scarcity

The following statistics were taken from the UN website:

2.3 billion people live in water-stressed countries
3.2 billion people live in agricultural areas with high or very high water shortages or scarcity
700 million people can be displaced by severe water scarcity by the year 2030

Readily accessible water is necessary to support a healthy population. Poor infrastructure, management of resources, climate change, and a number of other factors can have dramatic effects on the ability for populations to get the water they need.

Tanzania Water Scarcity

Although Tanzania has a relatively abundant amount of surface water with a number of lakes and rivers, a large portion of the country is arid and must depend on groundwater from communal bore holds in order to survive.

After Tanzania gained independence in 1961, aggressive policies of villagisation were put in place in an attempt to improve infrastructure by bringing communities together with common goals. Policies were put in place with a goal of getting water within 400 meters to every household within 20 years. However, as urban areas continued to expand in the country, the program failed to deliver what was promised in terms of water availability, especially in urban environments.

Keeping potable groundwater easily accessible to large populations is no easy task. In a country where so much of the population depends on access to groundwater for survival, it is crucial to be able to systemically maintain the infrastructure that is functional and urgently address the infrastructure that is not working.

Several water points are established in Tanzania which currently supply the population with water, although this system is quite inefficient. Poor infrastructure and spotty maintenance plague the system with broken taps, broken pipes, and damaged supply sources. In addition, water pumps and plumbing make attractive targets for thieves. These factors and others block access to clean water from the established system.

The most common form of maintenance for these points is to repair them once they are no longer functional. This is not very efficient and is somewhat expensive, but it is magnitudes cheaper than establishing a new water point through drilling and installing of large equipment.

A better approach would be to prevent repairs from becoming necessary in the first place.
Predictive maintenance constantly monitors the status of pumps in order to more efficiently maintain them. This is the goal of this project. Timely verification of the status of water pumps in the region can prevent further compounding of issues and can reduce maintenance costs significantly. Most importantly, it can bring fresh water to those who need it in a more efficient manner than before it was implemented.

For the past two decades, the Ministry of Water has been implementing sector reforms that aim at improving resource management and improving water supply in both rural and urban environments. An attempt to determine if predictive maintenance could help the Ministry of Water in the overall success of their mission statement.

The goal is to predict if an unknown water pump is either functional or nonfunctional before ever sending anyone to physically check the pump and test it’s functionality. While it could prove valuable to treat the issues as a tertiary classification problem, I decided to simplify things for this project and only predict two labels instead of three. Future work will focus on creating a more sophisticated model with proper class imbalance techniques. Focusing for now on creating a strong predictive model for a binary classification problem can be very beneficial for the Ministry of Water.

A team will obviously need different equipment, training, and other resources in order to do simple maintenance on a functional pump versus fully repairing a non-functional pump. It is critical to improve the efficiency of pump maintenance and repair to ensure more people can get what they need to survive as quickly as possible.

Without any model to make predictions, the Ministry of Water would have to physically check all of the pumps. The teams sent could bring just what they would need to maintain clean pumps, and only have what they need for a little over half of the pumps. Then they could record which pumps were maintained and which pumps still need to be repaired, go back to base, gather the proper supplies, and deploy again with the proper resources.

The ‘date recorded’ column reveals that in 2011 nearly 28k Water pumps were physically checked for their status. This seems to be the current capacity for just the exploratory part of maintenance before ever sending the proper equipment or supplies to fix what needs fixed. We could imagine how resource intensive this could be through a little bit of speculation. If one team could check 3 pumps a day, worked five days a week with no holidays, and never took any additional time off, it would still take around 40 teams even under these unrealistic, intense conditions. This doesn't even consider the few months a year that Tanzania has heavy rains that could prove to slow progress even further. These 40 teams would need the proper equipment to navigate long distances. Tanzania has some of the most diverse geography in the world, ranging from some of the world’s tallest mountain ranges and volcanoes, dense jungle, arid grassland with no shade from the intensity of the African sun, treacherous valleys, and seasonal flooding are just some features of Tanzania geography that can impede this first round of physical exploration before the second round in which resources can finally be directed to where they need to be. This burden can significantly lower whatever money and resources would even be left to buy and send the equipment that is necessary after funding all of this exploration.

Thus, being able to reliably predict the status of a pump without physically checking the water pump could be extremely important. Reducing the amount of resources necessary for this crucial step can free up resources to return functionality to important water points to the communities that desperately need them, increasing the amount of people with access to fresh water. Furthermore, these communities will get access to fresh water on a much shorter timeline.

Data Understanding

Below is an image generated using QGIS software that shows the geological location of the pumps and their current status as either functional or non-functional. It also shows Tanzanian rivers that were added through a shapefile available from USGS.gov.

A description of all of the data available directly from the DrivenData competition:

amount_tsh - Total static head (amount water available to waterpoint)
date_recorded - The date the row was entered
funder - Who funded the well
gps_height - Altitude of the well
installer - Organization that installed the well
longitude - GPS coordinate
latitude - GPS coordinate
wpt_name - Name of the waterpoint if there is one
num_private -
basin - Geographic water basin
subvillage - Geographic location
region - Geographic location
region_code - Geographic location (coded)
district_code - Geographic location (coded)
lga - Geographic location
ward - Geographic location
population - Population around the well
public_meeting - True/False
recorded_by - Group entering this row of data
scheme_management - Who operates the waterpoint
scheme_name - Who operates the waterpoint
permit - If the waterpoint is permitted
construction_year - Year the waterpoint was constructed
extraction_type - The kind of extraction the waterpoint uses
extraction_type_group - The kind of extraction the waterpoint uses
extraction_type_class - The kind of extraction the waterpoint uses
management - How the waterpoint is managed
management_group - How the waterpoint is managed
payment - What the water costs
payment_type - What the water costs
water_quality - The quality of the water
quality_group - The quality of the water
quantity - The quantity of water
quantity_group - The quantity of water
source - The source of the water
source_type - The source of the water
source_class - The source of the water
waterpoint_type - The kind of waterpoint
waterpoint_type_group - The kind of waterpoint
There was a lot of overlap in the data between these predictors and therefore a lot of redundant columns were dropped before modeling. There were also issues with incomplete data and high cardinality of categorical data that needed to be dealt with. I have included my list of functions that were used to clean the data in order to get an idea of the kind of cleaning that took place for my modeling.

def drop_cols(water_pump_df):
    to_drop_final = ['id', 'recorded_by', 'num_private',
          'waterpoint_type_group', 'source',
          'source_class', 'extraction_type',
          'extraction_type_group', 'payment_type',
          'management_group', 'scheme_name',
          'water_quality', 'quantity_group',
          'scheme_management', 'longitude',
          'latitude', 'date_recorded',
          'amount_tsh', 'gps_height',
          'region_code', 'district_code']
          #'population'

    return water_pump_df.drop(columns=to_drop_final, axis=1)

#helper function to bin construction year
def construction_wrangler(row):
    if row['construction_year'] >= 1960 and row['construction_year'] < 1970:
        return '60s'
    elif row['construction_year'] >= 1970 and row['construction_year'] < 1980:
        return '70s'
    elif row['construction_year'] >= 1980 and row['construction_year'] < 1990:
        return '80s'
    elif row['construction_year'] >= 1990 and row['construction_year'] < 2000:
        return '90s'
    elif row['construction_year'] >= 2000 and row['construction_year'] < 2010:
        return '00s'
    elif row['construction_year'] >= 2010:
        return '10s'
    else:
        return 'unknown'

def bin_construction_year(water_pump_df):
    water_pump_df['construction_year'] = water_pump_df.apply(lambda row: construction_wrangler(row), axis=1)
    return water_pump_df


#takes zero placeholders and NAN values and converts them into 'unknown'
def fill_unknowns(water_pump_df):    
    installer_index_0 = water_pump_df['installer'] =='0'
    funder_index_0 = water_pump_df['funder'] =='0'
    water_pump_df.loc[installer_index_0, 'installer'] = 'unknown'
    water_pump_df.loc[funder_index_0, 'funder'] = 'unknown'
    water_pump_df.fillna({'installer':'unknown', 
                   'funder':'unknown', 
                   'subvillage': 'unknown'}, inplace=True)
    return water_pump_df

#returns back boolean features without NANs while maintaining same ratio of True to False as with NANs    
def fill_col_normal_data(water_pump_df):
    filt = water_pump_df['permit'].isna()
    probs = water_pump_df['permit'].value_counts(normalize=True)
    water_pump_df.loc[filt, 'permit'] = np.random.choice([True, False], 
                       size=int(filt.sum()),
                       p = [probs[True], probs[False]])
    filt = water_pump_df['public_meeting'].isna()
    probs = water_pump_df['public_meeting'].value_counts(normalize=True)
    water_pump_df.loc[filt, 'public_meeting'] = np.random.choice([True, False], 
                       size=int(filt.sum()),
                       p = [probs[True], probs[False]])
    return water_pump_df



def apply_cardinality_reduct(water_pump_df, reduct_dict):
    for col, categories_list in reduct_dict.items():
        water_pump_df[col] = water_pump_df[col].apply(lambda x: x if x in categories_list else 'Other')
    return water_pump_df



#one_hot_incode categorical data
def one_hot(water_pump_df):
    final_cat = ['funder', 'installer', 'wpt_name', 'basin', 'subvillage', 'region',
       'lga', 'ward', 'public_meeting', 'permit', 'construction_year',
       'extraction_type_class', 'management', 'payment', 'quality_group',
       'quantity', 'source_type', 'waterpoint_type']

    water_pump_df = pd.get_dummies(water_pump_df[final_cat], drop_first=True)

    return water_pump_df



#master function for cleaning dataFrame
def clean_dataFrame(water_pump_df, reduct_dict):
    water_pump_df = drop_cols(water_pump_df)
    water_pump_df = bin_construction_year(water_pump_df)
    water_pump_df = fill_unknowns(water_pump_df)
    water_pump_df = fill_col_normal_data(water_pump_df)
    water_pump_df = apply_cardinality_reduct(water_pump_df, reduct_dict)
    water_pump_df = one_hot(water_pump_df)

    return water_pump_df

###############################################
# The rest of the functions in this section
#define functions that reduce cardinality
#by mapping infrequent values ot other
#the dictionary derived from these functions
#will be used by my_funk in my master 
#clean_dataFrame function

#helper function for reducing cardinality    
def cardinality_threshold(column,threshold=0.65):
    #calculate the threshold value using
    #the frequency of instances in column
    threshold_value=int(threshold*len(column))
    #initialize a new list for lower cardinality column
    categories_list=[]
    #initialize a variable to calculate sum of frequencies
    s=0
    #Create a dictionary (unique_category: frequency)
    counts=Counter(column)

    #Iterate through category names and corresponding frequencies after sorting the categories
    #by descending order of frequency
    for i,j in counts.most_common():
        #Add the frequency to the total sum
        s += dict(counts)[i]
        #append the category name to the categories list
        categories_list.append(i)
        #Check if the global sum has reached the threshold value, if so break the loop
        if s >= threshold_value:
            break
        #append the new 'Other' category to list
        categories_list.append('Other')

    #Take all instances not in categories below threshold  
    #that were kept and lump them into the
    #new 'Other' category.
    new_column = column.apply(lambda x: x if x in categories_list else 'Other')
#     return new_column
    return categories_list

 #reduces the cardinality of appropriate categories   
def get_col_val_mapping(water_pump_df):
    col_threshold_list = [
        ('funder',0.65), 
        ('installer', 0.65),
        ('wpt_name', 0.15),
        ('subvillage', 0.07),
        ('lga', 0.6),
        ('ward', 0.05)
    ]

    reduct_dict = {}

    for col, thresh in col_threshold_list:
        reduct_dict[col] = cardinality_threshold(water_pump_df[col],
                                                   threshold= thresh)

    return reduct_dict

# reduct_dict is a key value mapper that will
# be used for both training and testing sets
# in order to reduce cardinality of the data

I decided to try to bring in some more outside data to see if it would help improve my models. After finding a suitable shapefile for the rivers of Tanzania, I was able to create a new feature using QGIS spatial relation software. This feature describes the distance of each pump to the nearest river. I decided to simplify the feature into a boolean that describes whether each pump is within 8 km of a river or not. Future work can explore this feature more. For more information on how this process was done, you can visit my other blog here in which I created a similar feature for Boston Housing and bodies of water.

Another feature I decided to add was the population of the surrounding area for each pump. To do this, I imported government census data from the same time frame that the data for the pumps was initially collected. The data I chose to use was ward population data as this is a very high level geological boundary with thousands of divisions. It was a simple way to get some understanding of the population surrounding each water pump.

Here is my block of code in which I imported and manipulated this outside data to prepare for my modeling.

#import data from DrivenData
train_labels = pd.read_csv('files/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv')
train_features = pd.read_csv('files/4910797b-ee55-40a7-8668-10efd5c1b960.csv')
df = train_features.merge(train_labels, on='id').copy()
#import QGIS derived data and prepare for model
river_df = pd.read_csv('data/river_dist2.csv')
#removing outliers
index_riv = river_df[river_df['HubDist'] >66].index
river_median = river_df['HubDist'].median()
river_df.loc[index_riv, 'HubDist'] = river_median

#create boolean for pump being within 8 km of river
river_s = river_df['HubDist'].copy()
river_s.rename('near_river', inplace=True) 
near_river = river_s[river_s < 8].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
df = df.join(near_river)
df.near_river.fillna(0, inplace=True)

#import population data from 2012 government census
df_pop = pd.read_excel('data/tza-pop-popn-nbs-baselinedata-xlsx-1.xlsx')

#create a dictionary of values with format {Ward : Total Population}

pop_index = df_pop.groupby('Ward_Name')['total_both'].sum().index
pop_values = df_pop.groupby('Ward_Name')['total_both'].sum().values
pop_dict = dict(zip(pop_index, pop_values))

#create pandas Dataframe for merging
pop_dataframe = pd.DataFrame.from_dict(pop_dict, orient='index')
#rename column for clarity
pop_dataframe.rename(columns={0: 'ward_pop'}, inplace=True)

#merge dataframes
df_pop_merge = df.merge(pop_dataframe,
                              how='left',
                              left_on='ward',
                              right_index=True)

#replace null values of ward population with
#median ward population

ward_pop_median = df_pop_merge['ward_pop'].median()
df_pop_merge.fillna(value=ward_pop_median, inplace=True)
# merge back into df and drop pop column
ward_pop_s = df_pop_merge['ward_pop'].copy()
df = df.join(ward_pop_s)
df.drop(columns=['population'], axis=1, inplace=True)


need_repair_index = df['status_group'] == 'functional needs repair'
df_binary = df.copy()
df_binary.loc[need_repair_index, 'status_group'] = 'non functional'

Classification Metric Understanding

Below is a confusion matrix that would be produced from a model performing predictive maintenance on behalf of the Ministry of Water. There are four possible outcomes to be considered. The confusion matrix below is a visual aid to help in understanding what classification metrics to consider when building the model.

A true positive in the current context would be when the model correctly identifies a functional pump as functional. A true negative would be when the model correctly identifies a non-functional pump as non-functional. Both are important and both can be described by the overall accuracy of the model.

True negatives are really at the heart of the model, as this is the situation in which the Ministry of Water would have a call to action. An appropriately outfitted team would be set to all pumps that my model identifies as non-functional. Thus, this is the situation in which the correct resources are being derived to the correct water pumps as quickly as possible. High accuracy would mean that more resources are going to the correct locations from the get-go.

True positives are also important. This is where the model will really be saving time, resources, and money for the Ministry of Water. Any pumps identified as functional would no longer need to be physically checked and the Ministry of Water can withhold additional resources from going to pumps that do not actually need them.

Notice the emphasis on any and all pumps in my description of true negatives and true positives above. The true cost/resource analysis is really the consideration of this fact: no model I create will ever correctly identify every single pump appropriately. This is the cost of predictive maintenance and a proper understanding of false positives and false negatives is extremely important in production of classification models in the given context.

False positives in the current context are the worst case scenario for modeling. This is the scenario in which the model incorrectly identifies a non-functional model as functional. Thus, resources would be withheld and no team would be sent to physically check these pumps, as the Ministry of Water would have to assume they are indeed functional if they want to use the model appropriately. False positives therefore describe the number of non-functional pumps that will go unvisited and unfixed until they can be resolved by other means. Reducing false positives as much as possible is very important.

Well, why would I want to build the model if these false positives cannot be completely avoided? Cost/resource management, of course! After all, it is about making sure as many people get clean water as quickly as possible. The reality is there that resources are finite, and without the model the Ministry of Water likely would not have the resources to physically check all the pumps and then fix all of the pumps in any sort of reasonable timeline, and even less communities would have access to fresh water when compared to using the model for predictive maintenance.

False negatives are also important to consider. While false positives can be considered more harmful overall, false negatives are also important to reduce as much as possible. In the given context, false negatives describe the situation in which the model incorrectly identifies a functional pump as non-functional. Because the Ministry of water will deploy fully equipped teams to visit all pumps that my model predicts to be non-functional, these will be the pumps that will waste resources. Resources will be sent to locations that they aren't needed, and the metric that describes this would be false negatives. Thus, reduction of false negatives is essential in improving the efficiency of resource management through predictive maintenance.

In summary, overall accuracy of the model and a reduction of both false negatives and false positives are the most important metrics to consider when developing a model in this context. More specifically, models will be tuned to maximize accuracy and f1-score. Maximizing accuracy will increase the number of true positives and true negatives, and maximizing for f1-score will reduce the number of false negatives and false positives. Thus, these are the ideal metrics to consider when tuning the models.

Modeling

I decided to use Logistic Regression as my baseline model. After tuning the logistic regression model for high accuracy and f1 score, I tried to do the same for random forest classification models and XGBoost classification models.

Using a grid search to tune the hyperparameters of a random forest classifier, I created my best model for the business problem at hand. Below are the scores for the relevant metrics for my random forest classifier.

Testing Precision: 0.8000386025863733

Testing Accuracy: 0.8032547699214366

Testing F1-Score: 0.825450562580902

Evaluation

My random forest models outperformed my best logistic regression and XGSBoost models in regards to the metrics that are most important given the business problem at hand.

The best random forest model has the lowest false positive rate, a low false negative rate, and the highest accuracy.

About 11.63% of pumps would be misclassified as functional using my best model. This means that 11.63% of the pumps would go untreated if it was deployed to conduct predictive maintenance. However, it correctly identifies a high number of functional pumps correctly, which would save a lot of valuable resources, time and money, and it also identifies a large number of non-functional pumps correctly. Only 8.05% of functional pumps would be incorrectly identified as non-functional. This is the resource/time/money sink of my model.

Conclusions

I believe that my best classification model provides a powerful enough predictive ability to prove very valuable to the Ministry of Water. The amount of resources saved, the relatively low number of misclassified functional pumps, and the elimination of the need to physically sweep the functionality of all pumps can bring access to potable drinking water to a larger number of communities than before without predictive maintenance.

Please visit my github repository for a complete understanding of my modeling and data exploration process. Thank you for reading.