Kamau Gilbert Mungai

Posted on Jan 25

Financial Inclusion in Africa: A Zindi Project

#gcp #machinelearning #datascience #ai

Introduction

In this article, I will guide you through my approach to tackling the Financial Inclusion in Africa project by Zindi. While the primary goal of this project is to predict which individuals are most likely to have or use a bank account, my main focus was to learn how to deploy a machine learning model on Google Cloud Platform (GCP).

You can use my methodology as a reference to implement the project on your own, adapting it as needed. During my attempt, I fine-tuned the model once and achieved a Mean Absolute Error (MAE) score of 0.122942692.

Now, you might be wondering: "Isn't this a classification problem? Why use MAE as a metric?" Fellow Zindians provided a clever explanation:

"You're on the right track! In some cases, classification problems use Mean Absolute Error (MAE) when treating classes as numerical values. If the model predicts 0 instead of 1, the error is |0-1| = 1. MAE measures how far off predictions are from the true classes on average, unlike accuracy, which just counts correct predictions. If there's an ordinal relationship between classes, this method can be useful."

In this article, I'll outline the steps I followed, leaving room for you to build upon and improve with your unique ideas.

The project is structured into four key phases:

Data Cleaning and Exploratory Data Analysis (EDA)
Feature Engineering
Model Creation and Evaluation
Model Deployment on GCP

Please note that I trained the model locally; however, GCP offers the capability to train models directly in the cloud using its powerful instances. This process includes building a Docker image, a topic I plan to cover in detail on another occasion (honestly I'm not a big Docker fan but I like a nice operating system).

And yes, training the model locally does consume a lot of computer resources. The image below shows how much of my resources were used during model fine tuning:

You can access the repository here. This project was carried out using python version 3.13.

PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa> python --version
Python 3.13.1

Let's get started.

1. Data Cleaning and Exploratory Data Analysis (EDA)

EDA, or Exploratory Data Analysis, is all about understanding your dataset—spotting patterns, gaining insights, and identifying any issues that might need fixing. From my experience, EDA often goes hand in hand with data cleaning. While Data Analysts usually lead this process, Data Scientists also get involved, especially when it comes to more technical or statistical aspects.

If you're looking for a practical example, feel free to check out this article I wrote about a previous project. It covers some basics of data cleaning and EDA. You can also dive into the project’s repository for the code.

What Makes EDA Important?

There’s no single "right" way to do EDA—it really depends on the data, your goals, and even personal or team preferences. For example:

Your background matters. If you're technically inclined or have domain expertise, you might notice patterns others don’t.
Tools matter. While I use Python libraries like pandas and seaborn, tools like Power BI are fantastic for creating quick, intuitive visualizations.

My Approach to EDA

When working with tabular data, I typically start by answering basic questions like:

For numerical features:
- What’s the average, maximum, and minimum?
- Are there outliers or strange patterns?
For categorical features:
- How many unique categories exist?
- Are the categories evenly distributed, or do we have imbalances?

From there, I dive deeper depending on the dataset. For example:

Chi-square tests can help identify relationships between categorical variables.
Benford's Law is great for spotting unusual patterns in numerical data, like financial figures.

These are just starting points—your EDA will evolve as you explore more of the data.

Common EDA libraries

Python has some great libraries for EDA:

pandas and numpy: Perfect for manipulating and exploring data.
matplotlib.pyplot and seaborn: Go-to options for creating clear and impactful visualizations.
pyforest: A handy library that imports commonly used libraries for you, saving time. You can check out pyforest's repository here to see what libraries come with it.

Why EDA Matters

EDA is a crucial step in any data project because it sets the foundation for everything else. It helps you:

Understand your data: Gain insights that lead to better decisions during model training.
Solve problems early: Spot issues like class imbalances or missing data before they derail your project.
Guide your next steps: For instance, it might inspire new feature engineering ideas or show you which actions (like applying class weights) are necessary during model training.

See It in Action

Curious about how I approached EDA for the Financial Inclusion in Africa project? Check out this notebook. It’s all there, and you’re welcome to build on it or adapt it for this or your own projects.

EDA isn’t just a box to tick—it’s your opportunity to really connect with your data and get creative. So, don’t rush it! Cleaning and Exploratory Data Analysis (EDA) typically account for 70% to 80% of the total time spent on developing a machine learning model.

2. Feature Engineering

Feature engineering is all about turning raw data into meaningful features that can help your machine learning model perform better.

In my case, I combined values from two columns in the dataset—such as country and location_type—to create a new feature called geographical_location. This resulted in values like Kenya_Urban and Uganda_Rural. It’s a straightforward example, but even small steps like this can have a meaningful impact on your model’s performance.

Of course, feature engineering doesn’t stop there. Depending on your dataset and goals, you can get really creative. You might extract time-based patterns, combine multiple columns in unique ways, or apply advanced transformations to uncover deeper insights. It’s all about finding features that help your model make better predictions.

3. Model Creation

This is where the magic happens—the phase where data scientists roll up their sleeves, train models, and keep a close eye on how the model is learning. Admit it, we’ve all had those moments staring endlessly at our screens as the training process runs. 😄

After completing the Exploratory Data Analysis (EDA), you should have a clear understanding of the type of problem you're tackling. In most cases, it's either a regression problem or a classification problem. For this project, it was a classification problem, so I explored commonly used algorithms like Logistic Regression, Random Forest Classifier, and XGBoost.

Choosing Evaluation Metrics

Selecting the right evaluation metrics is just as important as selecting the right algorithm. For classification problems, metrics like:

Recall: Measures the ability to find all positive instances ((TP / (TP + FN))),
Precision: Measures the accuracy of positive predictions ((TP / (TP + FP))),
F1 Score: Balances precision and recall, ((2 / ((1 / Precision) + (1 / Recall)))),
Accuracy: Simply measures the ratio of correct predictions, are commonly used.

In deep learning, for instance, you might use metrics like validation loss or mean average precision (mAP) at different thresholds (e.g., 50%, 75%).

My Approach

Below is the code I used to train my model. It involves key steps like oversampling the minority class using SMOTE, fitting models, and evaluating them using various metrics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
    classification_report, accuracy_score, precision_score,
    recall_score, f1_score, roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from imblearn.over_sampling import SMOTE

# Step 1: Oversampling the Minority Class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Step 2: Define a Model Evaluation Function
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Performance Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
    roc_auc = auc(fpr, tpr)

    # Plot ROC Curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve for {model_name}')
    plt.legend()
    plt.show()

    # Confusion Matrix
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, cmap=plt.cm.Blues)
    plt.title(f'Confusion Matrix for {model_name}')
    plt.show()

    # Classification Report
    print(f"Classification Report for {model_name}:\n{classification_report(y_test, y_pred)}")

    return accuracy, precision, recall, f1, roc_auc

# Step 3: Train and Evaluate Models
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
rf = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = xgb.XGBClassifier(scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]), random_state=42)

log_reg_results = evaluate_model(log_reg, X_res, y_res, X_test, y_test, 'Logistic Regression')
rf_results = evaluate_model(rf, X_res, y_res, X_test, y_test, 'Random Forest')
xgb_results = evaluate_model(xgb_model, X_res, y_res, X_test, y_test, 'XGBoost')

# Step 4: Summarize Results
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [log_reg_results[0], rf_results[0], xgb_results[0]],
    'Precision': [log_reg_results[1], rf_results[1], xgb_results[1]],
    'Recall': [log_reg_results[2], rf_results[2], xgb_results[2]],
    'F1 Score': [log_reg_results[3], rf_results[3], xgb_results[3]],
    'ROC AUC': [log_reg_results[4], rf_results[4], xgb_results[4]],
})
print(results)

Key Takeaways:

Oversampling with SMOTE: Helps balance the dataset by synthetically generating new samples for the minority class. You could also check out SMOTETomek that is simply SMOTE which handles Tomek links. Read about it here.
Model Evaluation: Always analyze metrics beyond accuracy—precision, recall, and F1 Score give a more detailed understanding of model performance.
Visualization: Tools like ROC curves and confusion matrices provide a deeper look at how well your model is performing.

You can adapt this workflow or expand on it based on your dataset and goals!

After the initial training phase, you might choose to fine-tune the best-performing model to optimize its performance. Alternatively, you can go a step further and train an ensemble model, which combines multiple models to achieve better results. I prefer the latter approach—it often provides more robust and accurate predictions.

Here's how I trained and fine-tuned an ensemble model:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Assuming X_train, X_test, y_train, y_test are already defined

# Step 1: Apply SMOTE to oversample the minority class in the training data
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Step 2: Define base models (Logistic Regression and Random Forest)
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
rf = RandomForestClassifier(class_weight='balanced', random_state=42)

# Step 3: Create a Voting Classifier ensemble
ensemble_model = VotingClassifier(estimators=[
    ('log_reg', log_reg),
    ('rf', rf)
], voting='soft')  # 'soft' voting uses predicted probabilities

# Step 4: Hyperparameter tuning for the ensemble model
# We can tune hyperparameters of both base models
param_grid = {
    'log_reg__C': [0.1, 1, 10],  # Regularization strength for Logistic Regression
    'rf__n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'rf__max_depth': [None, 10, 20],  # Maximum depth of the trees
    'rf__min_samples_split': [2, 5],  # Minimum samples required to split an internal node
    'rf__min_samples_leaf': [1, 2]  # Minimum samples required to be at a leaf node
}

# Using GridSearchCV to search for the best hyperparameters
grid_search = GridSearchCV(ensemble_model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_res, y_res)

# Step 5: Evaluate the best model from the grid search
best_model = grid_search.best_estimator_

# Step 6: Predictions and evaluation on the test set
y_pred = best_model.predict(X_test)

# Performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# ROC curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Ensemble Model')
plt.legend(loc='lower right')
plt.show()

# Plot Confusion Matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y_test))
cm_display.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Ensemble Model')
plt.show()

# Print Classification Report
print("Classification Report for Ensemble Model:\n")
print(classification_report(y_test, y_pred))

# Summary of results
results = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'ROC AUC': roc_auc
}

print("\nPerformance Metrics for the Best Ensemble Model:")
for metric, value in results.items():
    print(f"{metric}: {value:.4f}")

The hyperparameters that were tuned here are:

param_grid = {
    'log_reg__C': [0.1, 1, 10],  # Regularization strength for Logistic Regression
    'rf__n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'rf__max_depth': [None, 10, 20],  # Maximum depth of the trees
    'rf__min_samples_split': [2, 5],  # Minimum samples required to split an internal node
    'rf__min_samples_leaf': [1, 2]  # Minimum samples required to be at a leaf node
}

You can play along with these hyperparameters and see how your model will perform. For the real test, you can use the Test dataset to confirm your MAE. Use the following code to save your predictions:

submission = pd.DataFrame({
    "uniqueid": test["uniqueid"] + " x " + test["country"],  # Concatenate uniqueid and country
    "bank_account": predictions  # Add predictions to the DataFrame
})

# Save the submission to a CSV file
submission.to_csv('file_name.csv', index=False)

4. Model Deployment.

As a new programmer, you might be wondering, "What exactly does deployment mean?" In technical terms, deployment is the process of transferring code from a development environment (where you build and test it) to a live or production environment, where it can be accessed and used by others.

Think of it this way: Imagine you’re a fantastic swimmer, but no one knows about your skills because you’ve only practiced in your backyard pool. To showcase your talent, you decide to participate in a swimming competition at your school or another venue. Moving from your backyard pool to the competition arena is like deployment—you’re taking your skills (or code) to a public stage where others can see and benefit from them.

Running the App Locally

To run the app locally on your computer, where it will use your machine’s resources, follow these steps:

Navigate to the app folder where the code resides. Use the following command in your terminal:

   cd app

Once you’re inside the folder, run the application using this command:

   python main.py

For example:

   (env) PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa\app> python .\main.py

Note: Don’t be alarmed by the (env) you see in the command line—it simply indicates that you’re working within a virtual environment. Virtual environments are a best practice when working on projects, as they help isolate and consolidate all the dependencies required to run your app.

Earlier, I mentioned not being too fond of Docker, but I must admit: Docker images are fantastic for ensuring your app runs smoothly across different environments. With Docker, the app would work just as seamlessly on my computer as it would on yours.

After running the command, you should see an output similar to this:

   * Serving Flask app 'main'
   * Debug mode: on
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
   * Running on http://127.0.0.1:5000
   Press CTRL+C to quit
   * Restarting with stat
   * Debugger is active!
   * Debugger PIN: 112-088-710

Copy the URL http://127.0.0.1:5000 and paste it into your web browser of choice. This will open the app, where you can interact with it.

Since the model is located in the app folder, you’ll be able to make inferences using it.

Below is a screenshot of the app’s interface. Note that the design is basic as it uses simple HTML. However, you can enhance it further by incorporating CSS for a more polished look.

A Few Things to Note Before Using the App:

Python Version: Ensure you’re using Python 3.13.1 for compatibility.
Preprocessing Inputs: The input data must be preprocessed to match the way the model was trained. For instance, our feature engineering involved combining certain features from the inputs—you’ll need to replicate this step for the app to function correctly.
Requirements: You will need to have all the libraries that are defined in the requirements.txt file. You can run the following command:

pip install -r requirements.txt

With that out of the way, you’ve successfully used the app to make inferences! Notice the address the app is running on? It does look a bit strange: http://127.0.0.1:5000. This is because the app is running locally on your machine.

Next Steps: Deploying the App on GCP

Let’s take your app to the next level by deploying it to Google Cloud Platform (GCP), making it accessible to others online.

Step 1: Create a GCP Account

Visit the GCP Console to create an account. Upon signing up, you’ll receive $300 in free credits for 90 days—don’t forget to activate your account!

Step 2: Prepare Your Application Files

Ensure your app folder is ready with the following structure:

main.py: This file contains your app’s main logic. GCP will look for this file during deployment.
app.yaml: Specifies the runtime environment. For example, if you’re using Python 3.13.1, set it as python313 in this file (omit the decimals).
ensemble_model2.joblib: Your pre-trained model file.
templates folder: Contains your index.html file for the app’s frontend.
requirements.txt: Lists the Python libraries your app needs to function.

Step 3: Set Up a Project in GCP

Log in to the GCP Console.
Navigate to IAM & Admin > Manage Resources.
Create a new project—this will serve as the container for your app and its resources.

Step 4: Install the Google Cloud SDK

Download and install the Google Cloud SDK. This command-line tool lets you deploy and manage your GCP resources directly from your terminal or IDE (e.g., VSCode).

Step 5: Initialize the Google Cloud CLI

Navigate to the directory containing your app files. For example:

(env) PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa\app>

Run the following command:

gcloud init

The CLI will guide you through the following steps:

Select or create a configuration: You can reinitialize an existing configuration or create a new one (e.g., new2).
Authenticate: Log in with the same Google account you used for GCP.
Set the project: Choose the project you created earlier in your GCP Console (e.g., financialinclusioninafrica).

Step 6: Deploy Your App

With everything in place, deploy your app using:

gcloud app deploy app.yaml --project [YOUR_PROJECT_NAME]

Replace [YOUR_PROJECT_NAME] with the actual name of your project (e.g., financialinclusioninafrica).

During deployment:

Choose the App Engine region closest to your users.
Confirm the deployment by typing Y.
Wait a few minutes for the process to complete.

Once deployed, you’ll receive a URL for your app. Share this link, and anyone can access and interact with your application.

Step 7: Disable the App to Avoid Costs

To minimize charges, disable the app when it’s no longer in use:

Go to Compute Engine > Settings in the GCP Console.
Select your project.
Disable the app or delete unused resources.

Final Notes

Before deploying, check the supported runtimes for GCP to ensure compatibility.
If GCP doesn’t suit your needs, you can explore alternatives like Render.com. Alternatively, you could use other tools like ngrok that mimic deployed apps.
Docker is another option for ensuring your app runs smoothly across environments, though it's optional here.

That’s it! Your app is now live and ready to use. Cheers, and happy coding!

Top comments (1)

Stacy Gathu • Feb 1

Wonderful.