DEV Community: Kamau Gilbert Mungai

Financial Inclusion in Africa: A Zindi Project

Kamau Gilbert Mungai — Sat, 25 Jan 2025 08:50:45 +0000

Introduction

In this article, I will guide you through my approach to tackling the Financial Inclusion in Africa project by Zindi. While the primary goal of this project is to predict which individuals are most likely to have or use a bank account, my main focus was to learn how to deploy a machine learning model on Google Cloud Platform (GCP).

You can use my methodology as a reference to implement the project on your own, adapting it as needed. During my attempt, I fine-tuned the model once and achieved a Mean Absolute Error (MAE) score of 0.122942692.

Now, you might be wondering: "Isn't this a classification problem? Why use MAE as a metric?" Fellow Zindians provided a clever explanation:

"You're on the right track! In some cases, classification problems use Mean Absolute Error (MAE) when treating classes as numerical values. If the model predicts 0 instead of 1, the error is |0-1| = 1. MAE measures how far off predictions are from the true classes on average, unlike accuracy, which just counts correct predictions. If there's an ordinal relationship between classes, this method can be useful."

In this article, I'll outline the steps I followed, leaving room for you to build upon and improve with your unique ideas.

The project is structured into four key phases:

Data Cleaning and Exploratory Data Analysis (EDA)
Feature Engineering
Model Creation and Evaluation
Model Deployment on GCP

Please note that I trained the model locally; however, GCP offers the capability to train models directly in the cloud using its powerful instances. This process includes building a Docker image, a topic I plan to cover in detail on another occasion (honestly I'm not a big Docker fan but I like a nice operating system).

And yes, training the model locally does consume a lot of computer resources. The image below shows how much of my resources were used during model fine tuning:

You can access the repository here. This project was carried out using python version 3.13.

PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa> python --version
Python 3.13.1

Let's get started.

1. Data Cleaning and Exploratory Data Analysis (EDA)

EDA, or Exploratory Data Analysis, is all about understanding your dataset—spotting patterns, gaining insights, and identifying any issues that might need fixing. From my experience, EDA often goes hand in hand with data cleaning. While Data Analysts usually lead this process, Data Scientists also get involved, especially when it comes to more technical or statistical aspects.

If you're looking for a practical example, feel free to check out this article I wrote about a previous project. It covers some basics of data cleaning and EDA. You can also dive into the project’s repository for the code.

What Makes EDA Important?

There’s no single "right" way to do EDA—it really depends on the data, your goals, and even personal or team preferences. For example:

Your background matters. If you're technically inclined or have domain expertise, you might notice patterns others don’t.
Tools matter. While I use Python libraries like pandas and seaborn, tools like Power BI are fantastic for creating quick, intuitive visualizations.

My Approach to EDA

When working with tabular data, I typically start by answering basic questions like:

For numerical features:
- What’s the average, maximum, and minimum?
- Are there outliers or strange patterns?
For categorical features:
- How many unique categories exist?
- Are the categories evenly distributed, or do we have imbalances?

From there, I dive deeper depending on the dataset. For example:

Chi-square tests can help identify relationships between categorical variables.
Benford's Law is great for spotting unusual patterns in numerical data, like financial figures.

These are just starting points—your EDA will evolve as you explore more of the data.

Common EDA libraries

Python has some great libraries for EDA:

pandas and numpy: Perfect for manipulating and exploring data.
matplotlib.pyplot and seaborn: Go-to options for creating clear and impactful visualizations.
pyforest: A handy library that imports commonly used libraries for you, saving time. You can check out pyforest's repository here to see what libraries come with it.

Why EDA Matters

EDA is a crucial step in any data project because it sets the foundation for everything else. It helps you:

Understand your data: Gain insights that lead to better decisions during model training.
Solve problems early: Spot issues like class imbalances or missing data before they derail your project.
Guide your next steps: For instance, it might inspire new feature engineering ideas or show you which actions (like applying class weights) are necessary during model training.

See It in Action

Curious about how I approached EDA for the Financial Inclusion in Africa project? Check out this notebook. It’s all there, and you’re welcome to build on it or adapt it for this or your own projects.

EDA isn’t just a box to tick—it’s your opportunity to really connect with your data and get creative. So, don’t rush it! Cleaning and Exploratory Data Analysis (EDA) typically account for 70% to 80% of the total time spent on developing a machine learning model.

2. Feature Engineering

Feature engineering is all about turning raw data into meaningful features that can help your machine learning model perform better.

In my case, I combined values from two columns in the dataset—such as country and location_type—to create a new feature called geographical_location. This resulted in values like Kenya_Urban and Uganda_Rural. It’s a straightforward example, but even small steps like this can have a meaningful impact on your model’s performance.

Of course, feature engineering doesn’t stop there. Depending on your dataset and goals, you can get really creative. You might extract time-based patterns, combine multiple columns in unique ways, or apply advanced transformations to uncover deeper insights. It’s all about finding features that help your model make better predictions.

3. Model Creation

This is where the magic happens—the phase where data scientists roll up their sleeves, train models, and keep a close eye on how the model is learning. Admit it, we’ve all had those moments staring endlessly at our screens as the training process runs. 😄

After completing the Exploratory Data Analysis (EDA), you should have a clear understanding of the type of problem you're tackling. In most cases, it's either a regression problem or a classification problem. For this project, it was a classification problem, so I explored commonly used algorithms like Logistic Regression, Random Forest Classifier, and XGBoost.

Choosing Evaluation Metrics

Selecting the right evaluation metrics is just as important as selecting the right algorithm. For classification problems, metrics like:

Recall: Measures the ability to find all positive instances ((TP / (TP + FN))),
Precision: Measures the accuracy of positive predictions ((TP / (TP + FP))),
F1 Score: Balances precision and recall, ((2 / ((1 / Precision) + (1 / Recall)))),
Accuracy: Simply measures the ratio of correct predictions, are commonly used.

In deep learning, for instance, you might use metrics like validation loss or mean average precision (mAP) at different thresholds (e.g., 50%, 75%).

My Approach

Below is the code I used to train my model. It involves key steps like oversampling the minority class using SMOTE, fitting models, and evaluating them using various metrics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
    classification_report, accuracy_score, precision_score,
    recall_score, f1_score, roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from imblearn.over_sampling import SMOTE

# Step 1: Oversampling the Minority Class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Step 2: Define a Model Evaluation Function
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Performance Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
    roc_auc = auc(fpr, tpr)

    # Plot ROC Curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve for {model_name}')
    plt.legend()
    plt.show()

    # Confusion Matrix
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, cmap=plt.cm.Blues)
    plt.title(f'Confusion Matrix for {model_name}')
    plt.show()

    # Classification Report
    print(f"Classification Report for {model_name}:\n{classification_report(y_test, y_pred)}")

    return accuracy, precision, recall, f1, roc_auc

# Step 3: Train and Evaluate Models
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
rf = RandomForestClassifier(class_weight='balanced', random_state=42)
xgb_model = xgb.XGBClassifier(scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]), random_state=42)

log_reg_results = evaluate_model(log_reg, X_res, y_res, X_test, y_test, 'Logistic Regression')
rf_results = evaluate_model(rf, X_res, y_res, X_test, y_test, 'Random Forest')
xgb_results = evaluate_model(xgb_model, X_res, y_res, X_test, y_test, 'XGBoost')

# Step 4: Summarize Results
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [log_reg_results[0], rf_results[0], xgb_results[0]],
    'Precision': [log_reg_results[1], rf_results[1], xgb_results[1]],
    'Recall': [log_reg_results[2], rf_results[2], xgb_results[2]],
    'F1 Score': [log_reg_results[3], rf_results[3], xgb_results[3]],
    'ROC AUC': [log_reg_results[4], rf_results[4], xgb_results[4]],
})
print(results)

Key Takeaways:

Oversampling with SMOTE: Helps balance the dataset by synthetically generating new samples for the minority class. You could also check out SMOTETomek that is simply SMOTE which handles Tomek links. Read about it here.
Model Evaluation: Always analyze metrics beyond accuracy—precision, recall, and F1 Score give a more detailed understanding of model performance.
Visualization: Tools like ROC curves and confusion matrices provide a deeper look at how well your model is performing.

You can adapt this workflow or expand on it based on your dataset and goals!

After the initial training phase, you might choose to fine-tune the best-performing model to optimize its performance. Alternatively, you can go a step further and train an ensemble model, which combines multiple models to achieve better results. I prefer the latter approach—it often provides more robust and accurate predictions.

Here's how I trained and fine-tuned an ensemble model:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Assuming X_train, X_test, y_train, y_test are already defined

# Step 1: Apply SMOTE to oversample the minority class in the training data
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Step 2: Define base models (Logistic Regression and Random Forest)
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
rf = RandomForestClassifier(class_weight='balanced', random_state=42)

# Step 3: Create a Voting Classifier ensemble
ensemble_model = VotingClassifier(estimators=[
    ('log_reg', log_reg),
    ('rf', rf)
], voting='soft')  # 'soft' voting uses predicted probabilities

# Step 4: Hyperparameter tuning for the ensemble model
# We can tune hyperparameters of both base models
param_grid = {
    'log_reg__C': [0.1, 1, 10],  # Regularization strength for Logistic Regression
    'rf__n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'rf__max_depth': [None, 10, 20],  # Maximum depth of the trees
    'rf__min_samples_split': [2, 5],  # Minimum samples required to split an internal node
    'rf__min_samples_leaf': [1, 2]  # Minimum samples required to be at a leaf node
}

# Using GridSearchCV to search for the best hyperparameters
grid_search = GridSearchCV(ensemble_model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_res, y_res)

# Step 5: Evaluate the best model from the grid search
best_model = grid_search.best_estimator_

# Step 6: Predictions and evaluation on the test set
y_pred = best_model.predict(X_test)

# Performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# ROC curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Ensemble Model')
plt.legend(loc='lower right')
plt.show()

# Plot Confusion Matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y_test))
cm_display.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Ensemble Model')
plt.show()

# Print Classification Report
print("Classification Report for Ensemble Model:\n")
print(classification_report(y_test, y_pred))

# Summary of results
results = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'ROC AUC': roc_auc
}

print("\nPerformance Metrics for the Best Ensemble Model:")
for metric, value in results.items():
    print(f"{metric}: {value:.4f}")

The hyperparameters that were tuned here are:

param_grid = {
    'log_reg__C': [0.1, 1, 10],  # Regularization strength for Logistic Regression
    'rf__n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'rf__max_depth': [None, 10, 20],  # Maximum depth of the trees
    'rf__min_samples_split': [2, 5],  # Minimum samples required to split an internal node
    'rf__min_samples_leaf': [1, 2]  # Minimum samples required to be at a leaf node
}

You can play along with these hyperparameters and see how your model will perform. For the real test, you can use the Test dataset to confirm your MAE. Use the following code to save your predictions:

submission = pd.DataFrame({
    "uniqueid": test["uniqueid"] + " x " + test["country"],  # Concatenate uniqueid and country
    "bank_account": predictions  # Add predictions to the DataFrame
})

# Save the submission to a CSV file
submission.to_csv('file_name.csv', index=False)

4. Model Deployment.

As a new programmer, you might be wondering, "What exactly does deployment mean?" In technical terms, deployment is the process of transferring code from a development environment (where you build and test it) to a live or production environment, where it can be accessed and used by others.

Think of it this way: Imagine you’re a fantastic swimmer, but no one knows about your skills because you’ve only practiced in your backyard pool. To showcase your talent, you decide to participate in a swimming competition at your school or another venue. Moving from your backyard pool to the competition arena is like deployment—you’re taking your skills (or code) to a public stage where others can see and benefit from them.

Running the App Locally

To run the app locally on your computer, where it will use your machine’s resources, follow these steps:

Navigate to the app folder where the code resides. Use the following command in your terminal:

   cd app

Once you’re inside the folder, run the application using this command:

   python main.py

For example:

   (env) PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa\app> python .\main.py

Note: Don’t be alarmed by the (env) you see in the command line—it simply indicates that you’re working within a virtual environment. Virtual environments are a best practice when working on projects, as they help isolate and consolidate all the dependencies required to run your app.

Earlier, I mentioned not being too fond of Docker, but I must admit: Docker images are fantastic for ensuring your app runs smoothly across different environments. With Docker, the app would work just as seamlessly on my computer as it would on yours.

After running the command, you should see an output similar to this:

   * Serving Flask app 'main'
   * Debug mode: on
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
   * Running on http://127.0.0.1:5000
   Press CTRL+C to quit
   * Restarting with stat
   * Debugger is active!
   * Debugger PIN: 112-088-710

Copy the URL http://127.0.0.1:5000 and paste it into your web browser of choice. This will open the app, where you can interact with it.

Since the model is located in the app folder, you’ll be able to make inferences using it.

Below is a screenshot of the app’s interface. Note that the design is basic as it uses simple HTML. However, you can enhance it further by incorporating CSS for a more polished look.

A Few Things to Note Before Using the App:

Python Version: Ensure you’re using Python 3.13.1 for compatibility.
Preprocessing Inputs: The input data must be preprocessed to match the way the model was trained. For instance, our feature engineering involved combining certain features from the inputs—you’ll need to replicate this step for the app to function correctly.
Requirements: You will need to have all the libraries that are defined in the requirements.txt file. You can run the following command:

pip install -r requirements.txt

With that out of the way, you’ve successfully used the app to make inferences! Notice the address the app is running on? It does look a bit strange: http://127.0.0.1:5000. This is because the app is running locally on your machine.

Next Steps: Deploying the App on GCP

Let’s take your app to the next level by deploying it to Google Cloud Platform (GCP), making it accessible to others online.

Step 1: Create a GCP Account

Visit the GCP Console to create an account. Upon signing up, you’ll receive $300 in free credits for 90 days—don’t forget to activate your account!

Step 2: Prepare Your Application Files

Ensure your app folder is ready with the following structure:

main.py: This file contains your app’s main logic. GCP will look for this file during deployment.
app.yaml: Specifies the runtime environment. For example, if you’re using Python 3.13.1, set it as python313 in this file (omit the decimals).
ensemble_model2.joblib: Your pre-trained model file.
templates folder: Contains your index.html file for the app’s frontend.
requirements.txt: Lists the Python libraries your app needs to function.

Step 3: Set Up a Project in GCP

Log in to the GCP Console.
Navigate to IAM & Admin > Manage Resources.
Create a new project—this will serve as the container for your app and its resources.

Step 4: Install the Google Cloud SDK

Download and install the Google Cloud SDK. This command-line tool lets you deploy and manage your GCP resources directly from your terminal or IDE (e.g., VSCode).

Step 5: Initialize the Google Cloud CLI

Navigate to the directory containing your app files. For example:

(env) PS C:\Users\Administrator\Documents\financial_inclusion_in_africa_full\financial_inclusion_in_africa\app>

Run the following command:

gcloud init

The CLI will guide you through the following steps:

Select or create a configuration: You can reinitialize an existing configuration or create a new one (e.g., new2).
Authenticate: Log in with the same Google account you used for GCP.
Set the project: Choose the project you created earlier in your GCP Console (e.g., financialinclusioninafrica).

Step 6: Deploy Your App

With everything in place, deploy your app using:

gcloud app deploy app.yaml --project [YOUR_PROJECT_NAME]

Replace [YOUR_PROJECT_NAME] with the actual name of your project (e.g., financialinclusioninafrica).

During deployment:

Choose the App Engine region closest to your users.
Confirm the deployment by typing Y.
Wait a few minutes for the process to complete.

Once deployed, you’ll receive a URL for your app. Share this link, and anyone can access and interact with your application.

Step 7: Disable the App to Avoid Costs

To minimize charges, disable the app when it’s no longer in use:

Go to Compute Engine > Settings in the GCP Console.
Select your project.
Disable the app or delete unused resources.

Final Notes

Before deploying, check the supported runtimes for GCP to ensure compatibility.
If GCP doesn’t suit your needs, you can explore alternatives like Render.com. Alternatively, you could use other tools like ngrok that mimic deployed apps.
Docker is another option for ensuring your app runs smoothly across environments, though it's optional here.

That’s it! Your app is now live and ready to use. Cheers, and happy coding!

Nairobi County Property Price Prediction Model: Technical Walkthrough On Model Creation.

Kamau Gilbert Mungai — Fri, 30 Aug 2024 10:15:38 +0000

In this article, we'll walk through the creation of a real-time property price prediction model focusing on Nairobi County. You can explore my model repository here.

Overview

The model is divided into the following 6 major parts, plus one optional component:

Web Scraping: Extract house data from relevant websites.
Data Cleaning: Clean and preprocess the gathered data.
Exploratory Data Analysis (EDA): Analyze and visualize the data.
Modeling: Build and train the predictive models.
Deployment: Deploy the model using a web framework.
Chatbot Creation: Develop a chatbot using OpenAI APIs to provide housing information in Kenya.
Automation with Airflow (Optional): Automate processes using Apache Airflow.

1. Web Scraping

Web scraping involves using a bot or web crawler to extract data from third-party websites. It plays a crucial role in today’s digital landscape, enabling web developers to build impactful applications and data scientists to gather relevant data for modeling.

There are several methods for web scraping. One straightforward approach is using API keys provided by websites, such as the Twitter API. However, these API keys can sometimes be costly, as many are not free. Alternatively, Python libraries offer powerful tools for scraping, including BeautifulSoup, Selenium, and Scrapy. Here’s a brief overview of each:

a. BeautifulSoup: An HTML and XML parser ideal for extracting data from static web pages. It's a great starting point for beginners. For a detailed tutorial, check out this video.

b. Selenium: Best for handling user interactions and JavaScript-heavy websites, making it suitable for dynamic content. For more information, see this tutorial.

c. Scrapy: Designed for large-scale, concurrent data extraction with built-in features for requests, parsing, crawling, and organizing data. Learn more from this tutorial.

In this project, we used BeautifulSoup to extract data from two websites: buyrentkenya.com and propertypro.co.ke. Here’s how BeautifulSoup works:

i. import the Requests Library:
The requests library allows us to send requests to websites. Here’s an example:

   import requests

   url = 'https://www.buyrentkenya.com/houses-for-sale'
   html_text = requests.get(url).text

   print(html_text)

The output will be the HTML content of the webpage. A successful HTTP request (status code 200) will return the HTML contents. If the request fails, you will encounter an HTTP error response. For more information on HTTP status codes, refer to this article.

ii. import BeautifulSoup:
BeautifulSoup is a Python library that includes various parsers such as html.parser, lxml, and html5lib. A parser reads and analyzes text to understand its structure and meaning, often converting it into a more usable format. In simple terms, a parser is like an interpreter who can bridge language barriers, allowing Python to understand HTML.

Once you use a parser, you store it in a variable called soup (a common convention) and use this variable to find or extract specific text from the HTML classes. To identify the text you’re interested in, right-click on the highlighted content you want example the property price in the browser, select "inspect," and find the relevant class field.

Here’s a code snippet for extracting prices from buyrentkenya.com:

   from bs4 import BeautifulSoup
   import requests

   url = 'https://www.buyrentkenya.com/houses-for-sale'
   html_text = requests.get(url).text
   soup = BeautifulSoup(html_text, 'html.parser')

   properties = soup.find_all('div', class_='relative w-full overflow-hidden rounded-2xl bg-white')

   for prop in properties:
       price_tag = prop.find('p', class_='text-xl font-bold leading-7 text-grey-900').find('a', class_='no-underline')
       price_text = price_tag.text if price_tag else None
       print(price_text)

This code retrieves prices from the first page of the website. To handle multiple pages, you will need to include pagination logic.

iii. import the CSV Library:
The csv library allows you to save the extracted data to a CSV file. Here’s a standard way to do it:

   import csv
   df.to_csv('filename.csv', index=False)

In the repository, under the data_collection subfolder, you will find a scraping_code folder containing the code used to extract data from the mentioned sites. There is also a practice script for experimenting with other sites.

My primary focus during the scraping process was on properties, including houses, apartments, and bedsitters, for both rental and sale listings.

2 & 3. Data Cleaning and Exploratory Data Analysis

Data cleaning and exploratory data analysis (EDA) are crucial stages in the data modeling process. After extracting data from the relevant websites, the first step is to consolidate it into a single Excel sheet. Preliminary cleaning, such as removing irrelevant rows, can be performed using Excel.

Once the data is consolidated and cleaned, the next step is to prepare it for modeling through Exploratory Data Analysis (EDA). Introduced by American mathematician John Tukey in the 1970s, EDA is a fundamental process for understanding and preparing data. There is no standardized approach to EDA; it varies depending on the analyst's preferences and the specific context of the data.

EDA is essential for preparing data for modeling, as it involves various tasks such as statistical analysis, data visualization, and feature engineering. To excel in this stage, you need strong skills in mathematics and statistics, data visualization, domain or market knowledge, and a curious mindset. Asking critical questions about the data is key to uncovering valuable insights.

Domain or market knowledge is particularly important for generating new features—a process known as feature engineering. Introducing new features or refining existing ones helps the model better understand the data, improving its performance. Features are essentially the columns in your dataset, such as location or number of bedrooms.

Data cleaning and EDA are often the most time-consuming parts of the modeling process. They require a deep understanding of both the general and statistical aspects of the data. This stage can take days or even weeks to thoroughly analyze and interpret. Its importance cannot be overstated; as the saying goes, "Garbage in, garbage out." Providing the model with poor-quality input will result in poor-quality predictions.

For a practical example, refer to the code in the nairobi_house_price_prediction notebook of the cleaning_eda_modeling subfolder to see how EDA was conducted in this project.

4. Modeling

Modeling is the core of our data science process. Once the data has been thoroughly explored and features have been enhanced, the final dataset is handed over to the data scientist for advanced statistical exploration and modeling.

A data scientist typically possesses advanced statistical knowledge compared to the initial analyst. They use this expertise to extract deeper insights from the data and prepare it for modeling.

The next step is to determine the type of problem at hand. Modeling problems generally fall into two categories:
a. Classification Problems: In these problems, the goal is to predict a discrete class label. The output is a categorical label. For example, if a model is trained with images of boys and girls, it will assign probability scores to the "boy" and "girl" labels for a new image and classify it based on the label with the highest probability.

b. Regression Problems: These problems aim to predict a continuous (numerical) output variable based on one or more input features. For instance, predicting house prices based on various features like location and size is a regression problem.

Different algorithms are used for classification and regression problems, though some algorithms can be applied to both types. It is the data scientist's role to select the appropriate algorithms and train the model accordingly.

A special type of model known as an ensemble model combines two or more models or algorithms. Ensemble models often outperform individual models for many tasks. More information on ensemble models can be found here.

Before modeling, it is essential to pre-process the data. Algorithms perform better with binary data. Pre-processing involves transforming the data into a suitable format for the algorithms to train on. This pre-processor can be saved as a pickle (.pkl) file and used during inference to prepare user input for prediction. Next, divide the data into training and testing sets (and sometimes validation sets) using the train_test_split function from scikit-learn:

from sklearn.model_selection import train_test_split

# Dividing into 70% train set and 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

With the data divided, proceed with modeling. In this project, three main algorithms were used: LinearRegression, RandomForestRegressor, and GradientBoostingRegressor. Selecting appropriate evaluation metrics is crucial for assessing model performance. For this task, metrics such as Mean Squared Error (MSE), R-squared (R²), Cross-Validation Mean Score (CV-Mean), and Cross-Validation Standard Deviation (CV-Std Dev) were used, as accuracy, precision, and recall were less relevant.

Hyperparameter tuning is another critical aspect for improving model performance. Grid Search was employed to tune the Random Forest and Gradient Boosting models, as Linear Regression has fewer hyperparameters to adjust. The ensemble of the two models yielded better results. After training, the models were saved as pickle files (.pkl) containing the trained weights. The model weights can be found in the model_preprocessor_weights subfolder, which also includes the preprocessor. As noted, the ensemble model provided the most accurate results and should be used for inference.

5. Deployment

Deploy your model using frameworks like FastAPI, Flask, or Streamlit. You can dockerize your application to enhance compatibility across different environments.

Check out the inferencing_and_deployment subfolder in the repo for details on how I did my deployment.

6. Chatbot Creation Using OpenAI API

Incorporating a chatbot gives our model a contemporary edge. While traditional machine learning methods are widely accepted and used in the industry, the advent of modern Large Language Models (LLMs), such as ChatGPT1 in 2018, has significantly disrupted the data field.

Chatbots today often utilize Retrieval Augmented Generation (RAG) applications, which combine vector databases with LLMs like gpt-4-turbo to provide sophisticated responses to user queries. For more information on RAG applications, you can explore this LangChain repository and watch this video on the topic.

OpenAI offers billable API keys to access their models, which you can find here.

For implementation details, refer to the chatbot subfolder in the nairobi_house_price_prediction_model repository.

7. Automation with Airflow (Optional)

Automate your data processes using Apache Airflow. Other tools like Apache Kafka or Redpanda can also be considered for data streaming. This component is still in progress.

For a comprehensive view, visit the repository.

For inquiries, connect with me on LinkedIn or email me at kamaugilbert9@gmail.com.