<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kamau Gilbert Mungai</title>
    <description>The latest articles on DEV Community by Kamau Gilbert Mungai (@kamaugilbert).</description>
    <link>https://dev.to/kamaugilbert</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1861417%2F7fa1ee6e-5517-4a41-8fa0-8517206988eb.jpg</url>
      <title>DEV Community: Kamau Gilbert Mungai</title>
      <link>https://dev.to/kamaugilbert</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kamaugilbert"/>
    <language>en</language>
    <item>
      <title>Financial Inclusion in Africa: A Zindi Project</title>
      <dc:creator>Kamau Gilbert Mungai</dc:creator>
      <pubDate>Sat, 25 Jan 2025 08:50:45 +0000</pubDate>
      <link>https://dev.to/kamaugilbert/financial-inclusion-in-africa-a-zindi-project-c4a</link>
      <guid>https://dev.to/kamaugilbert/financial-inclusion-in-africa-a-zindi-project-c4a</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this article, I will guide you through my approach to tackling the &lt;a href="https://zindi.africa/competitions/financial-inclusion-in-africa" rel="noopener noreferrer"&gt;Financial Inclusion in Africa&lt;/a&gt; project by Zindi. While the primary goal of this project is to predict which individuals are most likely to have or use a bank account, my main focus was to learn how to deploy a machine learning model on &lt;a href="https://console.cloud.google.com/" rel="noopener noreferrer"&gt;Google Cloud Platform (GCP)&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;You can use my methodology as a reference to implement the project on your own, adapting it as needed. During my attempt, I fine-tuned the model once and achieved a Mean Absolute Error (MAE) score of &lt;strong&gt;0.122942692&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;Now, you might be wondering: &lt;em&gt;"Isn't this a classification problem? Why use MAE as a metric?"&lt;/em&gt; Fellow Zindians provided a clever explanation:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You're on the right track! In some cases, classification problems use Mean Absolute Error (MAE) when treating classes as numerical values. If the model predicts 0 instead of 1, the error is |0-1| = 1. MAE measures how far off predictions are from the true classes on average, unlike accuracy, which just counts correct predictions. If there's an ordinal relationship between classes, this method can be useful."  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this article, I'll outline the steps I followed, leaving room for you to build upon and improve with your unique ideas.  &lt;/p&gt;

&lt;p&gt;The project is structured into four key phases:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Cleaning and Exploratory Data Analysis (EDA)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature Engineering&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Creation and Evaluation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Deployment on GCP&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Please note that I trained the model locally; however, GCP offers the capability to train models directly in the cloud using its powerful instances. This process includes building a Docker image, a topic I plan to cover in detail on another occasion (honestly I'm not a big Docker fan but I like a nice operating system).&lt;/p&gt;

&lt;p&gt;And yes, training the model locally does consume a lot of computer resources. The image below shows how much of my resources were used during model fine tuning:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn99wveqbin9r13kd4hg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn99wveqbin9r13kd4hg.png" alt="Image description" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can access the repository &lt;a href="https://github.com/KamauGilbert/financial_inclusion_in_africa" rel="noopener noreferrer"&gt;here&lt;/a&gt;. This project was carried out using python version 3.13.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PS&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;\&lt;span class="n"&gt;Users&lt;/span&gt;\&lt;span class="n"&gt;Administrator&lt;/span&gt;\&lt;span class="n"&gt;Documents&lt;/span&gt;\&lt;span class="n"&gt;financial_inclusion_in_africa_full&lt;/span&gt;\&lt;span class="n"&gt;financial_inclusion_in_africa&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;3.13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's get started.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Data Cleaning and Exploratory Data Analysis (EDA)
&lt;/h3&gt;

&lt;p&gt;EDA, or Exploratory Data Analysis, is all about understanding your dataset—spotting patterns, gaining insights, and identifying any issues that might need fixing. From my experience, EDA often goes hand in hand with data cleaning. While Data Analysts usually lead this process, Data Scientists also get involved, especially when it comes to more technical or statistical aspects.&lt;/p&gt;

&lt;p&gt;If you're looking for a practical example, feel free to check out &lt;a href="https://dev.to/kamaugilbert/nairobi-county-property-price-prediction-model-technical-walkthrough-on-model-creation-pkf"&gt;this article&lt;/a&gt; I wrote about a previous project. It covers some basics of data cleaning and EDA. You can also dive into the project’s &lt;a href="https://github.com/KamauGilbert/nairobi_house_price_prediction_model" rel="noopener noreferrer"&gt;repository&lt;/a&gt; for the code.&lt;/p&gt;




&lt;h4&gt;
  
  
  What Makes EDA Important?
&lt;/h4&gt;

&lt;p&gt;There’s no single "right" way to do EDA—it really depends on the data, your goals, and even personal or team preferences. For example:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your background matters.&lt;/strong&gt; If you're technically inclined or have domain expertise, you might notice patterns others don’t.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools matter.&lt;/strong&gt; While I use Python libraries like &lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;seaborn&lt;/code&gt;, tools like Power BI are fantastic for creating quick, intuitive visualizations.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  My Approach to EDA
&lt;/h4&gt;

&lt;p&gt;When working with tabular data, I typically start by answering basic questions like:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;For numerical features:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What’s the average, maximum, and minimum?
&lt;/li&gt;
&lt;li&gt;Are there outliers or strange patterns?
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For categorical features:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many unique categories exist?
&lt;/li&gt;
&lt;li&gt;Are the categories evenly distributed, or do we have imbalances?
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From there, I dive deeper depending on the dataset. For example:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.geeksforgeeks.org/ml-chi-square-test-for-feature-selection/" rel="noopener noreferrer"&gt;Chi-square tests&lt;/a&gt;&lt;/strong&gt; can help identify relationships between categorical variables.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.geeksforgeeks.org/benfords-law/" rel="noopener noreferrer"&gt;Benford's Law&lt;/a&gt;&lt;/strong&gt; is great for spotting unusual patterns in numerical data, like financial figures.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are just starting points—your EDA will evolve as you explore more of the data.  &lt;/p&gt;




&lt;h4&gt;
  
  
  Common EDA libraries
&lt;/h4&gt;

&lt;p&gt;Python has some great libraries for EDA:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt;&lt;/strong&gt;: Perfect for manipulating and exploring data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;matplotlib.pyplot&lt;/code&gt; and &lt;code&gt;seaborn&lt;/code&gt;&lt;/strong&gt;: Go-to options for creating clear and impactful visualizations.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pyforest&lt;/code&gt;&lt;/strong&gt;: A handy library that imports commonly used libraries for you, saving time.  You can check out pyforest's repository &lt;a href="https://github.com/8080labs/pyforest?tab=readme-ov-file" rel="noopener noreferrer"&gt;here&lt;/a&gt; to see what libraries come with it.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  Why EDA Matters
&lt;/h4&gt;

&lt;p&gt;EDA is a crucial step in any data project because it sets the foundation for everything else. It helps you:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand your data:&lt;/strong&gt; Gain insights that lead to better decisions during model training.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solve problems early:&lt;/strong&gt; Spot issues like class imbalances or missing data before they derail your project.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guide your next steps:&lt;/strong&gt; For instance, it might inspire new feature engineering ideas or show you which actions (like applying class weights) are necessary during model training.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  See It in Action
&lt;/h4&gt;

&lt;p&gt;Curious about how I approached EDA for the &lt;em&gt;Financial Inclusion in Africa&lt;/em&gt; project? Check out &lt;a href="https://github.com/KamauGilbert/financial_inclusion_in_africa/blob/main/financial_inclusion.ipynb" rel="noopener noreferrer"&gt;this notebook&lt;/a&gt;. It’s all there, and you’re welcome to build on it or adapt it for this or your own projects.&lt;/p&gt;

&lt;p&gt;EDA isn’t just a box to tick—it’s your opportunity to really connect with your data and get creative. So, don’t rush it! Cleaning and Exploratory Data Analysis (EDA) typically account for 70% to 80% of the total time spent on developing a machine learning model.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Feature Engineering
&lt;/h3&gt;

&lt;p&gt;Feature engineering is all about turning raw data into meaningful features that can help your machine learning model perform better. &lt;/p&gt;

&lt;p&gt;In my case, I combined values from two columns in the dataset—such as &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;location_type&lt;/code&gt;—to create a new feature called &lt;code&gt;geographical_location&lt;/code&gt;. This resulted in values like &lt;code&gt;Kenya_Urban&lt;/code&gt; and &lt;code&gt;Uganda_Rural&lt;/code&gt;. It’s a straightforward example, but even small steps like this can have a meaningful impact on your model’s performance.&lt;/p&gt;

&lt;p&gt;Of course, feature engineering doesn’t stop there. Depending on your dataset and goals, you can get really creative. You might extract time-based patterns, combine multiple columns in unique ways, or apply advanced transformations to uncover deeper insights. It’s all about finding features that help your model make better predictions.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Model Creation
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens—the phase where data scientists roll up their sleeves, train models, and keep a close eye on how the model is learning. Admit it, we’ve all had those moments staring endlessly at our screens as the training process runs. 😄&lt;/p&gt;

&lt;p&gt;After completing the Exploratory Data Analysis (EDA), you should have a clear understanding of the type of problem you're tackling. In most cases, it's either a &lt;a href="https://www.geeksforgeeks.org/regression-in-machine-learning/" rel="noopener noreferrer"&gt;regression problem&lt;/a&gt; or a &lt;a href="https://www.geeksforgeeks.org/getting-started-with-classification/" rel="noopener noreferrer"&gt;classification problem&lt;/a&gt;. For this project, it was a classification problem, so I explored commonly used algorithms like &lt;strong&gt;Logistic Regression&lt;/strong&gt;, &lt;strong&gt;Random Forest Classifier&lt;/strong&gt;, and &lt;strong&gt;XGBoost&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Choosing Evaluation Metrics
&lt;/h4&gt;

&lt;p&gt;Selecting the right evaluation metrics is just as important as selecting the right algorithm. For classification problems, metrics like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt;: Measures the ability to find all positive instances ((TP / (TP + FN))),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt;: Measures the accuracy of positive predictions ((TP / (TP + FP))),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1 Score&lt;/strong&gt;: Balances precision and recall, ((2 / ((1 / Precision) + (1 / Recall)))),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;: Simply measures the ratio of correct predictions,
are commonly used. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In deep learning, for instance, you might use metrics like validation loss or mean average precision (mAP) at different thresholds (e.g., 50%, 75%).&lt;/p&gt;




&lt;h4&gt;
  
  
  My Approach
&lt;/h4&gt;

&lt;p&gt;Below is the code I used to train my model. It involves key steps like oversampling the minority class using &lt;strong&gt;SMOTE&lt;/strong&gt;, fitting models, and evaluating them using various metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roc_curve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConfusionMatrixDisplay&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xgboost&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;imblearn.over_sampling&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SMOTE&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Oversampling the Minority Class
&lt;/span&gt;&lt;span class="n"&gt;smote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SMOTE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampling_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smote&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Define a Model Evaluation Function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Performance Metrics
&lt;/span&gt;    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;roc_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;roc_auc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;auc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Plot ROC Curve
&lt;/span&gt;    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC curve (AUC = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;roc_auc&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;False Positive Rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;True Positive Rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC Curve for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Confusion Matrix
&lt;/span&gt;    &lt;span class="n"&gt;ConfusionMatrixDisplay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_predictions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Blues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Confusion Matrix for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Classification Report
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classification Report for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roc_auc&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Train and Evaluate Models
&lt;/span&gt;&lt;span class="n"&gt;log_reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;xgb_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;XGBClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scale_pos_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;log_reg_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_reg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Logistic Regression&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Random Forest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;xgb_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xgb_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;XGBoost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Summarize Results
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Logistic Regression&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Random Forest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;XGBoost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;log_reg_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rf_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;xgb_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Precision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;log_reg_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rf_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;xgb_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Recall&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;log_reg_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rf_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;xgb_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;F1 Score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;log_reg_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rf_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;xgb_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC AUC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;log_reg_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rf_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;xgb_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  Key Takeaways:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Oversampling with SMOTE&lt;/strong&gt;: Helps balance the dataset by synthetically generating new samples for the minority class. You could also check out SMOTETomek that is simply SMOTE which handles Tomek links. Read about it &lt;a href="https://towardsdatascience.com/imbalanced-classification-in-python-smote-tomek-links-method-6e48dfe69bbc" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Evaluation&lt;/strong&gt;: Always analyze metrics beyond accuracy—precision, recall, and F1 Score give a more detailed understanding of model performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualization&lt;/strong&gt;: Tools like ROC curves and confusion matrices provide a deeper look at how well your model is performing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can adapt this workflow or expand on it based on your dataset and goals!&lt;/p&gt;

&lt;p&gt;After the initial training phase, you might choose to fine-tune the best-performing model to optimize its performance. Alternatively, you can go a step further and train an &lt;a href="https://builtin.com/machine-learning/ensemble-model" rel="noopener noreferrer"&gt;ensemble model&lt;/a&gt;, which combines multiple models to achieve better results. I prefer the latter approach—it often provides more robust and accurate predictions.&lt;/p&gt;

&lt;p&gt;Here's how I trained and fine-tuned an ensemble model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VotingClassifier&lt;/span&gt;

&lt;span class="c1"&gt;# Assuming X_train, X_test, y_train, y_test are already defined
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 1: Apply SMOTE to oversample the minority class in the training data
&lt;/span&gt;&lt;span class="n"&gt;smote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SMOTE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampling_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smote&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Define base models (Logistic Regression and Random Forest)
&lt;/span&gt;&lt;span class="n"&gt;log_reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Create a Voting Classifier ensemble
&lt;/span&gt;&lt;span class="n"&gt;ensemble_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VotingClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;log_reg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_reg&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;voting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;soft&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 'soft' voting uses predicted probabilities
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 4: Hyperparameter tuning for the ensemble model
# We can tune hyperparameters of both base models
&lt;/span&gt;&lt;span class="n"&gt;param_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;log_reg__C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Regularization strength for Logistic Regression
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__n_estimators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Number of trees in Random Forest
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Maximum depth of the trees
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__min_samples_split&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Minimum samples required to split an internal node
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__min_samples_leaf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Minimum samples required to be at a leaf node
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Using GridSearchCV to search for the best hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;grid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ensemble_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;param_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;grid_search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Evaluate the best model from the grid search
&lt;/span&gt;&lt;span class="n"&gt;best_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grid_search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;

&lt;span class="c1"&gt;# Step 6: Predictions and evaluation on the test set
&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Performance metrics
&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ROC curve
&lt;/span&gt;&lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;roc_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;roc_auc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;auc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Confusion Matrix
&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Plot ROC curve
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;darkorange&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC curve (area = %0.2f)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;roc_auc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;navy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.05&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;False Positive Rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;True Positive Rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC Curve for Ensemble Model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lower right&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Plot Confusion Matrix
&lt;/span&gt;&lt;span class="n"&gt;cm_display&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConfusionMatrixDisplay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cm_display&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Blues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Confusion Matrix for Ensemble Model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Print Classification Report
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classification Report for Ensemble Model:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Summary of results
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Precision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Recall&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;F1 Score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ROC AUC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;roc_auc&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Performance Metrics for the Best Ensemble Model:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hyperparameters that were tuned here are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;param_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;log_reg__C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Regularization strength for Logistic Regression
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__n_estimators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Number of trees in Random Forest
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Maximum depth of the trees
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__min_samples_split&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Minimum samples required to split an internal node
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf__min_samples_leaf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Minimum samples required to be at a leaf node
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can play along with these hyperparameters and see how your model will perform. For the real test, you can use the &lt;a href="https://zindi.africa/competitions/financial-inclusion-in-africa/data" rel="noopener noreferrer"&gt;Test&lt;/a&gt; dataset to confirm your MAE. Use the following code to save your predictions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;submission&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uniqueid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uniqueid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; x &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Concatenate uniqueid and country
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bank_account&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;  &lt;span class="c1"&gt;# Add predictions to the DataFrame
&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Save the submission to a CSV file
&lt;/span&gt;&lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file_name.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Model Deployment.
&lt;/h3&gt;

&lt;p&gt;As a new programmer, you might be wondering, &lt;em&gt;"What exactly does deployment mean?"&lt;/em&gt; In technical terms, deployment is the process of transferring code from a development environment (where you build and test it) to a live or production environment, where it can be accessed and used by others. &lt;/p&gt;

&lt;p&gt;Think of it this way: Imagine you’re a fantastic swimmer, but no one knows about your skills because you’ve only practiced in your backyard pool. To showcase your talent, you decide to participate in a swimming competition at your school or another venue. Moving from your backyard pool to the competition arena is like deployment—you’re taking your skills (or code) to a public stage where others can see and benefit from them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running the App Locally
&lt;/h3&gt;

&lt;p&gt;To run the app locally on your computer, where it will use your machine’s resources, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to the &lt;code&gt;app&lt;/code&gt; folder where the code resides. Use the following command in your terminal:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;cd &lt;/span&gt;app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Once you’re inside the folder, run the application using this command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   python main.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; PS C:&lt;span class="se"&gt;\U&lt;/span&gt;sers&lt;span class="se"&gt;\A&lt;/span&gt;dministrator&lt;span class="se"&gt;\D&lt;/span&gt;ocuments&lt;span class="se"&gt;\f&lt;/span&gt;inancial_inclusion_in_africa_full&lt;span class="se"&gt;\f&lt;/span&gt;inancial_inclusion_in_africa&lt;span class="se"&gt;\a&lt;/span&gt;pp&amp;gt; python .&lt;span class="se"&gt;\m&lt;/span&gt;ain.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Don’t be alarmed by the &lt;code&gt;(env)&lt;/code&gt; you see in the command line—it simply indicates that you’re working within a virtual environment. Virtual environments are a best practice when working on projects, as they help isolate and consolidate all the dependencies required to run your app.  &lt;/p&gt;

&lt;p&gt;Earlier, I mentioned not being too fond of Docker, but I must admit: Docker images are fantastic for ensuring your app runs smoothly across different environments. With Docker, the app would work just as seamlessly on my computer as it would on yours.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;After running the command, you should see an output similar to this:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   * Serving Flask app 'main'
   * Debug mode: on
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
   * Running on http://127.0.0.1:5000
   Press CTRL+C to quit
   * Restarting with stat
   * Debugger is active!
   * Debugger PIN: 112-088-710
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Copy the URL &lt;code&gt;http://127.0.0.1:5000&lt;/code&gt; and paste it into your web browser of choice. This will open the app, where you can interact with it. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since the model is located in the &lt;code&gt;app&lt;/code&gt; folder, you’ll be able to make inferences using it. &lt;/p&gt;

&lt;p&gt;Below is a screenshot of the app’s interface. Note that the design is basic as it uses simple HTML. However, you can enhance it further by incorporating CSS for a more polished look. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d0unb26law65psli0z5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d0unb26law65psli0z5.png" alt="Image description" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  A Few Things to Note Before Using the App:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Python Version&lt;/strong&gt;: Ensure you’re using Python 3.13.1 for compatibility. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preprocessing Inputs&lt;/strong&gt;: The input data must be preprocessed to match the way the model was trained. For instance, our feature engineering involved combining certain features from the inputs—you’ll need to replicate this step for the app to function correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirements&lt;/strong&gt;: You will need to have all the libraries that are defined in the requirements.txt file. You can run the following command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that out of the way, you’ve successfully used the app to make inferences! Notice the address the app is running on? It does look a bit strange: &lt;code&gt;http://127.0.0.1:5000&lt;/code&gt;. This is because the app is running locally on your machine.&lt;/p&gt;

&lt;h4&gt;
  
  
  Next Steps: Deploying the App on GCP
&lt;/h4&gt;

&lt;p&gt;Let’s take your app to the next level by deploying it to Google Cloud Platform (GCP), making it accessible to others online.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Create a GCP Account
&lt;/h4&gt;

&lt;p&gt;Visit the &lt;a href="https://console.cloud.google.com/welcome/new?pli=1&amp;amp;inv=1&amp;amp;invt=AbnxWg" rel="noopener noreferrer"&gt;GCP Console&lt;/a&gt; to create an account. Upon signing up, you’ll receive &lt;strong&gt;$300 in free credits&lt;/strong&gt; for 90 days—don’t forget to activate your account!&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: Prepare Your Application Files
&lt;/h4&gt;

&lt;p&gt;Ensure your &lt;code&gt;app&lt;/code&gt; folder is ready with the following structure:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/strong&gt;: This file contains your app’s main logic. GCP will look for this file during deployment.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;app.yaml&lt;/code&gt;&lt;/strong&gt;: Specifies the runtime environment. For example, if you’re using Python 3.13.1, set it as &lt;code&gt;python313&lt;/code&gt; in this file (omit the decimals).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ensemble_model2.joblib&lt;/code&gt;&lt;/strong&gt;: Your pre-trained model file.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;templates&lt;/code&gt; folder&lt;/strong&gt;: Contains your &lt;code&gt;index.html&lt;/code&gt; file for the app’s frontend.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;requirements.txt&lt;/code&gt;&lt;/strong&gt;: Lists the Python libraries your app needs to function.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Step 3: Set Up a Project in GCP
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Log in to the &lt;a href="https://console.cloud.google.com/" rel="noopener noreferrer"&gt;GCP Console&lt;/a&gt;.
&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;IAM &amp;amp; Admin &amp;gt; Manage Resources&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Create a new project—this will serve as the container for your app and its resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Step 4: Install the Google Cloud SDK
&lt;/h4&gt;

&lt;p&gt;Download and install the &lt;a href="https://cloud.google.com/sdk/docs/install" rel="noopener noreferrer"&gt;Google Cloud SDK&lt;/a&gt;. This command-line tool lets you deploy and manage your GCP resources directly from your terminal or IDE (e.g., VSCode).  &lt;/p&gt;

&lt;h4&gt;
  
  
  Step 5: Initialize the Google Cloud CLI
&lt;/h4&gt;

&lt;p&gt;Navigate to the directory containing your app files. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; PS C:&lt;span class="se"&gt;\U&lt;/span&gt;sers&lt;span class="se"&gt;\A&lt;/span&gt;dministrator&lt;span class="se"&gt;\D&lt;/span&gt;ocuments&lt;span class="se"&gt;\f&lt;/span&gt;inancial_inclusion_in_africa_full&lt;span class="se"&gt;\f&lt;/span&gt;inancial_inclusion_in_africa&lt;span class="se"&gt;\a&lt;/span&gt;pp&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI will guide you through the following steps:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Select or create a configuration&lt;/strong&gt;: You can reinitialize an existing configuration or create a new one (e.g., &lt;code&gt;new2&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authenticate&lt;/strong&gt;: Log in with the same Google account you used for GCP.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set the project&lt;/strong&gt;: Choose the project you created earlier in your GCP Console (e.g., &lt;code&gt;financialinclusioninafrica&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Step 6: Deploy Your App
&lt;/h4&gt;

&lt;p&gt;With everything in place, deploy your app using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud app deploy app.yaml &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;YOUR_PROJECT_NAME]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;[YOUR_PROJECT_NAME]&lt;/code&gt; with the actual name of your project (e.g., &lt;code&gt;financialinclusioninafrica&lt;/code&gt;).  &lt;/p&gt;

&lt;p&gt;During deployment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Choose the App Engine region closest to your users.
&lt;/li&gt;
&lt;li&gt;Confirm the deployment by typing &lt;code&gt;Y&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;Wait a few minutes for the process to complete.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once deployed, you’ll receive a URL for your app. Share this link, and anyone can access and interact with your application.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 7: Disable the App to Avoid Costs
&lt;/h4&gt;

&lt;p&gt;To minimize charges, disable the app when it’s no longer in use:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Compute Engine &amp;gt; Settings&lt;/strong&gt; in the GCP Console.
&lt;/li&gt;
&lt;li&gt;Select your project.
&lt;/li&gt;
&lt;li&gt;Disable the app or delete unused resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Final Notes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Before deploying, check the &lt;a href="https://cloud.google.com/functions/docs/runtime-support" rel="noopener noreferrer"&gt;supported runtimes&lt;/a&gt; for GCP to ensure compatibility.
&lt;/li&gt;
&lt;li&gt;If GCP doesn’t suit your needs, you can explore alternatives like &lt;a href="https://render.com" rel="noopener noreferrer"&gt;Render.com&lt;/a&gt;. Alternatively, you could use other tools like ngrok that mimic deployed apps.&lt;/li&gt;
&lt;li&gt;Docker is another option for ensuring your app runs smoothly across environments, though it's optional here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s it! Your app is now live and ready to use. Cheers, and happy coding!&lt;/p&gt;

</description>
      <category>gcp</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>ai</category>
    </item>
    <item>
      <title>Nairobi County Property Price Prediction Model: Technical Walkthrough On Model Creation.</title>
      <dc:creator>Kamau Gilbert Mungai</dc:creator>
      <pubDate>Fri, 30 Aug 2024 10:15:38 +0000</pubDate>
      <link>https://dev.to/kamaugilbert/nairobi-county-property-price-prediction-model-technical-walkthrough-on-model-creation-pkf</link>
      <guid>https://dev.to/kamaugilbert/nairobi-county-property-price-prediction-model-technical-walkthrough-on-model-creation-pkf</guid>
      <description>&lt;p&gt;In this article, we'll walk through the creation of a real-time property price prediction model focusing on Nairobi County. You can explore my model repository &lt;a href="https://github.com/KamauGilbert/nairobi_house_price_prediction_model" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;The model is divided into the following 6 major parts, plus one optional component:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Web Scraping&lt;/strong&gt;: Extract house data from relevant websites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Cleaning&lt;/strong&gt;: Clean and preprocess the gathered data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploratory Data Analysis (EDA)&lt;/strong&gt;: Analyze and visualize the data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modeling&lt;/strong&gt;: Build and train the predictive models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: Deploy the model using a web framework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chatbot Creation&lt;/strong&gt;: Develop a chatbot using OpenAI APIs to provide housing information in Kenya.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation with Airflow (Optional)&lt;/strong&gt;: Automate processes using Apache Airflow.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Web Scraping
&lt;/h2&gt;

&lt;p&gt;Web scraping involves using a bot or web crawler to extract data from third-party websites. It plays a crucial role in today’s digital landscape, enabling web developers to build impactful applications and data scientists to gather relevant data for modeling.&lt;/p&gt;

&lt;p&gt;There are several methods for web scraping. One straightforward approach is using API keys provided by websites, such as the Twitter API. However, these API keys can sometimes be costly, as many are not free. Alternatively, Python libraries offer powerful tools for scraping, including BeautifulSoup, Selenium, and Scrapy. Here’s a brief overview of each:&lt;/p&gt;

&lt;p&gt;a. &lt;strong&gt;BeautifulSoup&lt;/strong&gt;: An HTML and XML parser ideal for extracting data from static web pages. It's a great starting point for beginners. For a detailed tutorial, check out &lt;a href="https://www.youtube.com/watch?v=XVv6mJpFOb0" rel="noopener noreferrer"&gt;this video&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;b. &lt;strong&gt;Selenium&lt;/strong&gt;: Best for handling user interactions and JavaScript-heavy websites, making it suitable for dynamic content. For more information, see &lt;a href="https://www.youtube.com/watch?v=j7VZsCCnptM" rel="noopener noreferrer"&gt;this tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;c. &lt;strong&gt;Scrapy&lt;/strong&gt;: Designed for large-scale, concurrent data extraction with built-in features for requests, parsing, crawling, and organizing data. Learn more from &lt;a href="https://www.youtube.com/watch?v=s4jtkzHhLzY" rel="noopener noreferrer"&gt;this tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this project, we used BeautifulSoup to extract data from two websites: buyrentkenya.com and propertypro.co.ke. Here’s how BeautifulSoup works:&lt;/p&gt;

&lt;p&gt;i. &lt;strong&gt;import the Requests Library&lt;/strong&gt;:&lt;br&gt;
   The &lt;code&gt;requests&lt;/code&gt; library allows us to send requests to websites. Here’s an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

   &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.buyrentkenya.com/houses-for-sale&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
   &lt;span class="n"&gt;html_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output will be the HTML content of the webpage. A successful HTTP request (status code 200) will return the HTML contents. If the request fails, you will encounter an HTTP error response. For more information on HTTP status codes, refer to &lt;a href="https://www.restapitutorial.com/httpstatuscodes.html" rel="noopener noreferrer"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;ii. &lt;strong&gt;import BeautifulSoup&lt;/strong&gt;:&lt;br&gt;
   BeautifulSoup is a Python library that includes various parsers such as &lt;code&gt;html.parser&lt;/code&gt;, &lt;code&gt;lxml&lt;/code&gt;, and &lt;code&gt;html5lib&lt;/code&gt;. A parser reads and analyzes text to understand its structure and meaning, often converting it into a more usable format. In simple terms, a parser is like an interpreter who can bridge language barriers, allowing Python to understand HTML.&lt;/p&gt;

&lt;p&gt;Once you use a parser, you store it in a variable called &lt;code&gt;soup&lt;/code&gt; (a common convention) and use this variable to find or extract specific text from the HTML classes. To identify the text you’re interested in, right-click on the highlighted content you want example the property price in the browser, select "inspect," and find the relevant class field.&lt;/p&gt;

&lt;p&gt;Here’s a code snippet for extracting prices from buyrentkenya.com:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

   &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.buyrentkenya.com/houses-for-sale&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
   &lt;span class="n"&gt;html_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
   &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;relative w-full overflow-hidden rounded-2xl bg-white&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;price_tag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text-xl font-bold leading-7 text-grey-900&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;no-underline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="n"&gt;price_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;price_tag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;price_tag&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
       &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code retrieves prices from the first page of the website. To handle multiple pages, you will need to include pagination logic.&lt;/p&gt;

&lt;p&gt;iii. &lt;strong&gt;import the CSV Library&lt;/strong&gt;:&lt;br&gt;
   The &lt;code&gt;csv&lt;/code&gt; library allows you to save the extracted data to a CSV file. Here’s a standard way to do it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;filename.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the repository, under the &lt;code&gt;data_collection&lt;/code&gt; subfolder, you will find a &lt;code&gt;scraping_code&lt;/code&gt; folder containing the code used to extract data from the mentioned sites. There is also a practice script for experimenting with other sites.&lt;/p&gt;

&lt;p&gt;My primary focus during the scraping process was on properties, including houses, apartments, and bedsitters, for both rental and sale listings.&lt;/p&gt;




&lt;h2&gt;
  
  
  2 &amp;amp; 3. Data Cleaning and Exploratory Data Analysis
&lt;/h2&gt;

&lt;p&gt;Data cleaning and exploratory data analysis (EDA) are crucial stages in the data modeling process. After extracting data from the relevant websites, the first step is to consolidate it into a single Excel sheet. Preliminary cleaning, such as removing irrelevant rows, can be performed using Excel.&lt;/p&gt;

&lt;p&gt;Once the data is consolidated and cleaned, the next step is to prepare it for modeling through Exploratory Data Analysis (EDA). Introduced by American mathematician John Tukey in the 1970s, EDA is a fundamental process for understanding and preparing data. There is no standardized approach to EDA; it varies depending on the analyst's preferences and the specific context of the data.&lt;/p&gt;

&lt;p&gt;EDA is essential for preparing data for modeling, as it involves various tasks such as statistical analysis, data visualization, and feature engineering. To excel in this stage, you need strong skills in mathematics and statistics, data visualization, domain or market knowledge, and a curious mindset. Asking critical questions about the data is key to uncovering valuable insights.&lt;/p&gt;

&lt;p&gt;Domain or market knowledge is particularly important for generating new features—a process known as feature engineering. Introducing new features or refining existing ones helps the model better understand the data, improving its performance. Features are essentially the columns in your dataset, such as location or number of bedrooms.&lt;/p&gt;

&lt;p&gt;Data cleaning and EDA are often the most time-consuming parts of the modeling process. They require a deep understanding of both the general and statistical aspects of the data. This stage can take days or even weeks to thoroughly analyze and interpret. Its importance cannot be overstated; as the saying goes, "Garbage in, garbage out." Providing the model with poor-quality input will result in poor-quality predictions.&lt;/p&gt;

&lt;p&gt;For a practical example, refer to the code in the &lt;code&gt;nairobi_house_price_prediction&lt;/code&gt; notebook of the &lt;code&gt;cleaning_eda_modeling&lt;/code&gt; subfolder to see how EDA was conducted in this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Modeling
&lt;/h2&gt;

&lt;p&gt;Modeling is the core of our data science process. Once the data has been thoroughly explored and features have been enhanced, the final dataset is handed over to the data scientist for advanced statistical exploration and modeling.&lt;/p&gt;

&lt;p&gt;A data scientist typically possesses advanced statistical knowledge compared to the initial analyst. They use this expertise to extract deeper insights from the data and prepare it for modeling.&lt;/p&gt;

&lt;p&gt;The next step is to determine the type of problem at hand. Modeling problems generally fall into two categories:&lt;br&gt;
a. &lt;strong&gt;Classification Problems&lt;/strong&gt;: In these problems, the goal is to predict a discrete class label. The output is a categorical label. For example, if a model is trained with images of boys and girls, it will assign probability scores to the "boy" and "girl" labels for a new image and classify it based on the label with the highest probability.&lt;/p&gt;

&lt;p&gt;b. &lt;strong&gt;Regression Problems&lt;/strong&gt;: These problems aim to predict a continuous (numerical) output variable based on one or more input features. For instance, predicting house prices based on various features like location and size is a regression problem.&lt;/p&gt;

&lt;p&gt;Different algorithms are used for classification and regression problems, though some algorithms can be applied to both types. It is the data scientist's role to select the appropriate algorithms and train the model accordingly. &lt;/p&gt;

&lt;p&gt;A special type of model known as an &lt;strong&gt;ensemble model&lt;/strong&gt; combines two or more models or algorithms. Ensemble models often outperform individual models for many tasks. More information on ensemble models can be found &lt;a href="https://scikit-learn.org/stable/modules/ensemble.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Before modeling, it is essential to &lt;strong&gt;pre-process&lt;/strong&gt; the data. Algorithms perform better with binary data. Pre-processing involves transforming the data into a suitable format for the algorithms to train on. This pre-processor can be saved as a pickle (.pkl) file and used during inference to prepare user input for prediction. Next, divide the data into training and testing sets (and sometimes validation sets) using the &lt;code&gt;train_test_split&lt;/code&gt; function from scikit-learn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="c1"&gt;# Dividing into 70% train set and 30% test set
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the data divided, proceed with modeling. In this project, three main algorithms were used: &lt;code&gt;LinearRegression&lt;/code&gt;, &lt;code&gt;RandomForestRegressor&lt;/code&gt;, and &lt;code&gt;GradientBoostingRegressor&lt;/code&gt;. Selecting appropriate evaluation metrics is crucial for assessing model performance. For this task, metrics such as Mean Squared Error (MSE), R-squared (R²), Cross-Validation Mean Score (CV-Mean), and Cross-Validation Standard Deviation (CV-Std Dev) were used, as accuracy, precision, and recall were less relevant.&lt;/p&gt;

&lt;p&gt;Hyperparameter tuning is another critical aspect for improving model performance. Grid Search was employed to tune the Random Forest and Gradient Boosting models, as Linear Regression has fewer hyperparameters to adjust. The ensemble of the two models yielded better results. After training, the models were saved as pickle files (.pkl) containing the trained weights. The model weights can be found in the &lt;code&gt;model_preprocessor_weights&lt;/code&gt; subfolder, which also includes the preprocessor. As noted, the ensemble model provided the most accurate results and should be used for inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Deployment
&lt;/h2&gt;

&lt;p&gt;Deploy your model using frameworks like &lt;strong&gt;FastAPI&lt;/strong&gt;, &lt;strong&gt;Flask&lt;/strong&gt;, or &lt;strong&gt;Streamlit&lt;/strong&gt;. You can dockerize your application to enhance compatibility across different environments.&lt;/p&gt;

&lt;p&gt;Check out the &lt;code&gt;inferencing_and_deployment&lt;/code&gt; subfolder in the repo for details on how I did my deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Chatbot Creation Using OpenAI API
&lt;/h2&gt;

&lt;p&gt;Incorporating a chatbot gives our model a contemporary edge. While traditional machine learning methods are widely accepted and used in the industry, the advent of modern Large Language Models (LLMs), such as ChatGPT1 in 2018, has significantly disrupted the data field. &lt;/p&gt;

&lt;p&gt;Chatbots today often utilize Retrieval Augmented Generation (RAG) applications, which combine vector databases with LLMs like gpt-4-turbo to provide sophisticated responses to user queries. For more information on RAG applications, you can explore this &lt;a href="https://github.com/langchain/langchain" rel="noopener noreferrer"&gt;LangChain repository&lt;/a&gt; and watch this &lt;a href="https://www.youtube.com/watch?v=somevideo" rel="noopener noreferrer"&gt;video&lt;/a&gt; on the topic. &lt;/p&gt;

&lt;p&gt;OpenAI offers billable API keys to access their models, which you can find &lt;a href="https://platform.openai.com/overview" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;For implementation details, refer to the &lt;code&gt;chatbot&lt;/code&gt; subfolder in the &lt;code&gt;nairobi_house_price_prediction_model&lt;/code&gt; repository.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Automation with Airflow (Optional)
&lt;/h2&gt;

&lt;p&gt;Automate your data processes using &lt;strong&gt;Apache Airflow&lt;/strong&gt;. Other tools like &lt;strong&gt;Apache Kafka&lt;/strong&gt; or &lt;strong&gt;Redpanda&lt;/strong&gt; can also be considered for data streaming. This component is still in progress.&lt;/p&gt;




&lt;p&gt;For a comprehensive view, visit the &lt;a href="https://github.com/KamauGilbert/nairobi_house_price_prediction_model" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For inquiries, connect with me on &lt;a href="https://linkedin.com/in/gilbert-kamau-mungai" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or email me at &lt;a href="mailto:kamaugilbert9@gmail.com"&gt;kamaugilbert9@gmail.com&lt;/a&gt;.&lt;/p&gt;




</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>data</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
