DEV Community

Brent Ochieng
Brent Ochieng

Posted on

How I Built a Credit Risk Prediction App Using Python, XGBoost, and Streamlit

Financial institutions lose billions every year due to loan defaults. One of the biggest challenges in banking and fintech is accurately identifying high-risk borrowers before loans are approved.

As part of my machine learning and data science portfolio, I built a complete Credit Risk Prediction system capable of predicting whether a customer is likely to default on a loan using financial and demographic data.

This project evolved from:

  • Exploratory Data Analysis in Jupyter Notebook
  • Machine Learning model development
  • Model optimization using XGBoost
  • Building a production-ready Streamlit application

In this article, I will walk through the complete process step by step.

project Link:

Project Objective

The main goal of this project was to develop a machine learning system capable of:

  • Predicting loan default probability
  • Assisting financial institutions in risk assessment
  • Automating borrower screening
  • Reducing financial losses from bad loans

The final solution allows users to enter customer financial information and instantly receive a prediction on whether the customer is likely to default.

Dataset Overview

The dataset used contained over 250,000 customer records with both numerical and categorical variables.

Some of the major features included:

Feature Description
Age Customer age
Income Annual income
LoanAmount Requested loan amount
CreditScore Borrower credit score
InterestRate Applied loan interest
DTIRatio Debt-to-income ratio
Education Educational qualification
EmploymentType Employment status
HasMortgage Whether customer has mortgage
HasDependents Whether customer has dependents
LoanPurpose Purpose of the loan
Default Target variable

The target variable was:

Default
Enter fullscreen mode Exit fullscreen mode

Where:

  • 1 = Customer defaults
  • 0 = Customer repays successfully

Step 1 — Data Cleaning & Preprocessing
Before model development, the dataset required preprocessing.

Removing Unnecessary Columns
The LoanID column had no predictive value, so it was removed.

df = df.drop('LoanID', axis=1)
Enter fullscreen mode Exit fullscreen mode

Binary Feature Transformation

Several categorical columns had Yes/No values.

These were converted into numerical representations.

binary_cols = ['HasMortgage', 'HasDependents', 'HasCoSigner']
Enter fullscreen mode Exit fullscreen mode
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})
Enter fullscreen mode Exit fullscreen mode

Handling Categorical Variables
Categorical features such as:

  • Education
  • EmploymentType
  • MaritalStatus
  • LoanPurpose were transformed using encoding techniques to make them machine-readable.

Step 2 — Exploratory Data Analysis (EDA)
One of the most important stages of the project was understanding the data before modeling.

Using:

  • Matplotlib
  • Seaborn
  • Correlation analysis I explored:

Default distributions

  • Credit score relationships
  • Income patterns
  • Interest rate trends
  • Loan amount impacts
  • Some important findings emerged.

Key Business Insights
1. Credit Score Strongly Influences Default Risk
Customers with lower credit scores had significantly higher probabilities of default.

This aligned with real-world financial behavior.

2. High Interest Rates Increase Risk
Borrowers with higher interest rates tended to default more frequently.

This suggests that lenders often charge higher interest rates to already risky borrowers.

3. Employment Stability Matters
Unemployed or unstable-income borrowers showed elevated risk patterns.

4. Debt-to-Income Ratio Was Highly Informative
Customers with high DTI ratios struggled more with repayment obligations.

Step 3 — Machine Learning Model Development
I experimented with multiple machine learning algorithms.

Models Tested

  1. Logistic Regression
    Used as a baseline classification model.

  2. Random Forest
    Implemented to capture non-linear feature relationships.

  3. XGBoost
    Ultimately selected due to:

  • Higher predictive performance
  • Better handling of imbalanced data
  • Strong generalization ability
    Why I Chose XGBoost
    XGBoost outperformed the other models in:

  • Recall

  • ROC-AUC

  • Classification robustness
    The dataset had class imbalance issues, meaning defaulters were fewer than non-defaulters.

To address this, I implemented:

scale_pos_weight
Enter fullscreen mode Exit fullscreen mode

This helped the model pay more attention to the minority class.

Model Evaluation
Instead of focusing only on accuracy, I prioritized metrics that matter in real-world financial systems.

Key Metrics Used

  1. Recall
    Critical for detecting high-risk borrowers.

  2. Precision
    Important for reducing false alarms.

  3. ROC-AUC
    Measured the model’s ability to distinguish between risky and safe borrowers.

Step 4 — Saving the Trained Model
After training the model, I serialized it using joblib.

This allows the model to be reused without retraining every time.

joblib.dump(model, 'credit_risk_model.pkl')
joblib.dump(model_columns, 'model_columns.pkl')
Enter fullscreen mode Exit fullscreen mode

These .pkl files became the backbone of the deployment pipeline.

Step 5 — Building the Streamlit Application

Once the machine learning pipeline was complete, I transformed the notebook into a real interactive AI application using Streamlit.

The goal was to create a system where users could:

  • Enter customer details
  • Click a prediction button
  • Receive instant risk analysis
  • Streamlit Application Features The application includes:

Interactive Customer Input Forms
Users can provide:

  • Income
  • Loan amount
  • Credit score
  • Interest rate
  • Employment status
  • Loan purpose
  • Mortgage information
  • Real-Time Predictions
    The model instantly predicts:

  • High Risk

  • Low Risk
    alongside the default probability.

Intelligent Feature Alignment
One challenge during deployment was ensuring the app inputs aligned perfectly with the training features.

To solve this, I implemented:

pd.get_dummies()
Enter fullscreen mode Exit fullscreen mode

followed by column alignment using:

final_features = pd.DataFrame(0.0, index=[0], columns=model_columns)

This guaranteed that all prediction inputs matched the original training structure.

Streamlit Application Logic
The application workflow looks like this:

Step 1 — Load Model

model = joblib.load('credit_risk_model.pkl')
Enter fullscreen mode Exit fullscreen mode

Step 2 — Collect User Inputs
The user fills financial and demographic details.

Step 3 — Encode Features
Categorical variables are transformed using one-hot encoding.

Step 4 — Align Features
Missing columns are initialized to zero.

Step 5 — Generate Prediction
The model predicts:

  • Loan default classification
  • Default probability score Challenges I Faced Like many real-world machine learning projects, deployment introduced several challenges.

1. Feature Mismatch Errors
The biggest issue occurred when prediction inputs did not match the training dataset columns.

This caused:

  • shape mismatch errors
  • model prediction failures
    I solved this using:

  • stored training columns

  • dynamic column alignment

  • default zero initialization
    2. Data Type Conflicts
    Some encoded columns returned mixed types.

The solution was forcing all features to float:

final_features = final_features.astype(float)
Enter fullscreen mode Exit fullscreen mode

3. Model Serialization

Ensuring the trained model and preprocessing pipeline loaded correctly required careful file management using .pkl files.

Final Project Structure
ML PROJECTS/

├── credit_risk_prediction.ipynb
├── Loan_default.csv
├── app.py
├── train_model.py
├── credit_risk_model.pkl
├── model_columns.pkl
├── requirements.txt
└── README.md

Deployment
The application can now be deployed using:

  • Streamlit Community Cloud
  • Render
  • Railway
  • AWS
  • Azure
    The deployment process only requires:

  • GitHub repository

  • requirements.txt

  • Streamlit app file

Business Value of the Project
This solution demonstrates how machine learning can support real business decision-making.

Financial institutions can use systems like this to:

  • Automate loan screening
  • Reduce manual workload
  • Detect risky borrowers earlier
  • Improve lending accuracy
  • Reduce credit losses

Lessons Learned
This project taught me several critical data science skills:

  • Technical Skills
  • End-to-end ML workflow
  • Feature engineering
  • Model optimization
  • Streamlit deployment
  • Model serialization
  • Production debugging
  • Business Skills
  • Translating business problems into ML solutions
  • Understanding risk analytics
  • Communicating technical insights

Final Thoughts
Building machine learning models is only one part of the data science lifecycle.

The real value comes from transforming those models into usable systems that solve real-world problems.

This project allowed me to bridge:

  • Data analysis
  • Machine learning
  • Software deployment
  • Business intelligence into one end-to-end AI solution.

As I continue growing as a Data Analyst and Data Scientist, projects like this help me strengthen both my technical and problem-solving abilities while building solutions with practical business impact.

Top comments (0)