Brent Ochieng

Posted on May 15

How I Built a Credit Risk Prediction App Using Python, XGBoost, and Streamlit

#datascience #machinelearning #python #showdev

Financial institutions lose billions every year due to loan defaults. One of the biggest challenges in banking and fintech is accurately identifying high-risk borrowers before loans are approved.

As part of my machine learning and data science portfolio, I built a complete Credit Risk Prediction system capable of predicting whether a customer is likely to default on a loan using financial and demographic data.

This project evolved from:

Exploratory Data Analysis in Jupyter Notebook
Machine Learning model development
Model optimization using XGBoost
Building a production-ready Streamlit application

In this article, I will walk through the complete process step by step.

project Link:

Project Objective

The main goal of this project was to develop a machine learning system capable of:

Predicting loan default probability
Assisting financial institutions in risk assessment
Automating borrower screening
Reducing financial losses from bad loans

The final solution allows users to enter customer financial information and instantly receive a prediction on whether the customer is likely to default.

Dataset Overview

The dataset used contained over 250,000 customer records with both numerical and categorical variables.

Some of the major features included:

Feature	Description
Age	Customer age
Income	Annual income
LoanAmount	Requested loan amount
CreditScore	Borrower credit score
InterestRate	Applied loan interest
DTIRatio	Debt-to-income ratio
Education	Educational qualification
EmploymentType	Employment status
HasMortgage	Whether customer has mortgage
HasDependents	Whether customer has dependents
LoanPurpose	Purpose of the loan
Default	Target variable

The target variable was:

Default

Where:

1 = Customer defaults
0 = Customer repays successfully

Step 1 — Data Cleaning & Preprocessing
Before model development, the dataset required preprocessing.

Removing Unnecessary Columns
The LoanID column had no predictive value, so it was removed.

df = df.drop('LoanID', axis=1)

Binary Feature Transformation

Several categorical columns had Yes/No values.

These were converted into numerical representations.

binary_cols = ['HasMortgage', 'HasDependents', 'HasCoSigner']

for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

Handling Categorical Variables
Categorical features such as:

Education
EmploymentType
MaritalStatus
LoanPurpose were transformed using encoding techniques to make them machine-readable.

Step 2 — Exploratory Data Analysis (EDA)
One of the most important stages of the project was understanding the data before modeling.

Using:

Matplotlib
Seaborn
Correlation analysis I explored:

Default distributions

Credit score relationships
Income patterns
Interest rate trends
Loan amount impacts
Some important findings emerged.

Key Business Insights
1. Credit Score Strongly Influences Default Risk
Customers with lower credit scores had significantly higher probabilities of default.

This aligned with real-world financial behavior.

2. High Interest Rates Increase Risk
Borrowers with higher interest rates tended to default more frequently.

This suggests that lenders often charge higher interest rates to already risky borrowers.

3. Employment Stability Matters
Unemployed or unstable-income borrowers showed elevated risk patterns.

4. Debt-to-Income Ratio Was Highly Informative
Customers with high DTI ratios struggled more with repayment obligations.

Step 3 — Machine Learning Model Development
I experimented with multiple machine learning algorithms.

Models Tested

Logistic Regression
Used as a baseline classification model.
Random Forest
Implemented to capture non-linear feature relationships.
XGBoost
Ultimately selected due to:

Higher predictive performance
Better handling of imbalanced data
Strong generalization ability
Why I Chose XGBoost
XGBoost outperformed the other models in:
Recall
ROC-AUC
Classification robustness
The dataset had class imbalance issues, meaning defaulters were fewer than non-defaulters.

To address this, I implemented:

scale_pos_weight

This helped the model pay more attention to the minority class.

Model Evaluation
Instead of focusing only on accuracy, I prioritized metrics that matter in real-world financial systems.

Key Metrics Used

Recall
Critical for detecting high-risk borrowers.
Precision
Important for reducing false alarms.
ROC-AUC
Measured the model’s ability to distinguish between risky and safe borrowers.

Step 4 — Saving the Trained Model
After training the model, I serialized it using joblib.

This allows the model to be reused without retraining every time.

joblib.dump(model, 'credit_risk_model.pkl')
joblib.dump(model_columns, 'model_columns.pkl')

These .pkl files became the backbone of the deployment pipeline.

Step 5 — Building the Streamlit Application

Once the machine learning pipeline was complete, I transformed the notebook into a real interactive AI application using Streamlit.

The goal was to create a system where users could:

Enter customer details
Click a prediction button
Receive instant risk analysis
Streamlit Application Features The application includes:

Interactive Customer Input Forms
Users can provide:

Income
Loan amount
Credit score
Interest rate
Employment status
Loan purpose
Mortgage information
Real-Time Predictions
The model instantly predicts:
High Risk
Low Risk
alongside the default probability.

Intelligent Feature Alignment
One challenge during deployment was ensuring the app inputs aligned perfectly with the training features.

To solve this, I implemented:

pd.get_dummies()

followed by column alignment using:

final_features = pd.DataFrame(0.0, index=[0], columns=model_columns)

This guaranteed that all prediction inputs matched the original training structure.

Streamlit Application Logic
The application workflow looks like this:

Step 1 — Load Model

model = joblib.load('credit_risk_model.pkl')

Step 2 — Collect User Inputs
The user fills financial and demographic details.

Step 3 — Encode Features
Categorical variables are transformed using one-hot encoding.

Step 4 — Align Features
Missing columns are initialized to zero.

Step 5 — Generate Prediction
The model predicts:

Loan default classification
Default probability score Challenges I Faced Like many real-world machine learning projects, deployment introduced several challenges.

1. Feature Mismatch Errors
The biggest issue occurred when prediction inputs did not match the training dataset columns.

This caused:

shape mismatch errors
model prediction failures
I solved this using:
stored training columns
dynamic column alignment
default zero initialization
2. Data Type Conflicts
Some encoded columns returned mixed types.

The solution was forcing all features to float:

final_features = final_features.astype(float)

3. Model Serialization

Ensuring the trained model and preprocessing pipeline loaded correctly required careful file management using .pkl files.

Final Project Structure
ML PROJECTS/
│
├── credit_risk_prediction.ipynb
├── Loan_default.csv
├── app.py
├── train_model.py
├── credit_risk_model.pkl
├── model_columns.pkl
├── requirements.txt
└── README.md

Deployment
The application can now be deployed using:

Streamlit Community Cloud
Render
Railway
AWS
Azure
The deployment process only requires:
GitHub repository
requirements.txt
Streamlit app file

Business Value of the Project
This solution demonstrates how machine learning can support real business decision-making.

Financial institutions can use systems like this to:

Automate loan screening
Reduce manual workload
Detect risky borrowers earlier
Improve lending accuracy
Reduce credit losses

Lessons Learned
This project taught me several critical data science skills:

Technical Skills
End-to-end ML workflow
Feature engineering
Model optimization
Streamlit deployment
Model serialization
Production debugging
Business Skills
Translating business problems into ML solutions
Understanding risk analytics
Communicating technical insights

Final Thoughts
Building machine learning models is only one part of the data science lifecycle.

The real value comes from transforming those models into usable systems that solve real-world problems.

This project allowed me to bridge:

Data analysis
Machine learning
Software deployment
Business intelligence into one end-to-end AI solution.

As I continue growing as a Data Analyst and Data Scientist, projects like this help me strengthen both my technical and problem-solving abilities while building solutions with practical business impact.

Top comments (1)

Harjot Singh • Jun 1

great work on building a credit risk prediction app - it's crucial for banks to get better at identifying high-risk borrowers. if you're ever looking to deploy a web app quickly, check out Moonshift. you can get a full next.js + postgres + auth build deployed in about 7 minutes, and you own the code on your github. happy to offer a free run if you're interested.