Omair Ahmed

Posted on Dec 19, 2025 • Originally published at omairqazi.hashnode.dev

Using ML to Predict Credit Card Defaults

#machinelearning #datascience #creditcards #predictiveanalysis

Have you ever wondered what the “system” does to approve you instantly when you apply for a credit limit increase on your credit card? Does it make an API call to ChatGPT with your credit report or do the banks have their own AI/ML system? The banks today probably use a complex solution for this problem which I have tried to touchbase on by developing my very own ML analysis for predicting credit card defaults.

An average American holds 3.9 credit cards, and just in the States, there are hundreds of millions of active credit cards. For each of these, the banks, credit unions, other companies do one common but critical thing: complex calculations to determine which credit card clients may fail to make their next payment. The approach needs to be precise, too liberal and defaults pile up affecting the bank’s business and too conservative will result in low profits, and potentially losing good customers.

Problem Summary

The project’s question is pretty straightforward which is to predict whether a credit card client will default on their payment next month. This is a binary classification problem where my ML model needs to flag risky accounts and also minimizing false alarms. The banks use a more complex version of these predictions to adjust credit limits, put accounts on financial review, or hold accounts before losses occur.

Exploratory Data Analysis

I was provided a dataset of 30,000 Taiwanese credit card clients from 2005. It contained 24 features, for example, demographics, credit limits, payment history, billing information, etc. I identified variables the most important for my project which were payment status indicators across six months (PAY\_0\ through PAY_6), which identify whether the credit card clients paid on time, late or early.

The dataset also contained a significant class imbalance shown in Figure 1 below. About 77.7% of clients didn’t default while 22.3% did. This approx. 3.5:1 ratio distanced me away from accuracy as a metric. A simple and dummy model could achieve 78% accuracy by always predicting “no default.” Figure 2 shows feature distributions. Here we see huge scale differences, like credit limits ranging from 10,000 to over 1 million. On the other hand, we have payment status values with a much smaller range: from -2 to 8. This prompted the need for feature scaling.

Figure 1: Distribution of target variable showing 77.7% non-default vs 22.3% default

Figure 2: Feature distributions comparing default and non-default groups across key variables

How I approached the problem

I started with engineering new new features to capture patterns in the raw data. These included credit utilization ratio (just like in our credit reports), average payment status across all months, average bill amount, and average payment amount.

After dropping the ID column and the SEX column (to avoid gender bias), I tested four model types:

A baseline linear classifier, Logistic Regression,
An ensemble of decision trees, Random Forest,
Sequential tree building that corrects previous errors, Gradient Boosting,
and Instance-based learning using similar examples, K-Nearest Neighbors.

I applied hyperparameter tuning to each moel using 5-fold cross-validation in order to find the optimal configuration for each of them. I used ROC-AUC (Area Under the Receiver Operating Characteristic Curve) as my primary metric because it evaluates model performance across all classification thresholds, making it ideal for imbalanced datasets.

Findings

My tuned Gradient Boosting model was the clear winner with a test ROC-AUC of 0.7821, an F1 score of 0.4784, and an accuracy of 82.12%. Putting this into perspective, the baseline model that always predicted “no default” achieved 77.68% accuracy (as I expected it based on reasons mentioned earlier) but had a ROC-AUC of exactly 0.5. This is equivalent to guessing a coin toss.

Figure 3 below shows feature importance analysis. It indicates that PAY_0 was the most recent payment status. It was by far the most critical feature, and accounted for 53% of the model’s decision-making. The feature I engineered, AVG_PAY_STATUS was the second most important accounting for 20%. These two together drove over 72% of predictions, with a steep drop-off after. Credit utilization, average bill amount, and average payment amount also contributed smaller but meaningful signals at around 3% each.

Figure 3: Feature importance showing PAY_0 (53%) and AVG_PAY_STATUS (20%) as dominant predictors

Now, if we look with the perspective of individual predictions, there was a client who was correctly flagged as high-risk who had PAY_0 = 2 (which means the payment was delayed by two months), AVG_PAY_STATUS = 2.0 (indicating consistent delays), and credit utilization of 91.7%. On the other hand, there was also a client who also correctly, classified as low-risk. This client showed PAY_0 = -1 (meaning paid early), a negative average payment status (-0.33), and a relatively moderate utilization of 72.5%.

Limitations

These results may sound awesome but there are some factors that can make the model less reliable in production. I will list the two most (not necessarily important) popular in my head:

1. Data’s age and locality: The dataset is nearly two decades old. Credit card behavior, economic conditions, and lending practices, since then, have evolved significantly. What worked for Taiwanese consumers in 2005 may not apply today and/or in other regions. For example, one region, the US, has a significantly higher credit card spending.

2. Personal bias in engineering features: My engineered features like AVG_PAY_STATUS and credit utilization come with my assumption that averaging payment behavior across months is meaningful. However, this is not necessarily true today. There are literal Reddit posts that teach how to have a massic credit card limit by being a good consumer for the first 6 months, then max it out after being approved for a big credit limit increase. Average here is not going to increase the big utilization and recent delayed payments.

Wrapping Up

Now, we can improve the dataset by collecting more features, for example, economic indicators like unemployment rates in the region and/or seasonal effects in the client’s life which may influence default patterns. Crucial life-changing events like job loss, medical emergencies, relationship issues can also influence defaulting. Usually banks provide an insurance for such events so a model trained purely on historical patterns included in the client’s credit report may perform well in stable conditions.

DEV Community