Joseph Tobi

Posted on May 7

Handling Class Imbalance in Fraud Detection with scikit-learn

#python #machinelearning #datascience #fraud

Handling Class Imbalance in Fraud Detection with scikit-learn
Every fraud detection tutorial I've seen makes the same mistake. They train a model, print the accuracy score — 99.8% — and declare success.
That model is useless.
In a dataset where 0.17% of transactions are fraudulent, a model that predicts "legitimate" for every single transaction achieves 99.83% accuracy. It has never detected a single fraud case in its life.
This is the class imbalance problem and it's the most important thing to understand before building any fraud detection system.
In this tutorial I'll show you exactly how to handle it correctly using scikit-learn. By the end you'll have a working fraud detection pipeline that actually catches fraud.
Prerequisites
Python 3.8+
Basic understanding of classification
pip installed
The Dataset
We'll use the Credit Card Fraud Detection dataset from Kaggle. It contains 284,807 transactions with only 492 fraud cases — a fraud rate of 0.17%. This is a real-world class imbalance problem.
Download it from Kaggle and save it as creditcard.csv.
Step 1 — Explore the Data First
Never start modeling without understanding your data.
import pandas as pd
import numpy as np

df = pd.read_csv("creditcard.csv")

Always check this first

print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df["Class"].value_counts())
print(f"\nFraud rate: {df['Class'].mean():.4%}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
Output:
Dataset shape: (284807, 31)

Class distribution:
0 284315
1 492

Fraud rate: 0.1727%

Missing values: 0
This tells us everything we need to know. 492 fraud cases against 284,315 legitimate transactions. This is severe class imbalance.
Step 2 — Why Accuracy Is the Wrong Metric
Before we build anything, let's prove why accuracy is meaningless here.
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = df.drop("Class", axis=1)
y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

A model that predicts majority class every time

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)

print(f"Dummy model accuracy: {accuracy_score(y_test, y_pred):.4%}")
Output:
Dummy model accuracy: 99.8274%
A model that has learned absolutely nothing achieves 99.83% accuracy. This is why you must never use accuracy as your primary metric for imbalanced classification.
Step 3 — Use the Right Metrics
The correct metrics for fraud detection are:
from sklearn.metrics import (
classification_report,
roc_auc_score,
confusion_matrix,
precision_score,
recall_score,
f1_score
)

def evaluate_model(model, X_test, y_test, model_name):
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(f"\n{'='*50}")
print(f"Model: {model_name}")
print(f"{'='*50}")
print(f"\nAUC-ROC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=["Legitimate", "Fraud"]))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Here is what each metric means in fraud context:
AUC-ROC — measures how well the model separates fraud from legitimate transactions across all thresholds. 1.0 is perfect, 0.5 is random guessing. This is your primary metric.
Recall — of all actual fraud cases, how many did we catch? Missing real fraud is the most costly mistake. Prioritize this.
Precision — of all predicted fraud cases, how many were real? Low precision means too many false alarms blocking legitimate customers.
F1 Score — harmonic mean of precision and recall. Good overall measure when you need to balance both.
Step 4 — Preprocess the Data
from sklearn.preprocessing import StandardScaler

X = df.drop("Class", axis=1)
y = df["Class"]

Stratify ensures both splits maintain

the same fraud ratio

X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # Critical for imbalanced data
)

Scale features

Fit only on training data — never on test data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set fraud rate: {y_train.mean():.4%}")
print(f"Test set fraud rate: {y_test.mean():.4%}")
Output:
Training set fraud rate: 0.1727%
Test set fraud rate: 0.1727%
Stratify ensures both splits have the same fraud rate. Without it you might accidentally create a test set with no fraud cases at all.
Step 5 — Approach 1: Class Weights
The simplest approach. Tell the model to penalize misclassifying fraud cases more heavily.
from sklearn.linear_model import LogisticRegression

Without class weights — baseline

lr_baseline = LogisticRegression(
random_state=42,
max_iter=1000
)
lr_baseline.fit(X_train_scaled, y_train)
evaluate_model(lr_baseline, X_test_scaled,
y_test, "Logistic Regression (No Weights)")

With class weights — handles imbalance

lr_weighted = LogisticRegression(
class_weight="balanced", # This is the key change
random_state=42,
max_iter=1000
)
lr_weighted.fit(X_train_scaled, y_train)
evaluate_model(lr_weighted, X_test_scaled,
y_test, "Logistic Regression (Balanced)")
class_weight="balanced" automatically calculates weights inversely proportional to class frequencies. Fraud cases get much higher weight so misclassifying them costs more.
Step 6 — Approach 2: Random Forest with Class Weights
Tree-based models handle imbalance better than linear models and support class weighting too.
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
n_estimators=100,
class_weight="balanced",
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train_scaled, y_train)
evaluate_model(rf, X_test_scaled,
y_test, "Random Forest (Balanced)")
Random Forest typically outperforms Logistic Regression on fraud detection because fraud patterns are highly nonlinear.
Step 7 — Approach 3: SMOTE Oversampling
SMOTE (Synthetic Minority Oversampling Technique) creates synthetic fraud samples to balance the dataset.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

Install: pip install imbalanced-learn

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(
X_train_scaled, y_train
)

print(f"Before SMOTE: {y_train.value_counts().to_dict()}")
print(f"After SMOTE: {pd.Series(y_train_resampled).value_counts().to_dict()}")

rf_smote = RandomForestClassifier(
n_estimators=100,
random_state=42,
n_jobs=-1
)
rf_smote.fit(X_train_resampled, y_train_resampled)
evaluate_model(rf_smote, X_test_scaled,
y_test, "Random Forest + SMOTE")
Important — apply SMOTE only to training data, never to test data. You want to evaluate on real distribution, not synthetic data.
Step 8 — Tune the Classification Threshold
By default scikit-learn uses 0.5 as the fraud threshold. This is almost never optimal for imbalanced problems.
import numpy as np
from sklearn.metrics import precision_recall_curve

y_prob = rf.predict_proba(X_test_scaled)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(
y_test, y_prob
)

Find threshold that maximizes F1

f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]

print(f"Default threshold (0.5) results:")
y_pred_default = (y_prob >= 0.5).astype(int)
print(f"Recall: {recall_score(y_test, y_pred_default):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_default):.4f}")

print(f"\nOptimal threshold ({best_threshold:.3f}) results:")
y_pred_optimal = (y_prob >= best_threshold).astype(int)
print(f"Recall: {recall_score(y_test, y_pred_optimal):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_optimal):.4f}")
In fraud detection you usually want to lower the threshold to catch more fraud at the cost of more false alarms. The right threshold depends on the business cost of each error type.
Step 9 — Feature Importance
Understanding which features drive fraud predictions helps you build better models and explain decisions to stakeholders.
import pandas as pd
import matplotlib.pyplot as plt

feature_importance = pd.DataFrame({
"feature": X.columns,
"importance": rf.feature_importances_
}).sort_values("importance", ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))
Step 10 — Save the Model for Production
import joblib

Save model and scaler

joblib.dump(rf, "fraud_model.pkl")
joblib.dump(scaler, "scaler.pkl")
joblib.dump(best_threshold, "threshold.pkl")

print("Model, scaler and threshold saved")
Save the threshold too — you'll need it when serving predictions in production to apply the same optimal cutoff.
Summary — What To Always Do
Here's your checklist for any imbalanced classification problem:
Never use accuracy alone — use AUC-ROC, Recall, F1.
Always stratify your splits — use stratify=y in train_test_split.
Always handle class imbalance — at minimum use class_weight="balanced".
Always tune your threshold — 0.5 is almost never optimal.
Always save preprocessing artifacts — scaler, encoder, threshold together with the model.
Conclusion
Class imbalance is not a data problem — it is a modeling problem. The solution is not to collect more data. The solution is to choose the right metrics, handle the imbalance explicitly, and tune your decision threshold for your specific business context.
A fraud detection model is not measured by how often it is right. It is measured by how much fraud it catches and how many legitimate customers it wrongly blocks. Keep that in mind every time you evaluate a model.
The complete code for this tutorial is available on my GitHub at github.com/josephtobimayokun
Joseph Tobi Mayokun is a full-stack developer and ML engineer, founder of Microlink — an AI-focused tech startup building intelligent software for African markets.

DEV Community