Mahira Banu

Posted on Apr 28

🛡️ Building FraudShield: Credit Card Fraud Detection with Imbalanced Data

#datascience #machinelearning #security #showdev

Fraud detection is one of those problems that looks simple on the surface — classify transactions as “fraud” or “not fraud”. But once you look at real data, it becomes a completely different challenge.

In this project, I built FraudShield, an end-to-end machine learning system to detect fraudulent credit card transactions using both supervised and unsupervised approaches, along with a live dashboard.

📊 The Problem

The dataset I used contains over 284,000 transactions, but only:

👉 0.17% are fraud

This creates a highly imbalanced dataset, where a model can achieve 99% accuracy just by predicting everything as “not fraud”.

So the real question becomes:

How do we detect fraud when it’s so rare?

🔍 Dataset Overview

The dataset contains real-world credit card transactions made by European cardholders, anonymised using PCA transformation to protect sensitive information. It includes 284,807 transactions, of which only 492 are fraudulent (~0.17%), making it a highly imbalanced classification problem.

🧠 What are V1–V28?

These are PCA-transformed features.

In simple terms:

The original features are hidden
Data is transformed into mathematical components
We can’t interpret them directly

👉 This makes the problem harder — models must learn patterns without human-readable features.

📈 Exploratory Data Analysis (EDA)

Some key observations:

The dataset is extremely imbalanced
Most transactions are low value
Fraud doesn’t follow obvious patterns
Features are weakly correlated due to PCA transformation

One important realization early on:

Accuracy is NOT a useful metric here

⚠️ Why Accuracy is Misleading

If a model predicts:

text All transactions = Normal

It gets:

👉 99.8% accuracy

…but detects zero fraud

So instead, I focused on:

Precision
Recall
F1 Score

🤖 Model 1 — XGBoost (Supervised Learning)

I trained an XGBoost classifier, which is well-suited for tabular data and imbalanced problems.

Key setup:

scale_pos_weight to handle imbalance
Stratified train/test split
Feature scaling

📊 Results:

Precision: 0.71
Recall: 0.87 🔥
F1 Score: 0.78

🧠 Insight:

The model successfully detects 87% of fraud cases, which is critical in real-world systems.

🧪 Model 2 — Isolation Forest (Unsupervised)

To compare approaches, I also used Isolation Forest, an anomaly detection model.

📊 Results:

Precision: 0.29
Recall: 0.30
F1 Score: 0.30

🧠 Insight:

Unsupervised models struggle to detect subtle fraud patterns without labelled data.

⚖️ Model Comparison

Model	Precision	Recall	F1
XGBoost	0.71	0.87	0.78
Isolation Forest	0.29	0.30	0.30

🚀 Key takeaway:

Supervised learning significantly outperforms unsupervised anomaly detection when labelled data is available.

🔍 Explainability with SHAP

To understand how the model makes decisions, I used SHAP (SHapley Additive exPlanations).

This helps answer:

Which features influence predictions?
Why was a transaction classified as fraud?

👉 This adds transparency and trust to the system.

🖥️ Deployment — Streamlit Dashboard

To make the system usable, I built a Streamlit dashboard.

Features:

Input transaction data
Predict fraud probability
Display risk level
Show model metrics

🌐 Live Demo & Code

💻 GitHub: https://github.com/mahira-code/fraudshield-ml
🌍 Live Demo: https://fraudshield-ml-mahira.streamlit.app/

🧠 What I Learned

This project taught me a lot about real-world machine learning:

Handling imbalanced datasets
Choosing the right evaluation metrics
Comparing supervised vs unsupervised models
Using SHAP for explainability
Building and deploying end-to-end ML systems

🚀 What’s Next

Hyperparameter tuning
Model monitoring (drift detection)
API deployment (FastAPI)
MLOps integration

👩‍💻 About Me

I’m Mahira Banu, a Data Scientist and AI Engineer focused on building practical, real-world AI systems.

🌐 Portfolio: https://mahirabanu.website
💻 GitHub: https://github.com/mahira-code
🔗 LinkedIn: https://www.linkedin.com/in/mahira-banu

💬 Final Thoughts

Fraud detection isn’t just about building a model — it’s about understanding data, handling imbalance, and making reliable decisions in high-risk scenarios.

If you’re working on similar problems, I’d love to hear your thougts

DEV Community

🛡️ Building FraudShield: Credit Card Fraud Detection with Imbalanced Data

📊 The Problem

🔍 Dataset Overview

🧠 What are V1–V28?

📈 Exploratory Data Analysis (EDA)

⚠️ Why Accuracy is Misleading

🤖 Model 1 — XGBoost (Supervised Learning)

Key setup:

📊 Results:

🧠 Insight:

🧪 Model 2 — Isolation Forest (Unsupervised)

📊 Results:

🧠 Insight:

⚖️ Model Comparison

🚀 Key takeaway:

🔍 Explainability with SHAP

🖥️ Deployment — Streamlit Dashboard

Features:

🌐 Live Demo & Code

🧠 What I Learned

🚀 What’s Next

👩‍💻 About Me

💬 Final Thoughts

Top comments (0)