DEV Community

Cover image for I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned
Ali Ahmad
Ali Ahmad

Posted on

I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned

There is a number that haunts every fraud detection engineer: 0.13%.

That is the fraud rate in the PaySim dataset — 8,213 fraudulent transactions buried inside 6,362,620 legitimate ones. It sounds small. It is not. At that ratio, a model that predicts "legitimate" for every single transaction achieves 99.87% accuracy — and catches exactly zero fraud.

This is the problem I set out to solve with TrustGuard AI, a course project that turned into one of the most technically demanding things I have built. By the end of it, our deployed XGBoost model achieves AUC-ROC of 0.9995 and Recall of 0.9976 — meaning it catches 99.76% of all fraud on a 6.3 million row test set. It also explains every single prediction using SHAP, and grounds each fraud alert in real State Bank of Pakistan regulatory documents through a RAG pipeline.

This article is the full story — what worked, what broke, and why accuracy is the wrong metric for fraud detection.


The Problem With Accuracy

Before writing a single line of code, I want to be clear about why standard accuracy is useless here.

The dataset has 6,362,620 transactions. Of those, 8,213 are fraud. If I build a model that always predicts "legitimate," here is its scorecard:

Accuracy  = 99.87%
Precision = 0
Recall    = 0
F1        = 0
Enter fullscreen mode Exit fullscreen mode

A perfect-looking accuracy score on a model that is completely blind to fraud. This is why TrustGuard optimises for Recall (catching fraud), Average Precision / AUPRC (area under the precision-recall curve), and AUC-ROC — not accuracy. Accuracy is literally a deceptive metric on imbalanced data.


The Dataset

TrustGuard uses the PaySim synthetic dataset — a mobile money transaction log generated by a multi-agent simulation calibrated against real financial data. It spans 30 simulated days at hourly granularity.

Property Value
Total Transactions 6,362,620
Fraud Cases 8,213
Fraud Rate 0.13%
Transaction Types CASH_OUT, TRANSFER, PAYMENT, DEBIT, CASH_IN

One of the first insights from EDA: fraud is not spread across all transaction types. It is confined exclusively to CASH_OUT and TRANSFER. This makes structural sense — fraud follows the account-drain pattern: transfer funds to a mule account, then cash out. PAYMENT, DEBIT, and CASH_IN are clean.

This single observation shaped the entire feature engineering approach.


The Class Imbalance Problem — and Why SMOTE Alone Fails

At 0.13% fraud rate, SMOTE alone is not enough. Here is why.

With 5-fold cross-validation, some training folds can contain fewer than 10 actual fraud samples. SMOTE generates synthetic minority samples by interpolating between existing ones — but if there are only a handful of real fraud cases in a fold, SMOTE degenerates. The synthetic samples cluster too tightly and the model learns nothing useful.

TrustGuard uses a two-stage imbalance strategy:

Stage 1 — Fraud Simulation Engine

Before any train-test split, I apply a deterministic fraud injection step:

  1. Sample 5% of all legitimate TRANSFER and CASH_OUT transactions
  2. Set amount = oldbalanceOrg (full account drain)
  3. Set newbalanceOrig = 0
  4. Recompute balanceDiff and amount_ratio
  5. Label as fraud and append to the dataset

Result: Fraud rate rises from 0.13% → 1.26%.

The ablation study confirmed this was the single most important component in the pipeline. Removing it dropped CV F1 from 0.947 to 0.671 — a 29% relative reduction.

Stage 2 — SMOTE Inside ImbPipeline

After the 80/20 stratified train-test split, SMOTE (sampling_strategy=0.3) is applied inside an ImbPipeline per cross-validation fold. This is critical — SMOTE is fitted only on the training portion of each fold. The validation fold never sees synthetic samples. This prevents data leakage.

The final training distribution: 23.07% fraud.

Stage Fraud Rate
Original Dataset 0.13%
After Fraud Simulation 1.26%
After SMOTE (training folds) 23.07%

Feature Engineering

After cleaning, 12 features go into the model. The two most important are engineered:

balanceDiff = oldbalanceOrg − newbalanceOrig − amount

This detects balance inconsistencies. In a legitimate transaction, money flows normally. In an account-drain fraud, this value becomes anomalous.

amount_ratio = amount / (oldbalanceOrg + 1)

This approaches 1.0 in full account-drain attacks. For routine transfers it stays near zero.

The ablation confirmed their necessity: removing both dropped Test F1 from 0.5533 to 0.1538 and Test AP from 0.7317 to 0.6061. Without them, the model is nearly blind.


Training Four Models

All four models were trained identically inside an ImbPipeline(SMOTE → StandardScaler → Classifier) with 5-fold stratified cross-validation.

Cross-Validation Results:

Model CV F1 CV AUC-ROC
XGBoost 0.949 ± 0.020 1.000 ± 0.000
Neural Network 0.793 ± 0.061 0.999 ± 0.000
Random Forest 0.711 ± 0.007 0.999 ± 0.000
Logistic Regression 0.249 ± 0.003 0.977 ± 0.001

Test Set Results:

Model Test Recall Test AUC Test Avg Precision
XGBoost 0.9976 0.9995 0.9358
Random Forest 0.9976 0.9995 0.8870
Neural Network 0.9732 0.9983 0.7081
Logistic Regression 0.9860 0.9946 0.5567

XGBoost dominates across every metric. Its Test Average Precision (0.9358) is 3.8× higher than Logistic Regression at comparable recall. XGBoost was selected for deployment.


Why XGBoost? The Hyperparameter Story

n_estimators CV F1
100 0.921
200 0.938
300 0.949

More trees, lower learning rate (0.05), better generalisation. Max depth of 6 over 8 to avoid overfitting on fold-specific patterns.


Explainability with SHAP

A fraud detection system that says "this transaction is fraud" without explaining why is not useful to an analyst — and not acceptable to a regulator.

TrustGuard implements SHAP TreeExplainer, which computes exact Shapley values for each prediction. For every flagged transaction, a waterfall plot shows exactly which features pushed the prediction toward fraud and by how much.

For a sample transaction flagged at 94% fraud probability:

  • amount_ratio ≈ 1.0 → largest push toward fraud (full drain detected)
  • type_TRANSFER → second largest push
  • balanceDiff → third largest push

This tells the analyst: this transaction looks like fraud because it drained an account completely via a transfer operation. That is auditable and defensible.


The RAG Pipeline — Grounding Alerts in Regulation

This is the part most fraud detection tutorials skip entirely.

A model flagging a transaction at 97% is useful. A model that also cites the specific SBP regulatory provision being violated is operationally deployable.

Pipeline architecture:

  1. Document Ingestion: Five SBP regulatory PDFs chunked into ~100 passages, stored in ChromaDB
  2. Embedding: all-MiniLM-L6-v2 (384-dimensional dense retriever)
  3. Hybrid Retrieval: BM25 (lexical) + dense vector retrieval combined
  4. Reranking: CrossEncoder (ms-marco-MiniLM-L-6-v2)
  5. Generation: GPT-4o-mini receives fraud probability + retrieved passages → structured risk report with SBP citations

Results: Average Precision@5 = 0.855 across 10 retrieval queries. Zero hallucinations across all four high-risk transaction evaluations.


The Full Ablation Study

Condition CV F1 Test F1 Test AP
No Fraud Simulation (SMOTE only) 0.671 0.6247 0.9363
Full Pipeline (baseline) 0.947 0.5533 0.7317
No SMOTE 0.947 0.9132 0.9639
SMOTE ratio = 0.3 (selected) 0.947 0.5557 0.7322
SMOTE ratio = 0.5 0.947 0.5280 0.7180
No Engineered Features 0.636 0.1538 0.6061
With Engineered Features (full) 0.947 0.5533 0.7317

Three takeaways: the Fraud Simulation Engine is irreplaceable, engineered features are critical, and no SMOTE gives better precision but worse robustness.


Final Numbers

Metric Value
AUC-ROC 0.9995
Recall 0.9976
Average Precision (AUPRC) 0.9358
Fraud cases caught (of 8,213) 8,190
Fraud cases missed 23
RAG hallucinations 0
Retrieval Precision@5 0.855

Stack

Python · XGBoost · Scikit-learn · imbalanced-learn · SHAP · ChromaDB · sentence-transformers · rank-bm25 · CrossEncoder · GPT-4o-mini · Streamlit · Pandas · NumPy


Live demo: trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app

Source code: github.com/whozahm3d/trustguard-ai-fraud-detection

If you found this useful, the repo is public — feedback, issues, and stars are all welcome.

Top comments (0)