There is a number that haunts every fraud detection engineer: 0.13%.
That is the fraud rate in the PaySim dataset — 8,213 fraudulent transactions buried inside 6,362,620 legitimate ones. It sounds small. It is not. At that ratio, a model that predicts "legitimate" for every single transaction achieves 99.87% accuracy — and catches exactly zero fraud.
This is the problem I set out to solve with TrustGuard AI, a course project that turned into one of the most technically demanding things I have built. By the end of it, our deployed XGBoost model achieves AUC-ROC of 0.9995 and Recall of 0.9976 — meaning it catches 99.76% of all fraud on a 6.3 million row test set. It also explains every single prediction using SHAP, and grounds each fraud alert in real State Bank of Pakistan regulatory documents through a RAG pipeline.
This article is the full story — what worked, what broke, and why accuracy is the wrong metric for fraud detection.
The Problem With Accuracy
Before writing a single line of code, I want to be clear about why standard accuracy is useless here.
The dataset has 6,362,620 transactions. Of those, 8,213 are fraud. If I build a model that always predicts "legitimate," here is its scorecard:
Accuracy = 99.87%
Precision = 0
Recall = 0
F1 = 0
A perfect-looking accuracy score on a model that is completely blind to fraud. This is why TrustGuard optimises for Recall (catching fraud), Average Precision / AUPRC (area under the precision-recall curve), and AUC-ROC — not accuracy. Accuracy is literally a deceptive metric on imbalanced data.
The Dataset
TrustGuard uses the PaySim synthetic dataset — a mobile money transaction log generated by a multi-agent simulation calibrated against real financial data. It spans 30 simulated days at hourly granularity.
| Property | Value |
|---|---|
| Total Transactions | 6,362,620 |
| Fraud Cases | 8,213 |
| Fraud Rate | 0.13% |
| Transaction Types | CASH_OUT, TRANSFER, PAYMENT, DEBIT, CASH_IN |
One of the first insights from EDA: fraud is not spread across all transaction types. It is confined exclusively to CASH_OUT and TRANSFER. This makes structural sense — fraud follows the account-drain pattern: transfer funds to a mule account, then cash out. PAYMENT, DEBIT, and CASH_IN are clean.
This single observation shaped the entire feature engineering approach.
The Class Imbalance Problem — and Why SMOTE Alone Fails
At 0.13% fraud rate, SMOTE alone is not enough. Here is why.
With 5-fold cross-validation, some training folds can contain fewer than 10 actual fraud samples. SMOTE generates synthetic minority samples by interpolating between existing ones — but if there are only a handful of real fraud cases in a fold, SMOTE degenerates. The synthetic samples cluster too tightly and the model learns nothing useful.
TrustGuard uses a two-stage imbalance strategy:
Stage 1 — Fraud Simulation Engine
Before any train-test split, I apply a deterministic fraud injection step:
- Sample 5% of all legitimate
TRANSFERandCASH_OUTtransactions - Set
amount = oldbalanceOrg(full account drain) - Set
newbalanceOrig = 0 - Recompute
balanceDiffandamount_ratio - Label as fraud and append to the dataset
Result: Fraud rate rises from 0.13% → 1.26%.
The ablation study confirmed this was the single most important component in the pipeline. Removing it dropped CV F1 from 0.947 to 0.671 — a 29% relative reduction.
Stage 2 — SMOTE Inside ImbPipeline
After the 80/20 stratified train-test split, SMOTE (sampling_strategy=0.3) is applied inside an ImbPipeline per cross-validation fold. This is critical — SMOTE is fitted only on the training portion of each fold. The validation fold never sees synthetic samples. This prevents data leakage.
The final training distribution: 23.07% fraud.
| Stage | Fraud Rate |
|---|---|
| Original Dataset | 0.13% |
| After Fraud Simulation | 1.26% |
| After SMOTE (training folds) | 23.07% |
Feature Engineering
After cleaning, 12 features go into the model. The two most important are engineered:
balanceDiff = oldbalanceOrg − newbalanceOrig − amount
This detects balance inconsistencies. In a legitimate transaction, money flows normally. In an account-drain fraud, this value becomes anomalous.
amount_ratio = amount / (oldbalanceOrg + 1)
This approaches 1.0 in full account-drain attacks. For routine transfers it stays near zero.
The ablation confirmed their necessity: removing both dropped Test F1 from 0.5533 to 0.1538 and Test AP from 0.7317 to 0.6061. Without them, the model is nearly blind.
Training Four Models
All four models were trained identically inside an ImbPipeline(SMOTE → StandardScaler → Classifier) with 5-fold stratified cross-validation.
Cross-Validation Results:
| Model | CV F1 | CV AUC-ROC |
|---|---|---|
| XGBoost | 0.949 ± 0.020 | 1.000 ± 0.000 |
| Neural Network | 0.793 ± 0.061 | 0.999 ± 0.000 |
| Random Forest | 0.711 ± 0.007 | 0.999 ± 0.000 |
| Logistic Regression | 0.249 ± 0.003 | 0.977 ± 0.001 |
Test Set Results:
| Model | Test Recall | Test AUC | Test Avg Precision |
|---|---|---|---|
| XGBoost | 0.9976 | 0.9995 | 0.9358 |
| Random Forest | 0.9976 | 0.9995 | 0.8870 |
| Neural Network | 0.9732 | 0.9983 | 0.7081 |
| Logistic Regression | 0.9860 | 0.9946 | 0.5567 |
XGBoost dominates across every metric. Its Test Average Precision (0.9358) is 3.8× higher than Logistic Regression at comparable recall. XGBoost was selected for deployment.
Why XGBoost? The Hyperparameter Story
| n_estimators | CV F1 |
|---|---|
| 100 | 0.921 |
| 200 | 0.938 |
| 300 | 0.949 |
More trees, lower learning rate (0.05), better generalisation. Max depth of 6 over 8 to avoid overfitting on fold-specific patterns.
Explainability with SHAP
A fraud detection system that says "this transaction is fraud" without explaining why is not useful to an analyst — and not acceptable to a regulator.
TrustGuard implements SHAP TreeExplainer, which computes exact Shapley values for each prediction. For every flagged transaction, a waterfall plot shows exactly which features pushed the prediction toward fraud and by how much.
For a sample transaction flagged at 94% fraud probability:
-
amount_ratio ≈ 1.0→ largest push toward fraud (full drain detected) -
type_TRANSFER→ second largest push -
balanceDiff→ third largest push
This tells the analyst: this transaction looks like fraud because it drained an account completely via a transfer operation. That is auditable and defensible.
The RAG Pipeline — Grounding Alerts in Regulation
This is the part most fraud detection tutorials skip entirely.
A model flagging a transaction at 97% is useful. A model that also cites the specific SBP regulatory provision being violated is operationally deployable.
Pipeline architecture:
- Document Ingestion: Five SBP regulatory PDFs chunked into ~100 passages, stored in ChromaDB
-
Embedding:
all-MiniLM-L6-v2(384-dimensional dense retriever) - Hybrid Retrieval: BM25 (lexical) + dense vector retrieval combined
-
Reranking: CrossEncoder (
ms-marco-MiniLM-L-6-v2) - Generation: GPT-4o-mini receives fraud probability + retrieved passages → structured risk report with SBP citations
Results: Average Precision@5 = 0.855 across 10 retrieval queries. Zero hallucinations across all four high-risk transaction evaluations.
The Full Ablation Study
| Condition | CV F1 | Test F1 | Test AP |
|---|---|---|---|
| No Fraud Simulation (SMOTE only) | 0.671 | 0.6247 | 0.9363 |
| Full Pipeline (baseline) | 0.947 | 0.5533 | 0.7317 |
| No SMOTE | 0.947 | 0.9132 | 0.9639 |
| SMOTE ratio = 0.3 (selected) | 0.947 | 0.5557 | 0.7322 |
| SMOTE ratio = 0.5 | 0.947 | 0.5280 | 0.7180 |
| No Engineered Features | 0.636 | 0.1538 | 0.6061 |
| With Engineered Features (full) | 0.947 | 0.5533 | 0.7317 |
Three takeaways: the Fraud Simulation Engine is irreplaceable, engineered features are critical, and no SMOTE gives better precision but worse robustness.
Final Numbers
| Metric | Value |
|---|---|
| AUC-ROC | 0.9995 |
| Recall | 0.9976 |
| Average Precision (AUPRC) | 0.9358 |
| Fraud cases caught (of 8,213) | 8,190 |
| Fraud cases missed | 23 |
| RAG hallucinations | 0 |
| Retrieval Precision@5 | 0.855 |
Stack
Python · XGBoost · Scikit-learn · imbalanced-learn · SHAP · ChromaDB · sentence-transformers · rank-bm25 · CrossEncoder · GPT-4o-mini · Streamlit · Pandas · NumPy
Live demo: trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app
Source code: github.com/whozahm3d/trustguard-ai-fraud-detection
If you found this useful, the repo is public — feedback, issues, and stars are all welcome.
Top comments (0)