Every millisecond counts when it comes to fraud. A fraudulent transaction approved in 200ms costs real money. A legitimate transaction declined in 200ms costs a customer. Getting this balance right — at scale — is one of the hardest engineering problems in financial services.
This is a deep dive into the architectural decisions, trade-offs, and hard lessons from building a production-grade credit card fraud detection system. No toy datasets. No Jupyter notebooks. Real architecture, real constraints.
The Problem Is Not What You Think
Most tutorials frame fraud detection as a machine learning problem. Pick the right model, tune your F1 score, ship it.
In production, it's an engineering and systems problem with ML embedded inside it.
The real challenges are:
- Latency: you have ~150–300ms to make a decision before the payment network times out
- Scale: millions of transactions per day, with spikes you cannot always predict
- Imbalance: fraudulent transactions can be as low as 0.1% of total volume — your system must be hypersensitive to a signal buried in 99.9% noise
- Drift: fraud patterns change constantly; yesterday's model is today's liability
- Explainability: regulators and customers will ask why a transaction was declined
None of these are model problems. All of them are architecture problems.
System Architecture: The Big Picture
Here's the high-level design we settled on after several iterations:
Transaction Request
│
▼
┌─────────────────────┐
│ API Gateway │ ← Rate limiting, auth, routing
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Feature Service │ ← Real-time feature assembly (<20ms)
│ (Redis + Flink) │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Scoring Engine │ ← ML model inference (<50ms)
│ (Rule layer + │
│ ML model layer) │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Decision Engine │ ← Threshold logic, risk bands, actions
└────────┬────────────┘
│
┌────┴─────┐
▼ ▼
Approve Decline / Step-up (3DS)
The entire synchronous path — from transaction in to decision out — must complete in under 300ms. Everything else (model retraining, alerting, feedback loops) is asynchronous.
Layer 1: Feature Engineering is Everything
The biggest performance gains we saw did not come from switching models — they came from better features.
Real-Time Features (assembled per transaction)
These are computed on-the-fly and pulled from Redis:
- Velocity features: number of transactions in last 1m / 5m / 1h / 24h per card
- Amount deviation: how far this transaction deviates from the cardholder's average spend
- Merchant risk score: pre-computed score for the merchant category / MCC code
- Time-of-day signal: is this transaction outside the cardholder's normal hours?
- Geographic anomaly: is the country/city inconsistent with recent usage?
Behavioural Baseline Features (batch, updated hourly)
Pulled from a feature store (we used Feast, but Redis + Spark works too):
- 30-day average transaction amount
- Typical merchant categories
- Device fingerprint history
- Known trusted locations
The Hard Part: Consistency
The training data must use exactly the same feature definitions as inference. Feature drift between training and serving is one of the most common sources of degraded model performance in production — and it's invisible until something breaks.
We solved this by owning feature computation in a shared library used by both the batch training pipeline and the real-time feature service. Same code. No exceptions.
Layer 2: The Scoring Engine — Rules + ML Together
We ran two scoring layers in parallel, not in sequence.
Rules Engine (first line of defence)
Simple, fast, interpretable. Handles obvious cases:
IF transaction_country NOT IN cardholder_known_countries
AND amount > 500
AND hour_of_day BETWEEN 1 AND 5
THEN score += 80 (high risk)
Rules are cheap to update, easy to explain to regulators, and handle known fraud patterns with high precision. They block roughly 30–40% of fraud before the ML model is even invoked.
ML Model Layer
We landed on a gradient boosted tree (XGBoost) as the primary model, for three reasons:
- Speed: inference is sub-millisecond even with 200+ features
- Performance: consistently outperforms neural networks on tabular transaction data
- Explainability: SHAP values give you per-prediction feature importance, which is gold for compliance teams
We also ran a secondary autoencoder-based anomaly detector in parallel for catching novel fraud patterns the supervised model had never seen. Its score was blended in with a lower weight.
Class Imbalance: What Actually Works
The standard advice is to use SMOTE or random oversampling. In practice, we found:
- SMOTE helped during initial model development but added noise at high volumes
-
Cost-sensitive learning (setting
scale_pos_weightin XGBoost) was more robust in production - Threshold tuning post-training gave us far more control over the precision-recall trade-off than any resampling technique
Do not optimise for accuracy. Optimise for precision-recall AUC and then tune your operating threshold based on business risk appetite.
Layer 3: The Decision Engine
The model outputs a score between 0 and 1. The decision engine translates that into an action:
| Score Band | Action |
|---|---|
| 0.0 – 0.3 | Approve |
| 0.3 – 0.6 | Approve with soft alert |
| 0.6 – 0.8 | Step-up authentication (3DS) |
| 0.8 – 1.0 | Decline |
These thresholds are not fixed. They shift based on:
- Merchant risk profile: a high-risk merchant (e.g., crypto exchange, gift cards) shifts the threshold down
- Time of day / fraud rate spikes: during known high-fraud periods, thresholds tighten
- Card holder profile: a known high-value customer might get a softer threshold to protect against false positives
This dynamic thresholding layer is where a lot of the business logic lives, and it's deliberately kept separate from the ML model so it can be tuned without retraining.
Model Retraining: The Feedback Loop Nobody Talks About
A fraud model trained on 6-month-old data is already becoming stale. Fraud rings adapt quickly.
Our retraining pipeline:
- Daily batch: new confirmed fraud labels (from manual review + chargeback signals) are fed into the training dataset
- Weekly retrain: model is retrained on a rolling 90-day window
- Shadow scoring: new model runs in shadow mode alongside production for 48 hours before promotion
- Canary release: new model serves 5% of traffic before full rollout, with automatic rollback if key metrics degrade
The key metric we watched during shadow mode was rank order — does the new model score known fraudulent transactions higher than known legitimate ones? A lift chart drift of more than 5% triggered a human review.
Infrastructure Choices and Why
| Component | Choice | Why |
|---|---|---|
| Feature cache | Redis Cluster | Sub-millisecond read latency |
| Stream processing | Apache Flink | Stateful windowed aggregations at scale |
| Model serving | Custom FastAPI service | Full control over batching and concurrency |
| Feature store | Feast | Consistency between training and serving |
| Monitoring | Prometheus + Grafana | Real-time score distribution and drift alerts |
| Experiment tracking | MLflow | Model versioning and reproducibility |
We deliberately avoided "full-stack" ML platforms that bundle everything together. The lock-in risk and latency overhead were not worth the developer experience gains at our scale.
What We Got Wrong (and Fixed)
1. We underestimated feature latency
Early on, our feature service was querying a relational database. At peak load, this added 80–120ms to the decision path — completely unacceptable. Migrating feature reads to Redis brought this down to 3–5ms.
2. We over-rotated to ML and abandoned rules
A period where the ML model was doing all the heavy lifting meant that when it degraded (due to data drift), there was no backstop. Bringing the rules layer back as a first gate significantly improved resilience.
3. We optimised the wrong metric
In early iterations, we chased high recall (catch as much fraud as possible). This led to a false positive rate that was frustrating legitimate cardholders. The right trade-off is business-specific — understand your chargeback cost vs. your customer attrition cost before setting thresholds.
4. We didn't monitor score distributions
A model can look healthy on offline metrics while its live score distribution is silently shifting. We now track percentile distribution of scores daily. Any unexpected shift triggers an investigation.
Key Takeaways
- Latency is a hard constraint, not a nice-to-have. Design your feature assembly and inference path first, then fit your model choices around it.
- Rules + ML is better than ML alone — rules handle known patterns, ML catches the novel ones.
- Feature engineering and consistency between training and serving will give you more performance gains than model selection.
- Monitor score distributions in production, not just model metrics on a test set.
- Separate your decision logic from your model — thresholds and business rules change faster than models, and that's okay.
Fraud detection is a live arms race. The system that catches fraud today will need to evolve to catch fraud tomorrow. Build for adaptability, not just accuracy.
Have you built fraud detection systems? I'd love to hear what architectural decisions you made differently — drop it in the comments.
Top comments (0)