Printo Tom

Posted on Apr 27 • Originally published at dev.to

How We Built a Real-Time Credit Card Fraud Detection System: An Architect's Perspective

#architecture #machinelearning #fintech #systemdesign

Every millisecond counts when it comes to fraud. A fraudulent transaction approved in 200ms costs real money. A legitimate transaction declined in 200ms costs a customer. Getting this balance right — at scale — is one of the hardest engineering problems in financial services.

This is a deep dive into the architectural decisions, trade-offs, and hard lessons from building a production-grade credit card fraud detection system. No toy datasets. No Jupyter notebooks. Real architecture, real constraints.

The Problem Is Not What You Think

Most tutorials frame fraud detection as a machine learning problem. Pick the right model, tune your F1 score, ship it.

In production, it's an engineering and systems problem with ML embedded inside it.

The real challenges are:

Latency: you have ~150–300ms to make a decision before the payment network times out
Scale: millions of transactions per day, with spikes you cannot always predict
Imbalance: fraudulent transactions can be as low as 0.1% of total volume — your system must be hypersensitive to a signal buried in 99.9% noise
Drift: fraud patterns change constantly; yesterday's model is today's liability
Explainability: regulators and customers will ask why a transaction was declined

None of these are model problems. All of them are architecture problems.

System Architecture: The Big Picture

Here's the high-level design we settled on after several iterations:

Transaction Request
       │
       ▼
┌─────────────────────┐
│   API Gateway        │  ← Rate limiting, auth, routing
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Service     │  ← Real-time feature assembly (<20ms)
│  (Redis + Flink)     │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Scoring Engine      │  ← ML model inference (<50ms)
│  (Rule layer +       │
│   ML model layer)    │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Decision Engine     │  ← Threshold logic, risk bands, actions
└────────┬────────────┘
         │
    ┌────┴─────┐
    ▼          ▼
 Approve    Decline / Step-up (3DS)

The entire synchronous path — from transaction in to decision out — must complete in under 300ms. Everything else (model retraining, alerting, feedback loops) is asynchronous.

Layer 1: Feature Engineering is Everything

The biggest performance gains we saw did not come from switching models — they came from better features.

Real-Time Features (assembled per transaction)

These are computed on-the-fly and pulled from Redis:

Velocity features: number of transactions in last 1m / 5m / 1h / 24h per card
Amount deviation: how far this transaction deviates from the cardholder's average spend
Merchant risk score: pre-computed score for the merchant category / MCC code
Time-of-day signal: is this transaction outside the cardholder's normal hours?
Geographic anomaly: is the country/city inconsistent with recent usage?

Behavioural Baseline Features (batch, updated hourly)

Pulled from a feature store (we used Feast, but Redis + Spark works too):

30-day average transaction amount
Typical merchant categories
Device fingerprint history
Known trusted locations

The Hard Part: Consistency

The training data must use exactly the same feature definitions as inference. Feature drift between training and serving is one of the most common sources of degraded model performance in production — and it's invisible until something breaks.

We solved this by owning feature computation in a shared library used by both the batch training pipeline and the real-time feature service. Same code. No exceptions.

Layer 2: The Scoring Engine — Rules + ML Together

We ran two scoring layers in parallel, not in sequence.

Rules Engine (first line of defence)

Simple, fast, interpretable. Handles obvious cases:

IF transaction_country NOT IN cardholder_known_countries
  AND amount > 500
  AND hour_of_day BETWEEN 1 AND 5
THEN score += 80 (high risk)

Rules are cheap to update, easy to explain to regulators, and handle known fraud patterns with high precision. They block roughly 30–40% of fraud before the ML model is even invoked.

ML Model Layer

We landed on a gradient boosted tree (XGBoost) as the primary model, for three reasons:

Speed: inference is sub-millisecond even with 200+ features
Performance: consistently outperforms neural networks on tabular transaction data
Explainability: SHAP values give you per-prediction feature importance, which is gold for compliance teams

We also ran a secondary autoencoder-based anomaly detector in parallel for catching novel fraud patterns the supervised model had never seen. Its score was blended in with a lower weight.

Class Imbalance: What Actually Works

The standard advice is to use SMOTE or random oversampling. In practice, we found:

SMOTE helped during initial model development but added noise at high volumes
Cost-sensitive learning (setting scale_pos_weight in XGBoost) was more robust in production
Threshold tuning post-training gave us far more control over the precision-recall trade-off than any resampling technique

Do not optimise for accuracy. Optimise for precision-recall AUC and then tune your operating threshold based on business risk appetite.

Layer 3: The Decision Engine

The model outputs a score between 0 and 1. The decision engine translates that into an action:

Score Band	Action
0.0 – 0.3	Approve
0.3 – 0.6	Approve with soft alert
0.6 – 0.8	Step-up authentication (3DS)
0.8 – 1.0	Decline

These thresholds are not fixed. They shift based on:

Merchant risk profile: a high-risk merchant (e.g., crypto exchange, gift cards) shifts the threshold down
Time of day / fraud rate spikes: during known high-fraud periods, thresholds tighten
Card holder profile: a known high-value customer might get a softer threshold to protect against false positives

This dynamic thresholding layer is where a lot of the business logic lives, and it's deliberately kept separate from the ML model so it can be tuned without retraining.

Model Retraining: The Feedback Loop Nobody Talks About

A fraud model trained on 6-month-old data is already becoming stale. Fraud rings adapt quickly.

Our retraining pipeline:

Daily batch: new confirmed fraud labels (from manual review + chargeback signals) are fed into the training dataset
Weekly retrain: model is retrained on a rolling 90-day window
Shadow scoring: new model runs in shadow mode alongside production for 48 hours before promotion
Canary release: new model serves 5% of traffic before full rollout, with automatic rollback if key metrics degrade

The key metric we watched during shadow mode was rank order — does the new model score known fraudulent transactions higher than known legitimate ones? A lift chart drift of more than 5% triggered a human review.

Infrastructure Choices and Why

Component	Choice	Why
Feature cache	Redis Cluster	Sub-millisecond read latency
Stream processing	Apache Flink	Stateful windowed aggregations at scale
Model serving	Custom FastAPI service	Full control over batching and concurrency
Feature store	Feast	Consistency between training and serving
Monitoring	Prometheus + Grafana	Real-time score distribution and drift alerts
Experiment tracking	MLflow	Model versioning and reproducibility

We deliberately avoided "full-stack" ML platforms that bundle everything together. The lock-in risk and latency overhead were not worth the developer experience gains at our scale.

What We Got Wrong (and Fixed)

1. We underestimated feature latency

Early on, our feature service was querying a relational database. At peak load, this added 80–120ms to the decision path — completely unacceptable. Migrating feature reads to Redis brought this down to 3–5ms.

2. We over-rotated to ML and abandoned rules

A period where the ML model was doing all the heavy lifting meant that when it degraded (due to data drift), there was no backstop. Bringing the rules layer back as a first gate significantly improved resilience.

3. We optimised the wrong metric

In early iterations, we chased high recall (catch as much fraud as possible). This led to a false positive rate that was frustrating legitimate cardholders. The right trade-off is business-specific — understand your chargeback cost vs. your customer attrition cost before setting thresholds.

4. We didn't monitor score distributions

A model can look healthy on offline metrics while its live score distribution is silently shifting. We now track percentile distribution of scores daily. Any unexpected shift triggers an investigation.

Key Takeaways

Latency is a hard constraint, not a nice-to-have. Design your feature assembly and inference path first, then fit your model choices around it.
Rules + ML is better than ML alone — rules handle known patterns, ML catches the novel ones.
Feature engineering and consistency between training and serving will give you more performance gains than model selection.
Monitor score distributions in production, not just model metrics on a test set.
Separate your decision logic from your model — thresholds and business rules change faster than models, and that's okay.

Fraud detection is a live arms race. The system that catches fraud today will need to evolve to catch fraud tomorrow. Build for adaptability, not just accuracy.

Have you built fraud detection systems? I'd love to hear what architectural decisions you made differently — drop it in the comments.

DEV Community