DEV Community

Sameer Ahmed
Sameer Ahmed

Posted on

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.


The Problem

Fraud detection is a classic hard problem in systems design. You need to:

  • Classify transactions in real-time — users can't wait 100ms for a payment to go through
  • Handle massive throughput — payment systems process thousands of requests per second
  • Maintain high accuracy — false positives block legitimate transactions, false negatives let fraud through
  • Deploy without downtime — you can't take a payment system offline to update your model

I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.


The Architecture

Transaction Request
        │
        ▼
   Go HTTP Server
        │
        ▼
   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
        │
     Cache Miss
        │
        ▼
   ONNX Runtime
        │
        ▼
  XGBoost Model
        │
        ▼
  Fraud Score + Decision
        │
        ▼
  Prometheus Metrics
Enter fullscreen mode Exit fullscreen mode

The key insight: serve ML inference from Go, not Python.


Step 1: Training the Model

I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

# Handle class imbalance
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    eval_metric='aucpr',
    use_label_encoder=False
)

model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=20,
    verbose=False
)
Enter fullscreen mode Exit fullscreen mode

Results:

  • PR-AUC: 0.87
  • Recall: 86%
  • Precision: 74%

Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.


Step 2: Exporting to ONNX

Here's where it gets interesting. Python inference is slow. I needed Go-level performance.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Export XGBoost → ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("fraud_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())
Enter fullscreen mode Exit fullscreen mode

ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go binding.


Step 3: Go HTTP Server with ONNX Inference

type InferenceEngine struct {
    session  *onnxruntime.Session
    mu       sync.RWMutex
}

func (e *InferenceEngine) Predict(features []float32) (float64, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()

    input := onnxruntime.NewTensor(features)
    outputs, err := e.session.Run(input)
    if err != nil {
        return 0, err
    }

    score := outputs[0].GetData().([]float32)[0]
    return float64(score), nil
}
Enter fullscreen mode Exit fullscreen mode

The Go HTTP handler:

func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
    var req TransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "bad request", 400)
        return
    }

    // Check LRU cache first
    cacheKey := req.Hash()
    if cached, ok := h.cache.Get(cacheKey); ok {
        json.NewEncoder(w).Encode(cached)
        return
    }

    // Run ONNX inference
    features := req.ToFeatures()
    score, err := h.engine.Predict(features)
    if err != nil {
        http.Error(w, "inference error", 500)
        return
    }

    result := &PredictionResult{
        Score:     score,
        IsFraud:   score > 0.5,
        Timestamp: time.Now().UnixNano(),
    }

    h.cache.Set(cacheKey, result)
    json.NewEncoder(w).Encode(result)
}
Enter fullscreen mode Exit fullscreen mode

Step 4: LRU Cache — The Secret Weapon

The LRU cache was the single biggest performance win.

type LRUCache struct {
    capacity int
    mu       sync.RWMutex
    items    map[string]*list.Element
    list     *list.List
}

func (c *LRUCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if elem, ok := c.items[key]; ok {
        c.list.MoveToFront(elem)
        return elem.Value.(*entry).value, true
    }
    return nil, false
}
Enter fullscreen mode Exit fullscreen mode

Result: 99% cache hit rate, saving ~200μs per decision.

In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.


Step 5: Zero-Downtime Model Updates

The hardest part. How do you update an ML model without taking the server down?

type ModelManager struct {
    engine atomic.Pointer[InferenceEngine]
}

func (m *ModelManager) HotSwap(newModel []byte) error {
    newEngine, err := NewInferenceEngine(newModel)
    if err != nil {
        return err
    }

    // Atomic swap — zero downtime
    m.engine.Store(newEngine)
    return nil
}

func (m *ModelManager) GetEngine() *InferenceEngine {
    return m.engine.Load()
}
Enter fullscreen mode Exit fullscreen mode

atomic.Pointer from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.


Step 6: A/B Traffic Splitting

Once you can hot-swap models, A/B testing becomes easy:

func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
    // Hash user ID for consistent routing
    hash := fnv32(r.Header.Get("X-User-ID"))
    if hash%100 < h.config.ModelBPercentage {
        return h.modelB.Load()
    }
    return h.modelA.Load()
}
Enter fullscreen mode Exit fullscreen mode

This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.


Step 7: Drift Detection

Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:

func (d *DriftDetector) Check(features []float32) bool {
    var drift float64
    for i, f := range features {
        deviation := math.Abs(float64(f) - d.baseline[i].Mean)
        normalized := deviation / (d.baseline[i].Std + 1e-8)
        drift += normalized
    }
    drift /= float64(len(features))

    // Alert if drift exceeds threshold
    return drift > d.threshold // threshold: 5e-7
}
Enter fullscreen mode Exit fullscreen mode

Benchmark Results

Load tested with k6 — 200 concurrent VUs, 60 second duration:

scenarios: (100.00%) 1 scenario, 200 max VUs
  default: 200 looping VUs for 60s

✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200

Throughput: 71,274 req/s
Enter fullscreen mode Exit fullscreen mode

71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.


Key Lessons

1. Language choice matters for inference.
Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.

2. Cache aggressively.
99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.

3. Atomic operations > locks for hot paths.
atomic.Pointer for model swapping means zero contention on the critical path.

4. Design for deployability from day one.
Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.

5. Monitor everything.
Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.


What's Next

  • Implement online learning to update the model with new fraud patterns in real-time
  • Add feature store integration for richer transaction context
  • Experiment with transformer-based models for sequence modeling

Source Code

github.com/sameer-sde/sentinel


If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

Connect with me: LinkedIn · GitHub · Portfolio

Top comments (1)

Collapse
 
pranav_gore_297555a5b7dc2 profile image
Pranav Gore

Hi, I hope you are doing well. We are a software development team. We hunt for US jobs using Us job profile. So we are looking for a senior developer who can work with us.
Your role is to take part in the job interviews and pass the interviews. If your English is fluent, we can work together. If you are interested, please kindly send me message. I will explain more detail. Thank you!
Whatsapp: +1 (351) 234-6532
Telegram: @lionking06230810