How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms
A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.
The Problem
Fraud detection is a classic hard problem in systems design. You need to:
- Classify transactions in real-time — users can't wait 100ms for a payment to go through
- Handle massive throughput — payment systems process thousands of requests per second
- Maintain high accuracy — false positives block legitimate transactions, false negatives let fraud through
- Deploy without downtime — you can't take a payment system offline to update your model
I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.
The Architecture
Transaction Request
│
▼
Go HTTP Server
│
▼
LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
│
Cache Miss
│
▼
ONNX Runtime
│
▼
XGBoost Model
│
▼
Fraud Score + Decision
│
▼
Prometheus Metrics
The key insight: serve ML inference from Go, not Python.
Step 1: Training the Model
I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc
# Handle class imbalance
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()
model = xgb.XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=scale_pos_weight,
eval_metric='aucpr',
use_label_encoder=False
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=20,
verbose=False
)
Results:
- PR-AUC: 0.87
- Recall: 86%
- Precision: 74%
Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.
Step 2: Exporting to ONNX
Here's where it gets interesting. Python inference is slow. I needed Go-level performance.
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
# Export XGBoost → ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("fraud_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go binding.
Step 3: Go HTTP Server with ONNX Inference
type InferenceEngine struct {
session *onnxruntime.Session
mu sync.RWMutex
}
func (e *InferenceEngine) Predict(features []float32) (float64, error) {
e.mu.RLock()
defer e.mu.RUnlock()
input := onnxruntime.NewTensor(features)
outputs, err := e.session.Run(input)
if err != nil {
return 0, err
}
score := outputs[0].GetData().([]float32)[0]
return float64(score), nil
}
The Go HTTP handler:
func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
var req TransactionRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "bad request", 400)
return
}
// Check LRU cache first
cacheKey := req.Hash()
if cached, ok := h.cache.Get(cacheKey); ok {
json.NewEncoder(w).Encode(cached)
return
}
// Run ONNX inference
features := req.ToFeatures()
score, err := h.engine.Predict(features)
if err != nil {
http.Error(w, "inference error", 500)
return
}
result := &PredictionResult{
Score: score,
IsFraud: score > 0.5,
Timestamp: time.Now().UnixNano(),
}
h.cache.Set(cacheKey, result)
json.NewEncoder(w).Encode(result)
}
Step 4: LRU Cache — The Secret Weapon
The LRU cache was the single biggest performance win.
type LRUCache struct {
capacity int
mu sync.RWMutex
items map[string]*list.Element
list *list.List
}
func (c *LRUCache) Get(key string) (interface{}, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
if elem, ok := c.items[key]; ok {
c.list.MoveToFront(elem)
return elem.Value.(*entry).value, true
}
return nil, false
}
Result: 99% cache hit rate, saving ~200μs per decision.
In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.
Step 5: Zero-Downtime Model Updates
The hardest part. How do you update an ML model without taking the server down?
type ModelManager struct {
engine atomic.Pointer[InferenceEngine]
}
func (m *ModelManager) HotSwap(newModel []byte) error {
newEngine, err := NewInferenceEngine(newModel)
if err != nil {
return err
}
// Atomic swap — zero downtime
m.engine.Store(newEngine)
return nil
}
func (m *ModelManager) GetEngine() *InferenceEngine {
return m.engine.Load()
}
atomic.Pointer from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.
Step 6: A/B Traffic Splitting
Once you can hot-swap models, A/B testing becomes easy:
func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
// Hash user ID for consistent routing
hash := fnv32(r.Header.Get("X-User-ID"))
if hash%100 < h.config.ModelBPercentage {
return h.modelB.Load()
}
return h.modelA.Load()
}
This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.
Step 7: Drift Detection
Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:
func (d *DriftDetector) Check(features []float32) bool {
var drift float64
for i, f := range features {
deviation := math.Abs(float64(f) - d.baseline[i].Mean)
normalized := deviation / (d.baseline[i].Std + 1e-8)
drift += normalized
}
drift /= float64(len(features))
// Alert if drift exceeds threshold
return drift > d.threshold // threshold: 5e-7
}
Benchmark Results
Load tested with k6 — 200 concurrent VUs, 60 second duration:
scenarios: (100.00%) 1 scenario, 200 max VUs
default: 200 looping VUs for 60s
✓ http_req_duration.............: avg=4.2ms p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200
Throughput: 71,274 req/s
71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.
Key Lessons
1. Language choice matters for inference.
Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.
2. Cache aggressively.
99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.
3. Atomic operations > locks for hot paths.
atomic.Pointer for model swapping means zero contention on the critical path.
4. Design for deployability from day one.
Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.
5. Monitor everything.
Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.
What's Next
- Implement online learning to update the model with new fraud patterns in real-time
- Add feature store integration for richer transaction context
- Experiment with transformer-based models for sequence modeling
Source Code
github.com/sameer-sde/sentinel
If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.
Top comments (1)
Hi, I hope you are doing well. We are a software development team. We hunt for US jobs using Us job profile. So we are looking for a senior developer who can work with us.
Your role is to take part in the job interviews and pass the interviews. If your English is fluent, we can work together. If you are interested, please kindly send me message. I will explain more detail. Thank you!
Whatsapp: +1 (351) 234-6532
Telegram: @lionking06230810