DEV Community: Sameer Ahmed

I Built a Vector Search Engine from Scratch — Here's What I Learned

Sameer Ahmed — Wed, 03 Jun 2026 11:01:02 +0000

I Built a Vector Search Engine from Scratch — Here's What I Learned

Implementing HNSW (Hierarchical Navigable Small World) graphs, hybrid BM25 + dense retrieval, HyDE query rewriting, and atomic index persistence — achieving recall@10 = 0.984.

Why Build Your Own Vector Search?

When I started building Vektr — a RAG (Retrieval-Augmented Generation) engine — I had a choice: use an existing vector database like Pinecone, Weaviate, or FAISS, or build my own.

I chose to build my own. Not because existing solutions are bad (they're excellent), but because you don't truly understand a system until you've built it.

This post is about what I learned building HNSW from scratch.

What is HNSW?

HNSW (Hierarchical Navigable Small World) is the algorithm powering most modern vector databases. It achieves near-linear search time with high recall by organizing vectors into a hierarchical graph.

The key insight: approximate nearest neighbor search is fast enough, and "approximate" is closer to exact than you'd think.

My implementation achieves recall@10 = 0.984 — meaning for 98.4% of queries, all 10 true nearest neighbors appear in the top 10 results.

The HNSW Structure

Layer 2 (sparse):  1 ──────────── 5
                   │              │
Layer 1 (medium):  1 ── 3 ── 4 ── 5
                   │    │    │    │
Layer 0 (dense):   1─2─3─4─5─6─7─8─9

Each vector is inserted at layer 0. With probability 1/ln(M), it also appears in layer 1, and so on. This creates a highway network — you navigate quickly through sparse upper layers, then zoom in at the dense bottom layer.

Building the Index

public class HNSWIndex {
    private final int M;           // Max connections per node
    private final int efConstruction; // Search width during construction
    private final int maxLayer;
    private final Map<Integer, Node> nodes;
    private final Random random;
    private Node entryPoint;

    public void insert(int id, float[] vector) {
        int level = getRandomLevel();
        Node newNode = new Node(id, vector, level);

        if (entryPoint == null) {
            entryPoint = newNode;
            nodes.put(id, newNode);
            return;
        }

        // Start from entry point, navigate down to insertion level
        Node current = entryPoint;
        for (int l = entryPoint.level; l > level; l--) {
            current = greedySearch(current, vector, 1, l).get(0);
        }

        // Insert at each layer from level down to 0
        for (int l = Math.min(level, entryPoint.level); l >= 0; l--) {
            List<Node> candidates = searchLayer(current, vector, efConstruction, l);
            List<Node> neighbors = selectNeighbors(candidates, M, vector);

            newNode.setConnections(l, neighbors);

            // Add backlinks
            for (Node neighbor : neighbors) {
                neighbor.addConnection(l, newNode);

                // Prune if over capacity
                if (neighbor.getConnections(l).size() > M) {
                    List<Node> pruned = selectNeighbors(
                        neighbor.getConnections(l), M, neighbor.vector
                    );
                    neighbor.setConnections(l, pruned);
                }
            }
        }

        nodes.put(id, newNode);
        if (level > entryPoint.level) {
            entryPoint = newNode;
        }
    }

    private int getRandomLevel() {
        // Level distribution: P(level = l) = (1/ln(M))^l
        double r = -Math.log(random.nextDouble()) * (1.0 / Math.log(M));
        return (int) Math.min(r, maxLayer);
    }
}

Searching the Index

public List<SearchResult> search(float[] query, int k, int ef) {
    // Navigate from entry point down to layer 1
    Node current = entryPoint;
    for (int l = entryPoint.level; l > 0; l--) {
        current = greedySearch(current, query, 1, l).get(0);
    }

    // Beam search at layer 0 with ef candidates
    List<Node> candidates = searchLayer(current, query, ef, 0);

    // Return top-k by distance
    return candidates.stream()
        .sorted(Comparator.comparingDouble(n -> cosineSimilarity(query, n.vector)))
        .limit(k)
        .map(n -> new SearchResult(n.id, cosineSimilarity(query, n.vector)))
        .collect(Collectors.toList());
}

private List<Node> searchLayer(Node entry, float[] query, int ef, int layer) {
    Set<Node> visited = new HashSet<>();
    PriorityQueue<Node> candidates = new PriorityQueue<>(
        Comparator.comparingDouble(n -> -cosineSimilarity(query, n.vector))
    );
    PriorityQueue<Node> results = new PriorityQueue<>(
        Comparator.comparingDouble(n -> cosineSimilarity(query, n.vector))
    );

    candidates.add(entry);
    results.add(entry);
    visited.add(entry);

    while (!candidates.isEmpty()) {
        Node candidate = candidates.poll();

        // Termination condition: best candidate is worse than worst result
        if (results.size() >= ef &&
            cosineSimilarity(query, candidate.vector) <
            cosineSimilarity(query, results.peek().vector)) {
            break;
        }

        for (Node neighbor : candidate.getConnections(layer)) {
            if (!visited.contains(neighbor)) {
                visited.add(neighbor);
                candidates.add(neighbor);
                results.add(neighbor);
                if (results.size() > ef) results.poll();
            }
        }
    }

    return new ArrayList<>(results);
}

Hybrid Retrieval: BM25 + Dense Search

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. The solution: combine both.

public List<SearchResult> hybridSearch(String query, int k) {
    // Dense retrieval
    float[] queryEmbedding = embedder.embed(query);
    List<SearchResult> denseResults = index.search(queryEmbedding, k * 2, efSearch);

    // Sparse retrieval (BM25)
    List<SearchResult> sparseResults = bm25.search(query, k * 2);

    // Reciprocal Rank Fusion
    return reciprocalRankFusion(denseResults, sparseResults, k);
}

private List<SearchResult> reciprocalRankFusion(
    List<SearchResult> dense,
    List<SearchResult> sparse,
    int k
) {
    Map<Integer, Double> scores = new HashMap<>();
    int k_rrf = 60; // RRF constant

    // Dense scores
    for (int i = 0; i < dense.size(); i++) {
        int id = dense.get(i).id;
        scores.merge(id, 1.0 / (k_rrf + i + 1), Double::sum);
    }

    // Sparse scores
    for (int i = 0; i < sparse.size(); i++) {
        int id = sparse.get(i).id;
        scores.merge(id, 1.0 / (k_rrf + i + 1), Double::sum);
    }

    return scores.entrySet().stream()
        .sorted(Map.Entry.<Integer, Double>comparingByValue().reversed())
        .limit(k)
        .map(e -> new SearchResult(e.getKey(), e.getValue()))
        .collect(Collectors.toList());
}

RRF (Reciprocal Rank Fusion) is elegant: each result gets a score of 1 / (k + rank) from each retriever. Results appearing in both lists get combined scores, naturally surfacing the best matches.

HyDE: Hypothetical Document Embeddings

Query: "What is the capital of France?"

The problem: this query, embedded, looks nothing like a Wikipedia article about Paris. Dense retrieval fails.

HyDE solution: Generate a hypothetical answer first, then embed that.

public float[] hydeEmbed(String query) {
    // Generate hypothetical answer
    String hypothetical = llm.generate(
        "Write a short factual answer to: " + query
    );

    // Embed the hypothetical answer instead of the query
    return embedder.embed(hypothetical);
}

Query: "What is the capital of France?"
Hypothetical: "The capital of France is Paris, located in northern France along the Seine River..."

Now the embedding actually matches relevant documents.

Impact: +8% recall@10 on my test set.

Atomic Index Persistence

The naive approach to saving the index:

// DANGEROUS — if the process dies here, the file is corrupted
try (FileOutputStream fos = new FileOutputStream("index.bin")) {
    serialize(index, fos);
}

The safe approach — write-to-tmp + rename (atomic on POSIX systems):

public void saveIndex() throws IOException {
    Path tempFile = Files.createTempFile("index-", ".tmp");

    try {
        // Write to temp file
        try (ObjectOutputStream oos = new ObjectOutputStream(
            new BufferedOutputStream(Files.newOutputStream(tempFile))
        )) {
            oos.writeObject(this.nodes);
            oos.writeObject(this.entryPoint);
        }

        // Atomic rename — either succeeds completely or fails completely
        Files.move(tempFile, indexPath,
            StandardCopyOption.ATOMIC_MOVE,
            StandardCopyOption.REPLACE_EXISTING
        );
    } catch (Exception e) {
        Files.deleteIfExists(tempFile);
        throw e;
    }
}

ATOMIC_MOVE is a single filesystem operation — it either completes or doesn't happen at all. No corrupted state.

Result: Index loads in <15ms on restart, matching LevelDB's durability pattern.

Performance Results

Tested on 1,000 vectors (sentence embeddings, 384 dimensions):

Metric	Result
recall@10	0.984
Cold query latency	35ms
Cached query latency	<1ms
Index load time	<15ms
Index build time (1K vectors)	~200ms

The cold vs cached gap shows the LRU cache working: 35ms first query, sub-millisecond repeat queries.

Key Lessons

1. The probabilistic layer structure is brilliant.
O(log n) search complexity comes naturally from the exponential decay of upper layers. You don't need a balanced tree — randomness does the work.

2. ef and M are the critical parameters.

M: max connections per node. Higher = better recall, more memory.
efConstruction: search width during insertion. Higher = better index quality, slower build.
efSearch: search width at query time. Higher = better recall, slower queries.

3. Hybrid retrieval almost always beats pure dense retrieval.
BM25 catches exact matches that dense embeddings miss. RRF fusion requires no tuning.

4. Atomic writes are non-negotiable for any persistent data structure.
Write-to-tmp + rename is the standard pattern — use it everywhere.

5. HyDE is underrated.
Generating a hypothetical answer before embedding significantly improves recall for factoid queries with minimal overhead.

Source Code

github.com/sameer-sde/vektr

I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

Connect: LinkedIn · GitHub · Portfolio

Building a Distributed Rate Limiter That Handles 18,769 req/s with Redis Lua Scripts

Sameer Ahmed — Wed, 03 Jun 2026 10:59:44 +0000

Building a Distributed Rate Limiter That Handles 18,769 req/s with Redis Lua Scripts

How I implemented 4 atomic rate-limiting algorithms, consistent hashing across 3 Redis shards, and hit 18,769 req/s at p95 16ms — from scratch.

Why Rate Limiting is Hard

Rate limiting sounds simple: "allow X requests per Y seconds." But in a distributed system, it gets complicated fast.

The challenges:

Multiple server instances — how do you share state?
Race conditions — two requests arrive simultaneously, both check the counter, both see "limit not reached", both proceed. One too many.
Hotspots — all traffic to one Redis node? That's your bottleneck.
Algorithm choice — Token Bucket? Sliding Window? Fixed Window? Each has tradeoffs.

I built a distributed rate limiter that solves all of these.

The Four Algorithms

I implemented all four major rate limiting algorithms. Here's when to use each:

1. Fixed Window Counter

|----window----|----window----|
  100 requests    100 requests

Problem: A burst at the boundary (50 at end of window 1 + 50 at start of window 2 = 100 in 1 second, even though each window allowed 100).

2. Sliding Window Log

Track exact timestamps of each request. Most accurate, but memory-intensive — stores every request timestamp.

3. Sliding Window Counter

Approximate sliding window using two fixed windows. The money formula:

current_count = prev_window_count × (1 - elapsed/window_size) + curr_window_count

Best balance of accuracy and memory efficiency.

4. Token Bucket

Tokens refill at a constant rate. Allows controlled bursting — great for APIs where occasional spikes are acceptable.

The Key Insight: Atomic Operations with Redis Lua

The classic race condition:

Thread 1: GET counter → 99
Thread 2: GET counter → 99
Thread 1: INCR counter → 100 (allowed)
Thread 2: INCR counter → 101 (should be denied, wasn't)

The fix: do everything in a single atomic Redis operation.

Redis Lua scripts execute atomically — no other command can run between your script's operations.

Here's my sliding window counter in Lua:

local key = KEYS[1]
local prev_key = KEYS[2]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Get counts from both windows
local curr_count = tonumber(redis.call('GET', key) or 0)
local prev_count = tonumber(redis.call('GET', prev_key) or 0)

-- Calculate elapsed time in current window
local elapsed = now % window
local weight = 1 - (elapsed / window)

-- Approximate sliding window count
local sliding_count = math.floor(prev_count * weight) + curr_count

if sliding_count >= limit then
    -- Rate limited
    return {0, sliding_count, limit}
end

-- Increment and set expiry
redis.call('INCR', key)
redis.call('EXPIRE', key, window * 2)

return {1, sliding_count + 1, limit}

Calling it from Go:

var slidingWindowScript = redis.NewScript(`
    -- (lua script above)
`)

func (r *RateLimiter) IsAllowed(ctx context.Context, key string) (bool, error) {
    now := time.Now().Unix()
    windowStart := now - (now % r.windowSize)

    currKey := fmt.Sprintf("%s:%d", key, windowStart)
    prevKey := fmt.Sprintf("%s:%d", key, windowStart-r.windowSize)

    result, err := slidingWindowScript.Run(ctx, r.client,
        []string{currKey, prevKey},
        r.limit, r.windowSize, now,
    ).Int64Slice()

    return result[0] == 1, err
}

Atomic. No race conditions. Ever.

Consistent Hashing Across 3 Redis Shards

One Redis node is a single point of failure and a throughput bottleneck. I shard across 3 Redis nodes using consistent hashing.

type ConsistentHashRing struct {
    nodes       []string
    ring        map[uint32]string
    sortedKeys  []uint32
    virtualNodes int
    mu          sync.RWMutex
}

func (r *ConsistentHashRing) AddNode(node string) {
    r.mu.Lock()
    defer r.mu.Unlock()

    // Add 150 virtual nodes per physical node
    for i := 0; i < r.virtualNodes; i++ {
        key := r.hash(fmt.Sprintf("%s-%d", node, i))
        r.ring[key] = node
        r.sortedKeys = append(r.sortedKeys, key)
    }
    sort.Slice(r.sortedKeys, func(i, j int) bool {
        return r.sortedKeys[i] < r.sortedKeys[j]
    })
    r.nodes = append(r.nodes, node)
}

func (r *ConsistentHashRing) GetNode(key string) string {
    r.mu.RLock()
    defer r.mu.RUnlock()

    hash := r.hash(key)

    // Binary search for the first node >= hash
    idx := sort.Search(len(r.sortedKeys), func(i int) bool {
        return r.sortedKeys[i] >= hash
    })

    if idx == len(r.sortedKeys) {
        idx = 0
    }

    return r.ring[r.sortedKeys[idx]]
}

150 virtual nodes per physical node is the sweet spot — minimizes shard remapping when nodes are added/removed (~20% of keys remapped instead of ~100%).

Connection Pool Tuning

This single change boosted throughput by 61% and cut latency by 43%.

Default Go Redis pool: 10 connections per shard.

// BEFORE — default
client := redis.NewClient(&redis.Options{
    Addr: addr,
})
// PoolSize: 10 (default)
// Result: goroutines blocking waiting for connections

// AFTER — tuned
client := redis.NewClient(&redis.Options{
    Addr:         addr,
    PoolSize:     50,              // 50 connections per shard
    MinIdleConns: 10,              // Keep 10 warm
    PoolTimeout:  2 * time.Second,
    ReadTimeout:  500 * time.Millisecond,
    WriteTimeout: 500 * time.Millisecond,
})

Why does this help? Under high concurrency, goroutines block waiting for an available connection from the pool. Increasing pool size reduces this contention dramatically.

But don't set it too high — each connection costs memory on the Redis server, and you can overwhelm it.

LRU Cache Layer

Before hitting Redis at all, I check an in-memory LRU cache:

type DecisionCache struct {
    cache    *lru.Cache
    ttl      time.Duration
    hits     prometheus.Counter
    misses   prometheus.Counter
}

func (c *DecisionCache) Get(key string) (Decision, bool) {
    val, ok := c.cache.Get(key)
    if !ok {
        c.misses.Inc()
        return Decision{}, false
    }

    entry := val.(*CacheEntry)
    if time.Since(entry.CreatedAt) > c.ttl {
        c.cache.Remove(key)
        c.misses.Inc()
        return Decision{}, false
    }

    c.hits.Inc()
    return entry.Decision, true
}

Result: 93.5% cache hit rate, saving ~200μs per decision.

The math: 93.5% of 18,769 req/s = 17,549 requests served from memory, 1,220 hitting Redis.

Benchmark Results

Load tested with k6:

scenarios: 200 VUs, 60 seconds

✓ http_req_duration: avg=9.2ms p(95)=16ms
✓ http_req_failed:  0.00%
✓ iterations:       1,126,140

Throughput: 18,769 req/s

After connection pool tuning (10 → 50):

Throughput: +61% (11,658 → 18,769 req/s)
p95 latency: -43% (28ms → 16ms)

Key Lessons

1. Lua scripts are the only safe way to do atomic operations in Redis.
WATCH/MULTI/EXEC optimistic locking works but adds complexity. Lua is simpler and guaranteed atomic.

2. Consistent hashing is worth the complexity.
Simple modulo hashing (key % N) remaps nearly all keys when you add/remove a node. Consistent hashing remaps only K/N keys.

3. Connection pool size matters more than you think.
Profile your goroutine blocking before optimizing algorithm. The bottleneck is often connection contention, not computation.

4. Cache the result, not just the data.
I cache the rate limiting decision, not the counter. Much simpler invalidation logic.

Source Code

github.com/sameer-sde/ratelimit

I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

Connect: LinkedIn · GitHub · Portfolio

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

Sameer Ahmed — Wed, 03 Jun 2026 02:17:08 +0000

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.

The Problem

Fraud detection is a classic hard problem in systems design. You need to:

Classify transactions in real-time — users can't wait 100ms for a payment to go through
Handle massive throughput — payment systems process thousands of requests per second
Maintain high accuracy — false positives block legitimate transactions, false negatives let fraud through
Deploy without downtime — you can't take a payment system offline to update your model

I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.

The Architecture

Transaction Request
        │
        ▼
   Go HTTP Server
        │
        ▼
   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
        │
     Cache Miss
        │
        ▼
   ONNX Runtime
        │
        ▼
  XGBoost Model
        │
        ▼
  Fraud Score + Decision
        │
        ▼
  Prometheus Metrics

The key insight: serve ML inference from Go, not Python.

Step 1: Training the Model

I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

# Handle class imbalance
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    eval_metric='aucpr',
    use_label_encoder=False
)

model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=20,
    verbose=False
)

Results:

PR-AUC: 0.87
Recall: 86%
Precision: 74%

Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.

Step 2: Exporting to ONNX

Here's where it gets interesting. Python inference is slow. I needed Go-level performance.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Export XGBoost → ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("fraud_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go binding.

Step 3: Go HTTP Server with ONNX Inference

type InferenceEngine struct {
    session  *onnxruntime.Session
    mu       sync.RWMutex
}

func (e *InferenceEngine) Predict(features []float32) (float64, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()

    input := onnxruntime.NewTensor(features)
    outputs, err := e.session.Run(input)
    if err != nil {
        return 0, err
    }

    score := outputs[0].GetData().([]float32)[0]
    return float64(score), nil
}

The Go HTTP handler:

func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
    var req TransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "bad request", 400)
        return
    }

    // Check LRU cache first
    cacheKey := req.Hash()
    if cached, ok := h.cache.Get(cacheKey); ok {
        json.NewEncoder(w).Encode(cached)
        return
    }

    // Run ONNX inference
    features := req.ToFeatures()
    score, err := h.engine.Predict(features)
    if err != nil {
        http.Error(w, "inference error", 500)
        return
    }

    result := &PredictionResult{
        Score:     score,
        IsFraud:   score > 0.5,
        Timestamp: time.Now().UnixNano(),
    }

    h.cache.Set(cacheKey, result)
    json.NewEncoder(w).Encode(result)
}

Step 4: LRU Cache — The Secret Weapon

The LRU cache was the single biggest performance win.

type LRUCache struct {
    capacity int
    mu       sync.RWMutex
    items    map[string]*list.Element
    list     *list.List
}

func (c *LRUCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if elem, ok := c.items[key]; ok {
        c.list.MoveToFront(elem)
        return elem.Value.(*entry).value, true
    }
    return nil, false
}

Result: 99% cache hit rate, saving ~200μs per decision.

In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.

Step 5: Zero-Downtime Model Updates

The hardest part. How do you update an ML model without taking the server down?

type ModelManager struct {
    engine atomic.Pointer[InferenceEngine]
}

func (m *ModelManager) HotSwap(newModel []byte) error {
    newEngine, err := NewInferenceEngine(newModel)
    if err != nil {
        return err
    }

    // Atomic swap — zero downtime
    m.engine.Store(newEngine)
    return nil
}

func (m *ModelManager) GetEngine() *InferenceEngine {
    return m.engine.Load()
}

atomic.Pointer from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.

Step 6: A/B Traffic Splitting

Once you can hot-swap models, A/B testing becomes easy:

func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
    // Hash user ID for consistent routing
    hash := fnv32(r.Header.Get("X-User-ID"))
    if hash%100 < h.config.ModelBPercentage {
        return h.modelB.Load()
    }
    return h.modelA.Load()
}

This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.

Step 7: Drift Detection

Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:

func (d *DriftDetector) Check(features []float32) bool {
    var drift float64
    for i, f := range features {
        deviation := math.Abs(float64(f) - d.baseline[i].Mean)
        normalized := deviation / (d.baseline[i].Std + 1e-8)
        drift += normalized
    }
    drift /= float64(len(features))

    // Alert if drift exceeds threshold
    return drift > d.threshold // threshold: 5e-7
}

Benchmark Results

Load tested with k6 — 200 concurrent VUs, 60 second duration:

scenarios: (100.00%) 1 scenario, 200 max VUs
  default: 200 looping VUs for 60s

✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200

Throughput: 71,274 req/s

71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.

Key Lessons

1. Language choice matters for inference.
Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.

2. Cache aggressively.
99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.

3. Atomic operations > locks for hot paths.
atomic.Pointer for model swapping means zero contention on the critical path.

4. Design for deployability from day one.
Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.

5. Monitor everything.
Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.

What's Next

Implement online learning to update the model with new fraud patterns in real-time
Add feature store integration for richer transaction context
Experiment with transformer-based models for sequence modeling

Source Code

github.com/sameer-sde/sentinel

If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

Connect with me: LinkedIn · GitHub · Portfolio