Joud Awad

Posted on Jun 29

54/60 Days System Design Questions

#abotwrotethis #ai #rag #database

You built a RAG pipeline. Works great in dev.

6 months later, your users complain: "The search results are garbage."

You haven't changed a line of code.

Here's what happened:

Your product evolved. New features, new docs, new support tickets. The data drifted — but your embedding index didn't.

Now you're serving a 400GB FAISS index that was last rebuilt in January. Your chunks are stale. Your nearest-neighbor results point to deprecated docs. Your LLM is confidently hallucinating from outdated context.

You need to fix this. 4 engineers each propose a solution:

A) Scheduled full rebuild
Every Sunday, re-embed the entire corpus from scratch. Replace the index atomically. Slow (4h+ at scale), expensive, but always fresh.

B) Incremental upserts + soft delete
On every document change, re-embed only the affected chunks. Mark deleted chunks as tombstoned. Keep a version field on each vector. Index size grows over time; compact quarterly.

C) Embedding version registry + hot swap
Track which embedding model version produced each vector. When the model drifts (fine-tuned or upgraded), invalidate the mismatched vectors and rebuild only those. Two indexes run in parallel during migration. Route traffic by model version.

D) Approximate staleness detection
Run a nightly job that samples 1% of your corpus, re-embeds it, and measures cosine distance against the stored vector. If drift exceeds a threshold, trigger a full rebuild. Otherwise, skip it. Cheap monitoring, reactive rebuilds.

Real constraint: your corpus is 50M chunks. Full rebuild = 4 hours + ~$800 in embedding API cost. You deploy model updates every 6 weeks.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #SystemDesign #MachineLearning #MLEngineering

Top comments (7)

Luis Cruz • Jun 29

Really nice breakdown — I like how you’re framing the API ecosystem as a stack rather than a collection of isolated tools. That mental model alone helps reduce a lot of “integration confusion” developers hit in production.

One layer I think is becoming increasingly important (and often missing in these stacks) is API runtime governance:

request-level observability (not just logs, but end-to-end trace context across services)
policy enforcement (auth, rate limits, schema validation) as a shared middleware layer rather than per-service logic
contract drift detection between services (especially in fast-moving teams)
and cost visibility per endpoint (which is becoming critical with LLM + external API usage)

Also interesting to think about how this stack evolves when you introduce agentic systems calling APIs, where APIs are no longer just integrations but tools in execution loops. That shifts requirements toward stricter idempotency guarantees and richer metadata per endpoint.

Curious if you see API gateways evolving into more of an “execution control plane” in 2026 rather than just traffic routing.

Joud Awad • Jun 30

You're hitting a great point.

The shift to agentic systems changes the API contract. When a human hits a flaky 503, they retry or move on. When an LLM agent hits the same 503 inside a 12-step loop, it cascades. The agent retries without context, double-charges, or hallucinates around the failure.

So yes, gateways are turning into execution control planes. The bigger shift is the middle layer collapsing. Today the stack is gateway → service mesh → service → LLM observability. For agentic workloads, that doesn't hold. You need one layer that owns:

Idempotency keys at the request level, not the service
Schema validation LLMs respect (not OpenAPI specs hoping for the best)
Cost attribution per agent run, not per endpoint
Tool metadata living next to the route, not in a separate registry

Zuplo and Hookdeck are pushing here. Kong and Apigee are slower because their abstraction was built for human-driven traffic.

Contract drift hits closest to home for me. We had a service rename a field from customer_id to account_id last quarter. The agent didn't throw an error. It just stopped using that tool. Three days of silent degradation before anyone noticed. That's the new "5xx rate going up" for agent stacks.

And the link back to the original post: versioned embeddings are the same problem one layer down. If your tools can drift, your retrieval layer can drift. Both need version-aware routing. The control plane in 2026 isn't routing traffic. It's routing model versions, tool versions, and schema versions through the same execution graph.

Joud Awad • Jun 29

B is the senior engineer trap

Incremental upserts sound smart — you're only re-embedding what changed. But here's the silent killer: model drift. You update your embedding model every 6 weeks. Now your index has vectors from 4 different model versions. Chunk A was embedded with text-embedding-ada-002. Chunk B with your fine-tuned v2. Nearest neighbor search across a mixed-space index returns garbage — not because the data is stale, but because the geometry is inconsistent. B handles data drift. It doesn't handle model drift. At 6-week update cycles, model drift is your actual problem.

Joud Awad • Jun 29

A is brute-force ops debt

$800 every Sunday whether you need it or not. At 6-week model updates, most rebuilds are wasted spend on data that hasn't meaningfully changed. It works. It's just expensive and blunt. Fine for a startup at 1M chunks. Painful at 50M.

Joud Awad • Jun 29

C is the right answer — Embedding Version Registry + Hot Swap

This is the only solution that handles the real problem: your embedding model changed. When you fine-tune or upgrade your model, Option A and B still serve a mixed index — some vectors from model v1, some from v2. Your cosine similarity math breaks because you're comparing vectors from different embedding spaces. They're not comparable. C is the only option that tracks which model produced which vector and routes queries to the correct index version during migration. Two indexes in parallel during transition = zero downtime, zero mixed-space comparisons.

The production win: you can migrate 10% of traffic to the new index, validate retrieval quality, then cut over. No big bang. No 4-hour Sunday rebuild blocking your team.

Joud Awad • Jun 29

D is monitoring, not a fix

Staleness detection is genuinely useful — but as a signal, not a strategy. The 1% sample tells you that you have drift, not which vectors are affected or why. You still need a rebuild strategy. D is a trigger, not an architecture. Combine it with C for a complete system.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.