You built a RAG pipeline. Works great in dev.
6 months later, your users complain: "The search results are garbage."
You haven't changed a line of code.
Here's what happened:
Your product evolved. New features, new docs, new support tickets. The data drifted — but your embedding index didn't.
Now you're serving a 400GB FAISS index that was last rebuilt in January. Your chunks are stale. Your nearest-neighbor results point to deprecated docs. Your LLM is confidently hallucinating from outdated context.
You need to fix this. 4 engineers each propose a solution:
A) Scheduled full rebuild
Every Sunday, re-embed the entire corpus from scratch. Replace the index atomically. Slow (4h+ at scale), expensive, but always fresh.
B) Incremental upserts + soft delete
On every document change, re-embed only the affected chunks. Mark deleted chunks as tombstoned. Keep a version field on each vector. Index size grows over time; compact quarterly.
C) Embedding version registry + hot swap
Track which embedding model version produced each vector. When the model drifts (fine-tuned or upgraded), invalidate the mismatched vectors and rebuild only those. Two indexes run in parallel during migration. Route traffic by model version.
D) Approximate staleness detection
Run a nightly job that samples 1% of your corpus, re-embeds it, and measures cosine distance against the stored vector. If drift exceeds a threshold, trigger a full rebuild. Otherwise, skip it. Cheap monitoring, reactive rebuilds.
Real constraint: your corpus is 50M chunks. Full rebuild = 4 hours + ~$800 in embedding API cost. You deploy model updates every 6 weeks.
Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.
Top comments (5)
C is the right answer — Embedding Version Registry + Hot Swap
This is the only solution that handles the real problem: your embedding model changed. When you fine-tune or upgrade your model, Option A and B still serve a mixed index — some vectors from model v1, some from v2. Your cosine similarity math breaks because you're comparing vectors from different embedding spaces. They're not comparable. C is the only option that tracks which model produced which vector and routes queries to the correct index version during migration. Two indexes in parallel during transition = zero downtime, zero mixed-space comparisons.
The production win: you can migrate 10% of traffic to the new index, validate retrieval quality, then cut over. No big bang. No 4-hour Sunday rebuild blocking your team.
B is the senior engineer trap
Incremental upserts sound smart — you're only re-embedding what changed. But here's the silent killer: model drift. You update your embedding model every 6 weeks. Now your index has vectors from 4 different model versions. Chunk A was embedded with text-embedding-ada-002. Chunk B with your fine-tuned v2. Nearest neighbor search across a mixed-space index returns garbage — not because the data is stale, but because the geometry is inconsistent. B handles data drift. It doesn't handle model drift. At 6-week update cycles, model drift is your actual problem.
A is brute-force ops debt
$800 every Sunday whether you need it or not. At 6-week model updates, most rebuilds are wasted spend on data that hasn't meaningfully changed. It works. It's just expensive and blunt. Fine for a startup at 1M chunks. Painful at 50M.
D is monitoring, not a fix
Staleness detection is genuinely useful — but as a signal, not a strategy. The 1% sample tells you that you have drift, not which vectors are affected or why. You still need a rebuild strategy. D is a trigger, not an architecture. Combine it with C for a complete system.
Really nice breakdown — I like how you’re framing the API ecosystem as a stack rather than a collection of isolated tools. That mental model alone helps reduce a lot of “integration confusion” developers hit in production.
One layer I think is becoming increasingly important (and often missing in these stacks) is API runtime governance:
request-level observability (not just logs, but end-to-end trace context across services)
policy enforcement (auth, rate limits, schema validation) as a shared middleware layer rather than per-service logic
contract drift detection between services (especially in fast-moving teams)
and cost visibility per endpoint (which is becoming critical with LLM + external API usage)
Also interesting to think about how this stack evolves when you introduce agentic systems calling APIs, where APIs are no longer just integrations but tools in execution loops. That shifts requirements toward stricter idempotency guarantees and richer metadata per endpoint.
Curious if you see API gateways evolving into more of an “execution control plane” in 2026 rather than just traffic routing.