- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A better embedding model dropped. Your corpus is 40M chunks. The re-index quote is six figures and two weeks of downtime. So you don't upgrade. Your recall stays where it was.
This post is the runbook a team I talked to last month used to ship text-embedding-3-large over a text-embedding-ada-002 index without a stop-the-world job. Same pattern works for BGE-large to BGE-M3, or Cohere v3 to v4. It's a database migration. The vector store is the database. The schema change happens to be 1536→3072 dimensions and cosine→dot product.
Why a full re-index is the wrong default
Re-indexing 40M chunks costs money in three places. The embedding API bill (around $80 per million tokens for a high-end model. If your chunks average 300 tokens, that's roughly $10K just for inference). The compute to drive the API in parallel without tripping rate limits. The vector-store ingestion cost, which is often the worst of the three because most managed stores (Pinecone, Weaviate Cloud, Qdrant Cloud) bill per write op and your IOPS budget is not infinite.
The hidden cost is recall regression you don't catch until users complain. The new model is "better" on MTEB. Your corpus isn't MTEB. You moved one of your top-3 customers to a worse retrieval pipeline because the leaderboard told you to.
A team running a legal-document RAG saw nDCG@10 drop by 11% on their domain when they swapped to a model that beat the old one on MTEB by 4 points. They rolled back over a weekend. Cost of the failed migration wasn't the embedding bill. It was the trust hit with the customer who noticed.
The pattern below treats the upgrade like a zero-downtime DB migration. Dual-write. Shadow. A/B. Slice cutover. If recall regresses on any slice, you flip that slice back without touching the rest.
The four-step pattern in one diagram
Step 1: DUAL-WRITE Step 2: SHADOW Step 3: A/B Step 4: CUTOVER
ingestion backfill old corpus route % of queries per-slice flip
into shadow index against shadow
doc -> [old] doc(old) -> [old] q -> [old] -> served slice A: 100% new
-> [new] -> [new] -> [new] -> compared slice B: 100% new
slice C: 0% (rolled back)
Step 1 means every new write goes to both indexes. Step 2 means the historical corpus gets backfilled into the new index, throttled. Step 3 sends a fraction of read traffic to the new index and records per-query recall against the old. Step 4 cuts production reads over slice by slice. If a slice misbehaves, you flip the routing flag for that slice and keep going on the rest.
The total elapsed time is dominated by step 2. But step 2 runs in the background at the throttle rate your wallet can afford. Production isn't gated on it.
Step 1: Dual-write new embeddings on ingestion
Every new chunk gets embedded twice. The old model writes to the live index. The new model writes to the shadow index. Same chunk ID on both sides.
import asyncio
from dataclasses import dataclass
@dataclass
class Chunk:
chunk_id: str
text: str
tenant_id: str
doc_type: str # contract, ticket, email...
class DualWriter:
def __init__(self, old_index, new_index, old_embed, new_embed):
self.old_index = old_index
self.new_index = new_index
self.old_embed = old_embed
self.new_embed = new_embed
async def write(self, chunk: Chunk) -> None:
# embed in parallel. the new model is usually slower
old_vec, new_vec = await asyncio.gather(
self.old_embed(chunk.text),
self.new_embed(chunk.text),
)
# old index write is the source of truth
await self.old_index.upsert(
id=chunk.chunk_id,
vector=old_vec,
metadata={"tenant": chunk.tenant_id, "doc_type": chunk.doc_type},
)
# new index write is best-effort during dual-write phase
try:
await self.new_index.upsert(
id=chunk.chunk_id,
vector=new_vec,
metadata={"tenant": chunk.tenant_id, "doc_type": chunk.doc_type},
)
except Exception as e:
# log and move on. the backfill job will catch this id later.
self._log_shadow_miss(chunk.chunk_id, e)
Two things matter here. The old index write is awaited and must succeed. That's still production. The new index write is wrapped because you don't want a shadow-index outage to take ingestion down. Missed writes are reconciled by the backfill in step 2, which already has to be idempotent.
Your embedding API budget doubles during dual-write. Plan for it. The window is typically two to four weeks for a multi-tenant SaaS.
Step 2: Backfill into a shadow index, throttled, idempotent, resumable
The backfill walks the existing corpus and writes everything to the shadow index. Three properties matter: throttled (so you can cap spend), idempotent (so a crash is recoverable), and resumable (so you can stop and start without re-doing work).
import time
from typing import Iterator
class Backfill:
def __init__(self, source_db, new_index, new_embed,
cursor_store, rps_limit=20):
self.source_db = source_db
self.new_index = new_index
self.new_embed = new_embed
self.cursor_store = cursor_store # k/v: cursor -> last_chunk_id
self.rps_limit = rps_limit
def _iter_unprocessed(self) -> Iterator[Chunk]:
last_id = self.cursor_store.get("backfill_cursor") or ""
# ORDER BY chunk_id is critical for resumability
return self.source_db.iter_chunks(after_id=last_id)
async def run(self) -> None:
bucket = TokenBucket(self.rps_limit)
batch: list[Chunk] = []
for chunk in self._iter_unprocessed():
batch.append(chunk)
if len(batch) < 64:
continue
await bucket.acquire(len(batch))
await self._process_batch(batch)
# persist cursor AFTER successful write
self.cursor_store.set("backfill_cursor", batch[-1].chunk_id)
batch = []
if batch:
await self._process_batch(batch)
async def _process_batch(self, batch: list[Chunk]) -> None:
texts = [c.text for c in batch]
vectors = await self.new_embed.batch(texts)
# upsert by id (idempotent on retry)
await self.new_index.upsert_many([
{"id": c.chunk_id, "vector": v,
"metadata": {"tenant": c.tenant_id, "doc_type": c.doc_type}}
for c, v in zip(batch, vectors)
])
The cursor persists after a successful write, not before. That's what makes it crash-safe. If the process dies mid-batch, the next run re-does that batch. Idempotent upserts make the re-do harmless.
The throttle matters more than you think. A 40M-chunk backfill at 200 chunks per second runs for about 56 hours. The vector store cares less about the embedding bill and more about IOPS pressure on the live index, which shares hardware with the shadow index in some managed offerings (Pinecone serverless splits cleanly; Weaviate single-cluster does not). Test the throttle against your live p99 query latency before turning it up.
One subtle bug: if your source DB doesn't have a stable sort key on chunk content, a new chunk inserted while the backfill runs may be skipped. Either add a strict chunk_id index or set the dual-write flag (step 1) before you start the backfill, which guarantees new chunks land in the shadow index regardless.
Step 3: A/B route queries with eval-as-gating
Now both indexes have the corpus. Production reads still hit the old one. You start sending a fraction of queries to the new one in parallel and record the result.
import random
class ABQueryRouter:
def __init__(self, old_index, new_index, eval_logger,
shadow_fraction=0.10):
self.old_index = old_index
self.new_index = new_index
self.eval_logger = eval_logger
self.shadow_fraction = shadow_fraction
async def query(self, q_text: str, tenant_id: str, k: int = 10):
old_vec = await self.old_embed(q_text)
old_hits = await self.old_index.search(
vector=old_vec, k=k, filter={"tenant": tenant_id},
)
# always return old hits (production keeps the SLA it had)
if random.random() < self.shadow_fraction:
asyncio.create_task(
self._shadow_compare(q_text, tenant_id, k, old_hits)
)
return old_hits
async def _shadow_compare(self, q_text, tenant_id, k, old_hits):
new_vec = await self.new_embed(q_text)
new_hits = await self.new_index.search(
vector=new_vec, k=k, filter={"tenant": tenant_id},
)
self.eval_logger.record(
query=q_text,
tenant=tenant_id,
old_ids=[h.id for h in old_hits],
new_ids=[h.id for h in new_hits],
old_scores=[h.score for h in old_hits],
new_scores=[h.score for h in new_hits],
)
Production keeps returning what it always returned. The shadow call runs as a fire-and-forget task and feeds an eval log. You're not betting the SLA on the new model yet.
Two gates have to pass before you proceed to step 4. The offline gate: a labeled eval set scored against the new index, broken down by tenant and doc type. The online gate: the per-slice recall detector below, which compares the new index against the old over the previous 24 hours of real queries.
The labeled set is the hard part. If you don't have one yet, build one before you upgrade. Sample 200-500 queries from logs, weight by tenant size and query type, get humans (or a strong model with a careful rubric) to mark relevant chunks. This set is the only thing standing between you and the leaderboard-but-not-our-corpus failure mode.
Step 4: Gradual cutover by slice
The cutover isn't a single flag. It's a routing table.
class SlicedRouter:
def __init__(self, slice_config, old_router, new_router):
# slice_config: dict[str, float]
# key: "tenant:42" or "doc_type:contract" or "tenant:42:contract"
# value: fraction of traffic to send to new (0.0 to 1.0)
self.slice_config = slice_config
self.old_router = old_router
self.new_router = new_router
def _new_fraction(self, tenant_id: str, doc_type: str) -> float:
# most-specific-wins
for key in [
f"tenant:{tenant_id}:{doc_type}",
f"tenant:{tenant_id}",
f"doc_type:{doc_type}",
"default",
]:
if key in self.slice_config:
return self.slice_config[key]
return 0.0
async def query(self, q_text, tenant_id, doc_type, k=10):
frac = self._new_fraction(tenant_id, doc_type)
if random.random() < frac:
return await self.new_router.query(q_text, tenant_id, k)
return await self.old_router.query(q_text, tenant_id, k)
You start with {"default": 0.0} plus a few tenant:internal-test entries at 1.0. The internal test tenant catches gross failures. Then a single low-stakes external slice. Then doc-type slices that the eval gate scored well on. Slices that fail their gate get pinned at 0.0 and ignored until you debug them.
The unlock is per-slice rollback. A slice that regresses doesn't take the whole migration down. You set its entry to 0.0, leave the others alone, and ship.
The recall-regression detector that lets you sleep
The labeled eval set tells you about average quality on a fixed snapshot. It doesn't tell you when the new model regresses on a query pattern that wasn't in the snapshot. For that you need an online detector.
The trick: use the old model's top-k as a proxy ground truth, computed continuously, and watch for drops on the new model's overlap. It's not perfect. If the old model was wrong, this won't catch it. But it catches the worst case where the new model returns garbage that the old model didn't.
from collections import deque
import statistics
class OnlineRecallDetector:
def __init__(self, window_size=1000, k=10,
alert_threshold=0.65):
# per-slice ring buffers of overlap@k
self.windows: dict[str, deque] = {}
self.window_size = window_size
self.k = k
self.alert_threshold = alert_threshold
def _slice_key(self, tenant_id: str, doc_type: str) -> str:
return f"{tenant_id}:{doc_type}"
def record(self, tenant_id, doc_type,
old_top_ids, new_top_ids):
old_set = set(old_top_ids[: self.k])
new_set = set(new_top_ids[: self.k])
if not old_set:
return
overlap = len(old_set & new_set) / len(old_set)
key = self._slice_key(tenant_id, doc_type)
if key not in self.windows:
self.windows[key] = deque(maxlen=self.window_size)
self.windows[key].append(overlap)
def check_alerts(self) -> list[dict]:
alerts = []
for key, window in self.windows.items():
if len(window) < 100:
continue # not enough signal yet
mean = statistics.mean(window)
if mean < self.alert_threshold:
alerts.append({
"slice": key,
"mean_overlap": round(mean, 3),
"samples": len(window),
})
return alerts
Wire record() into the shadow comparison from step 3, and call check_alerts() from a scheduled job every five minutes. The threshold is per-corpus. Start at 0.65, tune. The point isn't a fixed number, it's a discontinuity. If a slice was averaging 0.78 yesterday and drops to 0.55 today, that's the signal, not the absolute value.
Two failure modes to know. The detector is blind to "both models return the same wrong answer". That's where the labeled set still matters. And query distribution shift (a tenant onboarded a new doc type) can look like regression. Pin the alert to slices with at least seven days of stable history when possible.
What changes if dimensions or distance metric differ
The 1536→3072 jump is the common one. The dimension change means the new index is a separate physical index. You can't write 3072-dim vectors into a 1536-dim collection. That's a feature for this migration: it forces clean separation.
Two practical consequences. Index storage roughly doubles during the dual-write window. Plan for the disk and the cost line. And query latency on the new index is higher (more work per comparison), often 1.3–1.6× the old p99. Measure before you cut over, not after.
The distance metric change is sneakier. If the old index was cosine and the new model is normalized and you switch to dot product, the math gives the same ranking, as long as you're sure both sides are unit-normalized. If you forgot to normalize on either side, dot product silently weights longer vectors higher and your rankings rotate. Check normalization explicitly:
import numpy as np
def assert_unit_norm(vec, tol=1e-3):
n = float(np.linalg.norm(vec))
if abs(n - 1.0) > tol:
raise ValueError(f"vector not unit-normalized: norm={n}")
Run this in the dual-writer for a sampled fraction (1 in 1000) of writes during the first week. The check is cheap. The bug it catches is expensive.
If you're upgrading to a model with binary or int8 quantization (Cohere v4, Matryoshka variants), the shadow comparison gets weirder because score scales differ by orders of magnitude. Compare ranks, not scores. The overlap@k detector above does the right thing here.
Common mistakes
Three patterns that bite teams running this migration.
Reranker training data goes stale. If you've trained a cross-encoder on (query, chunk) pairs scored by the old retriever, that training set is now skewed toward chunks the old model retrieved. The new model surfaces chunks the cross-encoder has never seen. Rebuild the reranker's hard-negative set against the new retriever after cutover. Don't roll both forward at once.
Normalization drift between training and serving. Some teams normalize at write time but forget at query time, or vice versa. The bug is invisible on the old index (because the inconsistency cancels) and obvious on the new one. The assertion above is your friend.
Eval set drift. The labeled set you built six months ago doesn't reflect today's tenants. Refresh 10% of it before every major upgrade. Stratify by tenant tier and doc type so a single big tenant doesn't dominate the metric.
The permission this pattern gives you is the real win. Embedding upgrades stop being a once-a-year heroic event. You run dual-write whenever a promising model ships, backfill on the schedule your budget likes, and the routing table lets you ship to the slices that benefit while skipping the ones that don't.
What's the biggest embedding migration you've shipped, and which step caught you out: the backfill cost, the recall regression, or the reranker going stale? Drop your war story below.
If this was useful
The four-step shape (dual-write, shadow, A/B, slice cutover) is one of the production patterns from the RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production. The chapter on operating retrieval at scale digs into the eval-set construction, hard-negative refresh after retriever swap, and the rerank/retrieve interplay that this post hand-waved past. Same shape, more depth.


Top comments (0)