Embedding Drift Detection: A 50-Line Monitor for Production RAG

#rag #ai #observability #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

When the embedding API itself collapses, the moments-based detector I wrote about earlier catches it. This post is about the other failure mode: the API is fine, your data has moved, and your dashboard still says green.

You ship a product launch on Monday. New SKUs, new docs, three thousand fresh chunks indexed alongside the existing eighty thousand. Nothing in the RAG pipeline changed. The embedding model is pinned. The chunker is pinned. The reranker config is byte-for-byte the same as last week.

By Wednesday, the support team is forwarding screenshots. Customers asking about the old product line are getting answers that quote the new one. Top-1 hit rate on your eval set is fine. Top-5 recall on real traffic looks acceptable. But the per-query-class numbers you stopped looking at after launch tell a different story: imagine legacy-product recall comes back at 0.54 against last week's 0.81, while the new-product class sits at 0.92 because every retrieval is biased toward the freshest, densest cluster of vectors. Illustrative numbers, but the shape of the failure is real.

Your embeddings did not regress. Your corpus did. And the monitor that would have caught it on Tuesday is about fifty lines of Python.

What corpus-side drift actually looks like

Three causes show up over and over in production RAG:

Re-indexed corpus. A bulk import, a doc-site scrape, a new content team. The shape of what you are embedding changed. Density shifted. Some query classes now compete with ten times more neighbors.
Model swap. Somebody swapped to a different model ID, or you moved to a self-hosted model to cut cost. The geometry of the space rotated. Old neighbors are no longer neighbors.
Fine-tune. You fine-tuned the embedding model on domain data. Recall on the in-domain class went up. Recall on everything else quietly went down, because fine-tunes warp the regions of the space they trained on and squeeze the rest.

In all three, the global retrieval metric you watch on a dashboard looks fine. The per-class metric does not. Drift is almost always a per-class story.

The signal: similarity moments per query class

Pick five to ten query classes that map to your business. For an e-commerce assistant: returns, shipping, product specs, account, billing. A docs bot might split along install, config, troubleshooting, API reference, and conceptual. Each class gets a small set of probe queries (twenty to fifty is plenty) and one known-good chunk per query.

For each class, on a daily cadence, you compute three numbers:

Mean similarity between each probe query and its known-good chunk.
Standard deviation of those similarities across the probe set.
Max similarity of each probe query against a fixed sample of unrelated corpus chunks. This is the noise floor for that class.

The signal you care about is the gap between the known-good mean and the noise-floor max. When the gap shrinks, retrieval for that class is degrading. Three sigma below the rolling baseline is your alert. The technique is generic: z-score against a rolling baseline is the standard distribution-shift pattern.

The monitor

import json, time
from pathlib import Path
import numpy as np
from sentence_transformers import SentenceTransformer

MODEL = SentenceTransformer("all-MiniLM-L6-v2")
HISTORY = Path("drift_history.jsonl")
WINDOW = 14
CORPUS: dict[str, str] = {}  # wire to your store: {chunk_id: text}
PROBES = {
    "billing": [("update my card", "chunk_4821"),
                ("cancel subscription", "chunk_2210")],
    "shipping": [("where is my order", "chunk_9013"),
                 ("change delivery address", "chunk_7741")],
}
NOISE_IDS = sorted(CORPUS.keys())[:200]
NOISE_VEC = None

def embed(texts):
    v = MODEL.encode(texts, normalize_embeddings=True)
    return np.asarray(v, dtype=np.float32)

def snapshot(probes, noise_vec):
    qv = embed([q for q, _ in probes])
    tv = embed([CORPUS[c] for _, c in probes])
    sims = np.sum(qv * tv, axis=1)
    nm = (qv @ noise_vec.T).max(axis=1).mean()
    return {"mean": float(sims.mean()), "std": float(sims.std()),
            "noise_max": float(nm), "gap": float(sims.mean() - nm)}

def run():
    global NOISE_VEC
    if NOISE_VEC is None:
        NOISE_VEC = embed([CORPUS[i] for i in NOISE_IDS])
    prior = [json.loads(l) for l in HISTORY.read_text().splitlines()] \
            if HISTORY.exists() else []
    snap = {"ts": time.time(),
            "classes": {n: snapshot(p, NOISE_VEC) for n, p in PROBES.items()}}
    HISTORY.open("a").write(json.dumps(snap) + "\n")
    if len(prior) < WINDOW: return []
    rows, alerts = prior[-WINDOW:], []
    for cls, vals in snap["classes"].items():
        for k in ("mean", "std", "noise_max", "gap"):
            hist = np.array([r["classes"][cls][k] for r in rows])
            z = (vals[k] - hist.mean()) / (hist.std() + 1e-9)
            if abs(z) > 3:
                alerts.append((cls, k, round(z, 2), vals[k]))
    return alerts

if __name__ == "__main__":
    for a in run(): print("DRIFT", a)

That is the whole thing. It depends on sentence-transformers and numpy. The probe sets and the corpus loader (the empty CORPUS dict you wire to your store) are the two pieces of work you do once, by hand. Everything else is mechanical.

A few details that matter. The noise sample is pinned at startup, not resampled per call, so changes in noise_max reflect the embedding space, not the lottery. The rolling baseline is read before today's snapshot is appended, so today's value is not inside its own mean. And the warmup gate suppresses alerts until there are at least 14 days of history, which prevents day-one false alarms when the rolling standard deviation is meaningless.

Reading the alerts

The four numbers each tell you a different story when they move:

mean drops: the model no longer maps queries close to their answers. Often a fine-tune side effect. The class you fine-tuned gets a higher mean; classes you did not get a lower one.
std rises: some probes still match well, others have fallen off a cliff. Suggests partial corpus drift, where a subset of the class is now dominated by new neighbors.
noise_max rises: unrelated chunks are scoring closer to your queries than they used to. The corpus got denser around the query region, or the embedding space compressed. Re-indexed corpus is the usual cause.
gap shrinks: the load-bearing one. Even when mean and noise_max both move in the same direction, what matters is the distance between them. A shrinking gap means top-k is mixing right and wrong answers.

A single class drifting on a single metric is rarely interesting. Two metrics on the same class, or the same metric across multiple classes on the same day, is when you wake somebody up.

Once you have it, what you change

The monitor does not fix anything. It tells you which class moved on which day. The fix usually falls into one of three buckets:

Re-embed. If you swapped the model, or fine-tuned, or upgraded a model ID, you re-embed the whole corpus and rebuild the index. Mixing old and new vectors in the same store is worse than either alone.
Rebalance. If the alert is "new-product class dominates", you give the affected classes a retrieval-time boost, or you split them into separate indices and route queries by class.
Rechunk. If a content style change caused the drift (longer chunks, different headings), you re-chunk the affected source and re-embed those chunks.

The point of the monitor is not to automate any of those decisions. It is to surface, on day one, that one of those decisions is now needed. RAG systems decay slowly. The customers notice before the dashboard does. The fifty lines above flip that order back.

If this was useful

Drift detection is one slice of the RAG operations problem. Chunking strategy, embedding model choice, reranker tuning, and the per-class evaluation rigs that make any of this measurable in the first place are what the RAG Pocket Guide walks through end to end. Short, opinionated, runnable.