When Your Embeddings Stop Distinguishing Anything

#rag #ai #observability #python

Book: RAG Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The provider's status page is green. Your error rate is flat. Latency is fine. And yet your RAG retrieval has gone soft. Top-1 hit rate roughly halved overnight, and nobody touched the index.

You start digging and notice something strange. You pick a query and a known-good answer chunk and compute cosine similarity. It comes back at 0.987. Good, right? Then you pick a query and an obviously unrelated chunk. That one is 0.984. Then you pick two random chunks from your corpus. Also 0.98. Every pair you sample is sitting in the same narrow band near 1.0.

The embedding model is still answering. It's still returning 1536-dim vectors. It's still passing every health check you have. But it has stopped distinguishing anything. The similarity distribution has collapsed.

This post is about catching that class of regression before your retrieval quality does.

The failure mode the API never reports

Embedding APIs have one obvious way to break (5xx errors, timeouts) and one quiet way (the vectors keep coming back, but they no longer encode meaning the way they used to). The quiet failure is the one that wrecks your retrieval and never trips a single alert.

Reasons the distribution can shift without an outage:

The provider rotated the model artifact behind the same model ID. Same name, different weights, different geometry of the embedding space.
A precision change. A switch from fp32 to fp16 or int8 quantisation can compress the spread of distances.
A pre- or post-processing change. A new tokenizer, a different normalisation step, a CLS-pooling change.
Your own preprocessing changed. Somebody upstream stripped punctuation, lowercased, or truncated, and now every input looks more similar.

Pin the cause later. First you need to know it happened.

What a healthy similarity distribution looks like

Take a fixed sample of 1000 random pairs from your corpus and plot the distribution of cosine similarity between them. For most production embedding models on a diverse corpus, you get something close to a bell shape with:

Mean around 0.3 to 0.5.
Standard deviation around 0.1 to 0.15.
The 99th percentile somewhere around 0.7 to 0.8.

Those numbers are domain-dependent. A corpus of legal contracts will sit higher than a corpus of news articles. The point is not the absolute number. The point is that whatever shape your corpus has today, it should be the same shape tomorrow.

When the distribution collapses, three things move together:

The mean drifts up toward 1.0.
The standard deviation collapses toward 0.
The gap between random pairs and known-good pairs disappears.

Any of those moving by more than three standard deviations is your alert.

The 50-line detector

Pin a fixed probe set at the start. The probe set has three parts:

Random pairs. A few hundred pairs sampled once from your corpus. The "noise floor" of your similarity distribution.
Known-similar pairs. A handful of pairs you've manually labelled as semantically close (a question and the chunk that answers it). The "signal" that should sit well above the noise.
Known-different pairs. A handful of pairs you've labelled as semantically far apart. Sanity check.

Embed the probe set every hour. Compute three numbers: mean of random-pair similarity, std of random-pair similarity, and average gap between known-similar and random pairs. Compare against a rolling baseline.

import time
import numpy as np
from openai import OpenAI

client = OpenAI()
EMBED_MODEL = "text-embedding-3-small"

PROBES = {
    "random": [
        ("the cat sat on the mat", "quarterly revenue grew 12%"),
        ("how do i reset my password", "kubernetes pod eviction"),
        # ... ~200 random pairs from your corpus
    ],
    "similar": [
        ("how do i cancel", "where do i stop billing"),
        ("reset my password", "i forgot my login"),
        # ... ~20 manually labelled close pairs
    ],
}

def embed(texts: list[str]) -> np.ndarray:
    r = client.embeddings.create(model=EMBED_MODEL, input=texts)
    v = np.array([d.embedding for d in r.data], dtype=np.float32)
    return v / np.linalg.norm(v, axis=1, keepdims=True)

def pair_sims(pairs: list[tuple[str, str]]) -> np.ndarray:
    flat = [t for pair in pairs for t in pair]
    vecs = embed(flat)
    a = vecs[0::2]
    b = vecs[1::2]
    return np.sum(a * b, axis=1)

def snapshot() -> dict:
    rand = pair_sims(PROBES["random"])
    simi = pair_sims(PROBES["similar"])
    return {
        "ts": time.time(),
        "rand_mean": float(rand.mean()),
        "rand_std": float(rand.std()),
        "rand_max": float(rand.max()),
        "sim_mean": float(simi.mean()),
        "gap": float(simi.mean() - rand.mean()),
    }

def check(now: dict, baseline: list[dict]) -> list[str]:
    if len(baseline) < 24:
        return []
    keys = ["rand_mean", "rand_std", "rand_max", "gap"]
    alerts = []
    for k in keys:
        hist = np.array([b[k] for b in baseline])
        mu, sd = hist.mean(), hist.std() + 1e-9
        z = (now[k] - mu) / sd
        if abs(z) > 3:
            alerts.append(f"{k} z={z:.2f} now={now[k]:.4f}")
    return alerts

if __name__ == "__main__":
    s = snapshot()
    print(s)

The detector is doing one job: tracking four moments of the similarity distribution and alerting when any of them moves more than three standard deviations from a 24-point rolling baseline. That's it.

The four moments matter for different reasons:

rand_mean catches a uniform shift up. Every pair got more similar; the model collapsed toward a single point.
rand_std catches a spread collapse. The mean might look fine, but the variance disappeared and your retrieval is now coin-flipping.
rand_max catches the long-tail edge case. Two unrelated chunks suddenly score 0.99.
gap is the load-bearing one. Even if the noise floor moves, what kills retrieval is the distance between signal and noise. When the gap shrinks, your top-k results stop being meaningful.

Run it on a 5-minute cron. Append each snapshot to a metrics backend (Prometheus, Datadog, a flat file, doesn't matter). Alert on the z-score crossings.

Why moments and not retrieval quality

You could write a different detector that runs your full retrieval pipeline against a labelled eval set and tracks recall@k. That detector is also worth having. It catches more failure modes (chunking changes, reranker changes, prompt changes).

But it's slower, costlier, and the signal arrives later. A retrieval eval needs a labelled set big enough to be statistically meaningful. Moments-based monitoring runs on 200 pairs, costs a few cents per snapshot, and surfaces the embedding regression before it has time to propagate into your retrieval logs.

The two detectors are complementary. Moments-based catches embedding-API drift in minutes. End-to-end retrieval eval catches everything else, on a slower cadence (daily or per-deploy).

What to do when it fires

The detector tells you something changed. It does not tell you what. The runbook:

Re-run the snapshot manually. Confirm the alert isn't a transient.
Compare model IDs. If you're calling a versioned ID like text-embedding-3-small, the response payload does not always carry a separate artifact-version stamp. Check the provider's status and changelog pages.
Diff your own preprocessing. Did anybody change the normalisation, tokenisation, or input truncation in the last 24 hours?
Re-embed a known-good chunk. Compare today's vector to a vector you stored a week ago. The cosine of those two should be very close to 1.0. If it's drifted, the model artifact moved.
Decide. Either roll back to a pinned, self-hosted embedding model, or accept the new geometry and re-embed your whole corpus. Mixing old and new embeddings in the same index gives you the worst of both.

The "re-embed a known-good chunk" trick is the cheap version of having pinned the model. You don't need to pin the model to detect when it moves; you need a stable reference vector and an alarm when today's vector for the same input no longer matches.

A note on similarity floors

A real number to anchor the abstract: with text-embedding-3-small, common production semantic-cache thresholds I've seen sit somewhere in the 0.9-ish range. If your random-pair noise floor drifts up that high, your cache is now hitting on every query. Same answer for everyone.

That's the failure mode that costs you trust faster than a 5xx ever will. A 5xx surfaces. A silent embedding collapse just makes your product feel stupid in a way users can't articulate.

The detector above is fifty lines plus probe data. The probe data takes an afternoon to assemble. The cron costs nothing. There is no reason to be running embeddings in production without it.

If this was useful

Embedding similarity is the load-bearing signal in every RAG system. The full pipeline (chunking, embedding choice, similarity thresholds, reranker tuning, evaluation rigs) and the failure modes that hide between the steps are what the RAG Pocket Guide walks through end to end. Short, opinionated, runnable.