Your RAG is getting worse and nothing in your code changed

#hermesagent #ai #rag #python

A RAG system I built was scoring 92% on our eval set in week one. Three weeks in, same eval set, same code, it was at 78%. Nothing had been deployed. Nothing had been re-indexed. The embeddings file was byte-identical.

So what changed?

This is the part of RAG that nobody talks about. Quality decays even when you do nothing, because the inputs are not the thing you measured. They are the thing your users are typing right now.

I built driftvane to find which signal was moving. It tracks five dimensions of drift in one report.

The five dimensions

After a few false starts (I started with just embedding drift, which is not enough), I landed on these:

Data drift: are the documents you index changing under you? (Includes silent index rebuilds.)
Embedding drift: is the distribution of your embeddings moving? Different distribution, different neighbors.
Query drift: are users asking different kinds of questions than they did at baseline?
Response drift: are answers getting longer, shorter, more uncertain, more confident?
Confidence drift: is the model's self-reported confidence changing on similar questions?

A single number tells you "something is off". The five tell you "your users started asking longer questions and the model is now hedging more".

What the API looks like

from driftvane import DriftDetector

detector = DriftDetector.load("baseline.snapshot")

report = detector.check(
    queries=recent_queries,
    retrieved_docs=recent_retrieved,
    responses=recent_responses,
    confidences=recent_confidences,
)

print(report.summary())

Output looks like this:

DriftReport
  data:       OK     (psi=0.04)
  embedding:  OK     (psi=0.07)
  query:      DRIFT  (psi=0.31)   <-- here
  response:   DRIFT  (length +38%, hedging +22%)
  confidence: OK     (mean shift 0.02)

PSI is Population Stability Index. Anything above 0.25 is "you should look at this". Anything above 0.5 is a fire.

What it actually found

For my 92 -> 78 case, the report said:

Query drift HIGH. Mean query length up 60%.
Response drift HIGH. Hedging phrases up 45%.

That was the whole answer. Users had started asking longer, multi-part questions ("compare A and B and also tell me about C"). My retriever was tuned for short queries. It was pulling the top 3 chunks for "A" and missing B and C entirely.

The fix had nothing to do with the model. I split long queries into sub-queries before retrieval. Accuracy went back to 91%.

The detector you actually want

The naive way to do this is k-means on embeddings and watch the cluster sizes. That catches a tiny fraction of real drift. The five-dimensional view is more useful because each dimension is independently diagnostic.

If response length moves but queries do not, your model is the culprit (upgrade? prompt edit?).

If queries move but responses do not, your model is being asked things it cannot answer well and is generating plausible-sounding-but-wrong content. Worst case.

If embeddings move but queries do not, your embedding provider silently changed the model. (Yes, this has happened.)

If data moves and you did not deploy, somebody else is writing to your index.

If confidence drifts but nothing else does, the model is having a bad week. Watch it.

Snapshots are the trick

The whole thing is built around snapshots. A snapshot is a fingerprint of your baseline: histograms, length distributions, a sample of embeddings, summary stats. It is small (a few MB for most setups).

from driftvane import build_snapshot

build_snapshot(
    queries=baseline_queries,
    retrieved_docs=baseline_docs,
    responses=baseline_responses,
    out="baseline.snapshot",
)

You run this once when your eval looks good. Then you can compare any future window against it. The detector does not need the raw baseline data again, just the snapshot.

What it does not do

It will not tell you the answers are wrong. It tells you the inputs and outputs are different. The "is it correct" question still needs your eval set.

Also: small windows are noisy. I would not trust a drift report on fewer than 200 queries. The library prints a warning if you try with less.

One library

PyPI only for now: pip install driftvane. The Rust port is on my list but Python is where my RAG code lives.

Repo: https://github.com/MukundaKatta/driftvane

If your RAG quality is mysteriously decaying, the answer is almost never the model. It is one of these five signals. Five histograms is a cheap way to find which one.