DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

My RAG system slowly got worse for three months and nobody noticed.

We shipped a RAG system in February. The first week it answered 89% of internal questions correctly per the eval set. The next month it was 84%. The month after that 79%. Nobody on the team noticed because nobody was running the eval set on a schedule.

By the time someone ran a spot check, the index had drifted in four different ways at once. The embedder had been swapped (a vendor SDK update changed the default model). The corpus had grown by 30%, which shifted retrieval distributions. The query distribution had shifted because a new product launched and people were asking different things. And one of the underlying source documents had been quietly edited.

Each of these was a separate kind of drift. We had no detector for any of them.

driftvane is the Python library I wrote so the next system never gets to month three without somebody seeing the slope. It is on PyPI as driftvane.

Five dimensions, not one

Production RAG rots in at least five distinct ways. A single "is my RAG getting worse" detector is the wrong shape because the signal has to be wired to the right cause.

Detector What it watches Fires when
EmbeddingDrift The distribution of query embeddings The geometric center of recent queries moves away from the baseline distribution. New topic emerged.
RetrievalDrift The top-k document IDs retrieved The intersection between today's top-k and the baseline top-k drops below a threshold. Index or retriever changed.
ResponseDrift The semantic similarity of responses to the baseline The judge score (or cosine distance) between today's answer and last month's drops. Model or prompt changed.
LatencyDrift p50 and p95 retrieval + generation latency Either percentile rises beyond a threshold. Infrastructure degraded.
FreshnessDrift The age of documents that win retrieval The mean age of cited docs climbs. New content is not getting indexed.

Each detector is a function over a window of production traces. None of them require knowing about the others.

The shape of the fix

from driftvane import EmbeddingDrift, RetrievalDrift, ResponseDrift, LatencyDrift, FreshnessDrift
from driftvane.report import DriftReport

window = load_last_24_hours()
baseline = load_baseline()

detectors = [
    EmbeddingDrift(embedder=my_embedder, threshold=0.15),
    RetrievalDrift(threshold_overlap=0.6),
    ResponseDrift(judge=my_judge, threshold=0.75),
    LatencyDrift(p50_threshold_ms=400, p95_threshold_ms=1200),
    FreshnessDrift(max_mean_age_days=180),
]

report = DriftReport.from_detectors(detectors, window=window, baseline=baseline)

if report.has_any():
    alert(report.summary_markdown())
Enter fullscreen mode Exit fullscreen mode

DriftReport rolls up which dimensions fired, by how much, with the evidence (a few example queries, the documents that drove the change, the latency percentile shift). One report, five separately-attributable signals.

What it does NOT do

  • It does not fix drift. It detects it. The fix is a model swap, a retriever change, a prompt change, an infrastructure restart, a corpus refresh. Different drifts get different fixes.
  • It does not own a time-series store. You wire it into whatever you already use (Postgres, Clickhouse, Honeycomb, your laptop). The detectors take a window and a baseline as plain Python objects.
  • It does not pick an embedder for you. The default detector configuration takes embedder as a parameter. You decide.

Inside the lib: one design choice worth showing

A composable detector library has to decide how it represents "the window of recent traces." The easy answer is a list of dicts. The hard part is that every detector wants a different slice.

The crate's answer is a Window protocol that any detector can interpret with the slice it cares about.

class Window:
    def queries(self) -> Iterable[str]: ...
    def query_embeddings(self) -> Iterable[np.ndarray]: ...
    def retrieved_docs(self) -> Iterable[list[str]]: ...
    def responses(self) -> Iterable[str]: ...
    def latencies_ms(self) -> Iterable[float]: ...
    def doc_ages_days(self) -> Iterable[float]: ...
Enter fullscreen mode Exit fullscreen mode

A detector that only cares about latency only iterates latencies_ms(). A detector that only cares about freshness only iterates doc_ages_days(). The Window can be backed by a SQL query, a JSON file on disk, an Arrow table, or an in-memory mock for tests. Each is a thin adapter.

The integrated demo ships a SyntheticWindow that simulates ageing the index from 0 to 30 days, so judges can see all five dimensions react to one knob without setting up a production pipeline.

When this is useful

  • You run a RAG or agent system in production and want a daily report on whether anything degraded.
  • You are about to change an embedder, a retriever, or a prompt and want a before-and-after on all five dimensions.
  • You are writing a regression test that asserts a prompt change did not silently shift retrievals.

When this is NOT what you want

  • For one-shot evaluation. driftvane compares a window against a baseline. If you do not have a baseline yet, run an eval set first.
  • For real-time alerting on every request. The detectors are window-based, not per-request. Set the window short if you want low latency.
  • For LLM-only systems with no retrieval. You will only get one or two of the five signals. Use cachebench or agenttrace instead.

Install

pip install driftvane
Enter fullscreen mode Exit fullscreen mode

Repo: https://github.com/MukundaKatta/driftvane

Sibling libraries

Lib Boundary Repo
driftvane RAG drift across five dimensions this repo
cachebench Prompt-cache observability https://github.com/MukundaKatta/cachebench
agenttrace Cost + latency rollup per agent run https://github.com/MukundaKatta/agenttrace
ragvitals Same family, narrower API surface https://github.com/MukundaKatta/ragvitals

What's next

A short-window mode (5-minute baselines) so the same detectors can power real-time alerting, not just daily reports. Wiring with the broader agent-stack family so a drift report can route directly into a agent-decision-log outcome if the change came from an upstream prompt edit.

Top comments (0)