We shipped a RAG system in February. The first week it answered 89% of internal questions correctly per the eval set. The next month it was 84%. The month after that 79%. Nobody on the team noticed because nobody was running the eval set on a schedule.
By the time someone ran a spot check, the index had drifted in four different ways at once. The embedder had been swapped (a vendor SDK update changed the default model). The corpus had grown by 30%, which shifted retrieval distributions. The query distribution had shifted because a new product launched and people were asking different things. And one of the underlying source documents had been quietly edited.
Each of these was a separate kind of drift. We had no detector for any of them.
driftvane is the Python library I wrote so the next system never gets to month three without somebody seeing the slope. It is on PyPI as driftvane.
Five dimensions, not one
Production RAG rots in at least five distinct ways. A single "is my RAG getting worse" detector is the wrong shape because the signal has to be wired to the right cause.
| Detector | What it watches | Fires when |
|---|---|---|
EmbeddingDrift |
The distribution of query embeddings | The geometric center of recent queries moves away from the baseline distribution. New topic emerged. |
RetrievalDrift |
The top-k document IDs retrieved | The intersection between today's top-k and the baseline top-k drops below a threshold. Index or retriever changed. |
ResponseDrift |
The semantic similarity of responses to the baseline | The judge score (or cosine distance) between today's answer and last month's drops. Model or prompt changed. |
LatencyDrift |
p50 and p95 retrieval + generation latency | Either percentile rises beyond a threshold. Infrastructure degraded. |
FreshnessDrift |
The age of documents that win retrieval | The mean age of cited docs climbs. New content is not getting indexed. |
Each detector is a function over a window of production traces. None of them require knowing about the others.
The shape of the fix
from driftvane import EmbeddingDrift, RetrievalDrift, ResponseDrift, LatencyDrift, FreshnessDrift
from driftvane.report import DriftReport
window = load_last_24_hours()
baseline = load_baseline()
detectors = [
EmbeddingDrift(embedder=my_embedder, threshold=0.15),
RetrievalDrift(threshold_overlap=0.6),
ResponseDrift(judge=my_judge, threshold=0.75),
LatencyDrift(p50_threshold_ms=400, p95_threshold_ms=1200),
FreshnessDrift(max_mean_age_days=180),
]
report = DriftReport.from_detectors(detectors, window=window, baseline=baseline)
if report.has_any():
alert(report.summary_markdown())
DriftReport rolls up which dimensions fired, by how much, with the evidence (a few example queries, the documents that drove the change, the latency percentile shift). One report, five separately-attributable signals.
What it does NOT do
- It does not fix drift. It detects it. The fix is a model swap, a retriever change, a prompt change, an infrastructure restart, a corpus refresh. Different drifts get different fixes.
- It does not own a time-series store. You wire it into whatever you already use (Postgres, Clickhouse, Honeycomb, your laptop). The detectors take a window and a baseline as plain Python objects.
- It does not pick an embedder for you. The default detector configuration takes
embedderas a parameter. You decide.
Inside the lib: one design choice worth showing
A composable detector library has to decide how it represents "the window of recent traces." The easy answer is a list of dicts. The hard part is that every detector wants a different slice.
The crate's answer is a Window protocol that any detector can interpret with the slice it cares about.
class Window:
def queries(self) -> Iterable[str]: ...
def query_embeddings(self) -> Iterable[np.ndarray]: ...
def retrieved_docs(self) -> Iterable[list[str]]: ...
def responses(self) -> Iterable[str]: ...
def latencies_ms(self) -> Iterable[float]: ...
def doc_ages_days(self) -> Iterable[float]: ...
A detector that only cares about latency only iterates latencies_ms(). A detector that only cares about freshness only iterates doc_ages_days(). The Window can be backed by a SQL query, a JSON file on disk, an Arrow table, or an in-memory mock for tests. Each is a thin adapter.
The integrated demo ships a SyntheticWindow that simulates ageing the index from 0 to 30 days, so judges can see all five dimensions react to one knob without setting up a production pipeline.
When this is useful
- You run a RAG or agent system in production and want a daily report on whether anything degraded.
- You are about to change an embedder, a retriever, or a prompt and want a before-and-after on all five dimensions.
- You are writing a regression test that asserts a prompt change did not silently shift retrievals.
When this is NOT what you want
- For one-shot evaluation. driftvane compares a window against a baseline. If you do not have a baseline yet, run an eval set first.
- For real-time alerting on every request. The detectors are window-based, not per-request. Set the window short if you want low latency.
- For LLM-only systems with no retrieval. You will only get one or two of the five signals. Use
cachebenchoragenttraceinstead.
Install
pip install driftvane
Repo: https://github.com/MukundaKatta/driftvane
Sibling libraries
| Lib | Boundary | Repo |
|---|---|---|
| driftvane | RAG drift across five dimensions | this repo |
| cachebench | Prompt-cache observability | https://github.com/MukundaKatta/cachebench |
| agenttrace | Cost + latency rollup per agent run | https://github.com/MukundaKatta/agenttrace |
| ragvitals | Same family, narrower API surface | https://github.com/MukundaKatta/ragvitals |
What's next
A short-window mode (5-minute baselines) so the same detectors can power real-time alerting, not just daily reports. Wiring with the broader agent-stack family so a drift report can route directly into a agent-decision-log outcome if the change came from an upstream prompt edit.
Top comments (0)