- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You have a RAG pipeline in production. Retrieval p95 looks fine. Token cost is stable. A user pings support: "it keeps answering about the wrong product." You pull the trace. Top-5 chunks look reasonable. The answer does not.
This post is about that gap.
Kuldeep Paul's 2026 dev.to piece Ten Failure Modes of RAG Nobody Talks About covers the taxonomy end-to-end and is worth reading before this one. What follows is five different angles — failure modes that (a) don't overlap much with his list, and (b) I keep seeing chewed up by aggregate dashboards. For each one: a concrete scenario, the signal that actually detects it, and the code or config to catch it.
1. Embedding drift from under your index
The scenario. Your pipeline uses text-embedding-3-small. In Q2 the provider ships a minor update to that model. Your index — the vectors you computed last November — was built against the prior artifact. Every new user query is encoded with the new weights. Retrieval still "works." Scores are plausible. Relevance has quietly fallen a few points across the board, concentrated on queries about rare entities.
You will not see this in retrieval-accuracy charts because there is no step-change. It is a slow slide.
The signal. Cosine-distance distribution between a fixed held-out query set and its known-correct chunks, tracked over time. The shape of the distribution changes when the embedding space changes under you. A Kolmogorov-Smirnov test on the daily distribution vs a 14-day baseline catches this in about three days.
The catch.
from scipy.stats import ks_2samp
import numpy as np
# Run this nightly against a frozen eval set of
# ~200 (query, correct_chunk_id) pairs.
def detect_embedding_drift(today_scores, baseline_scores):
stat, p = ks_2samp(today_scores, baseline_scores)
return {
"ks_stat": stat,
"p_value": p,
"drift_detected": p < 0.01 and stat > 0.15,
}
Pair that with a weekly re-embed of a 1% sample of the corpus and a cosine-similarity check between old and new vectors. If the mean drops below 0.98, reindex.
2. Chunking misalignment
The scenario. Your splitter chunks at 512 tokens with a 50-token overlap. The document is a compliance policy. One sentence reads: "Refunds are permitted within 14 days, except for digital goods delivered after download." The splitter cuts between 14 days and except. The user asks whether digital goods are refundable. The retriever returns the first half of that sentence, confidently. The model answers yes.
The chunk scored high. The answer was wrong. Retrieval looks fine because it is fine at the chunk level. The failure is upstream: your chunking strategy and the user's question live at different granularities.
The signal. Sentence-boundary break rate per chunk, joined with answer-correctness on a sample of traces. If chunks that split sentences correlate with incorrect answers at a rate meaningfully above the base rate, that is your signal.
The catch. Two cheap instruments.
First, at index time, log whether each chunk ends mid-sentence:
import re
SENTENCE_END = re.compile(r"[.!?]\s*$")
def chunk_end_quality(chunk_text: str) -> str:
if SENTENCE_END.search(chunk_text.strip()):
return "clean"
if chunk_text.strip().endswith((",", ";", ":", "-")):
return "mid_clause"
return "mid_sentence"
Emit chunk.end_quality as a metadata field on every chunk. Then at retrieval time, emit a span attribute on your retrieval span — gen_ai.retrieval.mid_sentence_chunks — counting how many of the top-k ended badly.
Second, run a weekly offline eval where your judge grades the answer against the full document (not the retrieved chunks). If grounded accuracy against the full doc is materially higher than against the chunks, your chunking strategy is the bottleneck.
Switch to a semantic splitter (LangChain's SemanticChunker, LlamaIndex's SemanticSplitterNodeParser) or increase overlap. Measure again.
3. Retrieval poisoning by one weird document
The scenario. A teammate ingests a slide deck export. One slide's notes field contains a motivational paragraph full of product terminology: "At Acme, we believe pricing, refunds, integrations, and reliability are all one promise." The chunk's embedding lands in a neighborhood that is close-ish to a disturbing number of real user queries. It scores in the top-5 for refund questions, pricing questions, integration questions. It has nothing useful to say about any of them. Every answer that touches those topics is now contaminated.
This is not an adversarial attack. It is a single well-embedded but content-poor document raising its hand on every question.
The signal. Per-chunk retrieval fan-out: how many distinct queries pull this chunk into top-k over a rolling window. Plot the distribution. A healthy corpus has a long tail; almost every chunk is retrieved for a narrow cluster of related queries. Poisoned chunks are outliers that appear across semantically unrelated clusters.
The catch.
from collections import Counter
from sklearn.cluster import KMeans
# Over the last 7 days of production traces:
# chunk_hits[chunk_id] = list of query_embeddings that retrieved it
def poison_score(query_embeddings, n_clusters=8):
if len(query_embeddings) < n_clusters:
return 0.0
km = KMeans(n_clusters=n_clusters, n_init=10)
labels = km.fit_predict(query_embeddings)
# If hits are spread across many clusters, the chunk
# is retrieved by semantically unrelated queries.
spread = len(set(labels)) / n_clusters
return spread
# Rank chunks by (retrieval_count * poison_score).
# The top 20 are what you review manually.
Review flagged chunks weekly. Most are garbage ingests (boilerplate, footers, all-topic marketing copy). Delete or down-weight them. You do not need ML to catch this — you need to look.
4. Low-grounding hallucination with invented citations
The scenario. The retriever returns five chunks. None of them are actually relevant — the user asked something the corpus does not answer. The model does not say "I don't know." It generates a confident paragraph and attaches a citation to chunk 3. Chunk 3 is about something adjacent but different. A reader who trusts the citation does not click through. The answer is wrong; the UI says it is sourced.
Aggregate metrics miss this because retrieval returned results (✓), the model produced a response (✓), and a citation object was emitted (✓). The content of the citation relative to the claim is where the failure lives.
The signal. Two separate scores, computed per response:
- Retrieval confidence — mean top-k similarity score, normalized against a rolling baseline for that query type.
- Citation faithfulness — an LLM-judge (from a different provider than the generator, to avoid self-preference bias) asked: "Does chunk N support the specific claim it is attached to? Yes / Partial / No."
The danger zone is low retrieval confidence and high response confidence and non-empty citations. That combination is the model covering for a miss.
The catch. Emit both as span attributes on the generation span and alert on the combination.
# Pseudocode — the exact signature depends on your
# tracer (OTel, Langfuse, Phoenix).
with tracer.start_as_current_span("gen_ai.generate") as span:
response = llm.generate(prompt, context=chunks)
retrieval_conf = mean([c.score for c in chunks])
faithfulness = judge_citations(
response.text, response.citations, chunks,
judge_model="other-provider-model"
)
span.set_attribute("gen_ai.retrieval.confidence", retrieval_conf)
span.set_attribute("gen_ai.response.faithfulness", faithfulness)
span.set_attribute(
"gen_ai.response.hallucination_risk",
retrieval_conf < 0.55 and faithfulness < 0.7,
)
Alert on hallucination_risk=true exceeding 2% of traffic over a 30-minute window. When it fires, the product fix is often simple: raise the retrieval-confidence threshold at which you let the model answer vs falling back to an "I don't have enough information" response. Users prefer a refusal over a fabricated citation. Every time.
5. The evaluation blind spot
The scenario. Your eval set has 200 golden questions. You added it in January. Retrieval recall@5 on the set is 0.91. The eval runs nightly and is green. A new customer segment onboards in March. They ask questions phrased differently — more technical, more acronym-heavy, more multi-part. Your eval never covered that distribution. You will find out the day the churn dashboard shifts.
This one is meta: the failure mode is that your detector doesn't cover the failure. Every other instrument in this post assumes your eval distribution matches your production distribution. When they diverge, your dashboards lie.
The signal. Distributional distance between your eval set and production queries, recomputed weekly. Cluster both in the same embedding space; compare cluster membership. Any production cluster with less than 2% representation in the eval set is a blind spot.
The catch. This is a reporting job, not an alert.
from sklearn.cluster import KMeans
import numpy as np
def eval_coverage_report(prod_embeddings, eval_embeddings, n=20):
km = KMeans(n_clusters=n, n_init=10).fit(prod_embeddings)
prod_labels = km.labels_
eval_labels = km.predict(eval_embeddings)
report = []
for cluster in range(n):
prod_share = (prod_labels == cluster).mean()
eval_share = (eval_labels == cluster).mean()
gap = prod_share - eval_share
if prod_share > 0.02 and eval_share < 0.005:
report.append({
"cluster": cluster,
"prod_share": round(prod_share, 3),
"eval_share": round(eval_share, 3),
"gap": round(gap, 3),
})
return sorted(report, key=lambda r: -r["gap"])
Feed the top-gap clusters to a human every Monday. For each cluster, pull 10 representative production queries, write ground-truth answers, add them to the eval set. An eval set that does not grow is already stale.
The common shape
All five share a pattern. Aggregate dashboards (retrieval p95, recall@k, token cost) return green because the transport layer succeeded. The content quality lives a layer deeper — in distributions, per-chunk behavior, per-response judgments, and coverage of your detection itself. You see the gap only when you instrument it directly:
- Distribution drift on embeddings, not just average score.
- Per-chunk end-quality metadata at index time, carried through to retrieval spans.
- Per-chunk fan-out across unrelated query clusters at review time.
- Retrieval confidence × response confidence × citation faithfulness, joined per trace.
- Eval-vs-production cluster coverage, reported weekly.
Four of the five can be built on OpenTelemetry GenAI semantic conventions (gen_ai.retrieval.*, gen_ai.response.*) plus whatever tracing backend you already use — Langfuse, Phoenix, LangSmith, Braintrust, or Datadog LLM Observability. None of the five require a bespoke ML platform.
The fifth — the eval blind spot — is the one that catches people out because it feels like a process problem, not an observability problem. It is both. A green dashboard you built in January is a dashboard that tested a distribution that no longer exists.
If this was useful
These five failure modes are the kind that Observability for LLM Applications teaches you to instrument end-to-end. The book walks through OTel GenAI semconv for the transport layer (Ch 4), online evals and LLM-as-judge for the content layer (Ch 8–11), and the operational patterns — threshold setting, alerting, incident response — that turn the signals above into something your on-call rotation can actually use (Ch 15–18).
- Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)