Siddharth Pandey

Posted on Jun 6

Add a PASS/WARN/FAIL Quality Gate to Your RAG Pipeline in 30 Seconds

#rag #opensource #llm #devtools

You deployed a RAG chatbot. The answers are vague. You bump the LLM from GPT-3.5 to GPT-4. The answers are still vague. You double the chunk size. Still vague. You spend three hours tuning prompts. Still. Vague.

The real problem isn't the model. It's that your pipeline is retrieving 10 chunks and the LLM is only seeing 3 of them — and nothing in your logs tells you that.

What's Actually Breaking (and Why You Can't See It)

A RAG pipeline has at least two moving parts between a user query and an answer: a retrieval step that fetches relevant chunks from a vector store, and an LLM call that uses those chunks to generate a response.

The failure mode that kills most RAG quality work is invisible: chunks are retrieved, then silently discarded before they reach the LLM prompt.

This happens because of TOP_K. You set TOP_K=10 thinking more context is better. But your LLM has a token budget. The orchestration layer (LangChain, LlamaIndex, your custom code) fills the prompt until it hits the limit — and quietly drops whatever didn't fit. The LLM never saw chunks 4 through 10. Your logs show a successful retrieval. Your logs show a successful LLM call. Nothing reports that 70% of your retrieved context was thrown away.

There are three failure patterns that account for most bad RAG answers:

TOP_K too high. You retrieve 10 chunks, the LLM uses 3. The 7 you paid to embed, store, and retrieve contribute nothing. Worse: if the 3 that fit aren't the 3 most relevant, your answer quality is determined by which chunks happened to survive token truncation rather than which ones actually matched the query.

Near-duplicate chunks. Sliding-window chunking creates overlapping segments. If chunk 3 ends with "...chlorophyll to capture light and convert it into chemical energy" and chunk 4 starts with the same phrase, you've burned 30% of your context window repeating one sentence. The model sees it twice and may over-weight it.

Missing similarity scores. Some vector stores (notably Chroma with L2 distance) return raw distance values, not normalized [0, 1] similarity scores. Your retrieval logs show scores like 1.42 and 0.93 with no indication which is better. Without normalized scores you can't tune thresholds or understand ranking.

These are all measurable. You just need something to measure them.

One Command to Add a Quality Gate

npx ragscope start

That starts a local OTLP receiver on port 4321. Then point your pipeline at it with one environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4321

If you're using Traceloop or OpenLLMetry for instrumentation, that's all you need — they auto-instrument LangChain, LlamaIndex, OpenAI, Qdrant, and Cohere out of the box:

import { Traceloop } from '@traceloop/node-server-sdk';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

Traceloop.init({
  exporter: new OTLPTraceExporter({ url: 'http://localhost:4321/v1/traces' }),
});

For Vercel AI SDK or custom pipelines, add two span attributes: one on the retrieval span listing your chunks, and one on the LLM span with the full prompt text. That's the minimum RAGScope needs to score a trace.

Run your app, fire a few queries, and your terminal shows this:

  PASS  90/100  █████████░  my-rag-app
  │  "What is RAG?"
  │
  │  ✓  precision    90  █████████░  9/10 chunks used
  │  ✓  efficiency   80  ████████░░  20% tokens wasted
  │  ✓  uniqueness  100  ██████████  chunks are distinct
  │  ✓  coverage    100  ██████████  all chunks scored
  │

  WARN  54/100  █████░░░░░  my-rag-app
  │  "What is dense passage retrieval?"
  │
  │  ✗  precision    40  ████░░░░░░  4/10 chunks used
  │  ~  efficiency   50  █████░░░░░  50% tokens wasted
  │  ~  uniqueness   65  ███████░░░  2 near-duplicate pairs
  │  ✓  coverage    100  ██████████  all chunks scored
  │
  │  → Reduce TOP_K 10→4 (only 4 chunks reached LLM)
  │  → 50% of retrieved tokens never reached the LLM
  │  → 2 near-duplicate chunks — deduplicate at ingest time
  │

  ──────────────────────────────────────────────────
  Session  2 queries  ·  avg 72/100  ↓

Every trace is scored the moment it arrives. No dashboard to open, no query to run — just a PASS/WARN/FAIL with specific numbers sitting next to the query that produced them.

Reading the Score — What Each Number Means

Four sub-scores combine into a single 0–100 overall score. The weights reflect actual impact on answer quality.

Precision (40%) — what fraction of retrieved chunks appeared in the LLM prompt. This is weighted highest because a chunk that doesn't reach the LLM has zero value to the answer. It consumed retrieval latency, vector bandwidth, and context window space, and then got dropped. A score of 40 means Reduce TOP_K 10→4 (only 4 chunks reached LLM) — RAGScope tells you the exact number to set it to.

Efficiency (30%) — what fraction of retrieved tokens the LLM actually consumed. Low precision and low efficiency usually appear together, but they can diverge: if you retrieve three large chunks and the LLM fits two and a half, efficiency is low even though precision is decent.

Uniqueness (20%) — how distinct your chunks are from each other. Computed from exact text overlap between adjacent chunks (sorted by retrieval rank). Score of 100 means all chunks are fully distinct. A score of 65 with 2 near-duplicate pairs means your chunking strategy is creating redundant segments — deduplicate at ingest time or increase your chunk step size.

Coverage (10%) — whether your chunks carry non-zero similarity scores. This is a signal flag: if it fires, your vector store is returning raw values that couldn't be normalized, which means you also can't tune retrieval thresholds. RAGScope normalizes Chroma distances automatically, so this usually only fires when scores are genuinely missing from the trace.

The overall label maps to:

Score	Label	Meaning
≥ 75	PASS	Retrieval is healthy for this query
50–74	WARN	Issues present — review the recommendations
< 50	FAIL	Significant retrieval problems

Each WARN or FAIL comes with a concrete recommendation. Not "consider reducing TOP_K" — Reduce TOP_K 10→4 (only 4 chunks reached LLM). The actual number, derived from your actual trace.

Conclusion

Adding a quality gate to your RAG pipeline takes one command and one environment variable. From that point, every query you run during development is scored — precision, efficiency, uniqueness, coverage — with specific recommendations when something is wrong.

You stop guessing whether a model upgrade or a prompt rewrite will fix the vague answers. You see whether the retrieval pipeline is the problem, and exactly where it's breaking.

RAGScope runs entirely locally. No accounts, no configuration files, no data leaving your machine. Trace data lives in memory for the session lifetime. It's the same category of tool as a linter: runs while you build, catches problems before users see them.

Try it on your next RAG session: GitHub · npm

Key Takeaways

Most RAG quality problems are retrieval mechanics problems, not model problems — and they're invisible without tracing
TOP_K too high is the most common culprit: chunks are retrieved, then silently dropped before the LLM prompt is assembled
npx ragscope start + one env var adds a live PASS/WARN/FAIL score to every query during development
Precision (40% weight) measures chunk utilization; efficiency (30%) measures token utilization — both usually fix with TOP_K reduction
Near-duplicate chunks from sliding-window chunking waste context window space and can bias model outputs

Top comments (12)

Max Quimby • Jun 10

"Your logs show a successful retrieval, your logs show a successful LLM call, nothing reports that 70% of your context was thrown away" — that sentence belongs on a poster in every RAG team's room. The silent truncation between TOP_K and what actually lands in the prompt is responsible for more "the model is dumb" bug reports than the model ever is.

One thing I'd push on: your precision metric counts chunks that reached the LLM, which is already a huge step up — but there's a sneakier gap between "chunk made it into the prompt" and "the model actually used it." We've seen prompts where all 10 chunks fit, yet the answer leans entirely on chunk #1 because the fact in chunk #7 was buried mid-context (lost-in-the-middle). Is there a path toward attribution — tying the generated answer back to which chunks it actually drew from? That's the metric I keep wanting and never quite have.

Also strongly agree on deduping at ingest rather than retrieval — fixing sliding-window overlap downstream is whack-a-mole.

Siddharth Pandey • Jun 13

Ha — I might print that poster. Glad it landed.

You've named the exact line RAGScope doesn't cross yet. Today precision asks "did the chunk land in the prompt?" — deterministic, no model in the loop. You're asking the next layer up: "did the chunk influence the answer?" Different question. Lost-in-the-middle nails it — chunk #7 clears every bar I measure and still contributes nothing. Real attribution needs a signal I can't get from spans alone: either token logprobs/attention (almost nobody emits these over OTel) or an LLM-judge pass (doable, but that turns a fast, model-free mechanics toolinto a noisy eval one). The cheap proxy I can do today: position-aware scoring. I already know each chunk's rank and where it lands in the prompt, so I can flag "N chunks buried past the lost-in-the-middle zone"— not attribution, but it surfaces the same failure with zero extra calls.I'm leaning toward shipping that and keeping true attribution as an opt-in eval layer. Would a "buried context" warning scratch the itch, or is it specifically the answer→chunk linkage you want?
And 100% on dedup at ingest — retrieval-time dedup is whack-a-mole. Fix the chunker, not the symptom.

Tae Kim • Jun 6

The silent-drop pattern bit me on a Graph RAG over trade news. Retriever returned 12 chunks, the LLM only ever saw 4 — I caught it by logging the actual character count of context that reached the prompt vs what came out of retrieval, and the gap was huge. Normalizing Chroma's L2 distances into a [0,1] band was the other thing that finally made my reranker thresholds tunable instead of guesswork.

Siddharth Pandey • Jun 8 • Edited

The silent drop is one of those bugs that makes you question your entire setup. 12 chunks come out of retrieval, 4 reach the LLM, and the model just casually acts like nothing happened. The character count delta approach is genuinely sharp debugging instinct though, most people chase cosine scores for weeks and never think to just measure the bytes going in versus what actually landed in the prompt. The L2 normalization thing is real too, raw Chroma distances floating in unbounded space are basically useless for threshold tuning and I spent way too long wondering why my reranker thresholds felt like astrology before I normalized them. RAGScope tracks both of these under the hood so if you ever run it on the trade news pipeline I would genuinely love to see what the coverage and precision numbers look like.

Alex Shev • Jun 9

PASS/WARN/FAIL is a good interface for RAG quality. Teams do not just need a score; they need a decision point that tells the product whether to answer, caveat, or stop.

Siddharth Pandey • Jun 13

Exactly — and that's the line I keep in mind: RAGScope's PASS/WARN/FAIL is a dev-time gate (fix it before you ship), but you're describing a runtime decision point (answer / caveat / stop on the live query). Same interface, different clock. The dev gate is deterministic retrieval mechanics; the runtime one has to fold in answer confidence too. Both belong in a mature stack — would love to see the threshold logic shared between them.

Alex Shev • Jun 13

Yes, and the shared threshold logic is the part that gets tricky. A dev-time gate can be strict because it blocks a build; a runtime gate has to preserve user experience while still being honest about uncertainty. I like the idea of keeping the policy vocabulary the same, but tuning the action separately: fail the pipeline in CI, caveat or refuse in production.

Siddharth Pandey • Jun 14

You've drawn the boundary cleaner than I did — the shared part is the vocabulary, not the thresholds. But I don't want RAGScope to grow the runtime gate at all. The mental model I keep landing on is Lighthouse: it scores and audits your page at build time, gives you a verdict, and nobody mistakes it for production RUM. RAGScope is that for RAG retrieval — a build-time quality gate. The runtime side is already well-served (Langfuse, LangChain tracing, OTLP), and RAGScope reads from that layer rather than competing with it.

It can absolutely get deeper — more metrics, regression budgets in CI, scoring trends over time — but that's growth within the dev tool, not a jump to the production side. And the two genuinely measure different things: dev-time is pure retrieval mechanics, no model in the loop; runtime has to fold in answer confidence. So they can legitimately disagree on the same trace — which is exactly why you tune the action, not the policy: fail the build in CI, caveat or refuse in prod. Same vocab, different clock, different tool.

Alex Shev • Jun 14

That Lighthouse analogy makes the product boundary much clearer. A build-time retrieval audit should stay opinionated and reproducible; the live answer gate has too many runtime variables to pretend it is the same system. I would still want the CI side to export enough history that teams can see drift over time, because retrieval quality often decays slowly before anyone notices it in production.

Siddharth Pandey • Jun 15

The historical comparison I would build it via temporal graph but thats for the future once I get some traction that real user have started using it problem is discoverability and adoption for this kinda CLI tools although I have 13+ years of experience but recently only I started writing about my work packages release I have done in past as well but never written about the work I m doing for OSS. 😃

Alex Shev • Jun 15

That makes sense. Discoverability is usually the hardest part for small CLI tools because the user has to feel the pain before they search for the fix.

I would keep writing around concrete failure cases: "a RAG answer shipped without a retrieval quality gate" is easier to remember than "a CLI for evals." The temporal graph angle is interesting, but traction probably comes first from a narrow before/after workflow.

Alex Shev • Jun 14

That distinction makes sense: dev-time gates catch retrieval quality before users see it, while runtime confidence has to decide how the answer should behave in front of a user. I like keeping both. A WARN in CI should improve the pipeline; a WARN at runtime should change the UX, not pretend certainty.

View full discussion (12 comments)