DEV Community

Cover image for Add a PASS/WARN/FAIL Quality Gate to Your RAG Pipeline in 30 Seconds
Siddharth Pandey
Siddharth Pandey

Posted on

Add a PASS/WARN/FAIL Quality Gate to Your RAG Pipeline in 30 Seconds

You deployed a RAG chatbot. The answers are vague. You bump the LLM from GPT-3.5 to GPT-4. The answers are still vague. You double the chunk size. Still vague. You spend three hours tuning prompts. Still. Vague.

The real problem isn't the model. It's that your pipeline is retrieving 10 chunks and the LLM is only seeing 3 of them — and nothing in your logs tells you that.


What's Actually Breaking (and Why You Can't See It)

A RAG pipeline has at least two moving parts between a user query and an answer: a retrieval step that fetches relevant chunks from a vector store, and an LLM call that uses those chunks to generate a response.

The failure mode that kills most RAG quality work is invisible: chunks are retrieved, then silently discarded before they reach the LLM prompt.

This happens because of TOP_K. You set TOP_K=10 thinking more context is better. But your LLM has a token budget. The orchestration layer (LangChain, LlamaIndex, your custom code) fills the prompt until it hits the limit — and quietly drops whatever didn't fit. The LLM never saw chunks 4 through 10. Your logs show a successful retrieval. Your logs show a successful LLM call. Nothing reports that 70% of your retrieved context was thrown away.

There are three failure patterns that account for most bad RAG answers:

TOP_K too high. You retrieve 10 chunks, the LLM uses 3. The 7 you paid to embed, store, and retrieve contribute nothing. Worse: if the 3 that fit aren't the 3 most relevant, your answer quality is determined by which chunks happened to survive token truncation rather than which ones actually matched the query.

Near-duplicate chunks. Sliding-window chunking creates overlapping segments. If chunk 3 ends with "...chlorophyll to capture light and convert it into chemical energy" and chunk 4 starts with the same phrase, you've burned 30% of your context window repeating one sentence. The model sees it twice and may over-weight it.

Missing similarity scores. Some vector stores (notably Chroma with L2 distance) return raw distance values, not normalized [0, 1] similarity scores. Your retrieval logs show scores like 1.42 and 0.93 with no indication which is better. Without normalized scores you can't tune thresholds or understand ranking.

These are all measurable. You just need something to measure them.


One Command to Add a Quality Gate

npx ragscope start
Enter fullscreen mode Exit fullscreen mode

That starts a local OTLP receiver on port 4321. Then point your pipeline at it with one environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4321
Enter fullscreen mode Exit fullscreen mode

If you're using Traceloop or OpenLLMetry for instrumentation, that's all you need — they auto-instrument LangChain, LlamaIndex, OpenAI, Qdrant, and Cohere out of the box:

import { Traceloop } from '@traceloop/node-server-sdk';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

Traceloop.init({
  exporter: new OTLPTraceExporter({ url: 'http://localhost:4321/v1/traces' }),
});
Enter fullscreen mode Exit fullscreen mode

For Vercel AI SDK or custom pipelines, add two span attributes: one on the retrieval span listing your chunks, and one on the LLM span with the full prompt text. That's the minimum RAGScope needs to score a trace.

Run your app, fire a few queries, and your terminal shows this:

  PASS  90/100  █████████░  my-rag-app
  │  "What is RAG?"
  │
  │  ✓  precision    90  █████████░  9/10 chunks used
  │  ✓  efficiency   80  ████████░░  20% tokens wasted
  │  ✓  uniqueness  100  ██████████  chunks are distinct
  │  ✓  coverage    100  ██████████  all chunks scored
  │

  WARN  54/100  █████░░░░░  my-rag-app
  │  "What is dense passage retrieval?"
  │
  │  ✗  precision    40  ████░░░░░░  4/10 chunks used
  │  ~  efficiency   50  █████░░░░░  50% tokens wasted
  │  ~  uniqueness   65  ███████░░░  2 near-duplicate pairs
  │  ✓  coverage    100  ██████████  all chunks scored
  │
  │  → Reduce TOP_K 10→4 (only 4 chunks reached LLM)
  │  → 50% of retrieved tokens never reached the LLM
  │  → 2 near-duplicate chunks — deduplicate at ingest time
  │

  ──────────────────────────────────────────────────
  Session  2 queries  ·  avg 72/100  ↓
Enter fullscreen mode Exit fullscreen mode

Every trace is scored the moment it arrives. No dashboard to open, no query to run — just a PASS/WARN/FAIL with specific numbers sitting next to the query that produced them.


Reading the Score — What Each Number Means

Four sub-scores combine into a single 0–100 overall score. The weights reflect actual impact on answer quality.

Precision (40%) — what fraction of retrieved chunks appeared in the LLM prompt. This is weighted highest because a chunk that doesn't reach the LLM has zero value to the answer. It consumed retrieval latency, vector bandwidth, and context window space, and then got dropped. A score of 40 means Reduce TOP_K 10→4 (only 4 chunks reached LLM) — RAGScope tells you the exact number to set it to.

Efficiency (30%) — what fraction of retrieved tokens the LLM actually consumed. Low precision and low efficiency usually appear together, but they can diverge: if you retrieve three large chunks and the LLM fits two and a half, efficiency is low even though precision is decent.

Uniqueness (20%) — how distinct your chunks are from each other. Computed from exact text overlap between adjacent chunks (sorted by retrieval rank). Score of 100 means all chunks are fully distinct. A score of 65 with 2 near-duplicate pairs means your chunking strategy is creating redundant segments — deduplicate at ingest time or increase your chunk step size.

Coverage (10%) — whether your chunks carry non-zero similarity scores. This is a signal flag: if it fires, your vector store is returning raw values that couldn't be normalized, which means you also can't tune retrieval thresholds. RAGScope normalizes Chroma distances automatically, so this usually only fires when scores are genuinely missing from the trace.

The overall label maps to:

Score Label Meaning
≥ 75 PASS Retrieval is healthy for this query
50–74 WARN Issues present — review the recommendations
< 50 FAIL Significant retrieval problems

Each WARN or FAIL comes with a concrete recommendation. Not "consider reducing TOP_K" — Reduce TOP_K 10→4 (only 4 chunks reached LLM). The actual number, derived from your actual trace.


Conclusion

Adding a quality gate to your RAG pipeline takes one command and one environment variable. From that point, every query you run during development is scored — precision, efficiency, uniqueness, coverage — with specific recommendations when something is wrong.

You stop guessing whether a model upgrade or a prompt rewrite will fix the vague answers. You see whether the retrieval pipeline is the problem, and exactly where it's breaking.

RAGScope runs entirely locally. No accounts, no configuration files, no data leaving your machine. Trace data lives in memory for the session lifetime. It's the same category of tool as a linter: runs while you build, catches problems before users see them.

Try it on your next RAG session: GitHub · npm


Key Takeaways

  • Most RAG quality problems are retrieval mechanics problems, not model problems — and they're invisible without tracing
  • TOP_K too high is the most common culprit: chunks are retrieved, then silently dropped before the LLM prompt is assembled
  • npx ragscope start + one env var adds a live PASS/WARN/FAIL score to every query during development
  • Precision (40% weight) measures chunk utilization; efficiency (30%) measures token utilization — both usually fix with TOP_K reduction
  • Near-duplicate chunks from sliding-window chunking waste context window space and can bias model outputs

Top comments (0)