DEV Community

Cover image for How RAGScope Knows Which Chunks Your LLM Actually Used
Siddharth Pandey
Siddharth Pandey

Posted on

How RAGScope Knows Which Chunks Your LLM Actually Used

How RAGScope Knows Which Chunks Your LLM Actually Used

Your retriever fetched 10 chunks. Your LLM only used 3. RAGScope shows a precision score of 30 out of 100. The question every new user asks: how does it know?

There is no OpenTelemetry attribute that says "this chunk was in the context window." RAGScope infers it — and the way it does this is the most consequential piece of engineering in the whole tool.


There Is No "In Context" Attribute in OTel

The OpenTelemetry semantic conventions for generative AI (gen_ai.*) define attributes for model, input/output tokens, and retrieved documents. They do not define anything like gen_ai.chunk.reached_llm or gen_ai.retrieval.used_document_ids.

When your RETRIEVER span fires, you get a list of documents. When your LLM span fires, you get a prompt and a completion. The two spans are connected by a parent-child trace relationship — but there is no attribute that maps which retrieved documents appear in which prompt.

This gap matters. A reranker might drop 7 of your 10 chunks. Your application code might apply a token budget and truncate 4 more. From the trace alone, you cannot tell.

RAGScope needs this information to compute the precision sub-score — the highest-weighted metric at 40% of the overall score. Getting it wrong would make precision meaningless.


The Substring Match — How assembleContext Works

RAGScope's answer is in src/enrichment/pipeline.ts, in a function called assembleContext:

function assembleContext(chunks: RagChunk[], llmSpans: ParsedSpan[]): RagChunk[] {
  const llmPrompts = llmSpans.map((s) => s.prompt).filter((p): p is string => !!p);
  if (llmPrompts.length === 0) return chunks;

  let position = 0;
  return chunks.map((chunk) => {
    if (!chunk.content) return chunk;
    const inContext = llmPrompts.some((p) => p.includes(chunk.content!));
    if (inContext) {
      return { ...chunk, inContext: true, contextPosition: position++ };
    }
    return { ...chunk, inContext: false, contextPosition: null };
  });
}
Enter fullscreen mode Exit fullscreen mode

The approach: collect the raw prompt strings from every LLM span in the trace, then check whether each chunk's content appears as a literal substring of any of those prompts.

If your LLM span records its prompt in the input attribute — which TraceAI, Traceloop, and OpenTelemetry's gen_ai conventions all do — and your retriever span records the chunk content in gen_ai.retrieval.documents — RAGScope has everything it needs.

The contextPosition counter assigns an incrementing index to each in-context chunk in the order they are encountered during the chunks.map() iteration — which follows retrieval rank, not prompt position. It tracks which retrieved chunks are in context and their relative order among in-context chunks.

Why substring matching works

Frameworks like LangChain and LlamaIndex build LLM prompts by concatenating retrieved chunk contents, often wrapped in minimal formatting like Context:\n{chunk}\n. The chunk text itself is usually present verbatim. As long as the chunk content recorded on the RETRIEVER span matches what was injected into the prompt string — which it does when both come from the same retrieval call — substring matching is reliable.

The constraint: chunk.content must be non-empty and non-null. RAGScope only stores content when the RETRIEVER span includes it in the documents array. If your instrumentation omits content and only records chunk IDs, assembleContext cannot match, and precision will read 0% until content is included.


What This Means for Your Precision Score

scoreRetrieval in src/audit/scorer.ts uses the inContext flag directly:

function scoreRetrieval(chunks: RagChunk[]): SubScore {
  const used = chunks.filter((c) => c.inContext).length;
  const score = Math.round((used / chunks.length) * 100);
  return {
    name: 'precision',
    score,
    symbol: symbol(score),
    finding: `${used}/${chunks.length} chunks used`,
    recommendation:
      score < 60
        ? `Reduce TOP_K ${chunks.length}${Math.max(used, 3)} (only ${used} chunks reached LLM)`
        : null,
  };
}
Enter fullscreen mode Exit fullscreen mode

If 3 of 10 chunks appear in the LLM prompt, precision = 30. The recommendation fires automatically: Reduce TOP_K 10→3. The score contributes 40% to the overall — a 30 on precision alone floors your overall score to at most 43, even if efficiency, redundancy, and coverage are perfect.

This is the most common cause of FAIL scores: teams set TOP_K=10 during early experimentation and never reduce it. Ten chunks get retrieved. Three reach the LLM. The other seven waste token budget and push the efficiency score down too.

The --verbose flag makes this explicit. Each sub-score prints with its finding:

   ✗  precision    30/100  3/10 chunks used
   ✗  efficiency   45/100  55% tokens wasted
Enter fullscreen mode Exit fullscreen mode

And the Recommendations section:

 Recommendations
   → Reduce TOP_K 10→3 (only 3 chunks reached LLM)
   → 55% of retrieved tokens never reached the LLM
Enter fullscreen mode Exit fullscreen mode

When precision reads 0% unexpectedly

If your trace has no LLM spans — for example, you're testing your retriever in isolation — llmPrompts will be empty and assembleContext returns all chunks unchanged with inContext: false. In that case, scoreRetrieval sees zero used chunks over a non-zero total, and precision reads 0.

If your trace has no chunks at all, scoreRetrieval short-circuits to a score of 100 with the finding no chunks — the assumption being that a trace with no retrieved chunks represents a non-retrieval query that shouldn't be penalized.


Conclusion

RAGScope's precision score is only meaningful because assembleContext solves the hardest observability problem in RAG pipelines: figuring out which retrieved chunks actually reached the model. It does this by checking chunk content against LLM prompt strings — no extra instrumentation, no special attributes, no embeddings.

The implication for your setup: include chunk content in your RETRIEVER spans. Without it, assembleContext cannot match, precision stays at zero, and the most impactful metric in your audit is blind. With it, you get the exact number that tells you whether your TOP_K setting is costing you context budget.

Try it: github.com/Sidd27/ragscope · npmjs.com/package/ragscope


Key Takeaways

  • OTel has no "in context" attribute — RAGScope determines LLM context inclusion by checking if chunk content is a substring of the LLM span's prompt string
  • assembleContext in src/enrichment/pipeline.ts performs this matching; contextPosition tracks relative order among in-context chunks (by retrieval rank, not prompt position)
  • Precision is 40% of the overall score — a low precision score is the most common cause of FAIL labels
  • If chunk content is missing from your RETRIEVER spans, precision will read 0%; include content in your instrumentation to get accurate scores
  • The automatic recommendation (Reduce TOP_K N→M) fires when precision < 60%, giving a concrete action to take immediately

Top comments (0)