How RAGScope Knows Which Chunks Your LLM Actually Used

#rag #opensource #observability #llm

Your retriever fetched 10 chunks. Your LLM only used 3. RAGScope shows a precision score of 30 out of 100. The question every new user asks: how does it know?

There is no OpenTelemetry attribute that says "this chunk was in the context window." RAGScope infers it — and the way it does this is the most consequential piece of engineering in the whole tool.

There Is No "In Context" Attribute in OTel

The OpenTelemetry semantic conventions for generative AI (gen_ai.*) define attributes for model, input/output tokens, and retrieved documents. They do not define anything like gen_ai.chunk.reached_llm or gen_ai.retrieval.used_document_ids.

When your RETRIEVER span fires, you get a list of documents. When your LLM span fires, you get a prompt and a completion. The two spans are connected by a parent-child trace relationship — but there is no attribute that maps which retrieved documents appear in which prompt.

This gap matters. A reranker might drop 7 of your 10 chunks. Your application code might apply a token budget and truncate 4 more. From the trace alone, you cannot tell.

RAGScope needs this information to compute the precision sub-score — the highest-weighted metric at 40% of the overall score. Getting it wrong would make precision meaningless.

The Substring Match — How `assembleContext` Works

RAGScope's answer is in src/enrichment/pipeline.ts, in a function called assembleContext:

function assembleContext(chunks: RagChunk[], llmSpans: ParsedSpan[]): RagChunk[] {
  const llmPrompts = llmSpans.map((s) => s.prompt).filter((p): p is string => !!p);
  if (llmPrompts.length === 0) return chunks;

  let position = 0;
  return chunks.map((chunk) => {
    if (!chunk.content) return chunk;
    const inContext = llmPrompts.some((p) => p.includes(chunk.content!));
    if (inContext) {
      return { ...chunk, inContext: true, contextPosition: position++ };
    }
    return { ...chunk, inContext: false, contextPosition: null };
  });
}

The approach: collect the raw prompt strings from every LLM span in the trace, then check whether each chunk's content appears as a literal substring of any of those prompts.

If your LLM span records its prompt in the input attribute — which TraceAI, Traceloop, and OpenTelemetry's gen_ai conventions all do — and your retriever span records the chunk content in gen_ai.retrieval.documents — RAGScope has everything it needs.

The contextPosition counter assigns an incrementing index to each in-context chunk in the order they are encountered during the chunks.map() iteration — which follows retrieval rank, not prompt position. It tracks which retrieved chunks are in context and their relative order among in-context chunks.

Why substring matching works

Frameworks like LangChain and LlamaIndex build LLM prompts by concatenating retrieved chunk contents, often wrapped in minimal formatting like Context:\n{chunk}\n. The chunk text itself is usually present verbatim. As long as the chunk content recorded on the RETRIEVER span matches what was injected into the prompt string — which it does when both come from the same retrieval call — substring matching is reliable.

The constraint: chunk.content must be non-empty and non-null. RAGScope only stores content when the RETRIEVER span includes it in the documents array. If your instrumentation omits content and only records chunk IDs, assembleContext cannot match, and precision will read 0% until content is included.

What This Means for Your Precision Score

scoreRetrieval in src/audit/scorer.ts uses the inContext flag directly:

function scoreRetrieval(chunks: RagChunk[]): SubScore {
  const used = chunks.filter((c) => c.inContext).length;
  const score = Math.round((used / chunks.length) * 100);
  return {
    name: 'precision',
    score,
    symbol: symbol(score),
    finding: `${used}/${chunks.length} chunks used`,
    recommendation:
      score < 60
        ? `Reduce TOP_K ${chunks.length}→${Math.max(used, 3)} (only ${used} chunks reached LLM)`
        : null,
  };
}

If 3 of 10 chunks appear in the LLM prompt, precision = 30. The recommendation fires automatically: Reduce TOP_K 10→3. The score contributes 40% to the overall — a 30 on precision alone floors your overall score to at most 43, even if efficiency, redundancy, and coverage are perfect.

This is the most common cause of FAIL scores: teams set TOP_K=10 during early experimentation and never reduce it. Ten chunks get retrieved. Three reach the LLM. The other seven waste token budget and push the efficiency score down too.

The --verbose flag makes this explicit. Each sub-score prints with its finding:

   ✗  precision    30/100  3/10 chunks used
   ✗  efficiency   45/100  55% tokens wasted

And the Recommendations section:

 Recommendations
   → Reduce TOP_K 10→3 (only 3 chunks reached LLM)
   → 55% of retrieved tokens never reached the LLM

When precision reads 0% unexpectedly

If your trace has no LLM spans — for example, you're testing your retriever in isolation — llmPrompts will be empty and assembleContext returns all chunks unchanged with inContext: false. In that case, scoreRetrieval sees zero used chunks over a non-zero total, and precision reads 0.

If your trace has no chunks at all, scoreRetrieval short-circuits to a score of 100 with the finding no chunks — the assumption being that a trace with no retrieved chunks represents a non-retrieval query that shouldn't be penalized.

Conclusion

RAGScope's precision score is only meaningful because assembleContext solves the hardest observability problem in RAG pipelines: figuring out which retrieved chunks actually reached the model. It does this by checking chunk content against LLM prompt strings — no extra instrumentation, no special attributes, no embeddings.

The implication for your setup: include chunk content in your RETRIEVER spans. Without it, assembleContext cannot match, precision stays at zero, and the most impactful metric in your audit is blind. With it, you get the exact number that tells you whether your TOP_K setting is costing you context budget.

Try it: GitHub · npm

Key Takeaways

OTel has no "in context" attribute — RAGScope determines LLM context inclusion by checking if chunk content is a substring of the LLM span's prompt string
assembleContext in src/enrichment/pipeline.ts performs this matching; contextPosition tracks relative order among in-context chunks (by retrieval rank, not prompt position)
Precision is 40% of the overall score — a low precision score is the most common cause of FAIL labels
If chunk content is missing from your RETRIEVER spans, precision will read 0%; include content in your instrumentation to get accurate scores
The automatic recommendation (Reduce TOP_K N→M) fires when precision < 60%, giving a concrete action to take immediately

Top comments (2)

Harjot Singh • May 31

Which chunks your LLM actually used is the question RAG evaluation usually skips, and it's the one that matters most. Everyone measures retrieval (did we fetch the right chunks) and answer quality (was the output good), but the gap in the middle is invisible: the model can retrieve the right chunk and then ignore it, answering from its priors instead, and you'd never know because the answer looks fine and the retrieval looks fine. Knowing which retrieved chunks genuinely informed the output closes that gap, and it unlocks two big things. First, faithfulness: you can tell whether an answer is actually grounded in the sources or just adjacent to them, which is the difference between trustworthy RAG and confident-RAG-shaped hallucination. Second, retrieval pruning: if chunks are consistently retrieved but never used, you're paying tokens to send noise that also degrades attention, so usage data tells you what to stop retrieving. Attribution at the chunk level is basically observability for the grounding step, and it's exactly what you need to debug why a grounded system still went wrong. Measure not just what you retrieved, but what the model actually leaned on. That trace-the-grounding instinct is core to how I think about RAG in Moonshift. How are you attributing usage, attention/citation signals from the model, or a post-hoc check of which chunks the answer is entailed by?

Siddharth Pandey • Jun 1

Exactly the gap RAGScope targets. Right now it's a post-hoc substring match — if the chunk content appears verbatim in the LLM span's prompt string, it counts as used. That catches the common case (LangChain/LlamaIndex concatenate chunks directly) but misses paraphrasing and reranker reformatting. Attention/citation signals would be cleaner but require model-side hooks most setups don't expose yet — curious how Moonshift handles that.