DEV Community

Erythix
Erythix

Posted on

Tracing a RAG Chain End-to-End: Where OpenTelemetry Stops and Where You Need to Instrument Yourself

There are already plenty of "Getting started with OpenTelemetry" tutorials. This is not one of them.

This article starts with a candid observation: if you have OTel running in your infrastructure and you've just added a RAG pipeline to production, your traces look impressive but they're mostly lying to you by omission. You have spans as latency numbers. What you don't have is visibility into the five stages that actually determine whether your system is working correctly.

OTel wasn't designed for RAG. It was designed for distributed systems built around HTTP, databases, and message queues: all well-understood primitives with established semantic conventions. A RAG pipeline adds several new primitives that have no standard OTel semantics yet. The OpenTelemetry GenAI SIG is working on it, but slowly. In the meantime, production systems are running blind.

The goal here is to be precise about where the boundary is and how to cross it.


What a RAG chain actually traverses

A minimal RAG pipeline involves eight distinct stages, each with its own failure modes and its own observability requirements:

  1. Query ingress: the user request arrives, gets validated, gets routed
  2. Embedding: the query is converted to a vector representation
  3. Retrieval: the vector DB is searched, ranked chunks are returned
  4. Reranking: chunks are rescored by a cross-encoder, poor matches are dropped
  5. Prompt assembly: context is injected, the prompt is constructed, tokens are counted
  6. LLM call: the assembled prompt is sent to the model
  7. Post-processing: the response is parsed, validated, formatted, filtered
  8. Response: the final answer is returned to the caller

Each of these stages has distinct latency characteristics, distinct failure modes, and distinct diagnostic signals. The problem is that OTel treats them very differently.


What OTel covers natively

Auto-instrumentation handles the outer envelope well. For a typical Python service running FastAPI or Flask with OpenTelemetry auto-instrumentation, you get:

  • A root span for the HTTP request (stage 1)
  • The HTTP response span (stage 8)
  • Any outbound HTTP calls you make, including the API call to your LLM provider

That's it. Three spans. In a pipeline with eight meaningful stages.

Depending on your vector database client, you might get a span for the retrieval call. Weaviate has partial SDK-level instrumentation, most others don't. But even when you get that span, it gives you network latency, not semantic information. You know the query arrived and returned. You don't know how many results came back, what their similarity scores were, or whether the result set was empty.

The picture after auto-instrumentation: two solid spans at the edges, one partial span in the middle and four stages that are completely invisible.

The eight stages below show what auto-instrumentation sees and what it misses.


The five dead zones

Zone 1: Embedding

When you call an embedding model, whether via OpenAI, Cohere, or a local sentence-transformer, you get a latency number if you're lucky and nothing otherwise. What you don't capture:

  • Which model, which version. Model drift is real. A silent upgrade to your embedding provider changes vector geometry and breaks retrieval. If you're not recording model_name and model_version on every span, you'll spend days debugging what looks like a retrieval problem.
  • Vector dimensionality. A dimension mismatch between your embedding model and your index is a hard failure that generates cryptic errors. Logging the output dimension takes one attribute.
  • CPU vs GPU time split. For on-premise inference, this is the first signal that hardware saturation is affecting quality.
  • Similarity score distribution. The embedding stage itself doesn't produce this, but it sets up the retrieval stage. Tracking what "normal" looks like here is your baseline for drift detection.

Zone 2: Retrieval

The retrieval call to your vector database may produce an outbound HTTP span if you're using a REST-based client. But that span contains only timing and status code. What it doesn't contain:

  • Number of chunks returned. If your retrieval returns zero results, you want to know immediately, and you want to know the query that triggered it, not just the timestamp.
  • Similarity scores. The distribution of top-k scores tells you whether the retrieval was confident or speculative. A max score of 0.94 and a max score of 0.41 both count as "retrieval succeeded" in OTel's view. They're completely different situations operationally.
  • Reranking time as a separate stage. Many pipelines combine retrieval and reranking into a single function call. Separating them in your spans is worth the effort: reranking is frequently the actual latency bottleneck, and you'd never know.

Zone 3: Reranking

This stage is the most consistently invisible and the most consistently underestimated. A cross-encoder reranker running a full forward pass over each retrieved chunk pair adds significant latency, sometimes more than the LLM call itself. OTel sees none of it unless you explicitly instrument it.

What to capture:

  • Reranking duration as its own span
  • Input chunk count vs. output chunk count (how many were filtered)
  • Score threshold applied
  • Model name and batch size

Zone 4: Prompt assembly

This is where silent failures are born.

When you assemble a prompt, you make decisions: which chunks to include, how to order them, how to truncate if the context window is tight. OTel has no visibility into any of this. You can have a system that routinely truncates critical context and generates factually incomplete responses, and your traces will show a perfectly healthy green pipeline.

What to capture:

  • Estimated token count before sending
  • Whether truncation occurred (context_truncated: true/false)
  • Number of chunks injected
  • Whether conflicting chunks were injected (requires a light coherence check)

Zone 5: LLM call payload

This is the subtlest dead zone. You do have a span for the LLM call: it's the outbound HTTP request. But the span contains no semantic information about what happened inside that call.

The SDK for Anthropic, OpenAI, and most LLM providers does not emit OTel attributes for tokens, stop reason, or model-level parameters. You have to enrich the span yourself after the response arrives. Without this:

  • You cannot track token costs
  • You cannot alert on prompts that regularly hit the context limit (stop_reason: length)
  • You cannot detect when model behavior changes across versions

Instrumenting manually

The pattern is consistent across all dead zones: wrap the operation in a span, set semantic attributes, and use a naming convention that survives grep.

Embedding

from opentelemetry import trace

tracer = trace.get_tracer("rag.pipeline")

with tracer.start_as_current_span("rag.embedding") as span:
    result = embed_query(query_text)
    span.set_attribute("rag.query.length", len(query_text))
    span.set_attribute("rag.embedding.model", result.model)
    span.set_attribute("rag.embedding.dimension", len(result.vector))
    span.set_attribute("rag.embedding.latency_ms", result.latency_ms)
Enter fullscreen mode Exit fullscreen mode

Retrieval

with tracer.start_as_current_span("rag.retrieval") as span:
    chunks = vector_db.search(query_vector, top_k=20)
    span.set_attribute("rag.retrieval.query_preview", query_text[:120])
    span.set_attribute("rag.retrieval.chunks_returned", len(chunks))
    span.set_attribute("rag.retrieval.empty_result", len(chunks) == 0)
    if chunks:
        span.set_attribute("rag.retrieval.max_score", round(chunks[0].score, 4))
        span.set_attribute("rag.retrieval.min_score", round(chunks[-1].score, 4))
Enter fullscreen mode Exit fullscreen mode

Reranking

with tracer.start_as_current_span("rag.reranking") as span:
    reranked = reranker.rank(query_text, chunks, top_n=5)
    span.set_attribute("rag.reranking.model", reranker.model_name)
    span.set_attribute("rag.reranking.input_count", len(chunks))
    span.set_attribute("rag.reranking.output_count", len(reranked))
    span.set_attribute("rag.reranking.filtered_count", len(chunks) - len(reranked))
    if reranked:
        span.set_attribute("rag.reranking.top_score", round(reranked[0].score, 4))
Enter fullscreen mode Exit fullscreen mode

Prompt assembly

with tracer.start_as_current_span("rag.prompt_assembly") as span:
    assembled, meta = build_prompt(query_text, reranked_chunks)
    span.set_attribute("rag.prompt.estimated_tokens", meta["token_count"])
    span.set_attribute("rag.prompt.context_truncated", meta["truncated"])
    span.set_attribute("rag.prompt.chunks_used", meta["chunks_injected"])
    span.set_attribute("rag.prompt.system_prompt_version", SYSTEM_PROMPT_VERSION)
Enter fullscreen mode Exit fullscreen mode

Note the system_prompt_version attribute. Prompt changes are the most common cause of unexplained behavioral shifts. Versioning your system prompt and logging it on every span costs nothing and will save you multiple production investigations.

LLM call enrichment

Don't create a new span for the LLM call. You already have the outbound HTTP span from auto-instrumentation. Enrich it instead via a wrapper that adds attributes after the response arrives:

with tracer.start_as_current_span("llm.completion") as span:
    response = llm_client.messages.create(**request_params)
    span.set_attribute("llm.model", response.model)
    span.set_attribute("llm.usage.input_tokens", response.usage.input_tokens)
    span.set_attribute("llm.usage.output_tokens", response.usage.output_tokens)
    span.set_attribute("llm.stop_reason", response.stop_reason)
    span.set_attribute("llm.request.temperature", request_params.get("temperature", 1.0))
    span.set_attribute("llm.request.max_tokens", request_params.get("max_tokens"))
Enter fullscreen mode Exit fullscreen mode

The stop_reason attribute alone justifies this instrumentation. When stop_reason is "length", it means the model ran out of context and stopped mid-response. In a RAG pipeline, this is almost always a prompt assembly bug. Without this attribute, the response looks valid until a human reads it.


Naming conventions

There is no OTel standard for RAG semantic conventions as of early 2026. The GenAI SIG has drafts in progress but nothing stable. Until there is a standard, the wrong choice is to invent arbitrary names per-service. The right choice is to define a coherent convention internally and apply it consistently.

The three-prefix approach:

Prefix Domain Examples
rag.* Pipeline logic rag.retrieval.chunks_returned, rag.prompt.context_truncated
llm.* Model interaction llm.usage.input_tokens, llm.stop_reason, llm.model
vec.* Vector operations vec.index.name, vec.search.metric, vec.dimension

This naming makes your traces queryable by domain in VictoriaMetrics, OpenObserve, or any backend that supports attribute filtering. A query like rag.retrieval.empty_result = true AND llm.stop_reason = "length" surfaces a specific failure pattern (empty retrieval leading to context-padded fallback response) in seconds.

Avoid prefixes that shadow existing OTel conventions. db.* is already used by database instrumentation. http.* is already HTTP. Pick names that won't collide with auto-instrumented attributes.


What you do with complete traces

Once the instrumentation is in place, four operational patterns become possible that were invisible before.

P95 latency by stage. Most teams assume the LLM call dominates pipeline latency. In practice, reranking is frequently the bottleneck, especially for models running on shared inference infrastructure. Without per-stage spans, you're optimizing the wrong thing.

Empty retrieval rate as a leading indicator. An uptick in rag.retrieval.empty_result = true before you see quality degradation in user feedback gives you a 24–48 hour warning window. It usually means your document index is stale or your embedding model has been silently upgraded. This is the most valuable leading indicator in a RAG system and it requires exactly one boolean attribute.

Context truncation rate as a prompt quality signal. If rag.prompt.context_truncated = true appears on more than 5–10% of requests, your retrieved chunks are too long for your context window configuration. This is a retrieval tuning problem, not an LLM problem, but without the attribute, it looks like random response degradation.

Stop reason distribution. A rise in llm.stop_reason = "length" correlates directly with content quality issues. Track it as a metric. Alert on it. It's a better signal than user satisfaction scores because it's available in real time.


Conclusion

OTel is a foundation, not a solution. For conventional infrastructure (HTTP services, databases, queues), auto-instrumentation covers most of what matters. For AI pipelines, the meaningful events happen inside application logic that has no standard semantic model.

The gap isn't a criticism of OTel. It's the normal boundary between generic infrastructure tooling and domain-specific observability. Every mature domain eventually develops its own semantic layer on top of the generic tracing substrate.

For RAG pipelines, that layer doesn't exist yet as a standard. Building it yourself is not optional if you're operating these systems in production. The instrumentation described here adds less than 200 lines of code to a typical pipeline and transforms your traces from a latency meter into an operational instrument.

The five dead zones (embedding, retrieval, reranking, prompt assembly, and LLM payload) are exactly where your system fails in interesting ways. Leaving them dark is a choice, and it's the wrong one.


Samuel Desseaux, founder Erythix · AI Observability & Industrial Monitoring

VictoriaMetrics Training Partner · erythix.tech

This article is part of the AI Observability series. Related: "GPU utilization tells you nothing about inference quality" · "Sovereign observability stack for HPC workloads"

Top comments (0)