Tijo Gaucher

Posted on Apr 17

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

#opentelemetry #ai #llm #observability

I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.

Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.

Why OpenTelemetry for LLM Workloads?

OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.

For LLM workloads specifically, OTel gives you a few things that matter:

Trace continuity across agent steps. When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.

Token and cost attribution. The gen_ai semantic conventions include attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, which let you track per-request costs without bolting on a separate billing layer.

Vendor neutrality. Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.

The Self-Hosted Stack

My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:

[Your LLM Agents]
       |
       v
[OTel Collector]  ← receives traces via OTLP/gRPC
       |
       v
[Tempo / Jaeger]  ← trace storage
[Prometheus]      ← metrics storage
[Grafana]         ← visualization

If you've looked at the self-hosted vs managed cost comparison, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.

Setting Up the OTel Collector

The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the production deployment guide covers Docker Compose configs and health checks.

Instrumenting LLM Calls

The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.

First, install the packages:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-requests

Then set up a tracer and wrap your LLM calls:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-agent")

def call_llm(prompt, model="claude-sonnet-4-20250514"):
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)

        response = your_llm_client.complete(prompt=prompt, model=model)

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        span.set_attribute("gen_ai.response.model", response.model)

        return response.content

The key is using the gen_ai.* semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.

Tracing Multi-Step Agent Workflows

Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:

def run_agent(task):
    with tracer.start_as_current_span("agent.run") as parent:
        parent.set_attribute("agent.task", task)

        # Step 1: retrieve context
        with tracer.start_as_current_span("retrieval.vector_search"):
            context = search_vector_store(task)

        # Step 2: call LLM with context
        result = call_llm(f"Context: {context}\nTask: {task}")

        # Step 3: maybe call a tool
        if needs_tool_call(result):
            with tracer.start_as_current_span("tool.execute") as tool_span:
                tool_span.set_attribute("tool.name", "web_search")
                tool_result = execute_tool(result)
                result = call_llm(f"Tool result: {tool_result}\nOriginal task: {task}")

        return result

When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.

What You Actually See in the Dashboard

Once everything is wired up, your self-hosted observability dashboard shows you:

Latency breakdown per agent step — which spans are slow, and whether it's network or model inference
Token usage over time — catch runaway prompts before they drain your API budget
Error rates by model/provider — spot degraded model endpoints early
Trace search — find the exact trace where an agent went off the rails

For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.

Rough Edges and Honest Takes

A few things that are still annoying in 2026:

Auto-instrumentation for LLM SDKs is patchy. The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.

Trace volume can surprise you. Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.

Grafana dashboards take time to build. The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.

Wrapping Up

OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.

If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.

Top comments (5)

Max Quimby • Apr 21

The high trace volume problem from agent loops is real and I'm glad you mentioned it. When you have agents calling sub-agents calling tools calling other agents, the trace tree becomes nearly unusable without a sampling strategy. We've found tail-based sampling works well here: keep 100% of error traces (the things you actually need to debug), and sample successful traces at a much lower rate (~5%). You keep the signal, eliminate most of the noise.

One gap in the current GenAI semantic conventions: they model individual LLM calls well but break down for multi-turn agent sessions where you care about the entire session arc. We ended up adding custom span attributes — agent.session_id, agent.iteration, and agent.tool_call_depth — to capture the stateful context that generic OTel doesn't model.

For the self-hosted backend, Grafana Tempo + Loki + Prometheus works cleanly as a trio with the same OTel Collector config you've described here. Tempo's native OTel ingestion means no translation layer, which keeps the ops footprint minimal. The TraceQL query language is worth learning if you're doing frequent debugging — filtering traces by custom span attributes makes root-cause analysis on agent failures dramatically faster.

Max Quimby • Apr 24

Great writeup, Tijo. The high trace volume problem from agent loops hit us hard when we first tried naively instrumenting every LLM call in a multi-step pipeline. What ended up working was tail-based sampling instead of head-based — you capture 100% of traces that contain errors or latency outliers and drop the boring successful ones. The OTel Collector's tail sampling processor handles this cleanly once you tune the decision wait time.

One thing I'd flag: when agents spawn sub-agents through async task queues (Celery, ARQ, etc.), trace context doesn't propagate automatically. You have to explicitly serialize the traceparent into the queue message and re-hydrate it on the consumer side. We spent days debugging why parent spans had no children before catching this.

On semantic conventions — have you run into inconsistencies between how the Anthropic and OpenAI SDKs expose gen_ai.* attributes? Building a unified Grafana dashboard across both providers required a normalization transform at the Collector level. Curious if you've found a cleaner approach than remapping at the pipeline stage.

Max Quimby • Apr 23

Really practical writeup—the vendor-neutral approach here is worth highlighting more. The OTLP-first architecture means you can swap backends (swap Tempo for Jaeger, Grafana for something else) without reinstrumenting your agents, which has real longevity value.

On the sampling strategy challenge: one thing we've found useful is treating LLM spans differently from application spans. Agent loop iterations that complete normally are high-frequency and often low-information—sampling aggressively there (say 5%) while keeping 100% of anything that hits an error, retry, or fallback gives you the full fidelity where you need it without the cardinality explosion.

The auto-instrumentation patchiness you mention is real, especially when agents are doing async multi-step work. Manual context propagation for agent task handoffs ends up being necessary. The OTel semantic conventions for GenAI (gen_ai.* attributes) have improved a lot in the last six months though—worth checking if you're still pinning to an older spec version, the new ones handle tool call spans much more cleanly.

Sol • May 20

Useful writeup. A pattern I keep seeing across primary-source issue threads is that token visibility exists while ownership-grade attribution fails at boundaries.

Two concrete examples:
1) OTel GenAI semantic-conventions issue #35: enterprise feedback asks for task-level aggregation mapped to cost centre/project, not only per-call token counts.
2) LiteLLM issue #27639: phantom BudgetExceeded errors under multi-replica Redis reservations show how spend state can drift from true finalized cost.

Practical check that has caught failures early for us:

Pick one task with one retry branch
Verify task_id + owner_id continuity across root -> retry -> sub-agent spans
Reconcile expected owner split vs finance rows before scaling instrumentation

Question for teams running this in production: does attribution integrity usually break first at retry/sub-agent identity propagation, or later at trace-to-finance join keys?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.