DEV Community

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Albert Alov on March 21, 2026

Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If you built your own — congratulations, you have...

Read full post

Mykola Kondratiuk • Mar 24

this is exactly the kind of writeup I was looking for. the fragmentation has been annoying - every observability tool doing its own thing means you learn the vendor not the pattern. been running a few AI agent projects and the tracing story was always the weakest part. having a real before/after with actual attribute names is way more useful than spec docs. the PII logging part especially - curious whether you ended up stripping at the SDK level or filtering in the collector

jidonglab • Mar 22

solid writeup. the token counting inconsistency across providers is the part that bites hardest in practice — we've been building agent pipelines where the same prompt goes through Claude and GPT-4 depending on the task, and without a standard like this you end up maintaining two separate cost tracking implementations that drift apart over ti
one thing i'd add: for agentic workloads the context window usage per turn is almost more important than total tokens, since you're re-sending the full conversation each loop. being able to trace that per-span with OTel attributes would make it way easier to catch context bloat before it eats your budget.****

Albert Alov • Mar 22 • Edited

Great point about context bloat in agentic loops — this is the kind of thing that's invisible until your budget alert fires.

We track gen_ai.usage.input_tokens per span, so in a ReAct loop each chat child span shows how many tokens were re-sent. In Jaeger you can watch input_tokens grow with each iteration — that's your bloat signal. What we don't have yet is a context_window_utilization_ratio (input_tokens / model_max_context). Your comment just pushed this up our backlog (github.com/vola-trebla/toad-eye/is...)

On multi-provider cost drift — we normalize to a single gen_ai.toad_eye.cost USD attribute. Each provider's accumulateChunk() handles the differences (Anthropic splits across message_start/message_delta, OpenAI sends usage in the final chunk, Gemini overwrites per chunk). Different plumbing, same attribute out.

This topic probably deserves its own deep dive — too much nuance for a comment. Might be the next article.

Sol • May 21

Your 80%/95% guardrail plus per-span input_tokens tracking is exactly the boundary I am testing.

In mixed-model routes, with different context windows and price curves, where did reservation become least misleading in practice: USD budget buckets, per-model token buckets, or utilization-ratio-only gates?

If you have a preferred failure signal for when to switch between those, I would value that boundary.

Shaya K. • Mar 22 • Edited

Why not route through a model routing system so you don't have multiple billings through diff providers? Something like OpenRouter, or using Azure's AI (they also support Claude) foundry and plugging into their endpoint API system?

jidonglab • Mar 22

yeah routing through something like OpenRouter works for the billing consolidation part. the tricky bit with OTel tracing specifically is that when you add a routing layer, the span hierarchy gets messy — you end up with the router's spans mixed into your application traces. Azure AI Foundry handles this better since the Claude endpoint there still emits standard OTel spans. for the token counting inconsistency though, routing doesn't fully solve it since each provider still reports usage differently. you'd still want a normalization layer sitting between the OTel exporter and your observability backend.

Shaya K. • Mar 22

So for my agentic system I'm running on a VPS with OpenClaw and plan to use model-router you're saying I should use something like this anyways? I'm more technical than the average user by far but I'm not a programmer myself

Lazar Nikolov • Mar 25

Great read @vola-trebla! Shameless plug - Sentry also does AI Agent Monitoring according to the OTel specs, and our free tier includes 5M spans (if you want to include us in your article).

jidonglab • Mar 22

the span naming convention with {operation} {name} is a nice touch — makes it way easier to grep through traces when you're debugging a chain of LLM calls versus agent tool invocations. the part that's going to matter most in practice though is how well this plays with existing OTel backends. a lot of teams already have Jaeger or Tempo set up for their microservices, so being able to see LLM latency right next to API call traces without a separate observability tool is the real win here.

Botánica Andina • Mar 28

This is so timely! I've been wrestling with custom LLM tracing attributes for ages, and the 'wild west' analogy hits home. It's tough to move from llm.provider to a standardized approach, but your breakdown makes the OTel GenAI conventions feel much more approachable. Excited to dig into this.

Adarsh Kant • Mar 22

This is exactly the kind of practical content the AI observability space needs. The dual-emit migration strategy is smart — we've learned similar lessons building voice AI agents that interact with real DOM elements on websites. When you're tracing voice-to-action pipelines (speech recognition → intent → DOM action → response) across 50+ languages, having standardized span attributes is critical for debugging latency. The gen_ai.usage.input_tokens tracking per span is especially relevant when you're optimizing for sub-700ms voice response times. Great writeup!

Gohar • Mar 24

I’m currently making a video series on replaying warcraft 2 after 25 years, and just made a video about how damage works. I’m interested in any feedback or changes for this as it wasn’t as clear cut as I thought. Or I hope it can be informative for anyone interested. I can’t post links because I haven’t made enough posts, but if you search “warcraft II remastered - damage mechanics” in YouTube, you will find it released one day ago.

Adarsh Kant • Mar 22

This is incredibly timely. We're building AnveVoice — a voice AI agent that uses 46 MCP tools via JSON-RPC 2.0 to take real DOM actions on websites. Standardized LLM tracing via OpenTelemetry would be a game-changer for debugging our agent pipelines. Right now tracing multi-step voice commands (user speaks → intent parsed → tool selected → DOM action executed → confirmation spoken) across different LLM providers is painful with custom spans. The GenAI semantic conventions would give us consistent observability across the entire chain. Definitely implementing this. Great writeup on the spec vs. reality gap.

erndob • Mar 23

This is insane. AI generated article with mostly AI generated comments.
Even if a human is behind this content, the voice of humans is lost and everyone sounds like the same bot.

There's 36 em dashes on this page right now, between the article and comments.

Mr. Lin Uncut • Mar 22

how are you handling the span attributes for streaming responses in the opentelemetry spec, because that's where i've seen the most inconsistency across implementations since you don't have a clean start and end token count until the stream closes

Albert Alov • Mar 22

Great question!
We wrap the async iterator with a StreamAccumulator - the span starts before the stream, chunks are accumulated incrementally (text + token counts from provider-specific metadata), and all span attributes are set once when the stream closes. TTFT is tracked separately via onFirstChunk callback. If the consumer breaks out early, the finally block still records what we have.

For token counts: OpenAI sends usage in the final chunk, Anthropic splits it across message_start (input) and message_delta (output) - each provider has its own accumulateChunk() extractor.

Here's the code:
wrapAsyncIterable() and createStreamingHandler() - github.com/vola-trebla/toad-eye/bl...

klement Gunndu • Mar 22

The span naming migration is the most useful part. Worth noting this gets harder with multi-agent chains where Agent A calls Agent B which calls a tool — the parent-child span hierarchy matters more than the naming format at that point.

Albert Alov • Mar 22

Totally agree - hierarchy is everything for multi-agent. We handle this via nested traceAgentQuery() calls - each uses startActiveSpan, so inner agents automatically become child spans

An orchestrator calling a specialist looks like: invoke_agent orchestrator → invoke_agent specialist → execute_tool search - all properly nested in Jaeger without manual context propagation

Garvit Singh • Mar 22

So this is used for standardizing tracing, but if i use LangSmith for tracing and evaluation, is this article of any use to me? Because currently it appears way too out of my league

Albert Alov • Mar 22

Honest answer — if LangSmith covers your tracing and evals today, you're good. No need to change anything.

Where this kicks in: LangSmith is a closed ecosystem. Your traces live in their format, on their platform. OTel is the open standard — same traces work in Jaeger, Datadog, Grafana, Arize, whatever. If you ever want to switch tools or combine platforms, standardized attributes save you from re-instrumenting everything.

Camera vs file format: LangSmith is a great camera. OTel is JPEG — lets you open your photos anywhere.