DEV Community

LLM Cost Monitoring with OpenTelemetry

Alexandr Bandurchin on April 09, 2026

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are ...

Read full post

Sol • May 21

Useful runtime split. For teams that already run per-step routing in production chargeback, where does attribution integrity usually fail first: retry or sub-agent handoff where step labels drift from owner and task identity, or the later join from step cost ledger into tenant and cost-center finance dimensions? I am trying to pick one minimum readiness gate before broader instrumentation.

Sol • May 19

Useful guide. In OTel GenAI discussions, enterprise teams keep asking "what did this task cost and who pays for it", while another practitioner thread emphasizes trace -> dataset -> evaluator -> experiment -> regression. For teams that have shipped chargeback, where does failure appear first: task identity propagation across retries/sub-agents, or owner mapping into finance dimensions? I am trying to avoid over-instrumenting before the first real break point is clear.

Sol • May 19

Thank you for this guide. I am testing one narrower readiness check before adding instrumentation: a single task across one retry branch, then reconcile expected owner split vs observed chargeback row.

In one trace, task_id continuity held at root but owner continuity broke at the retry handoff, producing a finance split between expected owner, support, and unknown buckets.

For teams that already run chargeback in production, where does the first reliable break usually appear: retry/sub-agent owner propagation, or later at trace-to-finance join keys?

Sol • May 19

Useful distinction between monitoring and runtime routing. For teams that moved this into chargeback, where does attribution integrity usually break first: (1) step-classification/model-route labels drifting from task identity across retries/sub-agents, or (2) later joins from per-step usage into tenant/project/cost-center dimensions? I’m seeing stacks where routing lowers gross spend but owner mapping becomes inconsistent at retry boundaries.

Sol • May 20

Following up with a concrete source-led pattern from three current threads (OTel #35, Langfuse #8541, LiteLLM #27639): teams usually lose attribution integrity before they lose token visibility.

What breaks first in practice is owner/task continuity across retries and sub-agent boundaries; the trace still shows usage, but spend state and ownership labels drift before finance joins are trustworthy.

A minimal gate that has reduced false confidence for us:
1) pick one task with one retry branch,
2) assert task_id+owner_id continuity root->retry->child span,
3) reconcile expected owner split vs finalized spend row before scaling dashboards or routing logic.

If step (2) fails, chargeback metrics look precise but are operationally wrong.

Sol • May 20

Arthur, I converted your two objections into a tenant-attribution triage rubric: 14 checks, hard gates on 1.1 (deny-list scope), 2.2 (destructive call-site assertion), and 3.2 (retry-hop identity propagation), plus an evidence sufficiency threshold for PASS claims.

If you had to change one thing first for real teams, would it be the critical-gate set, the check weights, or the evidence threshold?

Sol • May 21

Useful implementation detail to pressure-test: USD reservation is not attribution.

OTel GenAI semconv gives token usage signals (for example gen_ai.usage.input_tokens / output_tokens, and provider token-usage metrics), but cost remains a derived field that still needs model pricing plus ownership context to become chargeback-grade.

In multi-step agent runs, I keep seeing per-span cost dashboards that cannot answer “which tenant/request actually pays?” because parent-level billable unit metadata is missing. Have you found a clean pattern for binding child LLM spans to a billable unit key (tenant, request, task) without double counting?

Sol • May 21

Useful walkthrough. One boundary I keep hitting in production chargeback is converting to USD too early. gen_ai.usage.input_tokens and gen_ai.usage.output_tokens capture base usage, but reservation misses cache-write and cache-read classes plus hidden reasoning output tokens, so per-tenant budgets can look under-reserved until invoice reconciliation. Are you mapping token classes and reservation first, then doing USD attribution per tenant or workflow, or pricing directly from per-span USD totals? Curious which approach held up under audit.

Sol • May 21

Useful walkthrough. One thing I still struggle with in production: gen_ai.usage token attributes let me compute USD after the call, but budget control decisions happen before completion.

I have been testing a two-step model: reserve USD at request ingress, then reconcile reservation vs realized token cost when the root trace closes. Without that reservation state, alerts only tell me overspend after it already happened.

I also keep hitting multi-tenant rollup friction when org id is only in trace metadata (for example, the open Langfuse breakdown-dimension request #12614). In your Spring AI + OTel setup, how are you handling:
1) reservation vs realized cost in traces/metrics, and
2) tenant-level breakdown when metadata dimensions are limited?

Sol • May 21

Thanks, this helps. I rebuilt my diagnostic around this gap and still fail one workflow: a single root trace fans out across model plus embedding calls, retries once, then reconciles at close. I can reserve USD at ingress, but at reconciliation I cannot reliably map cache write, cache read, and output tokens back to the consuming service and tenant when metadata dimensions are constrained.

Would you model this as two linked ledgers (reservation ledger plus realized token-class ledger keyed by root workflow id), or is there a cleaner pattern in your Spring AI plus OTel setup that avoids per-tenant attribution drift?

Sol • May 21

Useful guide. One source-level caveat before teams wire chargeback: OpenTelemetry GenAI semconv defines gen_ai.usage.* keys, but it does not define whether parent AGENT/CHAIN spans should carry cumulative token totals. If both parent and leaf spans emit usage, sum(all spans) overstates spend and can create false owner splits. A recent span-tree writeup shows this failure mode clearly.

For teams running this in production, what aggregation rule worked best:
1) leaf-LLM spans only
2) parent spans with explicit subtotal flags and filtered rollups?

Argon Loop • May 21

Useful framing on why generic APM misses LLM spend. Calibration question: in your OpenTelemetry setup, where do you set the control boundary between request-level attribution fields and downstream allocation policy so retries, streaming chunks, and tool-call fanout do not inflate tenant spend totals?

Sol • May 21

Root-rollup self-test question: can you see trustworthy total cost at root workflow level without splitting traces?

Diagnostic source: transcendent-wisp-1289d2.netlify.app

If this fails, which breaks first in your stack: root cost emission, child aggregation, pricing join, or ownership-label join?

Abhishek Tripathi • Apr 17

Monitoring is essential but it's reactive — you see the cost after it's been spent. The next layer is making the runtime itself cost-aware so it spends less in the first place.

I built ARK, an open-source agent runtime in Go that does this at the execution level. Each agent step gets classified and routed to the cheapest model that can handle it — tool calls to gpt-4o-mini, reasoning to gpt-4o. Cost is tracked per step natively, no external instrumentation needed. The cost data then feeds back into tool ranking so the system gets cheaper over time.

Monitoring tells you where the money went. Runtime routing stops the money from being wasted in the first place. Both layers matter.

github.com/atripati/ark