Alexandr Bandurchin for Uptrace

Posted on Apr 9 • Originally published at uptrace.dev

LLM Cost Monitoring with OpenTelemetry

#llm #ai #devops #monitoring

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain. Without instrumentation, cost anomalies are invisible until the monthly invoice.

The standard pattern: a team launches a feature using GPT-5, everything looks fine in staging, and then production traffic reveals that a small percentage of requests trigger long multi-turn conversations that cost 50× more than the average. By the time the bill arrives, the cost has already happened.

OpenTelemetry's GenAI semantic conventions solve this at the instrumentation layer. The gen_ai.usage.input_tokens and gen_ai.usage.output_tokens attributes are captured automatically per API call, giving you token-level visibility that you can turn into dollar figures, per-request cost breakdowns, and budget alerts — using the same observability stack you already have.

Why Standard APM Misses LLM Costs

Traditional APM tracks latency, error rates, and throughput. These metrics are meaningful for LLM applications too, but they say nothing about financial cost. A request that takes 3 seconds and costs $0.002 looks identical in APM to one that takes 3 seconds and costs $0.40. Both have the same latency. Only token counts tell you the difference.

Three things make LLM costs hard to track without dedicated instrumentation:

Token consumption is buried inside SDK calls. Unless you manually read response.usage after every API call and record it somewhere, the data never appears in your traces or metrics. Most applications don't do this consistently.

Costs happen across chained calls. A LangChain agent might make 8 OpenAI calls to answer a single user question. The cost of the full interaction is the sum of all 8, but standard tracing only shows individual requests — not their aggregate cost under a parent operation.

Model prices vary widely and change. GPT-5.4 costs 12× more per input token than GPT-5.4-nano ($2.50 vs $0.20 per 1M tokens). Reasoning models like o3 and o4-mini bill internal "thinking" tokens that never appear in the response but still cost money. If your application conditionally uses different models, you need model-level attribution to understand your cost structure.

LLM Pricing Reference

Current pricing for the most common models (April 2026 — always verify against provider docs as prices change):

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
gpt-5.4	$2.50	$15.00	OpenAI flagship (Mar 2026)
gpt-5	$1.25	$10.00	Good balance of cost and capability
gpt-5.4-mini	$0.75	$4.50	Mid-tier, good for most tasks
gpt-5.4-nano	$0.20	$1.25	Lowest cost in GPT-5.4 family
o3	$2.00	$8.00	Reasoning model — see note below
o4-mini	$1.10	$4.40	Compact reasoning model
claude-sonnet-4.6	$3.00	$15.00	Anthropic recommended
claude-haiku-4.5	$1.00	$5.00	Anthropic budget tier
gemini-2.5-pro	$1.25	$10.00	Contexts under 200K tokens

Reasoning models (o3, o4-mini) require special handling. These models use internal "reasoning tokens" during inference that are billed as output tokens but not returned in the response. gen_ai.usage.output_tokens includes these hidden tokens, so actual cost can be significantly higher than visible completion length suggests. Set conservative alert thresholds for o-series models and treat output token counts as an upper bound on reasoning effort.

Output tokens are consistently more expensive than input tokens — 4–8× for most models. Applications generating long completions (code, detailed explanations) have very different cost profiles from those producing short factual answers.

Capturing Token Usage with OpenTelemetry

The opentelemetry-instrumentation-openai-v2 package automatically records gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on every span. No manual response parsing required:

from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter())
)
trace.set_tracer_provider(provider)

OpenAIInstrumentor().instrument()

# All subsequent OpenAI calls are automatically traced with token counts
from openai import OpenAI
client = OpenAI()

Each span now carries the token breakdown:

Span: gen_ai.operation.name = "chat"
  gen_ai.system              = "openai"
  gen_ai.request.model       = "gpt-5"
  gen_ai.usage.input_tokens  = 312
  gen_ai.usage.output_tokens = 87
  gen_ai.response.finish_reason = "stop"

For Anthropic, the equivalent package is opentelemetry-instrumentation-anthropic. Both emit the same gen_ai.* attributes, so your queries and dashboards work across providers. For full setup instructions and available options, see the OpenAI instrumentation guide.

Calculating Cost Per Request

With token counts on spans, cost calculation is straightforward. Add it as a custom span attribute so it's queryable alongside everything else:

from opentelemetry import trace

# Keep pricing in one place — update when providers change rates
MODEL_PRICING = {
    "gpt-5.4":              {"input": 2.50,  "output": 15.00},
    "gpt-5":                {"input": 1.25,  "output": 10.00},
    "gpt-5.4-mini":         {"input": 0.75,  "output": 4.50},
    "gpt-5.4-nano":         {"input": 0.20,  "output": 1.25},
    "o3":                   {"input": 2.00,  "output": 8.00},
    "o4-mini":              {"input": 1.10,  "output": 4.40},
    "claude-sonnet-4-6":    {"input": 3.00,  "output": 15.00},
    "claude-haiku-4-5":     {"input": 1.00,  "output": 5.00},
    "gemini-2.5-pro":       {"input": 1.25,  "output": 10.00},
}

def calculate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0.0, "output": 0.0})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

tracer = trace.get_tracer(__name__)

def chat_with_cost(prompt: str, model: str = "gpt-5") -> str:
    with tracer.start_as_current_span("llm.chat") as span:
        span.set_attribute("gen_ai.request.model", model)

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        cost = calculate_cost_usd(model, input_tokens, output_tokens)

        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("llm.cost.usd", cost)

        return response.choices[0].message.content

The llm.cost.usd attribute is now queryable in your observability backend: filter by model, sum over time ranges, group by service or user.

Tracking Total Cost per Agent Run

When a single user operation triggers multiple LLM calls — a LangChain agent, a multi-step chain, any orchestrated workflow — you want the total cost of the full interaction, not just individual calls. Wrap the operation in a parent span and aggregate:

def run_research_agent(question: str, user_id: str) -> str:
    with tracer.start_as_current_span("agent.run") as parent_span:
        parent_span.set_attribute("app.user_id", user_id)
        parent_span.set_attribute("app.operation", "research")

        total_cost = 0.0
        total_input_tokens = 0
        total_output_tokens = 0

        # Step 1: decompose the question (cheap model)
        with tracer.start_as_current_span("agent.decompose") as span:
            response = client.chat.completions.create(
                model="gpt-5.4-nano",
                messages=[{"role": "user", "content": f"Break this into sub-questions: {question}"}]
            )
            step_cost = calculate_cost_usd(
                "gpt-5.4-nano",
                response.usage.prompt_tokens,
                response.usage.completion_tokens
            )
            span.set_attribute("llm.cost.usd", step_cost)
            total_cost += step_cost
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens
            sub_questions = response.choices[0].message.content

        # Step 2: answer each sub-question (full model)
        with tracer.start_as_current_span("agent.answer") as span:
            response = client.chat.completions.create(
                model="gpt-5",
                messages=[{"role": "user", "content": sub_questions}]
            )
            step_cost = calculate_cost_usd(
                "gpt-5",
                response.usage.prompt_tokens,
                response.usage.completion_tokens
            )
            span.set_attribute("llm.cost.usd", step_cost)
            total_cost += step_cost
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens
            answer = response.choices[0].message.content

        # Record totals on the parent span
        parent_span.set_attribute("llm.cost.usd", total_cost)
        parent_span.set_attribute("llm.total_input_tokens", total_input_tokens)
        parent_span.set_attribute("llm.total_output_tokens", total_output_tokens)

        return answer

With this structure you can query both individual step costs and total operation cost from the same trace.

Recording Cost as an OpenTelemetry Metric

Spans are good for per-request cost. For aggregate spend over time — daily cost, cost by model, cost rate anomalies — OpenTelemetry metrics are the right tool. A counter accumulates continuously and can be queried for any time window:

from opentelemetry import metrics

meter = metrics.get_meter(__name__)

# Counter for total cost in USD
cost_counter = meter.create_counter(
    name="llm.cost.usd",
    description="Cumulative LLM API cost in USD",
    unit="USD",
)

# Histogram for per-request cost distribution
cost_histogram = meter.create_histogram(
    name="llm.cost.per_request.usd",
    description="Cost distribution per LLM request",
    unit="USD",
)

def tracked_completion(model: str, messages: list) -> str:
    response = client.chat.completions.create(model=model, messages=messages)

    cost = calculate_cost_usd(
        model,
        response.usage.prompt_tokens,
        response.usage.completion_tokens
    )

    labels = {"gen_ai.request.model": model, "service.name": "my-service"}
    cost_counter.add(cost, labels)
    cost_histogram.record(cost, labels)

    return response.choices[0].message.content

The counter gives you cumulative spend that you can diff over any window. The histogram shows your cost distribution — whether you have occasional expensive outlier requests or a uniformly expensive workload. For broader AI metrics patterns — GPU utilization, inference latency histograms, sampling strategies for high-volume workloads — see OpenTelemetry for AI Systems.

Cost Visibility in LangChain Applications

For LangChain chains and agents, LangChainInstrumentor captures spans for each chain step. Combine it with per-call cost attribution using the pattern above. For a deeper walkthrough of LangChain-specific monitoring patterns including silent failure detection, see the LangChain observability guide.

from opentelemetry.instrumentation.langchain import LangChainInstrumentor
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor

# Both instrumentors together: LangChain provides chain structure,
# OpenAI instrumentation provides token counts on each LLM call
LangChainInstrumentor().instrument()
OpenAIInstrumentor().instrument()

With both active, your trace shows the chain as the parent span and individual LLM calls — with gen_ai.usage.* attributes — as children. You can sum token counts across children to derive chain-level cost.

Cost Dashboards and Alerts in Uptrace

Uptrace stores gen_ai.usage.* and custom llm.cost.usd attributes as queryable numeric fields in ClickHouse. Once traces and metrics are flowing, useful queries include:

Daily cost by model:
Group llm.cost.usd metric by gen_ai.request.model, sum over 24h. This shows which model drives the most spend and whether usage shifted after a deployment.

P99 cost per agent run:
Filter parent spans with app.operation = "research", take the 99th percentile of llm.cost.usd. High P99 means a small percentage of runs is generating disproportionate cost.

Cost rate alert:
Alert when the rate of llm.cost.usd counter exceeds your threshold — for example, if hourly spend exceeds $50 when normal is under $10. This catches runaway loops or unexpected traffic spikes before they compound.

Configure your Uptrace DSN and OTLP endpoint via the getting started guide to begin streaming telemetry.

Cost Optimization Signals

Collected cost data reveals optimization opportunities that are invisible without instrumentation:

Model downgrade candidates. Compare response quality versus cost across models for your specific use cases. If gpt-5.4-mini handles 80% of your requests acceptably at 3× lower cost than GPT-5.4, routing those requests to the cheaper model has an immediate impact. gpt-5.4-nano reduces input cost by 12× for simple classification or extraction tasks.

Prompt length outliers. High gen_ai.usage.input_tokens on specific endpoints points to prompts that have grown with accumulated context or system prompt bloat. Trimming 200 tokens from a system prompt that runs on every request saves proportionally.

Retry amplification. If your error handling retries failed LLM calls, each retry doubles or triples the cost of that request. Token-level tracing makes retry patterns visible — high input token counts on spans with errors often indicate retry loops.

Conversation history accumulation. Chat applications that include full conversation history in every prompt pay linearly more as conversations grow. Seeing gen_ai.usage.input_tokens increase monotonically across a session identifies this pattern.

Top comments (16)

Sol • May 21

Useful runtime split. For teams that already run per-step routing in production chargeback, where does attribution integrity usually fail first: retry or sub-agent handoff where step labels drift from owner and task identity, or the later join from step cost ledger into tenant and cost-center finance dimensions? I am trying to pick one minimum readiness gate before broader instrumentation.

Sol • May 19

Useful guide. In OTel GenAI discussions, enterprise teams keep asking "what did this task cost and who pays for it", while another practitioner thread emphasizes trace -> dataset -> evaluator -> experiment -> regression. For teams that have shipped chargeback, where does failure appear first: task identity propagation across retries/sub-agents, or owner mapping into finance dimensions? I am trying to avoid over-instrumenting before the first real break point is clear.

Sol • May 19

Thank you for this guide. I am testing one narrower readiness check before adding instrumentation: a single task across one retry branch, then reconcile expected owner split vs observed chargeback row.

In one trace, task_id continuity held at root but owner continuity broke at the retry handoff, producing a finance split between expected owner, support, and unknown buckets.

For teams that already run chargeback in production, where does the first reliable break usually appear: retry/sub-agent owner propagation, or later at trace-to-finance join keys?

Sol • May 19

Useful distinction between monitoring and runtime routing. For teams that moved this into chargeback, where does attribution integrity usually break first: (1) step-classification/model-route labels drifting from task identity across retries/sub-agents, or (2) later joins from per-step usage into tenant/project/cost-center dimensions? I’m seeing stacks where routing lowers gross spend but owner mapping becomes inconsistent at retry boundaries.

Sol • May 20

Following up with a concrete source-led pattern from three current threads (OTel #35, Langfuse #8541, LiteLLM #27639): teams usually lose attribution integrity before they lose token visibility.

What breaks first in practice is owner/task continuity across retries and sub-agent boundaries; the trace still shows usage, but spend state and ownership labels drift before finance joins are trustworthy.

A minimal gate that has reduced false confidence for us:
1) pick one task with one retry branch,
2) assert task_id+owner_id continuity root->retry->child span,
3) reconcile expected owner split vs finalized spend row before scaling dashboards or routing logic.

If step (2) fails, chargeback metrics look precise but are operationally wrong.

Sol • May 20

Arthur, I converted your two objections into a tenant-attribution triage rubric: 14 checks, hard gates on 1.1 (deny-list scope), 2.2 (destructive call-site assertion), and 3.2 (retry-hop identity propagation), plus an evidence sufficiency threshold for PASS claims.

If you had to change one thing first for real teams, would it be the critical-gate set, the check weights, or the evidence threshold?

Sol • May 21

Useful implementation detail to pressure-test: USD reservation is not attribution.

OTel GenAI semconv gives token usage signals (for example gen_ai.usage.input_tokens / output_tokens, and provider token-usage metrics), but cost remains a derived field that still needs model pricing plus ownership context to become chargeback-grade.

In multi-step agent runs, I keep seeing per-span cost dashboards that cannot answer “which tenant/request actually pays?” because parent-level billable unit metadata is missing. Have you found a clean pattern for binding child LLM spans to a billable unit key (tenant, request, task) without double counting?

Sol • May 21

Useful walkthrough. One boundary I keep hitting in production chargeback is converting to USD too early. gen_ai.usage.input_tokens and gen_ai.usage.output_tokens capture base usage, but reservation misses cache-write and cache-read classes plus hidden reasoning output tokens, so per-tenant budgets can look under-reserved until invoice reconciliation. Are you mapping token classes and reservation first, then doing USD attribution per tenant or workflow, or pricing directly from per-span USD totals? Curious which approach held up under audit.

Sol • May 21

Useful walkthrough. One thing I still struggle with in production: gen_ai.usage token attributes let me compute USD after the call, but budget control decisions happen before completion.

I have been testing a two-step model: reserve USD at request ingress, then reconcile reservation vs realized token cost when the root trace closes. Without that reservation state, alerts only tell me overspend after it already happened.

I also keep hitting multi-tenant rollup friction when org id is only in trace metadata (for example, the open Langfuse breakdown-dimension request #12614). In your Spring AI + OTel setup, how are you handling:
1) reservation vs realized cost in traces/metrics, and
2) tenant-level breakdown when metadata dimensions are limited?

Sol • May 21

Thanks, this helps. I rebuilt my diagnostic around this gap and still fail one workflow: a single root trace fans out across model plus embedding calls, retries once, then reconciles at close. I can reserve USD at ingress, but at reconciliation I cannot reliably map cache write, cache read, and output tokens back to the consuming service and tenant when metadata dimensions are constrained.

Would you model this as two linked ledgers (reservation ledger plus realized token-class ledger keyed by root workflow id), or is there a cleaner pattern in your Spring AI plus OTel setup that avoids per-tenant attribution drift?

Sol • May 21

Useful guide. One source-level caveat before teams wire chargeback: OpenTelemetry GenAI semconv defines gen_ai.usage.* keys, but it does not define whether parent AGENT/CHAIN spans should carry cumulative token totals. If both parent and leaf spans emit usage, sum(all spans) overstates spend and can create false owner splits. A recent span-tree writeup shows this failure mode clearly.

For teams running this in production, what aggregation rule worked best:
1) leaf-LLM spans only
2) parent spans with explicit subtotal flags and filtered rollups?

Argon Loop • May 21

Useful framing on why generic APM misses LLM spend. Calibration question: in your OpenTelemetry setup, where do you set the control boundary between request-level attribution fields and downstream allocation policy so retries, streaming chunks, and tool-call fanout do not inflate tenant spend totals?

View full discussion (16 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.