Manas Sharma for OpenObserve

Posted on May 15 • Originally published at openobserve.ai

How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry

#opentelemetry #llm #observability #openai

TL;DR

Capture gen_ai.* semantic convention attributes on every OpenAI call: request model, input tokens, output tokens. Add feature, user_id, and team on every span so you can break down cost by who and what is spending.
Compute gen_ai.usage.cost_usd from a pricing table you control and emit it as both a span attribute (for per-request drill-down) and a histogram metric (for aggregation and alerting).
Alert on cost anomalies relative to your historical baseline, not just static budget thresholds. Retry loops and runaway agents show up as deviations before they ever cross a daily spend limit.

Why OpenAI bills are impossible to predict without instrumentation

Running an LLM app in production without instrumentation is a slow way to find out your margins are negative. Token consumption is non-obvious: a single user with a verbose system prompt and long chat history can cost 20x more per interaction than an average user. A bug in a retry loop can 10x your daily spend in an hour. A single new feature that adds RAG context to every call can double your input token count overnight.

The OpenAI dashboard tells you what you spent yesterday. It does not tell you which feature, which user, which prompt template, or which model variant drove the spend. By the time you notice a cost spike in your billing dashboard, you have already paid for it.

The fix is the same fix you use for any production system: emit structured telemetry at the point of the API call and make it queryable. OpenTelemetry gives you a vendor-neutral way to do this, and a growing set of GenAI-specific conventions means the fields you emit today will still be meaningful in two years.

Quick start: Jump to the Python setup or Node.js setup if you just need the code.

The three signals you actually need to track

For LLM cost monitoring, three signals carry almost all the value:

Token usage tells you how much capacity you consumed. Input tokens and output tokens, always separately, because they price differently.
Cost is the dollar-denominated derivative of token usage. You compute it at emit time using a pricing table you control.
Latency tells you how long users waited. For streaming endpoints, split this into time to first token and total duration.

Everything else (error rate, finish reason, response model) is useful context for these three. Start with the three and add context as you need it.

What OpenTelemetry's GenAI semantic conventions give you

OpenTelemetry has a dedicated set of semantic conventions for generative AI workloads, living under the gen_ai.* namespace. The point of conventions is that the same attribute names work across providers and observability backends, so your queries do not break when you swap from OpenAI to Anthropic or from one backend to another.

The attributes you will use most:

Attribute	What it holds
`gen_ai.provider.name`	Provider name: `openai`
`gen_ai.request.model`	Model requested by your code: `gpt-4o`, `gpt-4o-mini`
`gen_ai.response.model`	Model the provider actually used (can differ if provider routes)
`gen_ai.operation.name`	`chat`, `text_completion`, `embeddings`
`gen_ai.usage.input_tokens`	Prompt tokens consumed
`gen_ai.usage.output_tokens`	Completion tokens generated
`gen_ai.request.temperature`	Temperature parameter (useful when debugging determinism)
`gen_ai.request.max_tokens`	Max tokens parameter
`gen_ai.response.finish_reasons`	Why the model stopped: `stop`, `length`, `content_filter`

One attribute worth noting: gen_ai.system has been renamed to gen_ai.provider.name in the current OTel GenAI spec. Most instrumentation libraries still emit gen_ai.system today. Your backend should accept both until library adoption catches up.

Instrumenting a Python app with the official OTel OpenAI SDK

This guide uses opentelemetry-instrumentation-openai-v2, the official OTel package maintained in opentelemetry-python-contrib. It follows the GenAI semantic conventions closely and is the right choice for OpenAI instrumentation.

Install the three packages

pip install opentelemetry-distro
pip install opentelemetry-exporter-otlp
pip install opentelemetry-instrumentation-openai-v2

Then run the bootstrap command once to install auto-instrumentation for any other libraries in your app (Flask, FastAPI, requests, and so on):

opentelemetry-bootstrap --action=install

Set the OTLP endpoint for OpenObserve

Grab your OTLP HTTP endpoint and Authorization header from the OpenObserve UI under Data Sources -> Traces (OpenTelemetry) -> OTLP HTTP. Set these environment variables:

export OTEL_SERVICE_NAME=my-llm-app
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.openobserve.ai/api/<your-org>"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <your-auth-token>"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true

If you are self-hosting OpenObserve, the endpoint is typically http://localhost:5080/api/<your-org>.

Run with `opentelemetry-instrument`

Wrap your existing run command:

opentelemetry-instrument python app.py

No code changes to app.py. The OpenAI SDK is wrapped at import time, and every chat.completions.create call emits a span with the gen_ai.* attributes populated.

A minimal example app

# app.py
import os
from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize observability in one sentence."}],
)

print(resp.choices[0].message.content)
print("Input tokens:", resp.usage.prompt_tokens)
print("Output tokens:", resp.usage.completion_tokens)

Run it with opentelemetry-instrument python app.py and check the Traces tab in OpenObserve. You should see a span named chat gpt-4o-mini with the token counts attached.

Capturing message content (and the privacy tradeoff)

The instrumentation does not capture the prompt or completion text by default. To enable it:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

This ships the full prompt and completion as log events. It is useful for debugging but has real privacy implications: you are now logging whatever your users typed, including anything they pasted in. If your app handles regulated data (health, finance, anything under GDPR or HIPAA), do not enable this globally. Enable it per-environment or per-feature flag, and scrub sensitive fields before the exporter sees them.

Instrumenting a Node.js app

For Node.js, the pattern is the same. Install the packages:

npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/instrumentation-openai

Create a tracing.js bootstrap file:

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OpenAIInstrumentation } = require('@opentelemetry/instrumentation-openai');
const { Resource } = require('@opentelemetry/resources');

const sdk = new NodeSDK({
  resource: new Resource({
    'service.name': 'my-llm-app-node',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,
    headers: {
      Authorization: process.env.OTEL_EXPORTER_OTLP_HEADERS,
    },
  }),
  instrumentations: [new OpenAIInstrumentation()],
});

sdk.start();

Then preload it when you run your app:

node --require ./tracing.js app.js

Same result: every OpenAI call produces a span in OpenObserve with the GenAI attributes populated.

Building a cost calculation layer

OpenAI's SDK gives you token counts. It does not give you dollars. You have to multiply tokens by a price, and that price changes. Build this as a small, updatable module.

Pricing table as code

Keep this in source control. Review it every quarter, or every time a provider announces a price change.

# pricing.py
# Prices in USD per 1 million tokens, as of April 2026.
# Verify against provider pricing pages before each release.

MODEL_PRICING = {
    "gpt-4o":      {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini": {"input": 0.15,  "output": 0.60},
    "o1":          {"input": 15.00, "output": 60.00},
    "o1-mini":     {"input": 3.00,  "output": 12.00},
}


def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Return the estimated cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(model)
    if not pricing:
        # Unknown model. Emit 0 and alert separately so you can add pricing.
        return 0.0
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

Emitting cost as a custom metric

The official -v2 package does not emit cost, only tokens. Add cost yourself with a thin wrapper that runs after each call:

# tracked_llm.py
import time
from opentelemetry import trace, metrics
from openai import OpenAI
from pricing import calculate_cost

tracer = trace.get_tracer("llm-cost")
meter = metrics.get_meter("llm-cost")

cost_histogram = meter.create_histogram(
    name="gen_ai.usage.cost_usd",
    description="Estimated cost of a single LLM call in USD",
    unit="USD",
)

client = OpenAI()


def tracked_chat(messages, model="gpt-4o-mini", feature="unknown", user_id="anon"):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.provider.name", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("feature", feature)
        span.set_attribute("user_id", user_id)

        start = time.perf_counter()
        response = client.chat.completions.create(model=model, messages=messages)
        elapsed_ms = (time.perf_counter() - start) * 1000

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        cost = calculate_cost(model, input_tokens, output_tokens)

        # Span attributes for per-request investigation
        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("gen_ai.usage.cost_usd", cost)
        span.set_attribute("gen_ai.latency.duration_ms", elapsed_ms)
        span.set_attribute("gen_ai.response.model", response.model)

        # Metric for aggregation
        cost_histogram.record(cost, {
            "gen_ai.provider.name": "openai",
            "gen_ai.request.model": model,
            "feature": feature,
            "user_id": user_id,
        })

        return response

You now have cost on the span (for drill-down) and cost as a metric (for aggregation, alerting, and dashboards). Both are labeled with feature so you can break them down later.

Attributing cost to users, features, and teams

This is the section most readers came for. Raw token counts do not answer "who is spending our money." Attribution does.

Adding attributes on every span

Every LLM call should carry four attribution dimensions:

feature: which product path triggered the call (document_summary, chat_reply, rag_answer)
user_id: hashed user identifier for per-user rollups
team: which internal team or product area owns the feature
environment: prod, staging, dev

Wire them through as keyword arguments on your wrapper:

result = tracked_chat(
    messages=[{"role": "user", "content": prompt}],
    model="gpt-4o",
    feature="document_summary",
    user_id=hashed_user_id,
)

Building the cost attribution dashboard

A complete LLM cost dashboard covers two concerns: spend attribution and token efficiency. Organize it across two tabs.

Tab 1: LLM Cost Overview

Four single-stat tiles at the top give you the headline numbers at a glance: Total LLM Cost ($), Total Input Tokens, Total Output Tokens, and Total LLM Calls. These are the first things you check when something looks off.

Below the tiles:

LLM Cost Over Time ($): bar chart over the selected time range. Reveals bursty spend patterns and days that are trending above baseline.
Cost by Model: pie chart, one slice per gen_ai.request.model. Shows your model mix and whether a cheaper model is handling the bulk of traffic.
Input vs Output Cost Over Time ($): grouped bar chart with two series, input_cost and output_cost. Output tokens cost 3-4x more than input tokens on most models; this panel tells you which side is driving cost growth.
Token Usage by Model: grouped bar chart of input_tokens and output_tokens per model. Cross-reference this with Cost by Model to spot models that are expensive relative to their token volume.
Token Usage Over Time: time series of token counts. Useful for capacity planning and catching prompt inflation.

Alerting on cost anomalies and rate-limit errors

Static budget thresholds are table stakes. The interesting failures are the ones that do not cross a static threshold until it is too late.

Threshold alerts vs anomaly detection

A threshold alert fires when daily spend exceeds $500. It works for the blunt cases. It misses three common failure modes:

A retry loop that 3x's a specific feature's token usage in an hour. The daily threshold may still be fine by end of day, but you paid 3x for that hour.
A prompt injection that triggers a long runaway completion on a single request, burning 100k output tokens in one call.
Seasonal growth that quietly pushes baseline from $300/day to $600/day over a month, outpacing capacity plans.

Anomaly detection catches all three by comparing current behavior to historical baseline rather than to a fixed number.

A daily budget threshold

Set this first. In OpenObserve, create an alert on the gen_ai.usage.cost_usd metric:

Trigger: SUM(gen_ai_usage_cost_usd) over 24h is greater than 500
Evaluation frequency: every 5 minutes
Action: Slack or PagerDuty, routed to the LLM-platform team

An anomaly-based alert for cost spikes

This is more valuable. Create an anomaly alert on gen_ai.usage.cost_usd grouped by feature, with a training window of the last 14 days and a sensitivity tuned to catch 3x deviations. A retry loop in the document_summary feature shows up in minutes, before it hits your daily threshold.

Alert on rate-limit errors (HTTP 429)

When OpenAI rate-limits you, downstream calls fail and retries pile up. Fire an alert when gen_ai.response.error.type = rate_limit_exceeded exceeds a low threshold (say, 5 in 5 minutes). This usually surfaces a runaway loop before a cost anomaly does.

Reconciling estimated cost with the OpenAI billing API

Your OTel-derived cost is an estimate. It is usually within a couple of percent, but it drifts from the real bill for three reasons:

Cached input tokens. Repeat prompts are billed at a discount. Your naive pricing math assumes full price.
Reasoning tokens. o1 and similar models emit internal reasoning tokens that count toward billing but may not appear in the standard usage object.
Batch API discounts. If you use the async batch endpoint, those requests are priced lower.

Reconcile monthly. Pull the OpenAI usage endpoint and compare total cost for the window against your OTel sum. If the drift is more than 5 percent, dig in and adjust your pricing table. This is the pattern production teams use: OTel for real-time signal, billing API for ground truth.

Measuring time to first token for streaming

For chat UIs, users feel time to first token (TTFT), not total duration. If you use streaming responses, capture it:

import time

def stream_with_ttft(messages, model="gpt-4o"):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.provider.name", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.response.streaming", True)

        start = time.perf_counter()
        ttft_ms = None

        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True,
        )

        chunks = []
        for chunk in stream:
            if ttft_ms is None and chunk.choices[0].delta.content:
                ttft_ms = (time.perf_counter() - start) * 1000
                span.set_attribute("gen_ai.latency.ttft_ms", ttft_ms)
            chunks.append(chunk)

        total_ms = (time.perf_counter() - start) * 1000
        span.set_attribute("gen_ai.latency.duration_ms", total_ms)
        return chunks

Now you can alert on TTFT regressions separately from total-duration regressions.

Production checklist

Before shipping this to prod:

✅ Retention policy set on your LLM telemetry stream
✅ PII scrubbing pipeline in place if capturing message content
✅ Sampling strategy decided (100% for LLM spans is usually fine)
✅ Pricing table in source control with quarterly review reminder
✅ Budget threshold alert and anomaly-based alert configured
✅ Monthly reconciliation against OpenAI billing API scheduled

Send your LLM telemetry to OpenObserve

OpenObserve is an open-source observability platform that accepts standard OTLP over HTTP and gRPC. There is no proprietary SDK to adopt and no special instrumentation to learn. Point your OTLP exporter at OpenObserve Cloud or a self-hosted instance, and your LLM spans, logs, and metrics land in the same place as your infrastructure telemetry.

If you want to see this working end to end, spin up a free account at OpenObserve Cloud or check out the LLM Observability overview.

DEV Community

How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry

TL;DR

Why OpenAI bills are impossible to predict without instrumentation

The three signals you actually need to track

What OpenTelemetry's GenAI semantic conventions give you

Instrumenting a Python app with the official OTel OpenAI SDK

Install the three packages

Set the OTLP endpoint for OpenObserve

Run with `opentelemetry-instrument`

A minimal example app

Capturing message content (and the privacy tradeoff)

Instrumenting a Node.js app

Building a cost calculation layer

Pricing table as code

Emitting cost as a custom metric

Attributing cost to users, features, and teams

Adding attributes on every span

Building the cost attribution dashboard

Alerting on cost anomalies and rate-limit errors

Threshold alerts vs anomaly detection

A daily budget threshold

An anomaly-based alert for cost spikes

Alert on rate-limit errors (HTTP 429)

Reconciling estimated cost with the OpenAI billing API

Measuring time to first token for streaming

Production checklist

Send your LLM telemetry to OpenObserve

Further Reading

Top comments (0)

TL;DR

Why OpenAI bills are impossible to predict without instrumentation

The three signals you actually need to track

What OpenTelemetry's GenAI semantic conventions give you

Instrumenting a Python app with the official OTel OpenAI SDK

Install the three packages

Set the OTLP endpoint for OpenObserve

Run with opentelemetry-instrument

A minimal example app

Capturing message content (and the privacy tradeoff)

Instrumenting a Node.js app

Building a cost calculation layer

Pricing table as code

Emitting cost as a custom metric

Attributing cost to users, features, and teams

Adding attributes on every span

Building the cost attribution dashboard

Alerting on cost anomalies and rate-limit errors

Threshold alerts vs anomaly detection

A daily budget threshold

An anomaly-based alert for cost spikes

Alert on rate-limit errors (HTTP 429)

Reconciling estimated cost with the OpenAI billing API

Measuring time to first token for streaming

Production checklist

Send your LLM telemetry to OpenObserve

Further Reading

Run with `opentelemetry-instrument`