Temitope

Posted on May 11

Monitoring LLM API Calls in Python: Latency, Token Usage, and Cost Tracking With OpenTelemetry

#python #opentelemetry #llm #ai

LLM API calls are unlike any other external dependency in your Python application.

A database query takes milliseconds. A Redis call takes microseconds. An LLM call takes anywhere from half a second to thirty seconds, consumes a variable number of tokens on every invocation, costs real money on every request, and can fail in ways that have nothing to do with network connectivity — token limits, content filters, model refusals, context window exhaustion.

Standard application monitoring was not built for this. Your existing latency dashboards will show LLM calls as outliers. Your error rate alerts will fire on model refusals that aren't actually errors. Your cost monitoring won't exist at all unless you build it.

This article builds it. We'll instrument LLM API calls in Python with OpenTelemetry — capturing latency, token consumption, estimated cost, and finish reasons as structured telemetry that you can query, dashboard, and alert on.

The Monitoring Gap in LLM Applications

When you add an LLM to a Python application, you typically get visibility into two things: whether the call succeeded, and how long it took. Everything else — how many tokens it consumed, what the model decided to do, how much it cost, whether it hit a limit — is invisible unless you instrument it explicitly.

This creates real operational problems:

A feature that works in testing starts timing out in production because prompts grew longer than expected and token counts climbed
Costs spike unexpectedly because one endpoint is generating unusually long completions
Users report bad responses but you can't tell whether the model refused, truncated, or hallucinated because finish_reason is never captured
You can't tell which of your ten LLM-powered features is responsible for 80% of your API spend

Structured telemetry on LLM calls fixes all of these. Let's build it.

Prerequisites

Python 3.10+
An OpenAI or Anthropic API key
A running OpenTelemetry Collector or observability backend

Installing Dependencies

pip install opentelemetry-sdk
pip install opentelemetry-api
pip install opentelemetry-exporter-otlp-proto-grpc
pip install openai
pip install anthropic
pip install fastapi uvicorn

Project Structure

llm-monitoring/
├── tracing.py          # OpenTelemetry setup
├── llm_tracer.py       # LLM instrumentation layer
├── cost_estimator.py   # Token cost calculation
├── main.py             # FastAPI application
└── services.py         # LLM-powered features

Step 1: OpenTelemetry Setup

tracing.py

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME


def init_tracer(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        SERVICE_NAME: service_name,
        "service.version": "1.0.0",
    })

    exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name)

Step 2: Cost Estimation

Before building the instrumentation layer, we need a way to estimate costs. LLM providers charge per token, with different rates for input and output tokens.

cost_estimator.py

from dataclasses import dataclass
from typing import Optional


@dataclass
class ModelPricing:
    input_cost_per_token: float   # USD per token
    output_cost_per_token: float  # USD per token


# Pricing as of early 2026 — verify against provider pricing pages
# before building cost dashboards on these numbers
MODEL_PRICING: dict[str, ModelPricing] = {
    # OpenAI
    "gpt-4o": ModelPricing(
        input_cost_per_token=0.000005,
        output_cost_per_token=0.000015,
    ),
    "gpt-4o-mini": ModelPricing(
        input_cost_per_token=0.00000015,
        output_cost_per_token=0.0000006,
    ),
    "gpt-3.5-turbo": ModelPricing(
        input_cost_per_token=0.0000005,
        output_cost_per_token=0.0000015,
    ),
    # Anthropic
    "claude-sonnet-4-6": ModelPricing(
        input_cost_per_token=0.000003,
        output_cost_per_token=0.000015,
    ),
    "claude-haiku-4-5": ModelPricing(
        input_cost_per_token=0.00000025,
        output_cost_per_token=0.00000125,
    ),
}


def estimate_cost(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
) -> Optional[float]:
    """
    Estimate the cost of an LLM call in USD.
    Returns None if the model is not in the pricing table.
    """
    pricing = MODEL_PRICING.get(model)
    if not pricing:
        return None

    input_cost = prompt_tokens * pricing.input_cost_per_token
    output_cost = completion_tokens * pricing.output_cost_per_token
    return round(input_cost + output_cost, 8)

Step 3: The LLM Instrumentation Layer

This is the core of the setup — a context manager that wraps any LLM call and captures the telemetry we care about.

llm_tracer.py

import time
from contextlib import contextmanager
from typing import Optional, Generator
from opentelemetry import trace
from opentelemetry.trace import Span, Status, StatusCode

from cost_estimator import estimate_cost

tracer = trace.get_tracer("llm-instrumentation")


@contextmanager
def llm_span(
    model: str,
    operation: str,
    feature: str,
    prompt_tokens: Optional[int] = None,
    temperature: float = 0.0,
    max_tokens: Optional[int] = None,
) -> Generator[Span, None, None]:
    """
    Context manager that creates a span for an LLM API call.

    Args:
        model: The model identifier (e.g. "gpt-4o", "claude-sonnet-4-6")
        operation: What this call is doing (e.g. "summarize", "classify", "generate")
        feature: Which product feature triggered this call (e.g. "order_summary", "search")
        prompt_tokens: Estimated prompt token count (if known before the call)
        temperature: Sampling temperature
        max_tokens: Maximum tokens requested
    """
    with tracer.start_as_current_span(f"llm.{operation}") as span:
        # Request attributes — known before the call
        span.set_attributes({
            "llm.model": model,
            "llm.operation": operation,
            "llm.feature": feature,
            "llm.temperature": temperature,
            "llm.request_time": time.time(),
        })

        if prompt_tokens is not None:
            span.set_attribute("llm.prompt_tokens", prompt_tokens)

        if max_tokens is not None:
            span.set_attribute("llm.max_tokens", max_tokens)

        start_time = time.perf_counter()

        try:
            yield span
        finally:
            latency_ms = (time.perf_counter() - start_time) * 1000
            span.set_attribute("llm.latency_ms", round(latency_ms, 2))


def record_llm_response(
    span: Span,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    finish_reason: str,
    cached: bool = False,
) -> None:
    """
    Record response attributes after an LLM call completes.
    Call this inside the llm_span context manager after the API call returns.
    """
    total_tokens = prompt_tokens + completion_tokens
    cost = estimate_cost(model, prompt_tokens, completion_tokens)

    span.set_attributes({
        "llm.prompt_tokens": prompt_tokens,
        "llm.completion_tokens": completion_tokens,
        "llm.total_tokens": total_tokens,
        "llm.finish_reason": finish_reason,
        "llm.cached": cached,
    })

    if cost is not None:
        span.set_attribute("llm.estimated_cost_usd", cost)

    # Set span status based on finish reason
    # Not all non-"stop" finish reasons are errors — but they need visibility
    if finish_reason == "length":
        # Response was cut off — may indicate prompt is too long
        # or max_tokens is set too low
        span.set_status(Status(StatusCode.ERROR, "Response truncated by token limit"))
        span.set_attribute("llm.truncated", True)

    elif finish_reason == "content_filter":
        # Content policy triggered — usually a prompt design issue
        span.set_status(Status(StatusCode.ERROR, "Response blocked by content filter"))

    elif finish_reason == "stop":
        span.set_status(Status(StatusCode.OK))

    else:
        # tool_calls, function_call, or unknown — not an error
        span.set_status(Status(StatusCode.OK))


def record_llm_error(span: Span, error: Exception, error_type: str) -> None:
    """
    Record an LLM API error on the span.
    Use error_type to distinguish between different failure modes.
    """
    span.record_exception(error)
    span.set_attributes({
        "llm.error": True,
        "llm.error_type": error_type,
    })
    span.set_status(Status(StatusCode.ERROR, str(error)))

The finish_reason handling is worth examining. When an LLM response is truncated because of a token limit, most monitoring systems record it as a successful call — the HTTP request returned 200. But from a product perspective, the response is incomplete and the user may get a broken experience. Treating finish_reason == "length" as an error in the span means you can alert on it separately from network failures or API errors.

Step 4: Instrumenting Real LLM Calls

Now let's use the instrumentation layer with actual API calls.

services.py

import os
from openai import AsyncOpenAI, RateLimitError, APITimeoutError
from anthropic import AsyncAnthropic, APIStatusError
import structlog

from llm_tracer import llm_span, record_llm_response, record_llm_error

logger = structlog.get_logger()
openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])


async def summarize_order(order_text: str, user_id: str) -> str:
    """Summarize an order for the customer dashboard."""
    model = "gpt-4o-mini"

    with llm_span(
        model=model,
        operation="summarize",
        feature="order_dashboard",
        temperature=0.0,
        max_tokens=200,
    ) as span:
        try:
            response = await openai_client.chat.completions.create(
                model=model,
                temperature=0.0,
                max_tokens=200,
                messages=[
                    {
                        "role": "system",
                        "content": "Summarize the following order in 2-3 sentences for a customer.",
                    },
                    {
                        "role": "user",
                        "content": order_text,
                    },
                ],
            )

            choice = response.choices[0]
            usage = response.usage

            record_llm_response(
                span=span,
                model=model,
                prompt_tokens=usage.prompt_tokens,
                completion_tokens=usage.completion_tokens,
                finish_reason=choice.finish_reason,
            )

            logger.info(
                "order_summarized",
                user_id=user_id,
                model=model,
                prompt_tokens=usage.prompt_tokens,
                completion_tokens=usage.completion_tokens,
                finish_reason=choice.finish_reason,
            )

            return choice.message.content

        except RateLimitError as e:
            record_llm_error(span, e, error_type="rate_limit")
            logger.warning("llm_rate_limited", model=model, feature="order_dashboard")
            raise

        except APITimeoutError as e:
            record_llm_error(span, e, error_type="timeout")
            logger.error("llm_timeout", model=model, feature="order_dashboard")
            raise

        except Exception as e:
            record_llm_error(span, e, error_type="unknown")
            logger.error("llm_error", model=model, exc_info=True)
            raise


async def classify_support_ticket(ticket_text: str) -> dict:
    """Classify a support ticket by category and urgency."""
    model = "claude-haiku-4-5"

    with llm_span(
        model=model,
        operation="classify",
        feature="support_triage",
        temperature=0.0,
        max_tokens=100,
    ) as span:
        try:
            response = await anthropic_client.messages.create(
                model=model,
                max_tokens=100,
                messages=[
                    {
                        "role": "user",
                        "content": f"""Classify this support ticket. 
Respond with JSON only: {{"category": "...", "urgency": "low|medium|high"}}

Ticket: {ticket_text}""",
                    }
                ],
            )

            usage = response.usage
            finish_reason = response.stop_reason  # Anthropic uses stop_reason

            record_llm_response(
                span=span,
                model=model,
                prompt_tokens=usage.input_tokens,
                completion_tokens=usage.output_tokens,
                finish_reason=finish_reason or "stop",
            )

            import json
            result = json.loads(response.content[0].text)

            # Add classification result to span for filtering
            span.set_attributes({
                "ticket.category": result.get("category", "unknown"),
                "ticket.urgency": result.get("urgency", "unknown"),
            })

            return result

        except APIStatusError as e:
            record_llm_error(span, e, error_type=f"api_status_{e.status_code}")
            raise

        except Exception as e:
            record_llm_error(span, e, error_type="unknown")
            raise

Step 5: Wiring Into FastAPI

main.py

import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from tracing import init_tracer
from services import summarize_order, classify_support_ticket

init_tracer("llm-powered-api")

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)


class OrderSummaryRequest(BaseModel):
    order_text: str
    user_id: str


class SupportTicketRequest(BaseModel):
    ticket_text: str


@app.post("/orders/summarize")
async def summarize(request: OrderSummaryRequest):
    try:
        summary = await summarize_order(request.order_text, request.user_id)
        return {"summary": summary}
    except Exception:
        raise HTTPException(status_code=503, detail="Summary service unavailable")


@app.post("/support/classify")
async def classify(request: SupportTicketRequest):
    try:
        classification = await classify_support_ticket(request.ticket_text)
        return classification
    except Exception:
        raise HTTPException(status_code=503, detail="Classification service unavailable")

What the Telemetry Looks Like

A successful call to /orders/summarize produces a span with these attributes:

{
  "name": "llm.summarize",
  "status": "OK",
  "attributes": {
    "llm.model": "gpt-4o-mini",
    "llm.operation": "summarize",
    "llm.feature": "order_dashboard",
    "llm.temperature": 0.0,
    "llm.max_tokens": 200,
    "llm.prompt_tokens": 87,
    "llm.completion_tokens": 52,
    "llm.total_tokens": 139,
    "llm.finish_reason": "stop",
    "llm.estimated_cost_usd": 0.0000913,
    "llm.latency_ms": 1243.5,
    "llm.cached": false
  }
}

A truncated response — where the model hit the token limit — looks like:

{
  "name": "llm.summarize",
  "status": "ERROR",
  "status_message": "Response truncated by token limit",
  "attributes": {
    "llm.model": "gpt-4o-mini",
    "llm.finish_reason": "length",
    "llm.truncated": true,
    "llm.prompt_tokens": 312,
    "llm.completion_tokens": 200,
    "llm.total_tokens": 512,
    "llm.estimated_cost_usd": 0.0001672,
    "llm.latency_ms": 3821.2
  }
}

Dashboards and Alerts That Actually Matter

With this telemetry in place, here are the queries that become useful:

Cost by feature: Group spans by llm.feature and sum llm.estimated_cost_usd. This tells you which features are driving your LLM spend. In most applications, one or two features account for the majority of cost.

Truncation rate by model: Filter spans where llm.truncated = true and group by llm.model. A rising truncation rate on a specific model usually means prompts are growing — often because you've added more context or the input data has changed.

Latency percentiles by operation: P50 and P99 latency grouped by llm.operation. LLM latency distributions are wide — P50 might be 800ms while P99 is 12 seconds. Alerting on P99 rather than average catches the tail latency issues that users actually experience.

Error rate by error type: Group spans by llm.error_type. Rate limit errors, timeouts, and content filter triggers have completely different remediation paths. Grouping them together hides what's actually wrong.

Recommended alerts:

Alert	Condition	Threshold
High latency	P99 `llm.latency_ms`	> 10,000ms
Truncation spike	`llm.truncated = true` rate	> 5% of calls
Rate limiting	`llm.error_type = rate_limit` count	> 10 per minute
Cost spike	Sum `llm.estimated_cost_usd` per hour	> 2x baseline
Content filter	`llm.error_type = content_filter` count	> 3 per hour

Handling Retries Without Double-Counting

If your application retries failed LLM calls, you need to track retry counts to avoid double-counting costs and misattributing errors.

async def summarize_with_retry(order_text: str, user_id: str, max_retries: int = 2) -> str:
    model = "gpt-4o-mini"
    last_error = None

    for attempt in range(max_retries + 1):
        with llm_span(
            model=model,
            operation="summarize",
            feature="order_dashboard",
        ) as span:
            span.set_attribute("llm.attempt", attempt)
            span.set_attribute("llm.is_retry", attempt > 0)

            try:
                response = await openai_client.chat.completions.create(
                    model=model,
                    max_tokens=200,
                    messages=[
                        {"role": "system", "content": "Summarize this order."},
                        {"role": "user", "content": order_text},
                    ],
                )

                usage = response.usage
                record_llm_response(
                    span=span,
                    model=model,
                    prompt_tokens=usage.prompt_tokens,
                    completion_tokens=usage.completion_tokens,
                    finish_reason=response.choices[0].finish_reason,
                )

                return response.choices[0].message.content

            except RateLimitError as e:
                record_llm_error(span, e, error_type="rate_limit")
                last_error = e
                if attempt < max_retries:
                    import asyncio
                    await asyncio.sleep(2 ** attempt)
                continue

    raise last_error

With llm.attempt and llm.is_retry on every span, you can filter your cost dashboard to exclude retry attempts — or specifically query retried calls to understand which operations are flaky.

Summary

LLM API calls require a different approach to monitoring than standard HTTP dependencies. The key attributes to capture are:

Latency — LLM calls are slow and variable; P99 matters more than average
Token counts — input and output separately, since they have different costs
Finish reason — stop, length, content_filter, and tool_calls each indicate different conditions
Estimated cost — per-call and aggregated by feature
Error type — rate limits, timeouts, and content filters need different responses

The instrumentation layer in this article wraps both OpenAI and Anthropic calls with a consistent span structure. As you add more models or providers, the pattern stays the same — llm_span as the context manager, record_llm_response after the call, record_llm_error in the exception handler.

Without this telemetry, LLM-powered features are a black box. With it, you can answer the questions that actually matter in production: what is this costing, why is it slow, and what is the model actually doing.

Find me on GitHub or LinkedIn.

Top comments (1)

Void Stitch • May 21

Great walkthrough, especially the latency + token/cost focus.

One spec nuance I wanted your take on: in OpenTelemetry semconv 1.41.0, the GenAI attrs in the main repo are marked moved/deprecated, with active evolution in the dedicated semconv-genai repo. Are you pinning a schema version for your emitted keys so dashboards stay stable across that transition?

Sources:

I am curious whether you are normalizing both legacy and new key sets at ingest, or hard-switching to one namespace.