Wire OpenTelemetry Around Your Anthropic Python Calls

#observability #llm #python #tutorial

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your chat endpoint is hitting 14s p95 and the only thing your APM has to show for it is a single flat span labeled POST /chat. No children. No model name, no token counts, no tool-loop iterations, no time-to-first-token. The handler is 12 lines and most of them are a single client.messages.create call, so "the LLM" is the answer. Which part of the LLM, you cannot say.

That is what an untraced LLM call looks like. You see the door close and the door open. Everything between them is a black box billed by the token.

What you need installed

Four packages and an environment variable. Pin versions in your real project; the names below are the current ones.

pip install \
  anthropic \
  opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

Set ANTHROPIC_API_KEY and point the OTLP exporter at your collector:

export ANTHROPIC_API_KEY=sk-ant-...
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_SERVICE_NAME=chat-api

The collector can be a local Jaeger all-in-one, an OTel Collector,
or whatever your team already runs. The SDK does not care.

Bootstrap the tracer once

Set up the provider at process start. Two exporters: console
for dev, OTLP for everything else. Console output is loud but it
is the fastest way to confirm spans actually leave your code.

# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)


def init_tracing(service_name: str = "chat-api") -> None:
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(
        BatchSpanProcessor(ConsoleSpanExporter())
    )
    provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter())
    )
    trace.set_tracer_provider(provider)

Call init_tracing() once, before the first Anthropic call. In a
FastAPI app, do it in a startup hook. In a CLI, do it at the top
of main. The BatchSpanProcessor flushes asynchronously, so a
short script should call provider.shutdown() at exit or the
last span never leaves the buffer.

The sync wrapper

The wrapper itself follows one rule: one span per HTTP request,
attributes set both before and after, exception recorded if it
raises. Names follow the OpenTelemetry GenAI semantic conventions
where they exist (gen_ai.system, gen_ai.request.model,
gen_ai.usage.input_tokens).

# llm.py
import time
from anthropic import Anthropic
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("chat-api.llm")
client = Anthropic()
SPAN = "anthropic.messages.create"


def chat(prompt: str, model: str = "claude-opus-4-7") -> str:
    with tracer.start_as_current_span(SPAN) as s:
        s.set_attribute("gen_ai.system", "anthropic")
        s.set_attribute("gen_ai.request.model", model)
        s.set_attribute("gen_ai.request.max_tokens", 1024)

        t0 = time.perf_counter()
        try:
            msg = client.messages.create(
                model=model,
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
            )
        except Exception as exc:
            s.record_exception(exc)
            s.set_status(Status(StatusCode.ERROR, str(exc)))
            raise
        _record(s, msg, t0)
        return msg.content[0].text

The response side reads the usage block and stamps it on the
span. Pull it into a helper so the streaming wrapper later can
share it.

def _record(s, msg, t0: float) -> None:
    u = msg.usage
    s.set_attribute("gen_ai.response.model", msg.model)
    s.set_attribute(
        "gen_ai.response.stop_reason", msg.stop_reason
    )
    s.set_attribute(
        "gen_ai.usage.input_tokens", u.input_tokens
    )
    s.set_attribute(
        "gen_ai.usage.output_tokens", u.output_tokens
    )
    s.set_attribute(
        "gen_ai.latency_ms",
        int((time.perf_counter() - t0) * 1000),
    )

A few details that matter in production. record_exception writes
the traceback as a span event, so you do not need to log it
separately. set_status(ERROR) is what flips the span red in
Jaeger; without it, an exception still shows green. gen_ai.latency_ms
is redundant with span duration, but having it as an attribute
lets you query and group on it without a duration calculation in
your trace UI.

Tool calls deserve their own attributes

Real apps loop on tool use. Each messages.create returns either a
final text or stop_reason="tool_use" plus tool_use blocks the
caller has to execute. You want to see, on the span, which tools
the model picked and how many times. Add this after the response:

tool_uses = [
    b for b in msg.content if b.type == "tool_use"
]
if tool_uses:
    s.set_attribute(
        "gen_ai.response.tool_calls.count",
        len(tool_uses),
    )
    s.set_attribute(
        "gen_ai.response.tool_calls.names",
        [t.name for t in tool_uses],
    )

OTel attribute values can be lists of primitives, so the names list
exports cleanly. Avoid putting the full tool input on the span as
JSON, that is what span events or logs are for, and high-cardinality
strings on attributes wreck most backend cost models. Put the
arguments size in bytes if you want a cardinality-safe signal.

Wrapping the streaming variant

Streaming is where most teams give up on tracing. The HTTP request
opens fast, then sits open while tokens trickle in, and a naive
wrapper closes the span before the final event arrives. The
Anthropic SDK exposes a context manager that gives you the final
message at the end, which is the right place to read usage.

STREAM_SPAN = "anthropic.messages.stream"


def chat_stream(prompt: str, model: str = "claude-opus-4-7"):
    with tracer.start_as_current_span(STREAM_SPAN) as s:
        s.set_attribute("gen_ai.system", "anthropic")
        s.set_attribute("gen_ai.request.model", model)

        t0 = time.perf_counter()
        first_at: float | None = None

        with client.messages.stream(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        ) as stream:
            for text in stream.text_stream:
                if first_at is None:
                    first_at = time.perf_counter()
                yield text
            final = stream.get_final_message()

        _record(s, final, t0)
        if first_at is not None:
            s.set_attribute(
                "gen_ai.time_to_first_token_ms",
                int((first_at - t0) * 1000),
            )

Two streaming-specific attributes earn their place. time_to_first_token_ms
is the number a user actually feels, the latency before the
typewriter starts. latency_ms is total wall time. Plotting
both gives you the difference between "Anthropic was slow to start"
and "Anthropic was slow to finish", which is the difference between
a queueing problem and a long-context problem.

The generator pattern matters. The with tracer.start_as_current_span
block must wrap the entire iteration, not just the setup, or the
span closes before the stream ends. Same for the SDK's with client.messages.stream block. Yield from inside both.

Run it and look at one span

Hit the wrapper once with a small prompt and watch the console
exporter. The real output is one long line of JSON; pretty-printed
and trimmed for readability, it looks roughly like this:

{
  "name": "anthropic.messages.create",
  "context": {
    "trace_id": "0x7c1e9a4b2f3d8e6a91c4d8f0a2b3c4d5",
    "span_id": "0x4f2a8b1c9d3e7f02",
    "trace_state": "[]"
  },
  "kind": "SpanKind.INTERNAL",
  "parent_id": null,
  "start_time": "2026-04-27T09:14:22.418736Z",
  "end_time":   "2026-04-27T09:14:24.252901Z",
  "status": { "status_code": "UNSET" },
  "attributes": {
    "gen_ai.system": "anthropic",
    "gen_ai.request.model": "claude-opus-4-7",
    "gen_ai.request.max_tokens": 1024,
    "gen_ai.response.model": "claude-opus-4-7",
    "gen_ai.response.stop_reason": "end_turn",
    "gen_ai.usage.input_tokens": 38,
    "gen_ai.usage.output_tokens": 142,
    "gen_ai.latency_ms": 1834
  },
  "events": [],
  "links":  [],
  "resource": {
    "attributes": {
      "service.name": "chat-api",
      "telemetry.sdk.language": "python",
      "telemetry.sdk.name": "opentelemetry",
      "telemetry.sdk.version": "1.27.0"
    }
  }
}

If you see that, OTLP export is also working. The same span is
already on its way to your collector via the second processor.

A Jaeger query that earns its keep

Open Jaeger, pick chat-api as the service, anthropic.messages.create
as the operation. The query that matters most to me is "slow calls
that were not slow because they were huge":

service: chat-api
operation: anthropic.messages.create
minDuration: 5s
tags:
  gen_ai.usage.input_tokens<2000
  gen_ai.response.stop_reason="end_turn"

Five seconds is plenty for a sub-2k-token prompt that ended cleanly.
Anything matching that filter is a candidate for one of three
problems: cold model routing on Anthropic's side, your network,
or a tool loop you forgot to count. The same query in Grafana Tempo
uses TraceQL: { service.name="chat-api" && duration > 5s && span.gen_ai.usage.input_tokens < 2000 }. Either way, the
attribute work above is what makes the question askable.

A team I talked to ran exactly this filter on a chat product I
do not work on and surfaced a single trace where tool_calls.count=4
and output_tokens=18. The model was calling the same retrieval
tool four times in a row because the tool kept returning an empty
result and the loop never short-circuited. The fix took ten
minutes. Finding it without the span attributes would have taken
a day.

If this was useful

The LLM Observability Pocket Guide is the field manual for the rest of this stack: which attributes are worth paying for at scale, how to structure spans across tool loops and agent graphs, and how to pick between OTel-native backends (Jaeger, Tempo, Honeycomb) and the LLM-specific ones (Langfuse, Phoenix, Helicone). If your Anthropic dashboards still show one flat bar per request, this book is the way out.