- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
In January 2026, ClickHouse acquired Langfuse. In April, Cisco announced intent to acquire Galileo. Two of the most visible LLM observability brands, swallowed in a single quarter.
The takeaway for your instrumentation strategy is narrow: do not tie your observability to any vendor at the layer where it matters. Instrument to a standard. Let the vendor be a detail.
That standard, as of April 2026, is the OpenTelemetry GenAI semantic conventions.
Where the spec is
The open-telemetry/semantic-conventions repository cut version v1.40.0 in February 2026. The gen_ai.* namespace is marked experimental / in development.
In practice, the churn is at the edges: multimodal content, agent graphs, MCP. The core attributes (operation name, provider, model, token usage) have been stable in shape since v1.37.0. Instrument the core today; you will add to it, not rewrite it.
Three backends ingest gen_ai.* spans natively, without a translation layer:
- Datadog LLM Observability — v1.37+ consumed directly. Ship via Datadog Agent in OTLP mode or the Collector.
-
Langfuse (now inside ClickHouse) — OTLP HTTP at
/api/public/otel. Mapsgen_ai.*onto its internal trace model. gRPC not supported on that endpoint. - Arize AX and Phoenix — ingest GenAI spans and translate internally to OpenInference.
You can start on Langfuse self-hosted, decide a year from now you want Datadog, and move without rewriting a line of application code.
Anatomy of a GenAI span
The spec defines operations (things an LLM application does) and attributes (what describes each operation). An operation is identified by gen_ai.operation.name. The well-known values:
-
chat— chat completion -
text_completion— legacy completion APIs -
embeddings— embedding generation -
generate_content— multimodal (Gemini) -
retrieval— RAG retriever step -
execute_tool— tool execution -
create_agent— agent creation -
invoke_agent— agent invocation (parent span for a tool-calling loop)
The span name format is {gen_ai.operation.name} {gen_ai.request.model}. A call to gpt-4o-mini becomes a span named chat gpt-4o-mini. The operation name and provider name are the two attributes the spec marks required.
The attributes you will actually look at
On a chat span, the ones that earn their keep in production:
| Attribute | What it carries |
|---|---|
gen_ai.provider.name |
Provider key (openai, anthropic, aws.bedrock, gcp.vertex_ai, azure.ai.openai, cohere, mistral_ai, groq, x_ai). Replaces the older gen_ai.system. |
gen_ai.request.model |
The model name your code sent. |
gen_ai.response.model |
The model the provider actually served. These differ more often than you expect. Bedrock aliases, OpenAI autorouting, Azure deployment IDs. Capture both. |
gen_ai.response.id |
Provider response ID — the string you paste into a support ticket. |
gen_ai.response.finish_reasons |
List. ["stop"], ["length"] (truncated), ["content_filter"], ["tool_calls"]. |
gen_ai.usage.input_tokens |
Prompt tokens billed. |
gen_ai.usage.output_tokens |
Completion tokens billed. |
gen_ai.usage.cache_read.input_tokens |
Cache hits. If you use prompt caching and skip this, your cost dashboard lies. |
gen_ai.usage.cache_creation.input_tokens |
Cache writes (Anthropic surcharge tier). |
The recommended request-shape attributes (gen_ai.request.temperature, .max_tokens, .top_p, .top_k, .frequency_penalty, .presence_penalty, .stop_sequences, .seed) are cheap, leak no content, and the first time you need to reproduce a weird generation you will want all of them.
What an agent span tree looks like
Agent spans (create_agent, invoke_agent) carry gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, gen_ai.agent.version, and gen_ai.tool.definitions.
A multi-turn tool loop is expressed as a single invoke_agent parent span with alternating chat and execute_tool children:
invoke_agent [agent=support_bot, id=run-abc123]
├── chat gpt-4o-mini [input=412, output=64]
├── execute_tool search_orders [tool.call.id=call-01]
├── chat gpt-4o-mini [input=480, output=102]
├── execute_tool get_order [tool.call.id=call-02]
└── chat gpt-4o-mini [input=560, output=140]
Tool execution spans carry gen_ai.tool.name, gen_ai.tool.type (function, retrieval, extension), and gen_ai.tool.call.id — which links the tool span back to the tool_call_id the LLM emitted in the preceding chat turn.
Conversation correlation across separate root traces uses gen_ai.conversation.id. Set it when a user message and the assistant reply live in different HTTP requests (different trace IDs) and you still want to see them grouped.
Your first instrumented call
For Python, prefer the -v2 packages. That suffix means the library emits the new GenAI semconv. The current beta at the pin date is 2.3b0:
pip install \
opentelemetry-sdk \
opentelemetry-exporter-otlp \
opentelemetry-instrumentation-openai-v2==2.3b0
# instrumentation.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
OTLPSpanExporter,
)
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
# Opt into the new semconv. Legacy contrib packages need this.
os.environ["OTEL_SEMCONV_STABILITY_OPT_IN"] = "gen_ai_latest_experimental"
# Turn on prompt and response capture in dev only.
os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()
# app.py
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
)
# No other changes. The span is emitted automatically.
For TypeScript, the equivalent is @opentelemetry/instrumentation-openai with the same OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental env var.
The prompt-capture flag (read this before you ship)
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT attaches gen_ai.input.messages, gen_ai.output.messages, and gen_ai.system_instructions to every span. When it is off (default), those attributes are absent.
This is a privacy decision, not a performance decision. You do not want user prompts sitting in your trace store by accident. Turn it on in dev. Turn it on selectively in staging. Turn it on in production only after you have decided who can read traces and for how long you retain them.
Metrics, not just traces
The spec also defines GenAI metrics. The two you want wired up from day one:
-
gen_ai.client.operation.duration— histogram of span durations by operation, provider, and model. -
gen_ai.client.token.usage— histogram of input and output token counts.
Both are emitted automatically by the -v2 instrumentations. In Prometheus, you get per-provider latency and token-usage histograms without writing a metric collector.
If this was useful
Chapter 4 of Observability for LLM Applications is the full spec walkthrough — every attribute, every operation, TypeScript and Python side by side. Chapter 5 walks through a complete first instrumented call end to end. Chapter 6 covers agents. Chapter 7 covers RAG retrievals.
- Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)