DEV Community

Cover image for OpenTelemetry GenAI Semantic Conventions: Your LLM Traces Should Look Like This in 2026
Gabriel Anhaia
Gabriel Anhaia

Posted on

OpenTelemetry GenAI Semantic Conventions: Your LLM Traces Should Look Like This in 2026


In January 2026, ClickHouse acquired Langfuse. In April, Cisco announced intent to acquire Galileo. Two of the most visible LLM observability brands, swallowed in a single quarter.

The takeaway for your instrumentation strategy is narrow: do not tie your observability to any vendor at the layer where it matters. Instrument to a standard. Let the vendor be a detail.

That standard, as of April 2026, is the OpenTelemetry GenAI semantic conventions.

Where the spec is

The open-telemetry/semantic-conventions repository cut version v1.40.0 in February 2026. The gen_ai.* namespace is marked experimental / in development.

In practice, the churn is at the edges: multimodal content, agent graphs, MCP. The core attributes (operation name, provider, model, token usage) have been stable in shape since v1.37.0. Instrument the core today; you will add to it, not rewrite it.

Three backends ingest gen_ai.* spans natively, without a translation layer:

  • Datadog LLM Observability — v1.37+ consumed directly. Ship via Datadog Agent in OTLP mode or the Collector.
  • Langfuse (now inside ClickHouse) — OTLP HTTP at /api/public/otel. Maps gen_ai.* onto its internal trace model. gRPC not supported on that endpoint.
  • Arize AX and Phoenix — ingest GenAI spans and translate internally to OpenInference.

You can start on Langfuse self-hosted, decide a year from now you want Datadog, and move without rewriting a line of application code.

Anatomy of a GenAI span

The spec defines operations (things an LLM application does) and attributes (what describes each operation). An operation is identified by gen_ai.operation.name. The well-known values:

  • chat — chat completion
  • text_completion — legacy completion APIs
  • embeddings — embedding generation
  • generate_content — multimodal (Gemini)
  • retrieval — RAG retriever step
  • execute_tool — tool execution
  • create_agent — agent creation
  • invoke_agent — agent invocation (parent span for a tool-calling loop)

The span name format is {gen_ai.operation.name} {gen_ai.request.model}. A call to gpt-4o-mini becomes a span named chat gpt-4o-mini. The operation name and provider name are the two attributes the spec marks required.

The attributes you will actually look at

On a chat span, the ones that earn their keep in production:

Attribute What it carries
gen_ai.provider.name Provider key (openai, anthropic, aws.bedrock, gcp.vertex_ai, azure.ai.openai, cohere, mistral_ai, groq, x_ai). Replaces the older gen_ai.system.
gen_ai.request.model The model name your code sent.
gen_ai.response.model The model the provider actually served. These differ more often than you expect. Bedrock aliases, OpenAI autorouting, Azure deployment IDs. Capture both.
gen_ai.response.id Provider response ID — the string you paste into a support ticket.
gen_ai.response.finish_reasons List. ["stop"], ["length"] (truncated), ["content_filter"], ["tool_calls"].
gen_ai.usage.input_tokens Prompt tokens billed.
gen_ai.usage.output_tokens Completion tokens billed.
gen_ai.usage.cache_read.input_tokens Cache hits. If you use prompt caching and skip this, your cost dashboard lies.
gen_ai.usage.cache_creation.input_tokens Cache writes (Anthropic surcharge tier).

The recommended request-shape attributes (gen_ai.request.temperature, .max_tokens, .top_p, .top_k, .frequency_penalty, .presence_penalty, .stop_sequences, .seed) are cheap, leak no content, and the first time you need to reproduce a weird generation you will want all of them.

What an agent span tree looks like

Agent spans (create_agent, invoke_agent) carry gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, gen_ai.agent.version, and gen_ai.tool.definitions.

A multi-turn tool loop is expressed as a single invoke_agent parent span with alternating chat and execute_tool children:

invoke_agent [agent=support_bot, id=run-abc123]
├── chat gpt-4o-mini [input=412, output=64]
├── execute_tool search_orders [tool.call.id=call-01]
├── chat gpt-4o-mini [input=480, output=102]
├── execute_tool get_order [tool.call.id=call-02]
└── chat gpt-4o-mini [input=560, output=140]
Enter fullscreen mode Exit fullscreen mode

Tool execution spans carry gen_ai.tool.name, gen_ai.tool.type (function, retrieval, extension), and gen_ai.tool.call.id — which links the tool span back to the tool_call_id the LLM emitted in the preceding chat turn.

Conversation correlation across separate root traces uses gen_ai.conversation.id. Set it when a user message and the assistant reply live in different HTTP requests (different trace IDs) and you still want to see them grouped.

Your first instrumented call

For Python, prefer the -v2 packages. That suffix means the library emits the new GenAI semconv. The current beta at the pin date is 2.3b0:

pip install \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-openai-v2==2.3b0
Enter fullscreen mode Exit fullscreen mode
# instrumentation.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor

# Opt into the new semconv. Legacy contrib packages need this.
os.environ["OTEL_SEMCONV_STABILITY_OPT_IN"] = "gen_ai_latest_experimental"
# Turn on prompt and response capture in dev only.
os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

OpenAIInstrumentor().instrument()
Enter fullscreen mode Exit fullscreen mode
# app.py
from openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
# No other changes. The span is emitted automatically.
Enter fullscreen mode Exit fullscreen mode

For TypeScript, the equivalent is @opentelemetry/instrumentation-openai with the same OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental env var.

The prompt-capture flag (read this before you ship)

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT attaches gen_ai.input.messages, gen_ai.output.messages, and gen_ai.system_instructions to every span. When it is off (default), those attributes are absent.

This is a privacy decision, not a performance decision. You do not want user prompts sitting in your trace store by accident. Turn it on in dev. Turn it on selectively in staging. Turn it on in production only after you have decided who can read traces and for how long you retain them.

Metrics, not just traces

The spec also defines GenAI metrics. The two you want wired up from day one:

  • gen_ai.client.operation.duration — histogram of span durations by operation, provider, and model.
  • gen_ai.client.token.usage — histogram of input and output token counts.

Both are emitted automatically by the -v2 instrumentations. In Prometheus, you get per-provider latency and token-usage histograms without writing a metric collector.

If this was useful

Chapter 4 of Observability for LLM Applications is the full spec walkthrough — every attribute, every operation, TypeScript and Python side by side. Chapter 5 walks through a complete first instrumented call end to end. Chapter 6 covers agents. Chapter 7 covers RAG retrievals.

Observability for LLM Applications — the book

Top comments (0)