GenAI Semantic Conventions in 2026: The 11 Attributes That Survive Across SDKs

#ai #observability #opentelemetry #llm

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

OpenTelemetry's GenAI semantic conventions are still marked experimental in 2026, and the SDKs that claim to follow them quietly disagree with each other on six of the most-used attributes. You wire up a dashboard. You filter on gen_ai.request.model. Forty percent of your spans don't show up because one SDK emits llm.model_name instead.

This post is the survival guide. Eleven attributes that haven't moved in twelve months. The seven that have. A 30-line manual wrapper. A 20-line span processor that normalises renames at export, so your queries stop lying.

The drift you can't see

A platform team I talked to last month had a beautiful Grafana panel: tokens-per-minute, broken down by model. It worked perfectly for their RAG service. Then they added an agent runtime using a different SDK and the agent traffic just... didn't appear.

The RAG service emitted gen_ai.request.model. The agent runtime emitted llm.model_name. Both SDKs cited the OpenTelemetry GenAI semantic conventions as their source. Both were right at some point. One of them shipped against a draft from eight months back.

You can't see this drift on a dashboard. The panel renders. The numbers look plausible. The bug is that the filter silently excludes half your data.

What "stable" means in an experimental spec

The GenAI semconv page has a banner at the top: "This document is a work in progress." That banner has been there since the convention was introduced. It's still there. It may always be there.

What changes underneath the banner is what bites you. Attribute names get renamed. Attributes get split (one becomes two). Attributes get demoted from required to opt-in. The spec moves; the SDKs move at their own pace; auto-instrumentation libraries lag behind the SDKs.

"Stable" in this context doesn't mean what the spec says is stable. It means what hasn't moved in 12 months across the four SDKs you actually have to support. That's a measurable definition. The 11 attributes below all meet it. The 7 in the next section don't.

If you build dashboards and alerts against the 11, you can ignore the spec churn. If you build them against the 7, you're going to rewrite queries every quarter.

The 11 attributes that survive

Pin queries to these names. They're emitted by OpenLLMetry (Traceloop), OpenInference (Arize), the OTel-native Python and Node SDKs, and the LangSmith OTel exporter, and they're the same string in all four.

gen_ai.request.model: the model name the client asked for. Example: "gpt-4o-2024-08-06", "claude-sonnet-4-5", "gemini-2.0-flash". String. Never empty in practice.
gen_ai.response.model: the model the provider actually served. Differs from request when you ask for gpt-4o and get pinned to a dated variant. Example: "gpt-4o-2024-08-06".
gen_ai.operation.name: one of "chat", "text_completion", "embeddings", "tool_use". Required if you want to filter chat completions out of embedding noise.
gen_ai.system: the provider. Example: "openai", "anthropic", "google.vertex", "azure.openai", "ollama". Lowercase, dot-namespaced for sub-providers.
gen_ai.usage.input_tokens: integer, prompt tokens consumed. Example: 1842.
gen_ai.usage.output_tokens: integer, completion tokens generated. Example: 247.
gen_ai.usage.total_tokens: integer, sum. Redundant but every backend expects it. Emit it.
gen_ai.response.finish_reason: array of strings, one per choice. Values: ["stop"], ["length"], ["tool_calls"], ["content_filter"]. The array shape matters: Arize emits a string, the spec says array. Pick one and stick to it.
gen_ai.conversation.id: string, your conversation/session identifier. Not part of the original spec; promoted to "stable across SDKs" because everyone added it independently for the same reason. Example: "conv_01HKQ7Z...".
gen_ai.prompt.template.hash: 16-char SHA hash of the template string before variable substitution. Example: "a4f3e2b1c9d8e7f6". Lets you group spans by which prompt version produced them without leaking the prompt itself.
error.type: when the call fails. Example: "openai.RateLimitError", "anthropic.APIConnectionError". This is from the general OTel error semconv, not GenAI-specific, which is why it's stable.

Plus duration on the span itself (not an attribute, it's the span's end_time minus start_time). Counts as the 12th if you count it. Most backends auto-derive a gen_ai.request.duration metric from this.

The rename table

Here's what the four major SDKs actually emit for each of these. The right column is what your normaliser has to map back to the canonical name.

Canonical name	OpenLLMetry	OpenInference	OTel-native	LangSmith
`gen_ai.request.model`	`gen_ai.request.model`	`llm.model_name`	`gen_ai.request.model`	`gen_ai.request.model`
`gen_ai.response.model`	`gen_ai.response.model`	`llm.model_name`	`gen_ai.response.model`	`gen_ai.response.model`
`gen_ai.operation.name`	`llm.request.type`	`openinference.span.kind`	`gen_ai.operation.name`	`gen_ai.operation.name`
`gen_ai.system`	`gen_ai.system`	`llm.provider`	`gen_ai.system`	`gen_ai.system`
`gen_ai.usage.input_tokens`	`gen_ai.usage.prompt_tokens`	`llm.token_count.prompt`	`gen_ai.usage.input_tokens`	`gen_ai.usage.input_tokens`
`gen_ai.usage.output_tokens`	`gen_ai.usage.completion_tokens`	`llm.token_count.completion`	`gen_ai.usage.output_tokens`	`gen_ai.usage.output_tokens`
`gen_ai.usage.total_tokens`	`gen_ai.usage.total_tokens`	`llm.token_count.total`	`gen_ai.usage.total_tokens`	`gen_ai.usage.total_tokens`
`gen_ai.response.finish_reason`	`gen_ai.response.finish_reasons`	`llm.invocation_parameters`	`gen_ai.response.finish_reasons`	`gen_ai.response.finish_reason`

Note the input/output rename in OpenLLMetry. The spec said prompt_tokens/completion_tokens in 2024, then renamed to input_tokens/output_tokens in 2025. OpenLLMetry kept the old names for backward compatibility. OpenInference never adopted either name; it has its own llm.token_count.* namespace from before the spec existed.

The 7 attributes still moving

Don't build dashboards on these without a normaliser layer. They are: tool-call name, tool-call arguments, tool-call ID, message role, message content, prompt content, response content. The whole "what was actually said and which tool got called" surface is the most volatile part of the spec.

Tool-call attributes got renamed three times in 18 months. Message content moved from a flat string to an event with structured parts and back. If you need this data, for evals, for replay, for the "show me the conversation" UI, capture it, but capture it under your own namespace (myapp.llm.tool_call.*) and let the spec settle.

The token counts (input_tokens, output_tokens, total_tokens) are stable. The token contents are not.

A minimal manual instrumentation snippet

When you don't trust the auto-instrumentation, wrap the call yourself. Here's a 30-line wrapper for the OpenAI Python SDK that emits the 11 attributes correctly. Drop it in front of client.chat.completions.create and your spans will be canonical-name clean.

import hashlib, time
from opentelemetry import trace
from openai import OpenAI, OpenAIError

tracer = trace.get_tracer("myapp.llm")
client = OpenAI()

def chat(model, messages, template, conversation_id, **kwargs):
    tpl_hash = hashlib.sha256(template.encode()).hexdigest()[:16]
    with tracer.start_as_current_span("chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.conversation.id", conversation_id)
        span.set_attribute("gen_ai.prompt.template.hash", tpl_hash)
        try:
            r = client.chat.completions.create(
                model=model, messages=messages, **kwargs
            )
        except OpenAIError as e:
            # error.type is OTel-general, not gen_ai-namespaced
            span.set_attribute("error.type", type(e).__name__)
            span.record_exception(e)
            raise
        u = r.usage
        span.set_attribute("gen_ai.response.model", r.model)
        span.set_attribute("gen_ai.usage.input_tokens", u.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", u.completion_tokens)
        span.set_attribute("gen_ai.usage.total_tokens", u.total_tokens)
        span.set_attribute(
            "gen_ai.response.finish_reason",
            [c.finish_reason for c in r.choices],
        )
        return r

Three things worth pointing out. The hash is truncated to 16 chars: full SHA-256 inflates your trace payload and 16 hex chars is enough to disambiguate every prompt template you'll ever ship. error.type is set before record_exception so the attribute lands even if the exception path skips later code. And finish_reason is set as a list, matching what the spec says, even when there's only one choice. The list-vs-string disagreement between SDKs is one of the most common dashboard-breakers.

A query that joins across SDKs

If you can't normalise at ingest, normalise at query time. PromQL, SQL, or the Grafana Tempo TraceQL all support coalescing. Pseudocode in SQL because it reads cleanest:

SELECT
  COALESCE(
    span.attributes['gen_ai.request.model'],
    span.attributes['llm.model_name']
  ) AS model,
  COALESCE(
    span.attributes['gen_ai.usage.input_tokens'],
    span.attributes['gen_ai.usage.prompt_tokens'],
    span.attributes['llm.token_count.prompt']
  ) AS input_tokens,
  span.duration AS latency_ms
FROM spans
WHERE span.name = 'chat'
  AND span.start_time > now() - INTERVAL '1 hour';

Painful, but you only write it once per dashboard. Better is to do the rename at ingest.

What to instrument when you don't control the SDK

When the SDK is vendored or auto-instrumented and you can't touch the call site, normalise at the OTel collector or in a span processor before export. Here's a 20-line SpanProcessor (Python OTel SDK) that rewrites the moving attribute names to canonical ones as spans end, before they hit your exporter.

from opentelemetry.sdk.trace import SpanProcessor

RENAMES = {
    "llm.model_name": "gen_ai.request.model",
    "llm.provider": "gen_ai.system",
    "llm.token_count.prompt": "gen_ai.usage.input_tokens",
    "llm.token_count.completion": "gen_ai.usage.output_tokens",
    "llm.token_count.total": "gen_ai.usage.total_tokens",
    "gen_ai.usage.prompt_tokens": "gen_ai.usage.input_tokens",
    "gen_ai.usage.completion_tokens": "gen_ai.usage.output_tokens",
    "gen_ai.response.finish_reasons": "gen_ai.response.finish_reason",
}

class GenAINormaliser(SpanProcessor):
    def on_end(self, span):
        # span attributes are a read-only mapping; the underlying dict isn't
        attrs = dict(span.attributes or {})
        for old, new in RENAMES.items():
            if old in attrs and new not in attrs:
                span._attributes[new] = attrs[old]
    def on_start(self, span, parent_context=None): pass
    def shutdown(self): pass
    def force_flush(self, timeout_millis=30000): return True

provider.add_span_processor(GenAINormaliser())
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

Order matters. The normaliser runs first because add_span_processor calls them in registration order on on_end. Touching span._attributes is reaching into a private field. Yes, ugly. The public API doesn't let you mutate attributes after the fact. If that bothers you, do the rewrite at the collector layer with an attributes processor in otel-collector-config.yaml instead. Same idea, different layer.

The gotcha you'll hit: some SDKs (OpenInference is the worst offender) set attributes on the span after set_status is called. If those late attributes arrive on a span that's already ended, they're dropped. You'll see a span with llm.model_name set but no gen_ai.request.model even after your normaliser ran. The fix is to switch to a collector-side rewrite. By the time the OTLP payload arrives at the collector, all attributes are present.

When you should stop pinning

The day the GenAI semconv loses its experimental banner, re-read this post and check whether your 11 still match. Probably 9 of them will. The two most likely to move when the spec stabilises are gen_ai.conversation.id (might become session.id to match a different OTel proposal) and gen_ai.prompt.template.hash (might land as a structured attribute rather than a raw hash).

Until then: dashboards on the 11, normaliser for the renames, your own namespace for the volatile 7.

Which SDK are you stuck with, and which attribute has burned you most? Drop the rename pair you keep tripping over in the comments. The collective list is more useful than any spec page.

If this was useful

The mismatch between what the spec says and what your SDK actually emits is one of the things the LLM Observability Pocket Guide digs into. The chapter on choosing a tracing tool walks through the same SDK comparison from the angle of which one fits your stack, and the SLO chapter shows how to build alerts that don't break when the semconv moves.