daniel jeong

Posted on Mar 28 • Edited on Apr 1 • Originally published at manoit.co.kr

OpenTelemetry GenAI Semantic Conventions - The Standard for LLM Observability

#observability #opentelemetry #aiobservability #llm

LLM Observability: Making Blackbox AI Transparent

As LLM-based applications are being deployed at full scale in production, traditional APM (Application Performance Monitoring) alone is no longer sufficient to understand how AI systems operate. Metrics like API response time, error rate, and CPU usage remain important, but AI systems require a completely different set of questions and measurements.

We need specialized LLM metrics: "Which models were called with which prompts and what responses were generated?", "How many tokens were consumed?", "Is there consistency in responses?", "Where are hallucinations occurring?" The problem is that the current ecosystem is severely fragmented. Tools like Langfuse, Helicone, Traceloop, and LangSmith each use incompatible proprietary tracing formats, creating vendor lock-in situations.

OpenTelemetry GenAI Semantic Conventions were created specifically to solve this fragmentation problem.

The Structure of OpenTelemetry GenAI Semantic Conventions

Developed by the GenAI SIG (Special Interest Group) of OpenTelemetry since April 2024, this standard unifies attribute names, types, and enumeration values for LLM calls, agent steps, vector database queries, token usage, cost tracking, and quality metrics (hallucination indicators, scores, etc.).

As of March 2026, most GenAI semantic conventions are in experimental status, meaning the API isn't fully stabilized yet. However, major observability vendors have already started supporting it. Datadog began native support in OTel v1.37, and Grafana also started collecting LLM traces in Loki.

Domain	Major Attributes	Type	Description
Model Calls	gen_ai.operation.name	string	Operation type of LLM request (e.g., "chat", "embedding")
	gen_ai.request.model	string	Model name used (e.g., "gpt-4o", "claude-opus")
Token Tracking	gen_ai.usage.input_tokens	int	Number of input tokens
	gen_ai.usage.output_tokens	int	Number of output tokens
Agent	gen_ai.agent.name	string	AI agent identifier
	gen_ai.agent.description	string	Agent role description
Tool Calls	gen_ai.tool.name	string	Name of external tool used by agent
	gen_ai.tool.type	enum	Tool type (function, web_search, database, etc.)
Quality Metrics	gen_ai.*.score	double	Response quality score (0~1)
	gen_ai.*.cost	double	Cost per request (USD)

Hands-On Implementation: Applying LLM Tracing with Python

Step 1: Install OpenTelemetry Packages


# Core OpenTelemetry packages
pip install opentelemetry-api opentelemetry-sdk

# Provider-specific instrumentation
pip install opentelemetry-instrumentation-openai \
  opentelemetry-instrumentation-anthropic \
  opentelemetry-instrumentation-langchain

# Exporters (choose based on your backend)
pip install opentelemetry-exporter-otlp  # Jaeger, Datadog, Grafana Tempo
pip install opentelemetry-exporter-jaeger
pip install opentelemetry-exporter-prometheus

# Optional: context propagation (for microservices)
pip install opentelemetry-propagators-jaeger

Step 2: Configure Tracing and Auto-Instrumentation


from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure TracerProvider
trace_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    insecure=True
)
trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)

# Configure MeterProvider (metrics)
metric_exporter = OTLPMetricExporter(
    endpoint="http://localhost:4317",
    insecure=True
)
metric_reader = PeriodicExportingMetricReader(metric_exporter)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Enable auto-instrumentation for OpenAI
OpenAIInstrumentor().instrument()

# Also trace HTTP requests (external API calls)
RequestsInstrumentor().instrument()

# Now OpenAI calls are automatically traced
import openai

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

# Auto-generated span attributes:
# gen_ai.operation.name: "chat"
# gen_ai.request.model: "gpt-4o"
# gen_ai.usage.input_tokens: 15
# gen_ai.usage.output_tokens: 8
# gen_ai.response.finish_reason: "stop"

💡 Pro Tip: OpenTelemetry exporters support both gRPC (port 4317) and HTTP (port 4318). In environments with strict firewall policies, using HTTP is simpler.

Step 3: Add Custom Agent Spans


from opentelemetry import trace
import time

tracer = trace.get_tracer("my-ai-agent")

def execute_agent_task(task_description: str):
    """Function for AI agent to execute a task"""

    # Top-level span: agent execution
    with tracer.start_as_current_span("agent_execution") as agent_span:
        agent_span.set_attribute("gen_ai.agent.name", "research-bot")
        agent_span.set_attribute("gen_ai.agent.description", "Web research and analysis")
        agent_span.set_attribute("gen_ai.operation.name", "agent_plan_and_execute")

        # Step 1: Planning
        with tracer.start_as_current_span("plan_step") as plan_span:
            plan_span.set_attribute("gen_ai.operation.name", "chat")
            plan_span.set_attribute("gen_ai.request.model", "gpt-4o")
            # Actual LLM call is handled by auto-instrumentation

        # Step 2: Web search tool call
        with tracer.start_as_current_span("tool_call") as tool_span:
            tool_span.set_attribute("gen_ai.tool.name", "web_search")
            tool_span.set_attribute("gen_ai.tool.type", "function")
            tool_span.set_attribute("gen_ai.tool.description", "Search the web for information")

            # Execute tool
            search_results = perform_web_search(task_description)
            time.sleep(0.5)  # Simulate actual API call

            tool_span.set_attribute("gen_ai.tool.call.status", "success")
            tool_span.set_attribute("gen_ai.tool.output.count", len(search_results))

        # Step 3: Database query tool
        with tracer.start_as_current_span("tool_call") as db_span:
            db_span.set_attribute("gen_ai.tool.name", "database_lookup")
            db_span.set_attribute("gen_ai.tool.type", "database")

        # Step 4: Generate final response
        with tracer.start_as_current_span("generate_response") as response_span:
            response_span.set_attribute("gen_ai.operation.name", "chat")
            response_span.set_attribute("gen_ai.request.model", "gpt-4o")

        # Agent execution summary
        agent_span.set_attribute("gen_ai.agent.status", "completed")
        agent_span.set_attribute("gen_ai.agent.steps_executed", 4)

def perform_web_search(query: str):
    """Web search simulation"""
    return [
        {"title": "Result 1", "url": "https://example.com/1"},
        {"title": "Result 2", "url": "https://example.com/2"}
    ]

# Execute
execute_agent_task("How has climate change impacted global agriculture?")

Step 4: Hallucination Detection Metrics


from opentelemetry import trace, metrics
from opentelemetry.sdk.metrics import MeterProvider

tracer = trace.get_tracer("hallucination-detector")
meter = metrics.get_meter("llm-quality-metrics")

# Hallucination detection counter
hallucination_counter = meter.create_counter(
    name="gen_ai.hallucination.detected",
    description="Number of hallucinations detected in LLM responses",
    unit="1"
)

# Response quality histogram
quality_histogram = meter.create_histogram(
    name="gen_ai.response.quality",
    description="Quality score of LLM responses",
    unit="1"
)

def verify_llm_response(response: str, fact_check_fn) -> dict:
    """Verify LLM response and determine if hallucination occurred"""

    with tracer.start_as_current_span("verify_response") as span:
        span.set_attribute("gen_ai.operation.name", "verify")
        span.set_attribute("gen_ai.response.length", len(response))

        # Perform fact checking
        fact_check_result = fact_check_fn(response)

        quality_score = fact_check_result.get("quality_score", 0.5)
        has_hallucination = fact_check_result.get("has_hallucination", False)
        confidence = fact_check_result.get("confidence", 0.0)

        # Record metrics
        if has_hallucination:
            hallucination_counter.add(1, {"model": "gpt-4o"})

        quality_histogram.record(quality_score, {"model": "gpt-4o"})

        # Set span attributes
        span.set_attribute("gen_ai.response.quality.score", quality_score)
        span.set_attribute("gen_ai.response.hallucination_detected", has_hallucination)
        span.set_attribute("gen_ai.response.confidence", confidence)

        return fact_check_result

# Usage example
response = "Paris is the capital of France and has a population of about 2 million people."
result = verify_llm_response(response, lambda r: {
    "quality_score": 0.95,
    "has_hallucination": False,
    "confidence": 0.98
})

⚠️ Important: GenAI semantic conventions do not capture prompt and response content by default. This is a design principle to prevent personally identifiable information (PII) leakage. If content logging is required, explicitly enable it with the OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true environment variable.

Building an AI Dashboard with Grafana + OTel Collector


# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 100

  # GenAI attribute filtering
  attributes:
    actions:
      - key: gen_ai.*
        action: upsert

exporters:
  # Prometheus: export metrics
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    headers:
      Authorization: Bearer $PROMETHEUS_TOKEN

  # Tempo: store traces
  otlp:
    endpoint: http://tempo:4317
    tls:
      insecure: true

  # Loki: logs and text-based data
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    auth:
      authenticator: basicauth

connectors:
  spanmetrics:
    dimensions:
      - gen_ai.request.model
      - gen_ai.agent.name
    metrics:
      - name: gen_ai.request.duration
        description: "LLM request duration"

service:
  pipelines:
    # Trace pipeline: track LLM calls
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp]

    # Metrics pipeline: token count, cost, response time
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

    # Logs pipeline: errors, hallucination detection
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

With this configuration:

LLM call traces → Tempo/Jaeger (distributed tracing)
Token usage, cost, response time → Prometheus (metrics)
Prompts, responses, error logs → Loki (logs)

All unified in a Grafana dashboard for visualization.

AI Agent Observability: Next Steps

Beyond simple LLM calls, standardization for multi-agent system observability is also in development. The OpenTelemetry GenAI SIG is working on two parallel efforts:

1) AI Agent Application Conventions: A standard for tracking individual agent tasks, actions, and memory. Currently in experimental phases, standardizing agent-to-agent communication, state transitions, and memory management.

2) AI Agent Framework Conventions: Framework-specific instrumentation standardization for CrewAI, AutoGen, LangGraph, Semantic Kernel, and others. Since each framework has different agent architectures, suitable adapters are being developed for each.

Implementation approaches are two-fold:

Built-in: Native OTel support embedded in frameworks. CrewAI already supports native OpenTelemetry.
External Instrumentation Library: Provided through external libraries like Traceloop and Langtrace.

In production environments, a hybrid approach combining both methods is practical.

Cost Tracking: Optimizing LLM Costs with OpenTelemetry


from opentelemetry import trace, metrics

tracer = trace.get_tracer("llm-cost-tracker")
meter = metrics.get_meter("llm-cost-metrics")

# Token pricing by model (2026 rates)
PRICING = {
    "gpt-4o": {
        "input": 0.003 / 1000,      # $0.003 per 1K input tokens
        "output": 0.006 / 1000      # $0.006 per 1K output tokens
    },
    "gpt-4-turbo": {
        "input": 0.01 / 1000,
        "output": 0.03 / 1000
    },
    "claude-opus": {
        "input": 0.015 / 1000,
        "output": 0.075 / 1000
    }
}

# Cumulative cost counter
cost_counter = meter.create_counter(
    name="gen_ai.cost.total",
    description="Total cost of LLM API calls",
    unit="USD"
)

def calculate_and_record_cost(model: str, input_tokens: int, output_tokens: int):
    """Calculate and record cost based on token counts"""

    with tracer.start_as_current_span("cost_calculation") as span:
        if model not in PRICING:
            span.set_attribute("gen_ai.cost.error", "unknown_model")
            return 0.0

        pricing = PRICING[model]
        input_cost = input_tokens * pricing["input"]
        output_cost = output_tokens * pricing["output"]
        total_cost = input_cost + output_cost

        # Record metric
        cost_counter.add(
            total_cost,
            {
                "model": model,
                "operation": "chat"
            }
        )

        # Set span attributes
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("gen_ai.cost.input", input_cost)
        span.set_attribute("gen_ai.cost.output", output_cost)
        span.set_attribute("gen_ai.cost.total", total_cost)

        return total_cost

# Usage example
cost = calculate_and_record_cost("gpt-4o", 100, 50)
print(f"Cost: ${cost:.4f}")

Frequently Asked Questions

Q: I'm already using Langfuse. Do I need to migrate to OTel?

There's no immediate need to switch. However, once OTel GenAI conventions stabilize, most observability tools will support this standard, so a practical strategy is to start new projects with OTel and gradually migrate existing projects. In fact, Langfuse is publicly considering OTel integration.

Q: What's the current stability level of the conventions?

As of March 2026, most GenAI semantic conventions are in experimental status. For production adoption, the OTEL_SEMCONV_STABILITY_OPT_IN environment variable allows dual-emission of both legacy and new attribute names, maintaining compatibility during version transitions.

Q: Which LLM providers are supported?

The OpenAI Python SDK instrumentation is currently the most mature. Anthropic, Cohere, AWS Bedrock, and others are also supported through community libraries. If using LiteLLM, auto-tracing is possible through OpenAI-compatible interfaces. Even for providers not yet officially supported, custom spans can easily be added.

Q: What's the performance overhead in production environments?

OpenTelemetry is designed with asynchronous batch processing, so the impact on main application performance is less than 1%. Since LLM calls already take several seconds, the tracing overhead is negligible.

Conclusion: A New Standard for LLM Observability

OpenTelemetry GenAI semantic conventions fundamentally solve the observability problem for LLM applications. Without vendor-specific attributes, you can track all AI-related metrics—tokens, costs, quality, hallucinations—in a standardized manner.

I recommend starting now. Build new projects with OTel-based architecture, and gradually migrate existing projects. OpenTelemetry is already a CNCF standard and will be supported by all observability tools in the future.

This article was written with AI technology assistance. For more cloud-native engineering insights, visit the ManoIT Tech Blog.

DEV Community