LLM Observability: Making Blackbox AI Transparent
As LLM-based applications are being deployed at full scale in production, traditional APM (Application Performance Monitoring) alone is no longer sufficient to understand how AI systems operate. Metrics like API response time, error rate, and CPU usage remain important, but AI systems require a completely different set of questions and measurements.
We need specialized LLM metrics: "Which models were called with which prompts and what responses were generated?", "How many tokens were consumed?", "Is there consistency in responses?", "Where are hallucinations occurring?" The problem is that the current ecosystem is severely fragmented. Tools like Langfuse, Helicone, Traceloop, and LangSmith each use incompatible proprietary tracing formats, creating vendor lock-in situations.
OpenTelemetry GenAI Semantic Conventions were created specifically to solve this fragmentation problem.
The Structure of OpenTelemetry GenAI Semantic Conventions
Developed by the GenAI SIG (Special Interest Group) of OpenTelemetry since April 2024, this standard unifies attribute names, types, and enumeration values for LLM calls, agent steps, vector database queries, token usage, cost tracking, and quality metrics (hallucination indicators, scores, etc.).
As of March 2026, most GenAI semantic conventions are in experimental status, meaning the API isn't fully stabilized yet. However, major observability vendors have already started supporting it. Datadog began native support in OTel v1.37, and Grafana also started collecting LLM traces in Loki.
| Domain | Major Attributes | Type | Description |
|---|---|---|---|
| Model Calls | gen_ai.operation.name | string | Operation type of LLM request (e.g., "chat", "embedding") |
| gen_ai.request.model | string | Model name used (e.g., "gpt-4o", "claude-opus") | |
| Token Tracking | gen_ai.usage.input_tokens | int | Number of input tokens |
| gen_ai.usage.output_tokens | int | Number of output tokens | |
| Agent | gen_ai.agent.name | string | AI agent identifier |
| gen_ai.agent.description | string | Agent role description | |
| Tool Calls | gen_ai.tool.name | string | Name of external tool used by agent |
| gen_ai.tool.type | enum | Tool type (function, web_search, database, etc.) | |
| Quality Metrics | gen_ai.*.score | double | Response quality score (0~1) |
| gen_ai.*.cost | double | Cost per request (USD) |
Hands-On Implementation: Applying LLM Tracing with Python
Step 1: Install OpenTelemetry Packages
# Core OpenTelemetry packages
pip install opentelemetry-api opentelemetry-sdk
# Provider-specific instrumentation
pip install opentelemetry-instrumentation-openai \
opentelemetry-instrumentation-anthropic \
opentelemetry-instrumentation-langchain
# Exporters (choose based on your backend)
pip install opentelemetry-exporter-otlp # Jaeger, Datadog, Grafana Tempo
pip install opentelemetry-exporter-jaeger
pip install opentelemetry-exporter-prometheus
# Optional: context propagation (for microservices)
pip install opentelemetry-propagators-jaeger
Step 2: Configure Tracing and Auto-Instrumentation
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure TracerProvider
trace_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True
)
trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)
# Configure MeterProvider (metrics)
metric_exporter = OTLPMetricExporter(
endpoint="http://localhost:4317",
insecure=True
)
metric_reader = PeriodicExportingMetricReader(metric_exporter)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# Enable auto-instrumentation for OpenAI
OpenAIInstrumentor().instrument()
# Also trace HTTP requests (external API calls)
RequestsInstrumentor().instrument()
# Now OpenAI calls are automatically traced
import openai
client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
# Auto-generated span attributes:
# gen_ai.operation.name: "chat"
# gen_ai.request.model: "gpt-4o"
# gen_ai.usage.input_tokens: 15
# gen_ai.usage.output_tokens: 8
# gen_ai.response.finish_reason: "stop"
💡 Pro Tip: OpenTelemetry exporters support both gRPC (port 4317) and HTTP (port 4318). In environments with strict firewall policies, using HTTP is simpler.
Step 3: Add Custom Agent Spans
from opentelemetry import trace
import time
tracer = trace.get_tracer("my-ai-agent")
def execute_agent_task(task_description: str):
"""Function for AI agent to execute a task"""
# Top-level span: agent execution
with tracer.start_as_current_span("agent_execution") as agent_span:
agent_span.set_attribute("gen_ai.agent.name", "research-bot")
agent_span.set_attribute("gen_ai.agent.description", "Web research and analysis")
agent_span.set_attribute("gen_ai.operation.name", "agent_plan_and_execute")
# Step 1: Planning
with tracer.start_as_current_span("plan_step") as plan_span:
plan_span.set_attribute("gen_ai.operation.name", "chat")
plan_span.set_attribute("gen_ai.request.model", "gpt-4o")
# Actual LLM call is handled by auto-instrumentation
# Step 2: Web search tool call
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("gen_ai.tool.name", "web_search")
tool_span.set_attribute("gen_ai.tool.type", "function")
tool_span.set_attribute("gen_ai.tool.description", "Search the web for information")
# Execute tool
search_results = perform_web_search(task_description)
time.sleep(0.5) # Simulate actual API call
tool_span.set_attribute("gen_ai.tool.call.status", "success")
tool_span.set_attribute("gen_ai.tool.output.count", len(search_results))
# Step 3: Database query tool
with tracer.start_as_current_span("tool_call") as db_span:
db_span.set_attribute("gen_ai.tool.name", "database_lookup")
db_span.set_attribute("gen_ai.tool.type", "database")
# Step 4: Generate final response
with tracer.start_as_current_span("generate_response") as response_span:
response_span.set_attribute("gen_ai.operation.name", "chat")
response_span.set_attribute("gen_ai.request.model", "gpt-4o")
# Agent execution summary
agent_span.set_attribute("gen_ai.agent.status", "completed")
agent_span.set_attribute("gen_ai.agent.steps_executed", 4)
def perform_web_search(query: str):
"""Web search simulation"""
return [
{"title": "Result 1", "url": "https://example.com/1"},
{"title": "Result 2", "url": "https://example.com/2"}
]
# Execute
execute_agent_task("How has climate change impacted global agriculture?")
Step 4: Hallucination Detection Metrics
from opentelemetry import trace, metrics
from opentelemetry.sdk.metrics import MeterProvider
tracer = trace.get_tracer("hallucination-detector")
meter = metrics.get_meter("llm-quality-metrics")
# Hallucination detection counter
hallucination_counter = meter.create_counter(
name="gen_ai.hallucination.detected",
description="Number of hallucinations detected in LLM responses",
unit="1"
)
# Response quality histogram
quality_histogram = meter.create_histogram(
name="gen_ai.response.quality",
description="Quality score of LLM responses",
unit="1"
)
def verify_llm_response(response: str, fact_check_fn) -> dict:
"""Verify LLM response and determine if hallucination occurred"""
with tracer.start_as_current_span("verify_response") as span:
span.set_attribute("gen_ai.operation.name", "verify")
span.set_attribute("gen_ai.response.length", len(response))
# Perform fact checking
fact_check_result = fact_check_fn(response)
quality_score = fact_check_result.get("quality_score", 0.5)
has_hallucination = fact_check_result.get("has_hallucination", False)
confidence = fact_check_result.get("confidence", 0.0)
# Record metrics
if has_hallucination:
hallucination_counter.add(1, {"model": "gpt-4o"})
quality_histogram.record(quality_score, {"model": "gpt-4o"})
# Set span attributes
span.set_attribute("gen_ai.response.quality.score", quality_score)
span.set_attribute("gen_ai.response.hallucination_detected", has_hallucination)
span.set_attribute("gen_ai.response.confidence", confidence)
return fact_check_result
# Usage example
response = "Paris is the capital of France and has a population of about 2 million people."
result = verify_llm_response(response, lambda r: {
"quality_score": 0.95,
"has_hallucination": False,
"confidence": 0.98
})
⚠️ Important: GenAI semantic conventions do not capture prompt and response content by default. This is a design principle to prevent personally identifiable information (PII) leakage. If content logging is required, explicitly enable it with the
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=trueenvironment variable.
Building an AI Dashboard with Grafana + OTel Collector
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 100
# GenAI attribute filtering
attributes:
actions:
- key: gen_ai.*
action: upsert
exporters:
# Prometheus: export metrics
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
headers:
Authorization: Bearer $PROMETHEUS_TOKEN
# Tempo: store traces
otlp:
endpoint: http://tempo:4317
tls:
insecure: true
# Loki: logs and text-based data
loki:
endpoint: http://loki:3100/loki/api/v1/push
auth:
authenticator: basicauth
connectors:
spanmetrics:
dimensions:
- gen_ai.request.model
- gen_ai.agent.name
metrics:
- name: gen_ai.request.duration
description: "LLM request duration"
service:
pipelines:
# Trace pipeline: track LLM calls
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp]
# Metrics pipeline: token count, cost, response time
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
# Logs pipeline: errors, hallucination detection
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
With this configuration:
- LLM call traces → Tempo/Jaeger (distributed tracing)
- Token usage, cost, response time → Prometheus (metrics)
- Prompts, responses, error logs → Loki (logs)
All unified in a Grafana dashboard for visualization.
AI Agent Observability: Next Steps
Beyond simple LLM calls, standardization for multi-agent system observability is also in development. The OpenTelemetry GenAI SIG is working on two parallel efforts:
1) AI Agent Application Conventions: A standard for tracking individual agent tasks, actions, and memory. Currently in experimental phases, standardizing agent-to-agent communication, state transitions, and memory management.
2) AI Agent Framework Conventions: Framework-specific instrumentation standardization for CrewAI, AutoGen, LangGraph, Semantic Kernel, and others. Since each framework has different agent architectures, suitable adapters are being developed for each.
Implementation approaches are two-fold:
- Built-in: Native OTel support embedded in frameworks. CrewAI already supports native OpenTelemetry.
- External Instrumentation Library: Provided through external libraries like Traceloop and Langtrace.
In production environments, a hybrid approach combining both methods is practical.
Cost Tracking: Optimizing LLM Costs with OpenTelemetry
from opentelemetry import trace, metrics
tracer = trace.get_tracer("llm-cost-tracker")
meter = metrics.get_meter("llm-cost-metrics")
# Token pricing by model (2026 rates)
PRICING = {
"gpt-4o": {
"input": 0.003 / 1000, # $0.003 per 1K input tokens
"output": 0.006 / 1000 # $0.006 per 1K output tokens
},
"gpt-4-turbo": {
"input": 0.01 / 1000,
"output": 0.03 / 1000
},
"claude-opus": {
"input": 0.015 / 1000,
"output": 0.075 / 1000
}
}
# Cumulative cost counter
cost_counter = meter.create_counter(
name="gen_ai.cost.total",
description="Total cost of LLM API calls",
unit="USD"
)
def calculate_and_record_cost(model: str, input_tokens: int, output_tokens: int):
"""Calculate and record cost based on token counts"""
with tracer.start_as_current_span("cost_calculation") as span:
if model not in PRICING:
span.set_attribute("gen_ai.cost.error", "unknown_model")
return 0.0
pricing = PRICING[model]
input_cost = input_tokens * pricing["input"]
output_cost = output_tokens * pricing["output"]
total_cost = input_cost + output_cost
# Record metric
cost_counter.add(
total_cost,
{
"model": model,
"operation": "chat"
}
)
# Set span attributes
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("gen_ai.cost.input", input_cost)
span.set_attribute("gen_ai.cost.output", output_cost)
span.set_attribute("gen_ai.cost.total", total_cost)
return total_cost
# Usage example
cost = calculate_and_record_cost("gpt-4o", 100, 50)
print(f"Cost: ${cost:.4f}")
Frequently Asked Questions
Q: I'm already using Langfuse. Do I need to migrate to OTel?
There's no immediate need to switch. However, once OTel GenAI conventions stabilize, most observability tools will support this standard, so a practical strategy is to start new projects with OTel and gradually migrate existing projects. In fact, Langfuse is publicly considering OTel integration.
Q: What's the current stability level of the conventions?
As of March 2026, most GenAI semantic conventions are in experimental status. For production adoption, the OTEL_SEMCONV_STABILITY_OPT_IN environment variable allows dual-emission of both legacy and new attribute names, maintaining compatibility during version transitions.
Q: Which LLM providers are supported?
The OpenAI Python SDK instrumentation is currently the most mature. Anthropic, Cohere, AWS Bedrock, and others are also supported through community libraries. If using LiteLLM, auto-tracing is possible through OpenAI-compatible interfaces. Even for providers not yet officially supported, custom spans can easily be added.
Q: What's the performance overhead in production environments?
OpenTelemetry is designed with asynchronous batch processing, so the impact on main application performance is less than 1%. Since LLM calls already take several seconds, the tracing overhead is negligible.
Conclusion: A New Standard for LLM Observability
OpenTelemetry GenAI semantic conventions fundamentally solve the observability problem for LLM applications. Without vendor-specific attributes, you can track all AI-related metrics—tokens, costs, quality, hallucinations—in a standardized manner.
I recommend starting now. Build new projects with OTel-based architecture, and gradually migrate existing projects. OpenTelemetry is already a CNCF standard and will be supported by all observability tools in the future.
This article was written with AI technology assistance. For more cloud-native engineering insights, visit the ManoIT Tech Blog.
Top comments (0)