Saket

Posted on Jun 28

Observability in Agentic AI: What I Learned After Instrumenting a Real LLM Agent with OpenTelemetry

#ai #observability #agenticai #llm

A hands-on walkthrough for AI architects who want visibility into tools, API calls, MCP servers, and model interactions—not just “did the API return 200?”

Introduction

If you ship traditional microservices, observability is a solved problem in principle: traces show request paths, metrics expose latency and error rates, and logs give you the narrative. Agentic AI breaks that mental model.

The same user question can produce different outputs. A single “chat turn” might trigger guardrail checks, session reads, multiple LLM calls, web search, custom tool execution, and follow-up reasoning loops. Failures are often subtle—slow tools, ballooning context windows, guardrails firing too often, or models that degrade under load rather than returning hard errors.

I recently worked through the SigNoz guide on observing LLM applications with OpenTelemetry and ran the OpenTelemetry NBA Agent demo end to end. This post is my architect’s take on what that exercise taught me—and how observability needs to span the entire agentic ecosystem, not just the model call at the center.

Why Agentic AI Demands a Different Observability Mindset

Three ideas from the SigNoz article stuck with me after running the demo myself:

1. Non-determinism is the default

The same prompt can yield different answers across runs. That makes regression testing harder and makes distributed traces more valuable than unit tests alone. You need to see what path the agent took, not just what text came back.

2. “Correct” is contextual

A one-word answer might be fine for “Will it rain tomorrow?” and unacceptable for “Should I rebalance my portfolio?” Observability helps you correlate guardrail decisions, tool usage, and response structure with user intent and topic.

3. The stack moves fast

Model updates, provider brownouts, new APIs (Responses vs Chat Completions), and evolving OpenTelemetry GenAI conventions all change behavior under you. You need baseline metrics before you can tell whether a deploy or a model swap made things better or worse.

OpenTelemetry gives you a vendor-neutral way to capture that picture. SigNoz (or any OTLP-compatible backend) is where you visualize it. The demo proves the plumbing works; production agentic systems need you to extend that plumbing across every integration point.

What I Built (and Ran)

The demo is intentionally small—a FastAPI app with one agent, one custom tool, one guardrail, and session-backed conversations. That restraint is a feature. It produces meaningfully agent-shaped telemetry without hiding the signal in product complexity.

Component	Role
FastAPI	`POST /agent/turn` — one conversational turn per request
OpenAI Agents SDK	Orchestrates the agent loop, tools, and guardrails
WebSearchTool	External retrieval (high latency, high cost if abused)
`calculate_win_percentage`	Deterministic custom tool for stats questions
Input guardrail	Blocks off-topic queries (e.g., weather in Barcelona)
OpenAIConversationsSession	Server-managed context across turns

Running it is straightforward once environment variables are set:

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.<region>.signoz.cloud:443" \
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-key>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1

I could not get a Signoz free trial account due to regional limitations on their site. But since this demo supports Open Telemetry vendors, i created a grafana labs free trial account for 15 days and use that to export the metrics.

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-ap-south-1.grafana.net/otlp" \
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <GrafanaLabstoken>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1

The important detail: opentelemetry-instrumentation-openai-agents-v2 is pinned manually in requirements.txt. opentelemetry-bootstrap does not yet discover the OpenAI Agents SDK. That’s a real-world snapshot of GenAI observability in 2026—powerful, but still maturing.

What a Good Trace Actually Looks Like

After sending a few turns—initial question, follow-up with session_id, a stats question that triggers the tool, and an off-topic question blocked by the guardrail—I knew what “healthy” vs “blocked” looked like in the trace UI.

A successful multi-turn flow

POST /agent/turn — root HTTP span
Session I/O — GET/POST spans for conversation fetch and persist
invoke_agent — agent workflow container
guardrail_check — with gen_ai.guardrail.triggered=false
Model call span — model name, token usage, optional message content attributes
Tool spans — when the agent calls calculate_win_percentage or web search

Follow-up requests reuse session_id; traces show the agent pulling prior context instead of starting cold. That context growth shows up in rising input token counts over a session—something worth dashboarding.

A guardrail block

Off-topic queries still produce useful telemetry:

guardrail_check with gen_ai.guardrail.triggered=true
Workflow stops before expensive model/tool work
FastAPI records an exception on the span via span.record_exception(exc)
API returns 400 with a clear message

That’s observability doing safety and cost control: you can measure how often guardrails fire, by topic, without paying for a full agent loop.

Honest gaps I noticed

Several spans appeared as unknown—internal SDK paths not yet mapped to semantic span names. Child spans (e.g., guardrail_check) still tell the story. An open conformance effort for the Agents SDK should improve this. As an architect, plan for partial coverage today and richer conventions tomorrow—design dashboards around stable attributes (gen_ai.*) where possible.

Observability Across the Full Agentic Ecosystem

The NBA demo covers HTTP → agent → guardrail → LLM → tools → session store. Production agentic systems add more layers. Each needs the same treatment: trace context propagation, consistent span naming, and metrics that roll up to SLOs.

LLM interactions

Every model call should emit:

Operation name and provider (gen_ai.system, gen_ai.request.model)
Input / output / total tokens (and cached tokens when available)
Duration (gen_ai.client.operation.duration)
Finish reason, error type, retry count
Optional: prompt/completion content (off in production via OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0)

The demo surfaces usage in the API response as well—useful for debugging, but metrics and traces are the source of truth for trends:

{
  "usage": {
    "input_tokens": 123,
    "output_tokens": 45,
    "total_tokens": 168
  }
}

Tool usage

Tools are where agents become reliable—or expensive.

Instrument each tool with:

Span name: e.g., tool.calculate_win_percentage or tool.web_search
Arguments (sanitized), result size, success/failure
Latency and timeout events
Parent link back to the agent span

Compare tool latency vs model latency. In many agents, a slow external API dominates p95 while the LLM looks fine in isolation.

API calls (REST, GraphQL, internal services)

Web search in the demo is a stand-in for any external dependency. Treat them like microservice calls:

Propagate trace context in outbound headers
Record status codes, payload sizes, and retry behavior
Tag by dependency name for dependency maps

MCP server calls

Model Context Protocol (MCP) servers are becoming the standard way agents reach databases, repos, ticketing systems, and internal APIs. They deserve first-class spans:

Signal	Why it matters
`mcp.server.name`, `mcp.tool.name`	Which integration failed or slowed down
Request/response size	Context bloat and cost
Duration	Often dominates agent turn time
Error taxonomy	Auth vs timeout vs schema mismatch

Propagate the same trace ID from your API through the agent into MCP client calls. Without that, “the agent was slow” is impossible to decompose.

Guardrails and policy layers

The demo’s keyword guardrail is simple; production systems use classifiers, PII detectors, and budget enforcers. Track:

Trigger rate by guardrail name
Time spent in guardrail evaluation
Block vs warn vs redact outcomes

Correlate guardrail blocks with saved tokens and avoided tool calls to justify the complexity to leadership.

Sessions and memory

OpenAIConversationsSession makes context explicit. For any memory layer (vector store, Redis, custom DB), observe:

Read/write latency per turn
Context size growth (tokens or bytes)
Cache hit rate for retrieved chunks

Runaway context is a leading cause of latency and cost spikes in long conversations.

Metrics Every AI Architect Should Track

Traces explain one request. Metrics tell you whether the system is healthy at scale. Below is a practical catalog—some exported automatically by GenAI instrumentation today, others you’ll derive or add custom instrumentation for.

Latency and streaming

Metric	Definition	Why it matters
Time to First Token (TTFT)	Request start → first streamed token	User-perceived responsiveness; spikes often indicate queueing or cold starts
Time per token / inter-token latency	(Total generation time − TTFT) / (output tokens − 1)	Detects “stuttery” streaming and provider throttling
End-to-end turn latency	HTTP request → final response	Your real SLO; includes tools, MCP, and multiple model rounds
Agent loop depth	Number of model invocations per turn	High depth → runaway reasoning or tool retry loops
`gen_ai.client.operation.duration`	Provider-reported operation time	Compare across models and regions

Tokens and cost

Metric	Definition	Why it matters
`gen_ai.client.token.usage` (input / output / total)	Tokens per operation	Direct driver of cost and context pressure
Cached input tokens	Tokens served from provider cache	Cost optimization signal (when exposed)
Cost per turn / per user / per session	tokens × price table	Finance and capacity planning
Token growth per session turn	Δ input tokens turn-over-turn	Early warning for context explosion

Reliability and quality proxies

Metric	Definition	Why it matters
Error rate by span kind	LLM vs tool vs MCP vs HTTP	Target fixes where failures cluster
Tool success rate & p95 latency	Per tool name	Fragile tools break agent reliability
Guardrail trigger rate	Blocks / total requests	Mis-tuned guardrails hurt UX; under-tuned ones hurt safety
Empty or truncated response rate	502s, zero output tokens	Model or SDK integration issues
Provider availability / timeout rate	By model and region	Brownout detection

Agent-specific operational metrics

Metric	Definition	Why it matters
Tool calls per turn	Count by tool	Cost and latency multipliers
MCP calls per turn	Count by server/tool	Integration hot paths
Parallel vs sequential tool time	Sum of tool spans vs wall clock	Opportunities to parallelize
Retry count	On LLM and downstream calls	Instability or aggressive client config

Suggested dashboards

Cost & usage — tokens and estimated spend by model, topic, tenant
Latency — p50/p95 TTFT, turn latency, tool latency
Reliability — error rates, guardrail blocks, tool failures
Agent behavior — tool/MCP call mix, loop depth, session length

SigNoz’s OpenAI Python SDK dashboard template is a reasonable starting point; expect to extend it for Agents SDK, MCP, and custom tools.

GenAI Semantic Conventions: What I Internalized

OpenTelemetry uses gen_ai.* attributes under evolving GenAI semantic conventions. Treat them as a contract in motion:

gen_ai.agent.name — which agent ran
gen_ai.input.messages / gen_ai.output.messages — powerful for debug, risky for PII in production
gen_ai.guardrail.triggered — policy outcomes
gen_ai.client.token.usage — standardized token metrics
gen_ai.client.operation.duration — standardized latency metrics

Content capture: With session-backed conversations, gen_ai.input.messages may include prior turns, not just the latest user message. Disable capture in production:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0

Instrumentation choice: The plain OpenAI Python SDK instrumentation lagged the Responses API; the demo uses OpenAI Agents SDK + opentelemetry-instrumentation-openai-agents-v2. Pick instrumentation that matches your API surface, or you’ll fly blind on the code path you actually run.

Lessons I’m Taking Into Production Designs

Instrument at the agent boundary, not only at the LLM client. The demo’s value is seeing guardrails, sessions, tools, and model calls in one trace—not three disconnected logs.
Treat tools and MCP like microservices. They need spans, timeouts, and RED metrics (rate, errors, duration). An agent is only as reliable as its slowest dependency.
Measure cost and latency together. A “cheap” model that needs three extra tool rounds or a longer context window can cost more than a faster model with better cache behavior.
Guardrails are observability events. Blocked requests should be visible, measurable, and cheap—exactly what the demo’s guardrail path demonstrates.
Plan for convention churn. Manual dependency pins, occasional unknown spans, and missing cache metrics today are normal. Abstract your dashboards around stable concepts (turn, tool, token, cost), not one vendor’s span titles.
Correlate logs and traces. Enabling OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true ties application logs to trace IDs—essential when debugging “the agent said something weird” incidents.

Try It Yourself

If you’re evaluating observability for an agentic product:

Clone the SigNoz examples repo and run python/opentelemetry-llm-demo.
Read the original SigNoz walkthrough for setup screenshots and trace UI context.
Run four scenarios: first turn, follow-up with session_id, a tool question (“Raptors went 46-36—what’s their win percentage?”), and an off-topic guardrail block.
In your backend, note span hierarchy, token metrics, and guardrail attributes.
List your production integrations (APIs, MCP servers, memory stores) and mark which ones lack spans today—that’s your instrumentation backlog.

Closing Thoughts

Agentic AI isn’t “an LLM behind an API.” It’s a distributed system where the planner is probabilistic and the dependencies are heterogeneous. OpenTelemetry gives you a shared language for that system—traces for debugging individual turns, metrics for TTFT, token cost, tool reliability, and guardrail behavior at scale.

Running the SigNoz NBA agent demo didn’t just validate a tutorial. It clarified what I need to require from every agent service we design: propagated trace context, GenAI semantic attributes, tool and MCP spans, and dashboards that connect latency to tokens to dollars.

The conventions and SDK instrumentation will keep improving. The architectural imperative won’t: if you can’t see the full agent loop, you can’t operate it in production.

Have you instrumented MCP servers or multi-agent workflows yet? I’d be interested in what span attributes and metrics you’ve standardized on—drop a comment or reach out.

DEV Community