DEV Community

Cover image for Observability in Agentic AI: What I Learned After Instrumenting a Real LLM Agent with OpenTelemetry
Saket
Saket

Posted on

Observability in Agentic AI: What I Learned After Instrumenting a Real LLM Agent with OpenTelemetry

A hands-on walkthrough for AI architects who want visibility into tools, API calls, MCP servers, and model interactions—not just “did the API return 200?”


Introduction

If you ship traditional microservices, observability is a solved problem in principle: traces show request paths, metrics expose latency and error rates, and logs give you the narrative. Agentic AI breaks that mental model.

The same user question can produce different outputs. A single “chat turn” might trigger guardrail checks, session reads, multiple LLM calls, web search, custom tool execution, and follow-up reasoning loops. Failures are often subtle—slow tools, ballooning context windows, guardrails firing too often, or models that degrade under load rather than returning hard errors.

I recently worked through the SigNoz guide on observing LLM applications with OpenTelemetry and ran the OpenTelemetry NBA Agent demo end to end. This post is my architect’s take on what that exercise taught me—and how observability needs to span the entire agentic ecosystem, not just the model call at the center.


Why Agentic AI Demands a Different Observability Mindset

Three ideas from the SigNoz article stuck with me after running the demo myself:

1. Non-determinism is the default

The same prompt can yield different answers across runs. That makes regression testing harder and makes distributed traces more valuable than unit tests alone. You need to see what path the agent took, not just what text came back.

2. “Correct” is contextual

A one-word answer might be fine for “Will it rain tomorrow?” and unacceptable for “Should I rebalance my portfolio?” Observability helps you correlate guardrail decisions, tool usage, and response structure with user intent and topic.

3. The stack moves fast

Model updates, provider brownouts, new APIs (Responses vs Chat Completions), and evolving OpenTelemetry GenAI conventions all change behavior under you. You need baseline metrics before you can tell whether a deploy or a model swap made things better or worse.

OpenTelemetry gives you a vendor-neutral way to capture that picture. SigNoz (or any OTLP-compatible backend) is where you visualize it. The demo proves the plumbing works; production agentic systems need you to extend that plumbing across every integration point.


What I Built (and Ran)

The demo is intentionally small—a FastAPI app with one agent, one custom tool, one guardrail, and session-backed conversations. That restraint is a feature. It produces meaningfully agent-shaped telemetry without hiding the signal in product complexity.

Component Role
FastAPI POST /agent/turn — one conversational turn per request
OpenAI Agents SDK Orchestrates the agent loop, tools, and guardrails
WebSearchTool External retrieval (high latency, high cost if abused)
calculate_win_percentage Deterministic custom tool for stats questions
Input guardrail Blocks off-topic queries (e.g., weather in Barcelona)
OpenAIConversationsSession Server-managed context across turns

Running it is straightforward once environment variables are set:

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.<region>.signoz.cloud:443" \
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-key>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1
Enter fullscreen mode Exit fullscreen mode

I could not get a Signoz free trial account due to regional limitations on their site. But since this demo supports Open Telemetry vendors, i created a grafana labs free trial account for 15 days and use that to export the metrics.

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-ap-south-1.grafana.net/otlp" \
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <GrafanaLabstoken>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1
Enter fullscreen mode Exit fullscreen mode

The important detail: opentelemetry-instrumentation-openai-agents-v2 is pinned manually in requirements.txt. opentelemetry-bootstrap does not yet discover the OpenAI Agents SDK. That’s a real-world snapshot of GenAI observability in 2026—powerful, but still maturing.


What a Good Trace Actually Looks Like

After sending a few turns—initial question, follow-up with session_id, a stats question that triggers the tool, and an off-topic question blocked by the guardrail—I knew what “healthy” vs “blocked” looked like in the trace UI.

A successful multi-turn flow

  1. POST /agent/turn — root HTTP span
  2. Session I/O — GET/POST spans for conversation fetch and persist
  3. invoke_agent — agent workflow container
  4. guardrail_check — with gen_ai.guardrail.triggered=false
  5. Model call span — model name, token usage, optional message content attributes
  6. Tool spans — when the agent calls calculate_win_percentage or web search

Follow-up requests reuse session_id; traces show the agent pulling prior context instead of starting cold. That context growth shows up in rising input token counts over a session—something worth dashboarding.

A guardrail block

Off-topic queries still produce useful telemetry:

  • guardrail_check with gen_ai.guardrail.triggered=true
  • Workflow stops before expensive model/tool work
  • FastAPI records an exception on the span via span.record_exception(exc)
  • API returns 400 with a clear message

That’s observability doing safety and cost control: you can measure how often guardrails fire, by topic, without paying for a full agent loop.

Honest gaps I noticed

Several spans appeared as unknown—internal SDK paths not yet mapped to semantic span names. Child spans (e.g., guardrail_check) still tell the story. An open conformance effort for the Agents SDK should improve this. As an architect, plan for partial coverage today and richer conventions tomorrow—design dashboards around stable attributes (gen_ai.*) where possible.


Observability Across the Full Agentic Ecosystem

The NBA demo covers HTTP → agent → guardrail → LLM → tools → session store. Production agentic systems add more layers. Each needs the same treatment: trace context propagation, consistent span naming, and metrics that roll up to SLOs.

Interactions between components in Agentic Ecosystem

LLM interactions

Every model call should emit:

  • Operation name and provider (gen_ai.system, gen_ai.request.model)
  • Input / output / total tokens (and cached tokens when available)
  • Duration (gen_ai.client.operation.duration)
  • Finish reason, error type, retry count
  • Optional: prompt/completion content (off in production via OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0)

The demo surfaces usage in the API response as well—useful for debugging, but metrics and traces are the source of truth for trends:

{
  "usage": {
    "input_tokens": 123,
    "output_tokens": 45,
    "total_tokens": 168
  }
}
Enter fullscreen mode Exit fullscreen mode

Tool usage

Tools are where agents become reliable—or expensive.

Instrument each tool with:

  • Span name: e.g., tool.calculate_win_percentage or tool.web_search
  • Arguments (sanitized), result size, success/failure
  • Latency and timeout events
  • Parent link back to the agent span

Compare tool latency vs model latency. In many agents, a slow external API dominates p95 while the LLM looks fine in isolation.

API calls (REST, GraphQL, internal services)

Web search in the demo is a stand-in for any external dependency. Treat them like microservice calls:

  • Propagate trace context in outbound headers
  • Record status codes, payload sizes, and retry behavior
  • Tag by dependency name for dependency maps

MCP server calls

Model Context Protocol (MCP) servers are becoming the standard way agents reach databases, repos, ticketing systems, and internal APIs. They deserve first-class spans:

Signal Why it matters
mcp.server.name, mcp.tool.name Which integration failed or slowed down
Request/response size Context bloat and cost
Duration Often dominates agent turn time
Error taxonomy Auth vs timeout vs schema mismatch

Propagate the same trace ID from your API through the agent into MCP client calls. Without that, “the agent was slow” is impossible to decompose.

Guardrails and policy layers

The demo’s keyword guardrail is simple; production systems use classifiers, PII detectors, and budget enforcers. Track:

  • Trigger rate by guardrail name
  • Time spent in guardrail evaluation
  • Block vs warn vs redact outcomes

Correlate guardrail blocks with saved tokens and avoided tool calls to justify the complexity to leadership.

Sessions and memory

OpenAIConversationsSession makes context explicit. For any memory layer (vector store, Redis, custom DB), observe:

  • Read/write latency per turn
  • Context size growth (tokens or bytes)
  • Cache hit rate for retrieved chunks

Runaway context is a leading cause of latency and cost spikes in long conversations.


Metrics Every AI Architect Should Track

Traces explain one request. Metrics tell you whether the system is healthy at scale. Below is a practical catalog—some exported automatically by GenAI instrumentation today, others you’ll derive or add custom instrumentation for.

Latency and streaming

Metric Definition Why it matters
Time to First Token (TTFT) Request start → first streamed token User-perceived responsiveness; spikes often indicate queueing or cold starts
Time per token / inter-token latency (Total generation time − TTFT) / (output tokens − 1) Detects “stuttery” streaming and provider throttling
End-to-end turn latency HTTP request → final response Your real SLO; includes tools, MCP, and multiple model rounds
Agent loop depth Number of model invocations per turn High depth → runaway reasoning or tool retry loops
gen_ai.client.operation.duration Provider-reported operation time Compare across models and regions

Tokens and cost

Metric Definition Why it matters
gen_ai.client.token.usage (input / output / total) Tokens per operation Direct driver of cost and context pressure
Cached input tokens Tokens served from provider cache Cost optimization signal (when exposed)
Cost per turn / per user / per session tokens × price table Finance and capacity planning
Token growth per session turn Δ input tokens turn-over-turn Early warning for context explosion

Reliability and quality proxies

Metric Definition Why it matters
Error rate by span kind LLM vs tool vs MCP vs HTTP Target fixes where failures cluster
Tool success rate & p95 latency Per tool name Fragile tools break agent reliability
Guardrail trigger rate Blocks / total requests Mis-tuned guardrails hurt UX; under-tuned ones hurt safety
Empty or truncated response rate 502s, zero output tokens Model or SDK integration issues
Provider availability / timeout rate By model and region Brownout detection

Agent-specific operational metrics

Metric Definition Why it matters
Tool calls per turn Count by tool Cost and latency multipliers
MCP calls per turn Count by server/tool Integration hot paths
Parallel vs sequential tool time Sum of tool spans vs wall clock Opportunities to parallelize
Retry count On LLM and downstream calls Instability or aggressive client config

Suggested dashboards

  1. Cost & usage — tokens and estimated spend by model, topic, tenant
  2. Latency — p50/p95 TTFT, turn latency, tool latency
  3. Reliability — error rates, guardrail blocks, tool failures
  4. Agent behavior — tool/MCP call mix, loop depth, session length

SigNoz’s OpenAI Python SDK dashboard template is a reasonable starting point; expect to extend it for Agents SDK, MCP, and custom tools.


GenAI Semantic Conventions: What I Internalized

OpenTelemetry uses gen_ai.* attributes under evolving GenAI semantic conventions. Treat them as a contract in motion:

  • gen_ai.agent.name — which agent ran
  • gen_ai.input.messages / gen_ai.output.messages — powerful for debug, risky for PII in production
  • gen_ai.guardrail.triggered — policy outcomes
  • gen_ai.client.token.usage — standardized token metrics
  • gen_ai.client.operation.duration — standardized latency metrics

Content capture: With session-backed conversations, gen_ai.input.messages may include prior turns, not just the latest user message. Disable capture in production:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0
Enter fullscreen mode Exit fullscreen mode

Instrumentation choice: The plain OpenAI Python SDK instrumentation lagged the Responses API; the demo uses OpenAI Agents SDK + opentelemetry-instrumentation-openai-agents-v2. Pick instrumentation that matches your API surface, or you’ll fly blind on the code path you actually run.


Lessons I’m Taking Into Production Designs

  1. Instrument at the agent boundary, not only at the LLM client. The demo’s value is seeing guardrails, sessions, tools, and model calls in one trace—not three disconnected logs.

  2. Treat tools and MCP like microservices. They need spans, timeouts, and RED metrics (rate, errors, duration). An agent is only as reliable as its slowest dependency.

  3. Measure cost and latency together. A “cheap” model that needs three extra tool rounds or a longer context window can cost more than a faster model with better cache behavior.

  4. Guardrails are observability events. Blocked requests should be visible, measurable, and cheap—exactly what the demo’s guardrail path demonstrates.

  5. Plan for convention churn. Manual dependency pins, occasional unknown spans, and missing cache metrics today are normal. Abstract your dashboards around stable concepts (turn, tool, token, cost), not one vendor’s span titles.

  6. Correlate logs and traces. Enabling OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true ties application logs to trace IDs—essential when debugging “the agent said something weird” incidents.


Try It Yourself

If you’re evaluating observability for an agentic product:

  1. Clone the SigNoz examples repo and run python/opentelemetry-llm-demo.
  2. Read the original SigNoz walkthrough for setup screenshots and trace UI context.
  3. Run four scenarios: first turn, follow-up with session_id, a tool question (“Raptors went 46-36—what’s their win percentage?”), and an off-topic guardrail block.
  4. In your backend, note span hierarchy, token metrics, and guardrail attributes.
  5. List your production integrations (APIs, MCP servers, memory stores) and mark which ones lack spans today—that’s your instrumentation backlog.

Closing Thoughts

Agentic AI isn’t “an LLM behind an API.” It’s a distributed system where the planner is probabilistic and the dependencies are heterogeneous. OpenTelemetry gives you a shared language for that system—traces for debugging individual turns, metrics for TTFT, token cost, tool reliability, and guardrail behavior at scale.

Running the SigNoz NBA agent demo didn’t just validate a tutorial. It clarified what I need to require from every agent service we design: propagated trace context, GenAI semantic attributes, tool and MCP spans, and dashboards that connect latency to tokens to dollars.

The conventions and SDK instrumentation will keep improving. The architectural imperative won’t: if you can’t see the full agent loop, you can’t operate it in production.


Have you instrumented MCP servers or multi-agent workflows yet? I’d be interested in what span attributes and metrics you’ve standardized on—drop a comment or reach out.

Top comments (0)