A hands-on walkthrough for AI architects who want visibility into tools, API calls, MCP servers, and model interactions—not just “did the API return 200?”
Introduction
If you ship traditional microservices, observability is a solved problem in principle: traces show request paths, metrics expose latency and error rates, and logs give you the narrative. Agentic AI breaks that mental model.
The same user question can produce different outputs. A single “chat turn” might trigger guardrail checks, session reads, multiple LLM calls, web search, custom tool execution, and follow-up reasoning loops. Failures are often subtle—slow tools, ballooning context windows, guardrails firing too often, or models that degrade under load rather than returning hard errors.
I recently worked through the SigNoz guide on observing LLM applications with OpenTelemetry and ran the OpenTelemetry NBA Agent demo end to end. This post is my architect’s take on what that exercise taught me—and how observability needs to span the entire agentic ecosystem, not just the model call at the center.
Why Agentic AI Demands a Different Observability Mindset
Three ideas from the SigNoz article stuck with me after running the demo myself:
1. Non-determinism is the default
The same prompt can yield different answers across runs. That makes regression testing harder and makes distributed traces more valuable than unit tests alone. You need to see what path the agent took, not just what text came back.
2. “Correct” is contextual
A one-word answer might be fine for “Will it rain tomorrow?” and unacceptable for “Should I rebalance my portfolio?” Observability helps you correlate guardrail decisions, tool usage, and response structure with user intent and topic.
3. The stack moves fast
Model updates, provider brownouts, new APIs (Responses vs Chat Completions), and evolving OpenTelemetry GenAI conventions all change behavior under you. You need baseline metrics before you can tell whether a deploy or a model swap made things better or worse.
OpenTelemetry gives you a vendor-neutral way to capture that picture. SigNoz (or any OTLP-compatible backend) is where you visualize it. The demo proves the plumbing works; production agentic systems need you to extend that plumbing across every integration point.
What I Built (and Ran)
The demo is intentionally small—a FastAPI app with one agent, one custom tool, one guardrail, and session-backed conversations. That restraint is a feature. It produces meaningfully agent-shaped telemetry without hiding the signal in product complexity.
| Component | Role |
|---|---|
| FastAPI |
POST /agent/turn — one conversational turn per request |
| OpenAI Agents SDK | Orchestrates the agent loop, tools, and guardrails |
| WebSearchTool | External retrieval (high latency, high cost if abused) |
calculate_win_percentage |
Deterministic custom tool for stats questions |
| Input guardrail | Blocks off-topic queries (e.g., weather in Barcelona) |
| OpenAIConversationsSession | Server-managed context across turns |
Running it is straightforward once environment variables are set:
cd python/opentelemetry-llm-demo
pip install -r requirements.txt
OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.<region>.signoz.cloud:443" \
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-key>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1
I could not get a Signoz free trial account due to regional limitations on their site. But since this demo supports Open Telemetry vendors, i created a grafana labs free trial account for 15 days and use that to export the metrics.
cd python/opentelemetry-llm-demo
pip install -r requirements.txt
OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-ap-south-1.grafana.net/otlp" \
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <GrafanaLabstoken>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1
The important detail: opentelemetry-instrumentation-openai-agents-v2 is pinned manually in requirements.txt. opentelemetry-bootstrap does not yet discover the OpenAI Agents SDK. That’s a real-world snapshot of GenAI observability in 2026—powerful, but still maturing.
What a Good Trace Actually Looks Like
After sending a few turns—initial question, follow-up with session_id, a stats question that triggers the tool, and an off-topic question blocked by the guardrail—I knew what “healthy” vs “blocked” looked like in the trace UI.
A successful multi-turn flow
-
POST /agent/turn— root HTTP span - Session I/O — GET/POST spans for conversation fetch and persist
-
invoke_agent— agent workflow container -
guardrail_check— withgen_ai.guardrail.triggered=false - Model call span — model name, token usage, optional message content attributes
-
Tool spans — when the agent calls
calculate_win_percentageor web search
Follow-up requests reuse session_id; traces show the agent pulling prior context instead of starting cold. That context growth shows up in rising input token counts over a session—something worth dashboarding.
A guardrail block
Off-topic queries still produce useful telemetry:
-
guardrail_checkwithgen_ai.guardrail.triggered=true - Workflow stops before expensive model/tool work
- FastAPI records an exception on the span via
span.record_exception(exc) - API returns
400with a clear message
That’s observability doing safety and cost control: you can measure how often guardrails fire, by topic, without paying for a full agent loop.
Honest gaps I noticed
Several spans appeared as unknown—internal SDK paths not yet mapped to semantic span names. Child spans (e.g., guardrail_check) still tell the story. An open conformance effort for the Agents SDK should improve this. As an architect, plan for partial coverage today and richer conventions tomorrow—design dashboards around stable attributes (gen_ai.*) where possible.
Observability Across the Full Agentic Ecosystem
The NBA demo covers HTTP → agent → guardrail → LLM → tools → session store. Production agentic systems add more layers. Each needs the same treatment: trace context propagation, consistent span naming, and metrics that roll up to SLOs.
LLM interactions
Every model call should emit:
- Operation name and provider (
gen_ai.system,gen_ai.request.model) - Input / output / total tokens (and cached tokens when available)
-
Duration (
gen_ai.client.operation.duration) - Finish reason, error type, retry count
- Optional: prompt/completion content (off in production via
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0)
The demo surfaces usage in the API response as well—useful for debugging, but metrics and traces are the source of truth for trends:
{
"usage": {
"input_tokens": 123,
"output_tokens": 45,
"total_tokens": 168
}
}
Tool usage
Tools are where agents become reliable—or expensive.
Instrument each tool with:
- Span name: e.g.,
tool.calculate_win_percentageortool.web_search - Arguments (sanitized), result size, success/failure
- Latency and timeout events
- Parent link back to the agent span
Compare tool latency vs model latency. In many agents, a slow external API dominates p95 while the LLM looks fine in isolation.
API calls (REST, GraphQL, internal services)
Web search in the demo is a stand-in for any external dependency. Treat them like microservice calls:
- Propagate trace context in outbound headers
- Record status codes, payload sizes, and retry behavior
- Tag by dependency name for dependency maps
MCP server calls
Model Context Protocol (MCP) servers are becoming the standard way agents reach databases, repos, ticketing systems, and internal APIs. They deserve first-class spans:
| Signal | Why it matters |
|---|---|
mcp.server.name, mcp.tool.name
|
Which integration failed or slowed down |
| Request/response size | Context bloat and cost |
| Duration | Often dominates agent turn time |
| Error taxonomy | Auth vs timeout vs schema mismatch |
Propagate the same trace ID from your API through the agent into MCP client calls. Without that, “the agent was slow” is impossible to decompose.
Guardrails and policy layers
The demo’s keyword guardrail is simple; production systems use classifiers, PII detectors, and budget enforcers. Track:
- Trigger rate by guardrail name
- Time spent in guardrail evaluation
- Block vs warn vs redact outcomes
Correlate guardrail blocks with saved tokens and avoided tool calls to justify the complexity to leadership.
Sessions and memory
OpenAIConversationsSession makes context explicit. For any memory layer (vector store, Redis, custom DB), observe:
- Read/write latency per turn
- Context size growth (tokens or bytes)
- Cache hit rate for retrieved chunks
Runaway context is a leading cause of latency and cost spikes in long conversations.
Metrics Every AI Architect Should Track
Traces explain one request. Metrics tell you whether the system is healthy at scale. Below is a practical catalog—some exported automatically by GenAI instrumentation today, others you’ll derive or add custom instrumentation for.
Latency and streaming
| Metric | Definition | Why it matters |
|---|---|---|
| Time to First Token (TTFT) | Request start → first streamed token | User-perceived responsiveness; spikes often indicate queueing or cold starts |
| Time per token / inter-token latency | (Total generation time − TTFT) / (output tokens − 1) | Detects “stuttery” streaming and provider throttling |
| End-to-end turn latency | HTTP request → final response | Your real SLO; includes tools, MCP, and multiple model rounds |
| Agent loop depth | Number of model invocations per turn | High depth → runaway reasoning or tool retry loops |
gen_ai.client.operation.duration |
Provider-reported operation time | Compare across models and regions |
Tokens and cost
| Metric | Definition | Why it matters |
|---|---|---|
gen_ai.client.token.usage (input / output / total) |
Tokens per operation | Direct driver of cost and context pressure |
| Cached input tokens | Tokens served from provider cache | Cost optimization signal (when exposed) |
| Cost per turn / per user / per session | tokens × price table | Finance and capacity planning |
| Token growth per session turn | Δ input tokens turn-over-turn | Early warning for context explosion |
Reliability and quality proxies
| Metric | Definition | Why it matters |
|---|---|---|
| Error rate by span kind | LLM vs tool vs MCP vs HTTP | Target fixes where failures cluster |
| Tool success rate & p95 latency | Per tool name | Fragile tools break agent reliability |
| Guardrail trigger rate | Blocks / total requests | Mis-tuned guardrails hurt UX; under-tuned ones hurt safety |
| Empty or truncated response rate | 502s, zero output tokens | Model or SDK integration issues |
| Provider availability / timeout rate | By model and region | Brownout detection |
Agent-specific operational metrics
| Metric | Definition | Why it matters |
|---|---|---|
| Tool calls per turn | Count by tool | Cost and latency multipliers |
| MCP calls per turn | Count by server/tool | Integration hot paths |
| Parallel vs sequential tool time | Sum of tool spans vs wall clock | Opportunities to parallelize |
| Retry count | On LLM and downstream calls | Instability or aggressive client config |
Suggested dashboards
- Cost & usage — tokens and estimated spend by model, topic, tenant
- Latency — p50/p95 TTFT, turn latency, tool latency
- Reliability — error rates, guardrail blocks, tool failures
- Agent behavior — tool/MCP call mix, loop depth, session length
SigNoz’s OpenAI Python SDK dashboard template is a reasonable starting point; expect to extend it for Agents SDK, MCP, and custom tools.
GenAI Semantic Conventions: What I Internalized
OpenTelemetry uses gen_ai.* attributes under evolving GenAI semantic conventions. Treat them as a contract in motion:
-
gen_ai.agent.name— which agent ran -
gen_ai.input.messages/gen_ai.output.messages— powerful for debug, risky for PII in production -
gen_ai.guardrail.triggered— policy outcomes -
gen_ai.client.token.usage— standardized token metrics -
gen_ai.client.operation.duration— standardized latency metrics
Content capture: With session-backed conversations, gen_ai.input.messages may include prior turns, not just the latest user message. Disable capture in production:
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0
Instrumentation choice: The plain OpenAI Python SDK instrumentation lagged the Responses API; the demo uses OpenAI Agents SDK + opentelemetry-instrumentation-openai-agents-v2. Pick instrumentation that matches your API surface, or you’ll fly blind on the code path you actually run.
Lessons I’m Taking Into Production Designs
Instrument at the agent boundary, not only at the LLM client. The demo’s value is seeing guardrails, sessions, tools, and model calls in one trace—not three disconnected logs.
Treat tools and MCP like microservices. They need spans, timeouts, and RED metrics (rate, errors, duration). An agent is only as reliable as its slowest dependency.
Measure cost and latency together. A “cheap” model that needs three extra tool rounds or a longer context window can cost more than a faster model with better cache behavior.
Guardrails are observability events. Blocked requests should be visible, measurable, and cheap—exactly what the demo’s guardrail path demonstrates.
Plan for convention churn. Manual dependency pins, occasional
unknownspans, and missing cache metrics today are normal. Abstract your dashboards around stable concepts (turn, tool, token, cost), not one vendor’s span titles.Correlate logs and traces. Enabling
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=trueties application logs to trace IDs—essential when debugging “the agent said something weird” incidents.
Try It Yourself
If you’re evaluating observability for an agentic product:
- Clone the SigNoz examples repo and run
python/opentelemetry-llm-demo. - Read the original SigNoz walkthrough for setup screenshots and trace UI context.
- Run four scenarios: first turn, follow-up with
session_id, a tool question (“Raptors went 46-36—what’s their win percentage?”), and an off-topic guardrail block. - In your backend, note span hierarchy, token metrics, and guardrail attributes.
- List your production integrations (APIs, MCP servers, memory stores) and mark which ones lack spans today—that’s your instrumentation backlog.
Closing Thoughts
Agentic AI isn’t “an LLM behind an API.” It’s a distributed system where the planner is probabilistic and the dependencies are heterogeneous. OpenTelemetry gives you a shared language for that system—traces for debugging individual turns, metrics for TTFT, token cost, tool reliability, and guardrail behavior at scale.
Running the SigNoz NBA agent demo didn’t just validate a tutorial. It clarified what I need to require from every agent service we design: propagated trace context, GenAI semantic attributes, tool and MCP spans, and dashboards that connect latency to tokens to dollars.
The conventions and SDK instrumentation will keep improving. The architectural imperative won’t: if you can’t see the full agent loop, you can’t operate it in production.
Have you instrumented MCP servers or multi-agent workflows yet? I’d be interested in what span attributes and metrics you’ve standardized on—drop a comment or reach out.

Top comments (0)