An LLM API call is a function: input goes in, output comes out, duration is bounded. An AI agent is a loop: it plans, executes tools, observes results, and decides what to do next — potentially for dozens of iterations. The loop is the thing that makes agents useful and the thing that makes them dangerous to run in production without observability.
Traditional LLM observability tracks individual model calls: token usage, latency, error rates, finish reasons. Agent observability tracks the behavior of the loop itself: how many iterations it runs, which tools it calls, how much it costs per session, whether it's making progress or spinning, and whether it stays within its defined boundaries.
If you run agents in production — coding assistants, customer support bots, SRE automation, data pipelines with LLM steps — you need both layers. This guide covers the agent-specific layer.
What makes agents different
An API call has a predictable cost ceiling: one prompt, one completion, one bill. An agent has none of these guarantees:
Unbounded iteration. An agent that encounters an error might retry the same failing approach indefinitely. A coding agent that misreads a test failure can loop through 50 edit-test cycles without making progress. Each iteration costs tokens.
Tool-call chains. Agents call external tools — database queries, API requests, file operations, web searches. Each tool call introduces latency, cost, and a new failure mode. A tool that returns unexpected output can send the agent down a completely wrong investigation path.
State accumulation. Each iteration adds to the agent's context window. After 15 turns of investigation, the agent is reasoning over 50,000+ tokens of accumulated context. Performance degrades, costs increase, and the risk of the agent "forgetting" early context grows.
Non-deterministic behavior. Two identical inputs to an agent can produce completely different tool-call sequences. One run might solve the problem in 3 turns; another might take 20. You can't predict execution cost or duration from the input alone.
The four pillars
1. Execution traces
Every agent run should produce a trace that shows the complete decision chain. The OpenTelemetry GenAI semantic conventions define span types for this:
-
invoke_agent— the root span for an agent session, carryinggen_ai.agent.nameandgen_ai.agent.id -
chat— each LLM call within the session (the "thinking" step) -
execute_tool— each tool invocation, carryinggen_ai.tool.nameandgen_ai.tool.type
The span tree looks like:
invoke_agent (sre-investigator, session-42)
├── chat claude-sonnet-4-20250514 [2.1s, 800 in / 200 out tokens]
│ → decided to check database metrics
├── execute_tool query_prometheus [0.8s]
│ → returned: connection_pool_usage = 94%
├── chat claude-sonnet-4-20250514 [1.8s, 1200 in / 350 out tokens]
│ → decided to check recent deploys
├── execute_tool list_recent_deploys [0.3s]
│ → returned: migration deployed 20min ago
├── chat claude-sonnet-4-20250514 [2.4s, 1800 in / 500 out tokens]
│ → conclusion: migration added N+1 query, saturating pool
└── [total: 7.4s, 3800 in / 1050 out tokens, $0.04]
Export these traces to Jaeger via the OTel Collector and you get a visual timeline of every decision the agent made, which tools it called, and how long each step took.
2. Tool-call auditing
Every tool call is a potential side effect. A coding agent that calls write_file is modifying your codebase. An SRE agent that calls restart_pod is modifying your infrastructure. Even read-only tools matter — an agent that calls query_database with a poorly constructed query can create load.
For each tool call, record:
-
Tool name and type (
gen_ai.tool.name,gen_ai.tool.type) - Input arguments (what the agent asked the tool to do)
- Output (what the tool returned — or the error it threw)
- Duration (how long the tool took)
- Whether it was a read or write operation (custom attribute)
The audit trail serves two purposes: debugging (why did the agent do that?) and governance (the agent was authorized to call these tools with these arguments). For write operations, consider requiring human approval before execution — the agent proposes the action, a human confirms it.
3. Cost and token tracking
Agent cost tracking is harder than single-call cost tracking because costs accumulate across turns:
Session cost breakdown:
Turn 1: 800 input + 200 output = $0.008
Turn 2: 1,200 input + 350 output = $0.014
Turn 3: 1,800 input + 500 output = $0.022
Turn 4: 2,400 input + 300 output = $0.025
Turn 5: 3,100 input + 450 output = $0.034
─────────────────────────────────────────
Total: 9,300 input + 1,800 output = $0.103
Notice the pattern: input tokens grow with every turn because the agent accumulates context. By turn 20, you might be sending 20,000+ input tokens per turn. The cost curve is quadratic in the number of turns, not linear.
Track these metrics per session:
- Total tokens (input + output)
- Total cost (computed from provider pricing)
- Tokens per turn (watch for the growth curve)
- Turn count (how many iterations the agent ran)
- Cost per tool call (which tools are expensive?)
Set alerts on:
- Single session cost exceeding a threshold (e.g., $5)
- Daily aggregate cost exceeding a budget (e.g., $50)
- Average turns per session increasing week-over-week (indicates the agent is becoming less efficient)
4. Safety boundary monitoring
Agents need boundaries. Without them, a misinterpreted instruction or a hallucinated tool call can cause real damage. Monitor these boundaries:
Turn budget. Cap the maximum number of iterations per session. When we run AI SRE investigations, we set a hard limit of 25 turns. If the agent hasn't resolved the investigation in 25 turns, it stops and hands off to a human. Track how often sessions hit the turn budget — a high hit rate means the budget is too low or the agent is struggling with certain problem types.
Cost circuit breaker. Set a daily spend limit across all agent sessions. If total spend exceeds the limit, new sessions queue for human approval instead of auto-launching. Track circuit-breaker activation frequency.
Tool allowlist. Define which tools the agent can call and with what argument patterns. A coding agent should be able to read files but maybe not delete directories. An SRE agent should be able to query metrics but maybe not restart production services. Log every tool call that was attempted but blocked by the allowlist.
Output guardrails. If the agent produces user-facing output, run it through the same safety filters you use for direct LLM calls. Track guardrail violation rates per agent type.
Getting started
If you're running agents today with no observability:
Step 1: Add session-level cost tracking. Wrap your agent loop with a counter that sums input and output tokens across turns. Log the total at session end. Set an alert on daily cost. This takes 30 minutes and catches the most expensive failure mode (runaway loops).
Step 2: Add OTel auto-instrumentation. Install the OTel instrumentation for your LLM provider (opentelemetry-instrumentation-openai, opentelemetry-instrumentation-anthropic). This gives you per-call spans automatically. Export to your existing tracing backend.
Step 3: Add custom spans for tool calls. Wrap each tool invocation in a span with gen_ai.tool.name and the tool's input/output as attributes. This completes the execution trace.
Step 4: Add boundary monitoring. Implement turn budgets and cost circuit breakers. Track how often they activate. Tune the thresholds based on real session data.
The investment is modest — a few hours of instrumentation work — and the payoff is the difference between "our agent ran up a $200 bill overnight" and "our agent hit its $10 circuit breaker, queued the session, and we reviewed it in the morning."
Monitor the infrastructure your agents depend on — model provider endpoints, vector databases, tool APIs — with external checks at app.devhelm.io. When an agent session fails because the OpenAI API is returning 503s, you want to know it's a provider issue before you start debugging your agent logic.
Originally published on DevHelm.
Top comments (0)