Engineers love shipping agents… right up until the first production incident.
Databricks found that tool-calling accuracy can swing by as much as 10 percent on parts of BFCL just by changing generation settings like temperature. That's a friendly reminder that agents can behave "correctly" one day and drift the next, even when nothing obvious changes.
A tool-calling agent can return the "correct" final answer while doing five retries behind the scenes, calling the wrong tool twice, and quietly burning your budget. Or it can fail, recover, and still look "fine" if you only judge it by the last message it prints.
That is why agent observability matters.
Agent observability is the ability to understand, measure, and debug an agent's decisions over time, not just its final answer.
It means you can answer questions like
- Why did the agent pick that tool?
- Where did the first wrong decision come from?
- Did it loop, retry, or deviate from the plan?
- Why did this run cost 10x more than usual?
That matters because agents don't behave like classic chatbots. A chatbot usually responds in one shot: prompt in, answer out. An agent runs a multi-step loop: plan → act → observe → repeat. "Act" often refers to using tools such as search, databases, internal APIs, code execution, ticketing systems, or workflows that interact with real production data.
Without observability, you're essentially shipping a distributed workflow without telemetry. You might know something went wrong, but not whether it was a flaky tool, bad parameters, a misread tool response, or an agent that simply didn't stop when it should have.
In this blog, you'll learn a practical baseline you can apply to almost any tool-using agent. We'll start with tracing (one trace per run), then move to three core metrics: loop rate, error rate, and cost per successful task (with p95 latency as the honesty check). Finally, you'll add guardrails that stop silent failures early and prevent runaway spend.
What "observability" means for agents (and what it isn't)
Traditional observability is often framed as logs, metrics, and traces. In distributed systems, traces show the end-to-end path of a request, and spans are the timed operations along that path.
Agent observability builds on the same foundations as classic logs/metrics/traces, but it must capture more than just service timings. You need visibility into what the agent decided, which tools it called, and what came back, how its state changed after each observation, and why the run stopped. Without these four pieces, a "bad run" looks the same as a "good run" in a dashboard: you see the final output, but not the workflow that produced it.
Here is what agent observability is not:
- Not just prompt logging. Prompts alone do not explain tool failures, retries, or plan drift.
- Not only vendor dashboards. Dashboards can help, but you still need run-level traceability that maps to your workflow.
- Not "Collect everything and hope." Raw telemetry without structure becomes expensive noise. You want a consistent model of "one run" and "one step."
A helpful way to stay disciplined is to define the minimum unit of analysis:
One trace = one agent run, from input (goal) to terminal outcome (success, failure, or safe stop)
This is your "single source of truth" for debugging. Then you layer metrics on top.
- Qualitative signals (trace review) tell you why it happened.
- Quantitative signals (metrics) tell you how often it happened and whether it is getting worse.
Traces: the backbone of agent observability
What does a good agent trace contain?
Instead of raw transcripts, a structured timeline includes events like user goal, tool invocation, and stop condition, helping maintain focus on meaningful data.
At a minimum, capture these events per run:
- User goal: what the user asked for (or the job input)
- System constraints: Policies and boundaries (what tools are allowed, budget caps)
- Plan / next step decision: what the agent intends to do next
- Tool invoked: tool name + version
- Tool parameters: log shape and key fields, but protect secrets and PII
- Tool result summary: short structured summary (success, error type, key outputs)
- Stop condition: why the agent ended (completed, max steps, budget hit, escalation)
You also want state tracking, because many agent failures are really state failures. The agent might be working with stale evidence, drop a vital constraint, or keep reusing the same assumption even after a tool returns new information. In practice, state tracking is just recording two things consistently: what the agent believes it knows (its current evidence/working memory) and what changed after each tool result (new evidence added, assumptions removed, plan updated)
That gives you enough context to spot common issues like "plan drift," repeated tool calls, or looping on outdated evidence.
Finally, track error behavior explicitly:
- Timeouts and retries
- Fallback tool choices
- Escalation decisions (handoff to a human, safe failure response)
One practical tip: treat tool arguments and tool results as sensitive by default. Redact aggressively and only allow safe attributes through. OpenTelemetry, for example, documents redaction approaches for sensitive data
How to read traces to find root causes
Traces are only valuable if they help you debug faster.
A simple debugging workflow that works in practice:
- Locate the first wrong decision. The final answer is often "wrong late," but the root cause is "wrong early."
- Inspect the tool result quality. Was the tool output empty, stale, partial, or erroring?
- Check interpretation. Did the agent misread a correct tool response?
- Check stopping behavior. Did it fail to stop when it had enough evidence?
As you review traces, a few patterns show up repeatedly:
- Loops: the agent repeats the same step or tool call without making progress.
- Repeated tool calls: The same api hit multiple times with nearly identical parameters.
- Plan drift: the agent's stated plan changes subtly over iterations until it no longer matches the user's goal.
- Silent failures masked by retries: the agent "eventually succeeds" but only after a storm of timeouts and retries
Traces help you explain why a failure happened. Metrics help you quantify how often it happens and whether it is trending up after a deployment.
Metrics that matter: loop rate, tool error rate, cost per successful task
Once you can trace every run, you can compute metrics that keep production stable.
Here are the baseline metrics worth tracking from day one:
Loop Rate
Loop rate is the number of iterations your agent performs per task.
Track:
- Average iterations per run
- Distribution (p50/p95) to catch outliers
- Spike detection (sudden "thrash" after a prompt or tool change)
What it signals
- A rising loop rate often means the agent is not converging. It can be confused, under-specified, or stuck due to weak tool outputs.
Tool error rate
Tool error rate measures tool failures and timeouts.
Track it in two ways:
- Per call: failures ÷ total tool calls
- Per task: runs with ≥1 tool failure ÷ total runs
Also, break down by tool, because a single flaky dependency can dominate your incident rate.
What it signals
- Rising tool errors often mean flaky infrastructure, rate limiting, invalid parameters, or auth issues.
Cost per successful task
Cost per successful task is the most practical cost metric for agents:
(LLM token cost + tool/API cost + retry cost) ÷ (number of tasks completed correctly)
This is more honest than "cost per run," because it penalizes wasted attempts and failed runs.
What it signals
- Rising costs with flat success usually mean inefficiency: loops, retries, overly long context, or poor tool selection.
p95 latency (the honesty metric)
Latency averages lie. Percentiles tell the truth about tail behavior.
p95 latency means 95% of runs complete faster than this value (and 5% are slower).
Track p95 latency for:
- Total run time
- Time spent in tool calls vs model calls
A quick mapping example
- Loop rate up + success flat → likely retrieval thrash or poor stop conditions.
- Tool error rate up → flaky tool, rate limits, or bad parameters.
- Cost per successful task up → wasted attempts, retries, or runaway context.
- p95 latency up → slow tools, slow model calls, or compounding retries.
Why this matters: these four signals together tell you whether the agent is converging, reliable, and economically sane.
Operational guardrails: alerts, budgets, and safe rollouts
Metrics without guardrails are just charts of your next outage.
Here are practical guardrails that prevent silent failures and runaway spend.
1) Step, time, and tool-call budget (hard caps)
Give every run a budget
- Max steps (iterations)
- Max tool calls (total and per tool)
- Max wall-clock time
If the agent hits a cap, it should stop in a controlled way, not spiral.
2) Stop conditions and "final answer required" mode
Agents often fail by continuing when they should stop.
Add explicit stop logic.
- Stop when confidence is high enough.
- Stop when the required evidence is collected.
- Stop when additional tool calls are not improving the state.
For user-facing agents, add a mode that forces a stable response:
- "Provide the best-effort final answer with uncertainty and next steps."
- "Escalate to human" for high-risk scenarios.
3) Tool-specific rate limits and fallbacks
Not all tools are equal. Treat them differently
- Rate limit expensive tools
- Add fallbacks (cached results, alternate endpoints, more minor queries)
- Fail-safe when tool responses are missing or malformed
4) Alerting that matches failure modes
Trigger alerts on
- Long spikes (loop rate anomaly)
- Tool error spikes (overall and per tool)
- Cost-per-success jumps (spend anomaly)
- p95 latency breaches (tail slowdown)
Use these alerts as a release gate. If the loop rate and cost per success spike after a deployment rollback.
5) Close the loop with regression tests
Every absolute failure becomes a test case:
- Capture "known bad" scenarios
- Build a small regression suite
- Re-run it before shipping the prompt or tool changes
This turns production incidents into improved reliability over time.
Conclusion: a baseline you can ship this week
Agent observability is how you keep autonomy from turning into chaos, and how you keep compute bills from turning into comedy. The goal isn't to "collect everything"; it's to capture the few signals that let you explain behavior, measure reliability, and control costs.
A practical baseline is simple: trace every turn so you can see decisions, tool calls, state updates, and the stop reason. Then track metrics that expose drift early, such as loop rate, tool error rate, cost per successful task, and p95 latency. So you know whether the problem is rare or happening at scale. Finally, add guardrails such as step/time/tool budgets, explicit stop conditions, and alerting, so bad runs fail safely rather than silently spiral.
Once that foundation is in place, improvement becomes a repeatable loop: investigate failures in traces, confirm patterns in metrics, apply a fix or guardrail, and lock it in with regression tests.
If you want to dig deeper into the building blocks behind agent observability, tool calling, traces/spans, safe telemetry, and percentile latency, these references are a solid next step. They cover how traces are structured, how to avoid leaking sensitive data in telemetry, why tail latency matters more than averages, and why tool-calling performance can drift with seemingly small config changes.






Top comments (0)