- Book: Observability for LLM Applications — Tracing, Evals, and Shipping AI You Can Trust
- Also by me: Agents in Production — the companion book in The AI Engineer's Library (2-book series)
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You get paged: the triage agent is writing wrong labels on support tickets. You open your tracing backend and there are four thousand runs from the last hour. You know how to read a single LLM call. You emitted gen_ai.request.model, watched a span land in the waterfall, and read it top to bottom. That skill does not scale here. The question changed. It is no longer "what did the model say." It is "where in this fourteen-step run did the agent decide to do the wrong thing, and is that the same place it went wrong yesterday."
An LLM call is a span. An agent run is a tree. Once you model the run as a tree, every hard question about an agent becomes a query against that tree. Here is how to build it, what each node should carry, and how to replay it when a run goes sideways.
The root span is the trajectory boundary
The single most useful thing you can add is a parent span that wraps the whole run. The OpenTelemetry GenAI conventions call it invoke_agent. It fires once per task and it is the parent of every chat and tool span underneath.
Without that parent, a trace is a flat list of chat spans and tool spans, and you rebuild the trajectory in your head from timestamps. That reconstruction breaks the first time two runs interleave. With the parent, your backend has a boundary. It can render a trajectory view, sum cost per task, count turns, and diff one run against another, all without you scripting anything.
from opentelemetry import trace
tracer = trace.get_tracer("triage-agent")
with tracer.start_as_current_span(
"invoke_agent triage-agent"
) as root:
root.set_attribute("gen_ai.agent.name", "triage-agent")
root.set_attribute("gen_ai.agent.version", "2.1.0")
root.set_attribute("gen_ai.conversation.id", session_id)
# run the loop; child chat/tool spans nest under root
If you already instrumented single LLM calls, this is the whole migration. Wrap the outer loop. The existing gen_ai.* chat and tool spans nest under it unchanged. No second collector, no re-instrumentation.
Of the four attributes above, version is the one teams skip and regret. Agents regress quietly. A prompt tweak, a new tool on the whitelist, a model swap between two minor revisions all shift the trajectory distribution. When an eval regresses overnight, your first move is to filter by gen_ai.agent.version and check whether the agent changed between yesterday and today. If it did, the fix is in your version control. If it did not, the regression is in the model provider, a tool backend, or the data.
Each child span records a decision, not a response
Here is the mental shift. On a plain LLM call, the chat span records what the model said: tokens, model name, latency, maybe the message payload. That is a record of the response, and it is enough when answering was the only job.
Inside an agent, the model's output is rarely the user-facing answer. It is a decision. Call this tool with these arguments. Hand off to that agent. Stop and emit a final message. The response payload is not what you want on the span. The decision is. And a decision has structure: which branch of the loop the model took, which tool it picked, what arguments it chose, and why.
with tracer.start_as_current_span(
"chat claude-sonnet-4-6"
) as chat:
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
tools=tools,
)
chat.set_attribute("gen_ai.agent.step", step)
chat.set_attribute("gen_ai.agent.decision", "call_tool")
chat.set_attribute(
"gen_ai.agent.decision.tool", "search_kb"
)
chat.set_attribute(
"gen_ai.usage.input_tokens", resp.usage.input_tokens
)
chat.set_attribute(
"gen_ai.usage.output_tokens", resp.usage.output_tokens
)
Keep decision to a small fixed vocabulary your harness owns: call_tool, handoff, final_answer, reflect, stop. Consistency is what lets you group on it later. When you build trajectory evals, you score runs by grouping on gen_ai.agent.decision and measuring how often each decision type leads to task completion. That query is impossible unless the attribute is on every step.
Set one integer on the root when the loop exits: gen_ai.agent.step.count, the final turn count. Guardrail alerts like "agent exceeded 25 steps" key off it directly. If your harness raises on a recursion limit, set it in the exception handler before you re-raise, so the span still carries the truth.
Read the tree and the pathology is obvious
Here is a clean run. Indentation is parent-child nesting, numbers are the wall-clock offset from the root in seconds.
invoke_agent triage-agent 0.000 4.812s
├─ chat claude-sonnet-4-6 0.004 1.102s
│ in=842 out=96 decision: call search_kb
├─ execute_tool search_kb 1.118 0.612s
│ args={"query":"refund 14 day"} result_len=3
├─ chat claude-sonnet-4-6 1.741 0.980s
│ in=1910 out=44 decision: call classify_ticket
├─ execute_tool classify_ticket 2.734 0.018s
│ args={"label":"billing/refund"}
├─ chat claude-sonnet-4-6 2.760 1.420s
│ in=2004 out=312 decision: call write_reply
├─ execute_tool write_reply 4.198 0.598s
└─ chat claude-sonnet-4-6 4.810 0.002s
decision: final_answer
Four chat turns, three tool calls, input tokens climbing from 842 to 2004 as context accretes. The final chat span is a two-millisecond decision to answer. The model did not reflect, did not double-check, did not loop. That is what good looks like.
Now the same agent on a different ticket:
invoke_agent triage-agent 0.000 31.204s
├─ chat claude-sonnet-4-6 0.004 1.240s
│ decision: call search_kb
├─ execute_tool search_kb 1.244 0.590s
│ result_len=0
├─ chat claude-sonnet-4-6 1.840 1.510s
│ decision: call search_kb
├─ execute_tool search_kb 3.360 0.612s
│ result_len=0
... [18 more search_kb retries] ...
└─ chat claude-sonnet-4-6 30.920 0.280s
decision: final_answer
text: "I could not find this information."
Every execute_tool span is green. No tool errored. The run failed on a decision loop: the model kept asking the same query, kept getting zero rows, kept deciding to try again, and burned thirty seconds before giving up. You would never spot this on a single-call waterfall. You spot it because the root span shows the turn count at a glance and the trajectory view surfaces "this run called search_kb twenty times" as the top anomaly. The unit of debugging is the trajectory.
Handoffs nest as child invocations
Multi-agent systems keep the same shape. When a supervisor hands off to a specialist, the specialist's entire run is another invoke_agent span, parented under the supervisor's current chat span.
invoke_agent research-supervisor 0.000 18.440s
├─ chat claude-opus-4-6 0.004 2.110s
│ decision: handoff to web-searcher
├─ invoke_agent web-searcher 2.120 6.880s
│ ├─ chat claude-sonnet-4-6 2.122 0.810s
│ ├─ execute_tool web_search 2.940 1.402s
│ └─ chat claude-sonnet-4-6 4.350 1.680s
│ decision: final_answer
└─ chat claude-opus-4-6 9.010 1.380s
decision: final_answer
Cost accounting falls out of the nesting. Sum the child chat spans of each invoke_agent and you get per-agent token spend. The supervisor's context is usually small; the specialists burn most of the bill. You cannot see that from summing all chat spans flat.
The common failure here is a framework that emits a flat list of chat spans for a multi-agent run instead of nested invoke_agent spans. The tree collapses to one line and the trajectory view goes useless. The fix is a one-line wrapper at each handoff: start a child invoke_agent before you call the specialist, end it after. Check for this first when a multi-agent trace looks wrong.
Replaying a run to find the break
When you get paged, run the same procedure every time.
First, filter by gen_ai.agent.version. If it jumped in the last hour, diff the versions. The regression is almost certainly there.
Second, filter by decision = final_answer and the failing outcome, then pull one bad run and read it top to bottom. Look for the first decision that surprises you, not the one that produced the bad output. The bad output usually sits three or four steps downstream of the broken decision. You want the upstream one.
Third, pull a passing run from before the regression and read them side by side. Nine times out of ten the difference is a single tool call now returning something different, or a chat span where the model picked a different tool than it used to.
Fourth, grep that one divergence across the whole population. If decision.tool = classify_ticket used to fire in 92% of runs and now fires in 74%, you have localized the regression. Fix it there.
This procedure fails in exactly one case: an incomplete tree. Chat spans with no parent, an invoke_agent with no children, a multi-agent run flattened to one level. Then you scroll raw spans and lose an hour. Emitting the full tree is what buys you the fifteen-minute debug instead of the two-hour one.
What the tree cannot tell you
A trajectory shows what the agent did. It does not tell you whether what it did was good. The twenty-retry run is obvious because zero results twenty times is visibly wrong. Most bad runs are not visibly wrong: a reasonable tool, reasonable arguments, a reasonable result, and a reply that is slightly off. Read that trace without knowing ground truth and you shrug.
Tracing is the substrate. Evals are the judgment layer that grades the trajectory, and every eval technique assumes the tree exists first. So before Monday, open your backend, pick one production agent, and confirm every run has an invoke_agent root with gen_ai.agent.name, gen_ai.agent.version, and a complete child tree. If it does not, fix that before anything else. You cannot grade a trajectory you cannot see.
If you are building the agents, Agents in Production walks through the loop, the handoff, and the harness that emits these spans. If you are trying to see inside them, Observability for LLM Applications is the tracing and evals half of the pair. Together they are The AI Engineer's Library, and this post lives at the seam where they meet.

Top comments (0)