Here is a debugging session I have watched play out at four different companies now.
An agent does something dumb in production. A user complains. An engineer opens the logs. They find this:
[INFO] agent.run started
[INFO] calling tool: search
[INFO] calling tool: fetch_document
[INFO] agent.run completed in 14.2s
And that is it. That is everything. The agent burned 14 seconds, made three model calls, fetched the wrong document, and confidently told the user something false — and the logs have nothing to say about why. The engineer shrugs, marks the ticket "could not reproduce," and moves on. The bug ships forever.
The problem is not that they forgot to log. They logged plenty. The problem is they logged the wrong layer. Application logs are a record of what your code did. An agent's behavior does not live in your code — it lives in the gap between your code and the model's decisions. That gap is invisible to console.log.
Why traditional logging fails for agents
In a normal service, the interesting events are deterministic. A request comes in, you branch on some conditions, you hit a database, you return a response. If you log the branches and the query, you can reconstruct what happened. The control flow is the explanation.
Agents invert this. Your control flow is trivial — usually a while loop that calls the model, executes whatever tool the model asked for, and feeds the result back. All of the actual decision-making happens inside the model, expressed as tokens you never wrote. When the agent goes wrong, the answer is never "the loop had a bug." The answer is in the content: what was in the context window, what the model chose, what the tool returned, how the model interpreted that return.
So the unit of observability for an agent is not the log line. It is the step: one full turn of perceive, decide, act. And steps nest — a sub-agent's steps live inside a parent step, a tool call may itself trigger a model call. You need a tree, not a stream. This is exactly the trace-and-span model from distributed tracing, and it maps onto agents shockingly well.
What to actually capture per step
For every model invocation, you want the things that let you replay the decision without rerunning it. At minimum:
- The fully resolved input — not your prompt template, the actual rendered messages including retrieved context and prior tool outputs. The template is what you intended; the resolved input is what the model saw. Bugs hide in the difference.
- The raw output, including tool-call arguments, before any parsing.
- Token counts for input and output, separately. This is your cost and your latency early-warning system.
- The model and parameters actually used. "We use GPT-4" is not specific enough when half your traffic silently fell back to a cheaper model.
- Timing, split into network/queue time versus generation time.
For every tool call: the arguments the model produced, the result you returned to it, whether it errored, and how long it took. The tool result is the single most overlooked field, because that text re-enters the context and steers everything after it. Garbage in a tool result is the most common root cause of a confidently wrong final answer, and it is invisible unless you store it.
Here is a minimal tracer in TypeScript. The shape matters more than the implementation:
type StepKind = "model" | "tool";
interface Step {
id: string;
parentId: string | null;
kind: StepKind;
name: string;
input: unknown; // resolved messages or tool args
output: unknown; // raw completion or tool result
startedAt: number;
endedAt?: number;
tokensIn?: number;
tokensOut?: number;
error?: string;
meta: Record<string, unknown>; // model, temperature, etc.
}
class Trace {
readonly steps: Step[] = [];
private stack: string[] = [];
begin(kind: StepKind, name: string, input: unknown, meta = {}): string {
const id = crypto.randomUUID();
this.steps.push({
id,
parentId: this.stack.at(-1) ?? null,
kind, name, input,
output: undefined,
startedAt: Date.now(),
meta,
});
this.stack.push(id);
return id;
}
end(id: string, patch: Partial<Step>): void {
const step = this.steps.find((s) => s.id === id);
if (step) Object.assign(step, patch, { endedAt: Date.now() });
if (this.stack.at(-1) === id) this.stack.pop();
}
}
The parentId plus the stack is the whole trick. You get a tree for free, and a sub-agent just pushes more steps onto the same trace. Wrap your model client and your tool dispatcher so this happens automatically — if instrumenting requires discipline at every call site, it will rot within a month.
async function tracedModelCall(trace: Trace, messages: Message[], model: string) {
const id = trace.begin("model", model, messages, { model });
try {
const res = await client.chat({ model, messages });
trace.end(id, {
output: res,
tokensIn: res.usage.prompt_tokens,
tokensOut: res.usage.completion_tokens,
});
return res;
} catch (err) {
trace.end(id, { error: String(err) });
throw err;
}
}
The part everyone skips: make traces queryable
Capturing the trace is half the job. The half that actually pays off is being able to ask questions across traces. "Show me every run where a tool returned an empty result and the final answer still claimed success." "Which model version started producing 3x the tool calls last Tuesday?" "What did the context window look like for the five worst-rated responses this week?"
None of those are answerable from a log file. They require treating each trace as structured, queryable data — which means a real schema, indexed fields, and ideally a way to attach evaluation scores and user feedback onto the same trace. The moment you can join "this trace failed our eval" to "here is the exact resolved input that caused it," debugging stops being archaeology and becomes a query.
This is also where observability and evaluation stop being separate concerns. An eval failure is just a trace with a verdict attached. A production incident is a trace with a bad outcome. They are the same object viewed from two directions, and the teams who treat them as one thing move dramatically faster.
The takeaway
If you build one thing this quarter for your agents, build the trace tree. Not more INFO lines — a structured, nested record of every model and tool step, with the resolved inputs and raw outputs intact, that you can query and score after the fact. Everything else in agent reliability gets easier once you can actually see what happened.
This is the philosophy behind the tooling I work on: agent-eval for turning those traces into pass/fail verdicts in CI, and AgentLens for keeping the same traces searchable once the agent is live in production. Whether you adopt those or roll your own, the principle holds — your agent's behavior lives in the steps, so that is what you have to capture. Log the decisions, not the function calls.
Top comments (1)
Great breakdown of the trace-and-span model for agents. One thing I've noticed: even with perfect tracing, the hardest bugs are when an agent was in the wrong 'mode' — like calling tools during a brainstorming phase. That execution drift is nearly impossible to debug from logs alone.
I built a small hook plugin (Brainstorm-Mode by mehmetcanfarsak on GitHub) that adds PreToolUse hooks to block tool calls during ideation phases. It's a lightweight guardrail that keeps agents in the right headspace — divergent, actionable, or academic — instead of jumping straight to execution.