AI Agents in Practice — Part 7: When the Loop Goes Wrong: Reading Agent Failures from the Trace

#agents #ai #architecture #llm

Part 7 of 8 — AI Agents in Practice

Previous — Building the Production Agent Loop (Part 6)

Part 6 ended with a question. The agent cancelled an order and issued a refund, the run reported success, and the closing line asked: when the verification read comes back wrong, what kind of failure is it, and how do you tell? Part 7 begins with that mismatch.

Here is what makes the question real. The cancel-then-refund agent from Part 6 is in production, except the team shipped the unsafe variant: no verification gate, and a refund backend with no precondition check of its own. In review, both looked like optional hardening. A few days in, a refund goes out on an order that was never actually cancelled. The customer keeps the item and gets their money back, and nobody notices until the numbers do not reconcile at the end of the week. The agent did not crash and did not throw an error. It reported that everything worked.

The instinct in that moment is to reach for a better model, or to wrap the whole thing in a retry and hope the next run behaves. Both are guesses, and you do not have to guess, because the agent already wrote down what it did. The skill this article teaches is reading that record: inspect the trace, name the kind of failure, decide whether a retry will actually help, and add a check so the same failure cannot come back quietly. Trace, classify, respond, eval. That is the whole article.

The trace is the evidence

A demo trace and a production trace are not the same artifact. A demo trace, if it exists at all, is usually the conversation: what the user said, what the model said back, which tools got called. It is enough to see that the agent did something. It is not enough to see whether what it did was right.

A useful production trace should record the loop step by step, in the terms Part 6 built. For each step it holds what the agent observed, what it decided, which tool it called, what the tool returned, what the verification read came back with, the resulting state, the cost and latency, and, when the loop ends, why it stopped. Most of those fields are routine. The most important comparison is between the tool response and the verification read.

Part 6 singled out that gap. A tool response describes the request, not the world. accepted means the request was taken, not that the order is cancelled. The verification read asks an authoritative source what business state actually holds. The trace should also record which source was read, because a cache or delayed projection is not automatically ground truth. When the verification read is missing, or when it disagrees with the tool response, you have found the point where the run first diverged from the world. A trace that records that gap lets you see it after the fact instead of reconstructing it from a reconciliation report.

So the discipline is simple to state and easy to skip: when something goes wrong, read the trace before you change anything. The Part 6 trace is the right place to start because it already records the one comparison that matters. The companion lab's naive trace records this exact failure, if you want to read one before your pager makes you.

Two places failures show up

It helps to know roughly where to look before you start reading, and agent failures tend to come from one of two places. Either the agent did something wrong, or the loop around it did. These are not rigid categories to memorize. They are a way to point your attention. The model is often the first thing teams blame because it is the most visible part of the system, but the trace may point somewhere else entirely.

Execution failures: tool failure, model decision failure, or control-state failure, where the recorded state no longer matches the world.

Structural loop failures: context degradation, loop runaway with no progress, silent stall, or wrong escalation where the loop exits or hands off too early or too late.

The first place is execution. The agent took a step and the step was wrong. A tool failed: wrong arguments, a timeout, a malformed response. Or the model made a bad decision: it selected the wrong action or attempted an unsupported one. Or, the case Part 6 was built around, the control state drifted from the world. The tool returned accepted, the loop wrote down cancelled, and the order stayed open. Nothing errored. The loop simply believed a claim about a request and recorded it as a fact about the world.

That last kind of failure has a tendency to spread. Once the loop has written cancellation_status: cancelled into its state, the next step reads that as established truth. The refund step does not re-examine the cancellation; it trusts the state it inherited and fires. So a single control-state failure at one step becomes the false assumption the next step is built on. The damage is not contained to where it started. When a chain of steps each trusts the one before it, authoritative verification matters at every consequential boundary, not only the last one. (In the Part 6 design, the backend's own precondition check refuses this attempt; the trace still has to expose why the loop tried.)

The second place is the loop itself. Here the individual steps may each be fine, but the structure that runs them went wrong. The context the agent is working from goes stale or overflows, and it starts deciding on outdated information. A run can also drift from its original objective even when each step looks reasonable on its own; in a single trace, that usually points back to degraded context or a decision step that lost the thread. The loop runs longer than it should because the stopping condition was never tight enough. It escalates to a human when it did not need to, or worse, it keeps going when it should have escalated. And there is one structural failure that hides better than the rest.

Not every stall announces itself. A loop can hang on a step that never errors and never returns, a stream that goes idle without closing, so a stopping rule that only watches for errors and iteration caps will wait forever. Production loops need a watchdog that treats silence as a failure too. Silence has to resolve into a defined state the loop can act on: a timeout, or the blocked condition Part 3 named. From there, policy decides whether to retry, escalate, or stop; the failure is letting silence remain an invisible non-state.

How do you tell these apart from the trace? You read the recorded fields for the signature each one leaves. For each step, four questions do most of the work: what did the agent see, what did it decide, what did the tool return, and what state did the loop write next. The signature is the pattern across those fields, and it points you at the likely class and the first thing to check.

Trace signature	Likely failure class	First thing to check
Bad arguments, tool error, timeout, or a malformed, empty, or incomplete response	Execution, tool failure	Tool contract, required fields, and inputs
Decision or tool choice does not follow from the observed state	Execution, model decision	Context and tool descriptions visible at that step
Tool response is treated as completion, but verification is missing or the world disagrees	Execution, control-state	Verification read and source of truth
Inherited state contradicts the current world	Propagated control-state failure	Where state and world first diverged
Backend rejects a consequential action because an authoritative precondition is false	Propagated control-state or decision failure	Where the loop first marked the precondition satisfied
Context is stale or oversized	Structural, context degradation	What the loop kept or dropped
Cost or steps rise with no progress	Structural, loop runaway	Stopping condition and budget
No response and no error	Structural, silent stall	Timeout and watchdog coverage
Loop exits or escalates at the wrong time	Structural, wrong escalation	Exit and escalation condition

The table is not the lesson. The lesson is the habit: a failure leaves a signature in the recorded fields, and reading the signature is faster and more honest than guessing at a cause. Classification matters because each class points to a different kind of fix. And the signature you name here is the same thing you will turn into a test at the end.

Does retry help?

Once you have named the failure, the next question is what to do about it, and the most common reflex is to retry. Retry is a response strategy, not a diagnosis, and whether it helps depends entirely on the kind of failure you named.

Naming the failure is only half the job. Diagnosis asks why a step failed: a tool error, a control-or-state gap, or a decision mistake. Retry policy asks what to do about it, and the right answer depends on the failure's nature, not where it happened. A transient fault, such as a rate limit or dropped connection, may justify a short backoff and another attempt. A failure that recurs identically needs a fix, not another call that spends budget reaching the same wall. A condition the current run cannot resolve, such as a bad credential or missing token, should stop the loop and surface the problem rather than spend the remaining budget retrying. Retrying every failure the same way is how a system manages to be expensive and unreliable at once.

It is worth being deliberate about that last case, because "stop" sounds like giving up and it is not. When the loop reaches a blocker or a condition it cannot resolve safely, it should stop and surface the problem. Surfacing means handing the situation to a person with enough of the trace for them to act, rather than letting the loop spend its budget retrying into a wall or, worse, continuing past the blocker as if it were cleared. An explicit stopping condition is a control, not a failure of nerve.

There is a subtlety in retrying that the trace makes visible. Before you replay a step, look at what the step already did. A step that completed part of its work, or that called a model and got a non-deterministic answer, is not safe to re-run blindly: replaying it can repeat an effect or produce a different result than the one the rest of the run assumed.

This is the backend discipline of idempotency: repeating the same request should not create a second side effect. Before retrying a side-effecting call, reuse the caller-minted idempotency key for that same logical operation, and when the outcome is uncertain, re-read authoritative state before deciding whether another attempt is needed. Do not mint a new key for the retry; that turns one logical operation into two. Without that protection, retrying issue_refund can issue the refund twice.

The safer move is to classify what failed, preserve the work that already succeeded, and retry only the part that genuinely needs it. Which is the same point from a different angle: the kind of failure decides the response, and a uniform retry ignores the one thing that should drive the decision.

Add an eval so it does not come back

Fixing the failure in front of you is satisfying and incomplete. The run that failed will not be the last run, and a fix you cannot verify on the next deploy is a fix you are trusting on faith. The durable move is to turn the failure into a test, so the same failure cannot return quietly.

The trace makes this concrete, because the trace already contains the failed case. Start with a real failure, the one you just diagnosed. Turn it into a clearly specified task: cancel an order, verify the cancellation, then issue one refund. Define what success actually means as something you can check, not a vague sense that it worked. Write a grader that checks that condition. Run it more than once, because an agent is not deterministic and a single passing run does not prove the behavior is reliable. Then keep the case in a regression suite so it runs on every change.

A note on words, because they get used loosely. Terms such as transcript, trace, and trajectory are used differently across systems. In this article, trace means the recorded path of the run, while outcome means the resulting world state. Both matter, and they catch different problems. Checking only the outcome can miss unsafe or invalid behavior along the way: the agent reached the right answer but used the wrong tool, leaked data, retried a dozen times, or skipped an approval. Checking only the path can be too rigid, rejecting a valid run that reached the right result by a different route than the one you expected. So grade the outcome where you can, inspect the trace when the path is what matters, and do not force one exact sequence unless the sequence is part of the requirement.

For the cancel-then-refund failure, use two checks. The trace grader verifies that the loop never attempts issue_refund before an authoritative verification read confirms cancelled. The outcome grader verifies that the backend never creates a refund unless the order is actually cancelled, and that a successful run creates exactly one refund. The first catches bad sequencing even when the backend protects the money; the second verifies the final enforcement boundary. Run both across several trials, and the exact failure you saw in production becomes something the suite catches before it ships again.

Structural failures can be turned into checks too, though the assertion looks different. For a step that stalls, the test is not about a final answer; it is that the run stops with a defined blocked state or surfaces the failure within its turn, time, or retry budget rather than hanging. The check is on the path and the stopping behavior, not the outcome. (Part 3 defined blocked as a stopping condition; the point is that the stall has to resolve into a state the loop can detect.)

You do not need a large suite to start. A useful first set is on the order of twenty to fifty real tasks, drawn from the checks you already run by hand, the failures you have actually seen in production, bug reports, and the edge cases you know are dangerous. The point is not coverage of everything. The point is that the failures you have already paid for become failures you never pay for twice.

Three takeaways

A well-instrumented trace helps distinguish execution failures from structural ones. The recorded fields carry a signature; reading it beats guessing at a cause or reaching for a bigger model.
Retry is a separate decision from diagnosis. The failure class decides whether a retry helps, hurts, or just spends budget arriving at the same wall. Classify first, then choose the response.
Every production failure is a test case waiting to be written. Turn the trace into a task and a grader, run it across several trials, and add it to the suite before the next page fires.

Next: even a correctly running loop can fail at its boundaries, including what the agent may see, do, and remember, and a wrong action may be induced by untrusted input or inherited from poisoned memory. That is Part 8.

Source note: this article builds on the evaluation concepts in Anthropic's Demystifying Evals for AI Agents, and on the human-oversight, simplicity, and tool-design principles in Building Effective Agents (Schluntz & Zhang). The failure framing, the trace-signature reading, the retry classification, and the diagnostic workflow are this series' own synthesis.

Part of AI in Practice: three practical series on MCP, RAG, and AI Agents, focused on why these patterns exist, where they break, and how to think through the engineering decisions behind them.