Part 7 of 8 — AI Agents in Practice series.
Previous — Building the Production Agent Loop (Part 6)
Part 6 ended with a question. The agent cancelled an order and issued a refund, the run reported success, and the closing line asked: when the verification read comes back wrong, what kind of failure is it, and how do you tell? Part 7 begins with that mismatch.
Here is what makes the question real. The cancel-then-refund agent from Part 6 is in production. A few days in, a refund goes out on an order that was never actually cancelled. The customer keeps the item and gets their money back, and nobody notices until the numbers do not reconcile at the end of the week. The agent did not crash and did not throw an error. It reported that everything worked.
The instinct in that moment is to reach for a better model, or to wrap the whole thing in a retry and hope the next run behaves. Both are guesses, and you do not have to guess, because the agent already wrote down what it did. The skill this article teaches is reading that record: inspect the trace, name the kind of failure, decide whether a retry will actually help, and add a check so the same failure cannot come back quietly. Trace, classify, retry, eval. That is the whole article.
The trace is the evidence
A demo trace and a production trace are not the same artifact. A demo trace, if it exists at all, is usually the conversation: what the user said, what the model said back, which tools got called. It is enough to see that the agent did something. It is not enough to see whether what it did was right.
A useful production trace should record the loop step by step, in the terms Part 6 built. For each step it holds what the agent observed, what it decided, which tool it called, what the tool returned, what the verification read came back with, the resulting state, the cost and latency, and, when the loop ends, why it stopped. Most of those fields are routine. The most important comparison is between the tool response and the verification read.
Part 6 singled out that gap. A tool response describes the request, not the world. accepted means the request was taken, not that the order is cancelled. The verification read is the loop going back to the source of truth and asking what actually happened. When the verification read is missing, or when it disagrees with the tool response, you have found the point where the run first diverged from the world. A trace that records that gap lets you see it after the fact instead of reconstructing it from a reconciliation report.
So the discipline is simple to state and easy to skip: when something goes wrong, read the trace before you change anything. The Part 6 trace is the right place to start because it already records the one comparison that matters.
Two places failures show up
It helps to know roughly where to look before you start reading, and agent failures tend to come from one of two places. Either the agent did something wrong, or the loop around it did. These are not rigid categories to memorize. They are a way to point your attention. The model is often the first thing teams blame because it is the most visible part of the system, but the trace may point somewhere else entirely.
Execution failures: tool failure, model reasoning failure, or control-state failure, where the recorded state no longer matches the world.
Structural loop failures: context degradation, loop runaway with no progress, silent stall, or wrong escalation where the loop exits or hands off too early or too late.
The first place is execution. The agent took a step and the step was wrong. A tool failed: wrong arguments, a timeout, a malformed response. Or the model reasoned badly: it chose the wrong action, or it invented one. Or, the case Part 6 was built around, the control state drifted from the world. The tool returned accepted, the loop wrote down cancelled, and the order stayed open. Nothing errored. The loop simply believed a claim about a request and recorded it as a fact about the world.
That last kind of failure has a tendency to spread. Once the loop has written cancellation_status: cancelled into its state, the next step reads that as established truth. The refund step does not re-examine the cancellation; it trusts the state it inherited and fires. So a single control-state failure at one step becomes the false assumption the next step is built on. The damage is not contained to where it started. When a chain of steps each trusts the one before it, an independent check matters at every consequential boundary, not only the last one.
The second place is the loop itself. Here the individual steps may each be fine, but the structure that runs them went wrong. The context the agent is working from goes stale or overflows, and it starts deciding on outdated information. A run can also drift from its original objective even when each step looks reasonable on its own; in a single trace, that usually points back to degraded context or a reasoning step that lost the thread. The loop runs longer than it should because the stopping condition was never tight enough. It escalates to a human when it did not need to, or worse, it keeps going when it should have escalated. And there is one structural failure that hides better than the rest.
Not every stall announces itself. A loop can hang on a step that never errors and never returns, a stream that goes idle without closing, so a stopping rule that only watches for errors and iteration caps will wait forever. Production loops need a watchdog that treats silence as a failure too. A production loop needs another exit condition besides completion: blocked. The loop has to detect and act on that state rather than sit indefinitely in silence.
How do you tell these apart from the trace? You read the recorded fields for the signature each one leaves. For each step, four questions do most of the work: what did the agent see, what did it decide, what did the tool return, and what state did the loop write next. The signature is the pattern across those fields, and it points you at the likely class and the first thing to check.
| Trace signature | Likely failure class | First thing to check |
|---|---|---|
| Bad arguments, tool error, timeout, or a malformed, empty, or incomplete response | Execution, tool failure | Tool contract, required fields, and inputs |
| Decision or tool choice does not follow from the observed state | Execution, model reasoning | Context and tool descriptions visible at that step |
| Tool response is treated as completion, but verification is missing or the world disagrees | Execution, control-state | Verification read and source of truth |
| Inherited state contradicts the current world | Propagated control-state failure | Where state and world first diverged |
| Context is stale or oversized | Structural, context degradation | What the loop kept or dropped |
| Cost or steps rise with no progress | Structural, loop runaway | Stopping condition and budget |
| No response and no error | Structural, silent stall | Timeout and watchdog coverage |
| Loop exits or escalates at the wrong time | Structural, wrong escalation | Exit and escalation condition |
The table is not the lesson. The lesson is the habit: a failure leaves a signature in the recorded fields, and reading the signature is faster and more honest than guessing at a cause. Classification matters because each class points to a different kind of fix. And the signature you name here is the same thing you will turn into a test at the end.
Does retry help?
Once you have named the failure, the next question is what to do about it, and the most common reflex is to retry. Retry is a response strategy, not a diagnosis, and whether it helps depends entirely on the kind of failure you named.
Naming the failure is only half the job. Diagnosis asks why a step failed: a tool error, a control-or-state gap, or a reasoning mistake. Retry policy asks what to do about it, and the right answer depends on the failure's nature, not where it happened. A transient fault, such as a rate limit or dropped connection, may justify a short backoff and another attempt. A failure that recurs identically needs a fix, not another call that spends budget reaching the same wall. A condition the current run cannot resolve, such as a bad credential or missing token, should stop the loop and surface the problem rather than spend the remaining budget retrying. Retrying every failure the same way is how a system manages to be expensive and unreliable at once.
It is worth being deliberate about that last case, because "stop" sounds like giving up and it is not. When the loop reaches a blocker or a condition it cannot resolve safely, it should stop and surface the problem. Surfacing means handing the situation to a person with enough of the trace for them to act, rather than letting the loop spend its budget retrying into a wall or, worse, continuing past the blocker as if it were cleared. An explicit stopping condition is a control, not a failure of nerve.
There is a subtlety in retrying that the trace makes visible. Before you replay a step, look at what the step already did. A step that completed part of its work, or that called a model and got a non-deterministic answer, is not safe to re-run blindly: replaying it can repeat an effect or produce a different result than the one the rest of the run assumed.
This is the backend discipline of idempotency: repeating the same request should not create a second side effect. Before retrying a side-effecting call, reuse the same idempotency key so the system recognizes it as the same operation, or check the source of truth to see whether the effect already happened. Without that protection, retrying issue_refund can issue the refund twice.
The safer move is to classify what failed, preserve the work that already succeeded, and retry only the part that genuinely needs it. Which is the same point from a different angle: the kind of failure decides the response, and a uniform retry ignores the one thing that should drive the decision.
Add an eval so it does not come back
Fixing the failure in front of you is satisfying and incomplete. The run that failed will not be the last run, and a fix you cannot verify on the next deploy is a fix you are trusting on faith. The durable move is to turn the failure into a test, so the same failure cannot return quietly.
The trace makes this concrete, because the trace already contains the failed case. Start with a real failure, the one you just diagnosed. Turn it into a clearly specified task: cancel an order, verify the cancellation, then issue one refund. Define what success actually means as something you can check, not a vague sense that it worked. Write a grader that checks that condition. Run it more than once, because an agent is not deterministic and a single passing run does not prove the behavior is reliable. Then keep the case in a regression suite so it runs on every change.
A note on words, because they get used loosely. Terms such as transcript, trace, and trajectory are used differently across systems. In this article, trace means the recorded path of the run, while outcome means the resulting world state. Both matter, and they catch different problems. Checking only the outcome can miss unsafe or invalid behavior along the way: the agent reached the right answer but used the wrong tool, leaked data, retried a dozen times, or skipped an approval. Checking only the path can be too rigid, rejecting a valid run that reached the right result by a different route than the one you expected. So grade the outcome where you can, inspect the trace when the path is what matters, and do not force one exact sequence unless the sequence is part of the requirement.
For the cancel-then-refund failure, the assertion writes itself: a consequential action must not proceed unless the verification read confirms the required terminal state. For this failure, the grader should be a strict check against the source of truth: no refund may be created unless the order status is actually cancelled. That single check, run across several trials, turns the exact failure you saw in production into something the suite will catch before it ships again.
Structural failures can be turned into checks too, though the assertion looks different. For a step that stalls, the test is not about a final answer; it is that the run stops with a defined blocked state or surfaces the failure within its turn, time, or retry budget rather than hanging. The check is on the path and the stopping behavior, not the outcome. (The word blocked here is this series' own; the point is that the stall has to resolve into a state the loop can detect.)
You do not need a large suite to start. A useful first set is on the order of twenty to fifty real tasks, drawn from the checks you already run by hand, the failures you have actually seen in production, bug reports, and the edge cases you know are dangerous. The point is not coverage of everything. The point is that the failures you have already paid for become failures you never pay for twice.
Three takeaways
A well-instrumented trace helps distinguish execution failures from structural ones. The recorded fields carry a signature; reading it beats guessing at a cause or reaching for a bigger model.
Retry is a separate decision from diagnosis. The failure class decides whether a retry helps, hurts, or just spends budget arriving at the same wall. Classify first, then choose the response.
Every production failure is a test case waiting to be written. Turn the trace into a task and a grader, run it across several trials, and add it to the suite before the next page fires.
Next: even a correctly running loop can fail at its boundaries, including what the agent may see, do, and remember, and a wrong action may be induced by untrusted input or inherited from poisoned memory. That is Part 8.
Source note: this article builds on the evaluation concepts in Anthropic's Demystifying Evals for AI Agents, and on the human-oversight, simplicity, and tool-design principles in Building Effective Agents (Schluntz & Zhang). The failure framing, the trace-signature reading, the retry classification, and the diagnostic workflow are this series' own synthesis.
Part of AI in Practice: three practical series on MCP, RAG, and AI Agents, focused on why these patterns exist, where they break, and how to think through the engineering decisions behind them.

Top comments (0)