A few months into building AI agents for client projects, we hit a pattern that should sound familiar to anyone shipping this technology beyond the demo stage: the agent worked beautifully in front of stakeholders, then quietly fell apart the moment real users got their hands on it.
Not catastrophically. That would've been easier to catch.
A tool call would be made with a slightly malformed argument and get stuck in a retry loop. A multi-step task would drift away from its original objective halfway through execution. An agent would confidently report success while accomplishing nothing useful at all.
Nothing crashed. Nobody got paged. The damage was a slow leak of trust.
That's the moment we stopped treating reliability as a property the model would eventually have enough of and started treating it as something we had to engineer for directly.
Demos Lie About Reliability
A demo is a curated path through a system.
You ask the question you know it handles well, in the phrasing you know it understands, and you stop before it has the chance to wander.
Production doesn't give you that courtesy.
Users paraphrase. They contradict themselves halfway through a conversation. They paste malformed data. They ask for things that are three steps removed from anything in your evaluation set.
The uncomfortable realization for us was that an agent's reliability in the real world has very little to do with how impressive it looked across fifteen carefully selected examples.
It has everything to do with how it behaves on the long tail—the situations nobody anticipated.
The Incident That Changed Our Thinking
One workflow in particular forced us to rethink our assumptions.
We had an agent responsible for collecting information from multiple sources and updating records in an external system. Most of the time it worked perfectly.
Then we started noticing duplicate records appearing sporadically.
After digging through logs, we found the culprit.
The external system successfully completed the update but returned a timeout before the response reached the agent. The agent interpreted the timeout as a failure and retried the action. Since the update had already succeeded, the retry created a duplicate.
The model didn't hallucinate.
The reasoning wasn't wrong.
The failure came from how the surrounding system handled uncertainty.
That realization changed how we approached reliability.
Reliability Isn't One Thing
For a long time, we treated reliability as a single fuzzy goal.
The problem with that approach is that you can't improve what you can't define.
Breaking reliability into separate concerns made it much easier to reason about:
Determinism
Does the same input produce roughly the same behavior each time, or does the agent behave differently on every run?
Failure Visibility
When something goes wrong, does the system fail loudly and clearly, or does it generate a confident-sounding but incorrect answer?
Recoverability
If a workflow fails halfway through execution, can it resume safely, or does it need to start from scratch?
Boundedness
Does the agent know when to stop, or can it continue calling tools indefinitely because it never reaches a satisfying conclusion?
Once we started treating these as separate engineering problems, reliability became much easier to improve.
What We Actually Changed
Smaller Tools Instead of Clever Ones
Our early tool definitions tried to be flexible.
A single tool might accept numerous optional parameters and support several different workflows.
In theory, that made development easier.
In practice, it increased ambiguity.
The model had too many ways to call the same tool, and we had too many code paths to validate.
We replaced these with smaller, narrowly scoped tools that performed one job well and enforced strict schemas.
The reduction in malformed tool calls was immediate.
Validating Outputs Like Untrusted Input
Because that's exactly what they are.
Every structured response now passes through schema validation before it can trigger a real action.
Validation failures are treated as expected branches in the workflow rather than exceptional situations.
This single change prevented numerous downstream failures.
Idempotency and Circuit Breakers
Retries are useful until they aren't.
Some of our strangest bugs came from retrying actions that had partially succeeded.
We introduced idempotent operations wherever possible and capped retries with circuit breakers instead of allowing endless loops.
When failures happen now, they fail cleanly and visibly.
Checkpointing Agent State
For longer workflows, we persist state after each completed step.
If a seven-step process fails at step four, the agent resumes from step four instead of repeating the first three actions.
This reduced duplicate side effects and made recovery significantly more predictable.
Human Approval for Irreversible Actions
Sending emails.
Charging cards.
Deleting records.
Publishing content.
These actions now pass through explicit approval gates rather than relying solely on the model's confidence.
Confidence and correctness are not the same signal.
Treating them as if they are creates unnecessary risk.
Turning Evals into Regression Tests
Most teams run evaluations before deployment.
We started treating them as a permanent regression suite.
Every time an agent failed in production, we captured the example and added it to our test set.
That meant every future change had to prove it wasn't reintroducing an old failure.
Some of our most promising "improvements" turned out to solve one problem while creating three new ones.
Without regression testing, we never would've noticed.
Tracing Every Step
This was the least glamorous improvement and probably the most valuable.
We began tracing every reasoning step, tool call, validation check, and decision point.
Debugging stopped feeling like archaeology.
The majority of our mysterious failures became obvious once we could see the sequence of events that led to them.
None of This Made the Agent Smarter
That's the part worth emphasizing.
None of these changes improved the model's reasoning ability.
What they did was reduce the number of ways a reasoning mistake could become a production problem.
They made failures visible.
They made failures recoverable.
They reduced the blast radius when things inevitably went wrong.
That distinction changed how we scope projects today.
We no longer start by asking:
Can the model perform this task?
Increasingly, the answer is yes.
Instead, we ask:
When this fails—and it will—what does failure look like, who sees it, and how do we recover?
That question turns out to be far more important.
If You're Building AI Agents Right Now
A few lessons we'd share with teams early in their journey:
- Write down failure cases before writing prompts.
- Don't let one tool do five jobs.
- Validate every structured output.
- Log intermediate reasoning and tool calls, not just final answers.
- Treat retries as a reliability strategy, not a default reaction.
- Decide which actions are reversible and which aren't.
- Add real production failures to your evaluation suite.
The happy path is rarely the hard part.
The edge cases are where reliability is won or lost.
Final Thoughts
We're still finding new ways for agents to surprise us.
That part probably never goes away.
But the failures look different now.
They're visible instead of silent.
Bounded instead of endless.
Recoverable instead of catastrophic.
For production systems, that's most of the battle.
As models continue to improve, reliability will increasingly become an engineering challenge rather than a model-quality problem.
The teams that recognize that shift early will build systems users can trust.
If you're working through similar challenges, I'd be especially interested in how you're approaching recoverability and state management in long-running agent workflows. It's one of the areas we're still actively refining.
Top comments (0)