Pallavi Sharma

Posted on Jun 18

The Reliability Problem That Forced Us to Rethink AI Agents

#ai #googleaichallenge #opensource #machinelearning

A few months into building AI agents for client projects, we hit a pattern that should sound familiar to anyone shipping this technology beyond the demo stage: the agent worked beautifully in front of stakeholders, then quietly fell apart the moment real users got their hands on it.

Not catastrophically. That would've been easier to catch.

A tool call would be made with a slightly malformed argument and get stuck in a retry loop. A multi-step task would drift away from its original objective halfway through execution. An agent would confidently report success while accomplishing nothing useful at all.

Nothing crashed. Nobody got paged. The damage was a slow leak of trust.

That's the moment we stopped treating reliability as a property the model would eventually have enough of and started treating it as something we had to engineer for directly.

Demos Lie About Reliability

A demo is a curated path through a system.

You ask the question you know it handles well, in the phrasing you know it understands, and you stop before it has the chance to wander.

Production doesn't give you that courtesy.

Users paraphrase. They contradict themselves halfway through a conversation. They paste malformed data. They ask for things that are three steps removed from anything in your evaluation set.

The uncomfortable realization for us was that an agent's reliability in the real world has very little to do with how impressive it looked across fifteen carefully selected examples.

It has everything to do with how it behaves on the long tail—the situations nobody anticipated.

The Incident That Changed Our Thinking

One workflow in particular forced us to rethink our assumptions.

We had an agent responsible for collecting information from multiple sources and updating records in an external system. Most of the time it worked perfectly.

Then we started noticing duplicate records appearing sporadically.

After digging through logs, we found the culprit.

The external system successfully completed the update but returned a timeout before the response reached the agent. The agent interpreted the timeout as a failure and retried the action. Since the update had already succeeded, the retry created a duplicate.

The model didn't hallucinate.

The reasoning wasn't wrong.

The failure came from how the surrounding system handled uncertainty.

That realization changed how we approached reliability.

Reliability Isn't One Thing

For a long time, we treated reliability as a single fuzzy goal.

The problem with that approach is that you can't improve what you can't define.

Breaking reliability into separate concerns made it much easier to reason about:

Determinism

Does the same input produce roughly the same behavior each time, or does the agent behave differently on every run?

Failure Visibility

When something goes wrong, does the system fail loudly and clearly, or does it generate a confident-sounding but incorrect answer?

Recoverability

If a workflow fails halfway through execution, can it resume safely, or does it need to start from scratch?

Boundedness

Does the agent know when to stop, or can it continue calling tools indefinitely because it never reaches a satisfying conclusion?

Once we started treating these as separate engineering problems, reliability became much easier to improve.

What We Actually Changed

Smaller Tools Instead of Clever Ones

Our early tool definitions tried to be flexible.

A single tool might accept numerous optional parameters and support several different workflows.

In theory, that made development easier.

In practice, it increased ambiguity.

The model had too many ways to call the same tool, and we had too many code paths to validate.

We replaced these with smaller, narrowly scoped tools that performed one job well and enforced strict schemas.

The reduction in malformed tool calls was immediate.

Validating Outputs Like Untrusted Input

Because that's exactly what they are.

Every structured response now passes through schema validation before it can trigger a real action.

Validation failures are treated as expected branches in the workflow rather than exceptional situations.

This single change prevented numerous downstream failures.

Idempotency and Circuit Breakers

Retries are useful until they aren't.

Some of our strangest bugs came from retrying actions that had partially succeeded.

We introduced idempotent operations wherever possible and capped retries with circuit breakers instead of allowing endless loops.

When failures happen now, they fail cleanly and visibly.

Checkpointing Agent State

For longer workflows, we persist state after each completed step.

If a seven-step process fails at step four, the agent resumes from step four instead of repeating the first three actions.

This reduced duplicate side effects and made recovery significantly more predictable.

Human Approval for Irreversible Actions

Sending emails.

Charging cards.

Deleting records.

Publishing content.

These actions now pass through explicit approval gates rather than relying solely on the model's confidence.

Confidence and correctness are not the same signal.

Treating them as if they are creates unnecessary risk.

Turning Evals into Regression Tests

Most teams run evaluations before deployment.

We started treating them as a permanent regression suite.

Every time an agent failed in production, we captured the example and added it to our test set.

That meant every future change had to prove it wasn't reintroducing an old failure.

Some of our most promising "improvements" turned out to solve one problem while creating three new ones.

Without regression testing, we never would've noticed.

Tracing Every Step

This was the least glamorous improvement and probably the most valuable.

We began tracing every reasoning step, tool call, validation check, and decision point.

Debugging stopped feeling like archaeology.

The majority of our mysterious failures became obvious once we could see the sequence of events that led to them.

None of This Made the Agent Smarter

That's the part worth emphasizing.

None of these changes improved the model's reasoning ability.

What they did was reduce the number of ways a reasoning mistake could become a production problem.

They made failures visible.

They made failures recoverable.

They reduced the blast radius when things inevitably went wrong.

That distinction changed how we scope projects today.

We no longer start by asking:

Can the model perform this task?

Increasingly, the answer is yes.

Instead, we ask:

When this fails—and it will—what does failure look like, who sees it, and how do we recover?

That question turns out to be far more important.

If You're Building AI Agents Right Now

A few lessons we'd share with teams early in their journey:

Write down failure cases before writing prompts.
Don't let one tool do five jobs.
Validate every structured output.
Log intermediate reasoning and tool calls, not just final answers.
Treat retries as a reliability strategy, not a default reaction.
Decide which actions are reversible and which aren't.
Add real production failures to your evaluation suite.

The happy path is rarely the hard part.

The edge cases are where reliability is won or lost.

Final Thoughts

We're still finding new ways for agents to surprise us.

That part probably never goes away.

But the failures look different now.

They're visible instead of silent.

Bounded instead of endless.

Recoverable instead of catastrophic.

For production systems, that's most of the battle.

As models continue to improve, reliability will increasingly become an engineering challenge rather than a model-quality problem.

The teams that recognize that shift early will build systems users can trust.

If you're working through similar challenges, I'd be especially interested in how you're approaching recoverability and state management in long-running agent workflows. It's one of the areas we're still actively refining.

Top comments (3)

Max Quimby • Jun 21

The duplicate-record incident is the most important part of this piece, and I'd push it even harder: that wasn't an agent failure at all, it was an at-least-once delivery problem that's existed in distributed systems for decades — the LLM just inherited it. The fix that's saved us the most pain is treating every external mutation as needing an idempotency key, so a retry of a side effect that already landed becomes a no-op, and treating a timeout as unknown rather than failed (retrying an unknown is only safe once it's idempotent). Your "confidently reports success while accomplishing nothing" line is the scarier one to me, because eval sets structurally can't catch it — the tool returned 200, so the trace looks clean. The only thing that's worked is outcome verification: re-read the record after the write and confirm the field actually changed, rather than trusting the tool's own status. Did splitting reliability into separate dimensions change what you measure, or mostly where you intervene in the loop?

Andrii Krugliak • Jun 20

The slow leak of trust is the exact phrase for it. A crash you can catch; the confident wrong answer with nobody paged is the one that quietly costs you the user. We ended up treating 'did it actually do the thing' as a separate gate from 'did it return without error', because the demo only ever tests the second one.

Theo Valmis • Jun 22

The duplicate-records detail is the tell that this is an enforcement problem more than a model problem. The agent did exactly what it was told, called the update, and nothing held the constraint that this record was already written, so a retry or a re-run became a second insert. Nothing crashed because nothing was wrong from the model's point of view; the missing piece was a check outside the model that refuses the second write before it happens. That's the shift you landed on: reliability isn't a property the model accumulates, it's a boundary you put around its actions. Let the agent decide freely, and make the irreversible steps, the write, the delete, the send, pass a check it can't talk its way around. We're building exactly that kind of pre-action enforcement at Mneme, because the long tail you describe is unbounded and the thing that scales against it is a deterministic gate on the actions that matter, not a bigger eval set.