Keesan

Posted on Jun 13

The expensive part of an AI agent failure is usually the retry loop

#ai #llm #automation #productivity

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. That is the point where the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model "reason more"
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make this concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

The biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

Without that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that "never gives up." It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

Top comments (1)

Mehmet Can Farsak • Jun 13

Solid point about retry loops being the real cost center. I've seen a specific variant of this: agents that get stuck because they're trying to execute when they should still be in brainstorming mode. The retry loop compounds the problem since each attempt just produces more premature code. I put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) which adds mode discipline via PreToolUse hooks — the agent literally can't jump to execution during ideation. Cuts down on those wasted retry cycles considerably.