DEV Community

Patrick Hughes
Patrick Hughes

Posted on • Originally published at bmdpat.com

Your AI Agent's Retry Loop Is a Cost Bug Waiting to Happen

This morning a small piece of my own automation got stuck. A repair agent tried to fix one blog draft. It failed the same check 27 times in a row. Each attempt was a full model call. The loop never asked the obvious question: if 26 tries did not work, why would the 27th?

That is a retry loop with no circuit breaker. And it is one of the most common ways AI agents quietly waste money.

Retries are not free

In normal code, a retry is cheap. You hit a flaky network call, you try again, you move on. The cost of one extra attempt is a few milliseconds.

Agent retries are different. Every attempt is a model call. Every model call costs tokens. A loop that retries 27 times is 27 paid attempts at the same task. If the task is impossible the way it is framed, you pay 27 times to learn nothing.

The error handling looks responsible. It catches the failure. It tries again. It logs the attempt. But "try again" without "and stop at some point" is not error handling. It is a slow leak.

A quick cost example

Say each attempt is a 4,000 token call. At a few dollars per million tokens, one attempt is a fraction of a cent. Twenty-seven attempts is still small. Now scale it. A loop that runs on a 40,000 token context, across ten agents, several times a day, adds up. The per-attempt cost hides the total. That is the danger. Small numbers times a loop with no ceiling become a large number you never chose to spend.

Why agents thrash

Traditional retries assume the failure is transient. The server was busy. The connection dropped. Wait a beat and the same input works.

Agent failures are often not transient. The model produced output that breaks a hard rule. A paragraph too long. A forbidden word. A schema mismatch. Feed the same prompt back and you get a slightly different wrong answer. The constraint that blocked attempt one blocks attempt 27.

So the loop runs until something else stops it. A timeout. A token budget. A human noticing the bill. None of those are good stopping conditions.

The fix is a counter and a ceiling

You do not need anything fancy. You need two things most retry loops skip.

First, count the attempts. Not just "did it fail," but "how many times in a row." A simple integer.

Second, set a ceiling. After N tries, stop. Escalate. Write the failure somewhere a human will see it. Hand the work to a different path. Anything except trying again forever.

In my case the right move was obvious in hindsight. After three failed repairs, the draft should have been flagged for a human and the loop should have stopped. Instead it ground through 27 attempts before a separate check caught it.

Make the ceiling visible

The trap is that an uncapped loop looks fine from the outside. The agent is busy. Logs are filling. Work appears to be happening. Nobody sees the problem until the invoice arrives or a quota runs dry.

So make the cap explicit and loud. Log when you hit it. Count retries as their own metric, separate from successes and failures. A spike in retries is an early warning that something is stuck, long before it shows up as cost.

If you run more than one agent, track this across all of them. One stuck loop is annoying. Ten stuck loops at once is a real bill.

This is the same idea as a budget

A retry ceiling and a token budget are the same instinct. Both say an autonomous process should have a hard stop it cannot cross. The agent does not get to decide it needs one more try, one more call, one more dollar. The ceiling decides.

That is why I built loop and budget limits into AgentGuard in the first place. Agents are good at doing things over and over. They are bad at knowing when to quit. The stop has to come from outside the agent.

If your agents retry, audit those loops today. Find the ones with no counter. Add a ceiling. Make the ceiling visible. The 27-attempt loop I hit cost me cents because the task was tiny. The next one might not be.

Want hard caps on retries, tokens, and spend for your agents? Start here: https://bmdpat.com/tools/agentguard

Top comments (0)