Bounded retries for agent tool calls: the budget that stopped our infinite-loop incidents

#agents #ai #llm #programming

The worst incident our agent caused was not a wrong answer. It was a loop.

The worst incident our agent ever caused was not a wrong answer. It was a loop. A tool call failed, the agent retried, the retry failed the same way, and it kept going, burning tokens and hammering a downstream API a few hundred times in a minute before anything stopped it. The agent was doing exactly what we had told it: if a tool fails, try again. We just never told it when to stop.

Retries are the right instinct. A transient failure should be retried. The problem is that an agent does not reliably distinguish "transient" from "this will fail every time," and left to its own judgment it will cheerfully retry a permanently broken call until something external kills it. Human-written code learned this lesson decades ago: every retry loop has a bound. Agent tool-calling quietly forgot it, because the retry decision moved from a for-loop you can see into the model's reasoning, where it is invisible and unbounded.

So we put the bound back, outside the model. Two budgets: a per-call retry cap (this specific tool call gets N attempts, then it is a hard failure the agent must handle differently), and a per-session attempt budget (the whole task gets a ceiling on total tool calls, after which we stop and escalate rather than let it spin).

class ToolBudget:
    def __init__(self, per_call=2, per_session=40):
        self.per_call, self.per_session = per_call, per_session
        self.session_calls = 0
    def check(self, tool, attempt):
        self.session_calls += 1
        if attempt > self.per_call:
            raise ToolGivingUp(f"{tool}: {attempt} attempts, stop retrying, try another path")
        if self.session_calls > self.per_session:
            raise SessionGivingUp("tool budget exhausted, escalate to a human")

The important part is not the cap, it is what happens at the cap. A retry that just stops leaves the agent stuck. So when the per-call budget is exhausted we hand the agent a specific message ("this call has failed twice, do not retry it, try a different approach or ask for help"), which turns a dead loop into a decision. Most of the time the agent then does something sensible, because now it knows retrying is off the table.

Tool-misuse loops in our logs went from a handful of nasty incidents a month to basically none, and the few that remain hit the session budget and escalate cleanly instead of paging someone at 2am about a runaway API bill.

The tension I have not fully resolved: a per-session budget that is too tight kills legitimately long tasks (a genuine multi-step workflow can need a lot of tool calls), and one that is too loose lets a slow loop run up real cost before it trips. We set ours from the 95th percentile of healthy task lengths and pad it, which is empirical and a little arbitrary. If you have found a non-arbitrary way to bound agent tool-call budgets, that is the comment I am reading.

DEV Community

Bounded retries for agent tool calls: the budget that stopped our infinite-loop incidents

The worst incident our agent caused was not a wrong answer. It was a loop.

Top comments (0)