We almost burned $400 in one afternoon.
Not because of a bad model. Not because of a broken API. Because our agent got stuck in a loop — calling itself over and over, retrying a task that was never going to succeed — and nothing told it to stop.
That incident forced us to rethink how we build agents at Anythoughts.ai. Here's what we learned.
The Setup
We had an outreach agent that:
- Fetches a list of prospects
- Enriches each one via an external API
- Drafts a personalized email
- Flags anything it can't enrich for human review
Simple enough. The bug: step 2 was hitting a rate-limited endpoint. The agent got a 429, retried, got another 429, retried again — and never stopped. It had no concept of "this task is failing, escalate or quit."
After about 90 minutes (and several hundred unnecessary API calls), we caught it manually.
Why Agents Loop
Most agent frameworks are optimized for completing tasks, not for stopping gracefully. The default behavior is:
- Tool call fails → retry
- Retry fails → retry again
- No explicit exit condition → keep trying
This makes sense for transient failures (network blip, timeout). It's catastrophic for systematic ones (rate limits, invalid input, missing permissions).
The agent isn't being stupid. It's doing exactly what it was told: keep going until done. The problem is we never defined "done" to include "unable to proceed."
The Fix: Three Termination Layers
We now build every agent with three explicit termination layers:
Layer 1: Per-tool retry caps
Every tool call has a max retry count with exponential backoff. After N failures on the same call, it throws a hard error — not a soft retry signal.
def call_with_limit(tool_fn, args, max_retries=3):
for attempt in range(max_retries):
result = tool_fn(**args)
if result.ok:
return result
if result.status == 429:
time.sleep(2 ** attempt)
else:
raise ToolError(f"Unrecoverable: {result.status}")
raise ToolError(f"Exceeded {max_retries} retries")
This sounds obvious. We didn't have it.
Layer 2: Task-level failure budget
Each agent run gets a failure budget — a max number of errors across all tool calls. Once exceeded, the entire run halts and logs state for recovery.
class AgentRun:
def __init__(self, failure_budget=10):
self.errors = 0
self.budget = failure_budget
def record_error(self, err):
self.errors += 1
if self.errors >= self.budget:
raise BudgetExhausted("Too many failures, halting run")
For our outreach agent, the budget is 5. If 5 enrichment calls fail, we stop, log the failed prospects, and ping Slack.
Layer 3: Wall-clock timeout
Every agent process runs inside a timeout wrapper. If it hasn't finished in 10 minutes (or whatever makes sense for the task), it's killed and the partial state is saved.
This is your last resort. If layers 1 and 2 fail, layer 3 ensures you don't burn resources indefinitely.
The Bigger Lesson
We spent a lot of early time making our agents smarter — better prompts, better models, better tool design. What actually made them reliable was making them safer to fail.
Every production agent we now ship answers three questions before it runs:
- What does success look like? (exit condition)
- What does unrecoverable failure look like? (halt condition)
- What's the worst-case resource cost if it loops? (budget)
If you can't answer all three, the agent isn't ready for production.
What This Costs You
About 2 hours to retrofit an existing agent. About 30 minutes to build it in from the start.
The $400 afternoon cost us a lot more than that.
At Anythoughts.ai, we build AI agents that run real business workflows autonomously. If you're building something similar and hit a wall, drop a comment — we've probably broken it the same way.
Top comments (0)