The Infinite Loop Problem: How We Stopped Our Agent From Running Forever

#ai #indiehackers #startup #automation

We almost burned $400 in one afternoon.

Not because of a bad model. Not because of a broken API. Because our agent got stuck in a loop — calling itself over and over, retrying a task that was never going to succeed — and nothing told it to stop.

That incident forced us to rethink how we build agents at Anythoughts.ai. Here's what we learned.

The Setup

We had an outreach agent that:

Fetches a list of prospects
Enriches each one via an external API
Drafts a personalized email
Flags anything it can't enrich for human review

Simple enough. The bug: step 2 was hitting a rate-limited endpoint. The agent got a 429, retried, got another 429, retried again — and never stopped. It had no concept of "this task is failing, escalate or quit."

After about 90 minutes (and several hundred unnecessary API calls), we caught it manually.

Why Agents Loop

Most agent frameworks are optimized for completing tasks, not for stopping gracefully. The default behavior is:

Tool call fails → retry
Retry fails → retry again
No explicit exit condition → keep trying

This makes sense for transient failures (network blip, timeout). It's catastrophic for systematic ones (rate limits, invalid input, missing permissions).

The agent isn't being stupid. It's doing exactly what it was told: keep going until done. The problem is we never defined "done" to include "unable to proceed."

The Fix: Three Termination Layers

We now build every agent with three explicit termination layers:

Layer 1: Per-tool retry caps

Every tool call has a max retry count with exponential backoff. After N failures on the same call, it throws a hard error — not a soft retry signal.

def call_with_limit(tool_fn, args, max_retries=3):
    for attempt in range(max_retries):
        result = tool_fn(**args)
        if result.ok:
            return result
        if result.status == 429:
            time.sleep(2 ** attempt)
        else:
            raise ToolError(f"Unrecoverable: {result.status}")
    raise ToolError(f"Exceeded {max_retries} retries")

This sounds obvious. We didn't have it.

Layer 2: Task-level failure budget

Each agent run gets a failure budget — a max number of errors across all tool calls. Once exceeded, the entire run halts and logs state for recovery.

class AgentRun:
    def __init__(self, failure_budget=10):
        self.errors = 0
        self.budget = failure_budget

    def record_error(self, err):
        self.errors += 1
        if self.errors >= self.budget:
            raise BudgetExhausted("Too many failures, halting run")

For our outreach agent, the budget is 5. If 5 enrichment calls fail, we stop, log the failed prospects, and ping Slack.

Layer 3: Wall-clock timeout

Every agent process runs inside a timeout wrapper. If it hasn't finished in 10 minutes (or whatever makes sense for the task), it's killed and the partial state is saved.

This is your last resort. If layers 1 and 2 fail, layer 3 ensures you don't burn resources indefinitely.

The Bigger Lesson

We spent a lot of early time making our agents smarter — better prompts, better models, better tool design. What actually made them reliable was making them safer to fail.

Every production agent we now ship answers three questions before it runs:

What does success look like? (exit condition)
What does unrecoverable failure look like? (halt condition)
What's the worst-case resource cost if it loops? (budget)

If you can't answer all three, the agent isn't ready for production.

What This Costs You

About 2 hours to retrofit an existing agent. About 30 minutes to build it in from the start.

The $400 afternoon cost us a lot more than that.

At Anythoughts.ai, we build AI agents that run real business workflows autonomously. If you're building something similar and hit a wall, drop a comment — we've probably broken it the same way.