DEV Community

Mike
Mike

Posted on

Your Agents Run Forever — Here's How I Make Mine Stop

Rubber duck reaching for the kill switch on a runaway agent loop

Here's what happens when you put two models in an iterative refinement loop without a termination strategy.

One model generates API documentation. The other critiques it. Generate, critique, improve, repeat. The pattern works beautifully — three rounds, maybe four, and you get documentation that's better than what either model produces alone.

Until the critic is too good. It says "the error handling section could be more specific." The generator makes it more specific. The critic says "now the specificity makes the overview section feel vague by comparison." The generator improves the overview. The critic says "the improved overview introduces terminology that should be defined earlier."

Seventeen rounds. Both models are being helpful. Neither is wrong. They just never converge. By the time a billing alert fires, the workflow has burned through 50x its expected budget overnight.

This is the failure mode nobody writes tutorials about. Everyone shows you how to start agents. Nobody talks about how to make them stop.

Why max_iterations = 10 is not a termination strategy

The obvious first fix:

const MAX_ITERATIONS = 10;
Enter fullscreen mode Exit fullscreen mode

Cargo cult engineering at its finest. Why 10? Because it's a round number. Because some blog post used 10. Because it felt like enough.

Here's the problem: 10 is a constant solving a dynamic problem. Sometimes a refinement loop converges in 2 rounds and you're wasting 8 rounds of tokens on marginal improvements. Sometimes the task genuinely needs 15 rounds and you're cutting it off right before the output gets good.

Hard iteration limits are a safety net, not a strategy. They're the catch (Exception e) of agent orchestration — better than nothing, dangerous to rely on.

You need exit conditions that respond to what's actually happening in the loop.

Exit conditions that actually work

Six conditions that handle real-world agent loops. Use them together, not individually.

1. Budget ceiling

The simplest and most important. Set a hard dollar cap per workflow. Not per model call — per workflow. When you hit it, you stop. Not "try to stop gracefully." Stop.

{
  "workflow": "api-doc-refinement",
  "budget": {
    "max_cost_usd": 2.00,
    "tracking": "cumulative",
    "on_exceeded": "kill"
  }
}
Enter fullscreen mode Exit fullscreen mode

The key word is kill. Not "warn." Not "try to wrap up." The orchestrator terminates the loop and returns whatever output it has. A $2 answer that exists beats a $50 answer that's 4% better.

This is your seatbelt. Everything else is driving skill.

2. Convergence detection

Diff the last two outputs. If they're nearly identical, you've converged — further iterations are burning tokens for marginal gains.

Round 3 output vs Round 4 output:
- Similarity: 0.94
- Changed tokens: 31 out of 847
- Semantic diff: rewording only, no new information

→ Converged. Stop.
Enter fullscreen mode Exit fullscreen mode

You can measure this with cosine similarity on embeddings, token-level diff ratios, or even structured checks like "no new action items in the critique." A reliable approach: embedding similarity above ~0.92 combined with a check that the critique contains no novel issues — either signal alone can false-positive, but together they work.

The exact threshold depends on your embedding model and what you're comparing (full document vs. sections). Tune it for your use case. Documentation converges faster than code generation. Debate loops need a lower threshold because the format stays similar even when the arguments change. The threshold matters less than having one at all.

3. Step-limit with escalation

Sometimes you hit your step limit and the output genuinely isn't ready. max_iterations = 10 just truncates. Escalation does something useful instead.

Step limit reached (10 iterations).
Output quality score: 0.64 (below 0.80 threshold).

→ Escalating to frontier model for final pass.
Enter fullscreen mode Exit fullscreen mode

After N steps, instead of stopping cold, you hand the accumulated context to a more capable (and more expensive) model for a single final pass. Or you flag it for human review. The point is: the step limit triggers an action, not just a halt.

This is the difference between a circuit breaker that trips and protects the system, and a fuse that blows and leaves you in the dark.

4. Deadlock breaker

This is the one that catches the overnight loop. Detect when agents are passing the same information back and forth without making progress.

The simplest check: if Agent B's input is more than 90% similar to its previous input, the agents might be in a cycle. But that can false-positive on structured templates where inputs naturally look similar. A better signal is detecting repeating states across both agents: if the critic raises the same class of issue for the third time (even if the wording differs), you're cycling. Track critique themes, not just text similarity.

Round 5: Critic says "error examples could be more specific"
Round 7: Critic says "the error handling examples lack specificity"
Round 9: Critic says "consider adding more specific error scenarios"

→ Cycle detected. Same feedback pattern repeated 3x. Breaking.
Enter fullscreen mode Exit fullscreen mode

Implementation: keep a sliding window of the last 3-4 inputs to each agent. Compute pairwise similarity. If any pair exceeds your threshold, break the cycle.

Deadlock detection catches the failure mode that convergence detection misses: when outputs are changing (so they don't look converged) but the nature of the changes is circular.

5. Quality gate

Define acceptance criteria upfront and check them each round. "Does the output cover all 5 API endpoints? Are error codes documented? Are examples included for each method?" These are structured yes/no checks — not "should we continue?" but "are these specific criteria met?"

This is the missing piece from the max_iterations approach: instead of "stop after N rounds," it's "stop when the output is done." The acceptance criteria make termination goal-directed rather than arbitrary. An LLM can evaluate them — but as binary checklist items, not as an open-ended quality judgment.

The distinction from convergence matters: convergence says "nothing is changing." A quality gate says "everything required is present." An output can converge on something incomplete (criteria not met), or meet all criteria on round 2 (no need to keep going).

6. Diminishing returns

Convergence detection asks "are the outputs the same?" Diminishing returns asks a different question: "are they getting better?"

Track the rate of improvement per round. If the delta between round N and round N+1 is 80% smaller than the delta between round N-1 and round N, improvement is flattening. Stop.

This catches the case where the critic keeps finding real issues but they're increasingly minor — comma placement, word choice, formatting nits. Technically not converged (outputs are still changing), technically not cycling (the changes are genuine), but practically done. You're burning tokens for marginal gains that no human would notice.

Beyond the loop

Four operational guards that don't need their own subsections but belong in any production config:

  • Human-in-the-loop checkpoint. At the alert threshold or after N rounds, pause and notify a human (Slack, webhook) instead of auto-escalating to a frontier model. Not every workflow should auto-resolve.
  • Error rate threshold. If tool calls or model calls fail 3+ times consecutively, break out instead of retrying into the same wall.
  • External abort signal. An outside system — monitoring dashboard, user action, webhook — should be able to kill a running loop. The orchestrator polls for abort signals between rounds.
  • Output length cap. If generated output exceeds a max token count, the model is rambling or over-generating. Terminate and return what you have.

The orchestrator's kill switch

Here's the rule that matters most:

Never let an LLM be the only thing standing between you and an infinite loop.

It's tempting. You have a sophisticated system. Why not ask a model "given this conversation history, should we continue or stop?" The model will understand the nuance. It'll make a smart decision.

It won't. It will say "let's do one more round." Almost every time. LLMs are biased toward being helpful, and "let's stop here, this is good enough" is not a helpful-sounding answer. Try it: ask four different models "given this refinement history, should we do another round?" after outputs have clearly converged. Most will say yes. The holdout will say "one more round couldn't hurt."

Can LLMs participate in quality checks? Sure — rubric scoring, self-eval, "are acceptance criteria met?" can all work as inputs to the decision. But the final kill switch must be deterministic code. The termination conditions are if statements, not prompts. The kill switch is a function that returns a boolean, not a chat completion.

function shouldTerminate(state: LoopState): boolean {
  // Hard stops — non-negotiable resource limits
  if (state.totalCost >= state.budgetCeiling) return true;
  if (state.wallClockMs >= state.timeoutMs) return true;
  if (state.outputTokens >= state.maxOutputTokens) return true;
  if (state.consecutiveErrors >= state.errorThreshold) return true;
  if (state.abortSignal) return true;
  // Smart stops — the loop achieved its goal or stopped improving
  if (state.cycleDetected) return true;
  if (state.similarity > CONVERGENCE_THRESHOLD
      && !state.hasNovelIssues) return true;
  if (state.improvementRate < DIMINISHING_RETURNS_THRESHOLD) return true;
  if (state.acceptanceCriteriaMet) return true;
  // Safety net
  if (state.iteration >= state.maxIterations) return true;
  return false;
}
Enter fullscreen mode Exit fullscreen mode

Ten conditions in priority order. Hard stops (budget, time, output length, errors, external abort) always win — they fire before anything else is checked. Smart stops (deadlock, convergence, diminishing returns, quality gate) come next. The step limit is last because it might trigger escalation rather than a hard stop. All deterministic. No LLM in the loop. This function runs in microseconds and never hallucinates.

Cost envelopes: budgeting agent runs like cloud compute

Once you start treating token spend as a resource to manage rather than a cost to absorb, the mental model clicks. It's cloud compute. You already know how to think about this.

cost_envelopes:
  doc_refinement:
    budget_usd: 2.00
    alert_at: 1.50
    kill_at: 2.00
    expected_cost: 0.60

  code_review_debate:
    budget_usd: 5.00
    alert_at: 3.50
    kill_at: 5.00
    expected_cost: 1.20

  architecture_consensus:
    budget_usd: 8.00
    alert_at: 6.00
    kill_at: 8.00
    expected_cost: 3.00
Enter fullscreen mode Exit fullscreen mode

Three numbers per workflow: expected (what it should cost), alert (something's off), kill (hard stop).

Track the ratio between expected and actual over time. If your doc refinement workflow consistently costs $1.40 instead of $0.60, either your budget is wrong or your convergence detection needs tuning. Both are useful signals.

The burn rate matters too. A workflow that spends $1.80 in 3 rounds is probably fine — that's a complex task doing real work. A workflow that spends $1.80 in 12 rounds is looping. Same cost, very different health.

One thing that catches people: refinement loops get more expensive per iteration, not less. Each round adds to the conversation context. By round 10, you're paying for the full history of all previous rounds in every call. Your $0.20/round estimate from round 2 might be $0.40 by round 8. Budget accordingly — or truncate/summarize context between rounds.

Workflow: doc_refinement
Iteration 1: $0.14 (cumulative: $0.14)
Iteration 2: $0.17 (cumulative: $0.31)
Iteration 3: $0.19 (cumulative: $0.50) ← expected range
Iteration 4: $0.22 (cumulative: $0.72)  ← context growing
Iteration 5: $0.26 (cumulative: $0.98)
⚠️  Alert: 163% of expected cost. Convergence not detected.
Iteration 6: $0.29 (cumulative: $1.27)  ← context growth visible
Iteration 7: $0.33 (cumulative: $1.60)
⚠️  Kill threshold ($2.00) approaching. Budget alert.
Enter fullscreen mode Exit fullscreen mode

If you're running agent loops in production and you don't have this visibility, you don't have production. You have a demo with a credit card attached.

When to use DAGs instead

Here's the honest version of this section: if your loop count is predictable, you probably want a DAG instead.

A loop says: "I don't know how many steps this will take. I'll keep going until the output is good enough." That's valid for refinement, debate, open-ended exploration.

A DAG says: "I know the steps. Step A feeds into Step B which feeds into Step C. Done." That's valid for everything else.

Loop (use when you genuinely don't know):
    ┌──→ Generate ──→ Critique ──┐
    └────────────────────────────┘

DAG (use when you do know):
    Gather Context → Analyze → Draft → Format → Output
Enter fullscreen mode Exit fullscreen mode

If you find yourself setting max_iterations = 3 because you know it always takes exactly 3 rounds — you don't have a loop. You have a 3-step pipeline pretending to be a loop. Make it a DAG. You'll get the same output without the termination complexity.

Loops are expensive, hard to debug, and require all the termination machinery in this article. They earn their complexity when you genuinely need open-ended exploration — refinement, debate, search. Don't use them when a straight line will do.

Putting it all together

Here's a full termination config. Every agent loop should get something like this.

termination:
  budget:
    max_cost_usd: 2.00
    alert_threshold_usd: 1.50
    on_exceeded: kill

  convergence:
    similarity_threshold: 0.92
    min_iterations_before_check: 2
    method: embedding_cosine

  escalation:
    max_iterations: 10
    quality_threshold: 0.80
    on_limit_reached: escalate_to_frontier

  deadlock:
    window_size: 4
    similarity_threshold: 0.90
    min_cycle_length: 2
    on_detected: break_and_return_best

  quality_gate:
    criteria: [endpoints_covered, error_codes_documented, examples_present]
    check_after_iteration: 2
    on_met: stop

  diminishing_returns:
    min_improvement_delta: 0.05
    window: 3
    on_detected: stop

  guards:
    max_output_tokens: 8000
    max_consecutive_errors: 3
    abort_signal: webhook
    human_checkpoint:
      trigger_at_iteration: 5
      channel: slack

  # The safety net under the safety nets
  hard_timeout_seconds: 300
Enter fullscreen mode Exit fullscreen mode

Five minutes. That's the outer boundary. No agent workflow should take longer than five minutes. If it does, something is wrong — better to debug it tomorrow than pay for it tonight.

The boring truth

The exciting part of multi-model orchestration is the routing — consensus voting, adversarial debate, iterative refinement. That's the part people write about — the Fowler patterns article.

The important part is termination. It's if statements and YAML configs and budget spreadsheets. It's not interesting to read about at conferences. But it's the difference between a system that runs in production and a system that runs up your bill.

Orchestration patterns tell you how to get better answers from multiple models. Termination conditions tell you when to stop asking.

Build both. Start with the second one.


The orchestration patterns referenced here — consensus, debate, iteration, and judgment — are all tools in MCP Rubber Duck. The termination and budget machinery wraps around them.

Top comments (0)