DEV Community

~K¹yle Million
~K¹yle Million

Posted on

Loop Termination Architecture: How Production Agents Know When to Stop

Loop Termination Architecture: How Production Agents Know When to Stop

By W. Kyle Million (~K¹) | IntuiTek¹ | Published on dev.to/@thebrierfox


Your Claude Code agent just ran for 47 minutes on a task that should have taken 3.

It made 200 bash calls. 150 of them were retrying the same failing operation. It wrote 12 intermediate files. Then it ran out of context, compacted, and started over from scratch — still failing on the same thing.

This isn't a model problem. It's an architecture problem. Your agent has no circuit breaker.


The Problem: Autonomous Agents Don't Know When to Quit

When you hand a task to a Claude Code agent in -p mode, you're handing it a goal. The agent will pursue that goal until it either succeeds, runs out of context, or hits a wall it can't climb over. What it won't do by default is recognize that it's stuck in a pattern and stop itself.

Three failure modes that kill production runs:

1. Infinite retry loops

The task fails. The agent retries. It fails again with the same error. The agent retries again. Repeat until context exhausts. If the error is permission denied or service unavailable, no amount of retrying will fix it — but the agent doesn't know that.

2. False progress spirals

The agent makes some progress on each iteration, but never enough to complete the goal. It keeps finding new things to try. From inside the loop it looks like progress. From outside, it's burning 50 API calls on a task that can't be finished this way.

3. Scope creep cascades

The agent discovers that completing task A requires completing task B, which requires task C. Without termination logic, it builds an unbounded dependency tree and starts executing all of it. The original task is buried.


The Pattern: Three-Layer Termination Architecture

Production agents need termination logic at three levels:

Layer 1: Step Counter (Hard Limit)

Every agent execution should have a maximum step count. When it's reached, the agent writes its current state and stops — regardless of whether the task is complete.

MAX_STEPS = 50
step_count = 0

def execute_step(action):
    global step_count
    step_count += 1

    if step_count >= MAX_STEPS:
        write_state_file(f"STOPPED: max steps ({MAX_STEPS}) reached at step {step_count}")
        raise TerminationError(f"Hard limit reached: {step_count} steps")

    return perform_action(action)
Enter fullscreen mode Exit fullscreen mode

The right max for your agent depends on the task class. For filesystem operations: 20-30 steps. For multi-file code generation: 50-75. For research and synthesis: 100+. Start conservative; you can always raise it after a clean run.

In Claude Code terms: if your headless agent is triggered via claude -p, you can't inject this directly into Claude's reasoning. But you can wrap the invocation:

#!/bin/bash
# run_task.sh with timeout-based termination
timeout 300 claude -p "$TASK_PROMPT" \
  --allowedTools "Bash(*),Read(*),Write(*),Edit(*)" \
  --output-format text

if [[ $? -eq 124 ]]; then
  echo "TERMINATED: 5-minute timeout reached" >> logs/errors.log
  bash notify.sh "⚠️ Agent timeout: $TASK_PROMPT"
fi
Enter fullscreen mode Exit fullscreen mode

A 5-minute timeout is a blunt instrument. The deeper fix is building termination awareness into your task design.

Layer 2: Error Accumulation Counter (Smart Limit)

Not all failure is equal. A hard limit catches runaway loops, but a smart limit catches stuck loops — agents that keep executing but keep failing on the same error type.

error_counts = {}
ERROR_THRESHOLD = 3  # same error type: stop after 3

def handle_error(error_type: str, context: str):
    error_counts[error_type] = error_counts.get(error_type, 0) + 1

    if error_counts[error_type] >= ERROR_THRESHOLD:
        write_escalation(
            f"BLOCKED: {error_type} failed {error_counts[error_type]} times. "
            f"Last context: {context}. Stopping."
        )
        raise TerminationError(f"Repeated failure: {error_type}")

    return retry_with_backoff()
Enter fullscreen mode Exit fullscreen mode

In production, the error categories that most often trigger stuck loops:

  • PermissionError / permission denied — fix is environmental, not retry
  • ConnectionRefusedError / service unavailable — fix requires intervention
  • KeyError / AttributeError on the same field — the data isn't there; more retries won't produce it
  • FileNotFoundError on a generated file — prior step failed silently

When your agent sees the same error class three times in sequence, it should stop and write a diagnostic, not retry again.

Layer 3: Goal Proximity Check (Semantic Limit)

The hardest termination problem: the agent is making progress but will never finish this way. Step counters won't catch it. Error counters won't catch it. But goal proximity can.

Before each step, the agent should evaluate: Is this action moving me toward completion, or is it maintenance?

In Claude Code, you can implement this as a structured planning header in your CLAUDE.md:

## TASK EXECUTION PROTOCOL

Before each action:
1. State what completion looks like (one sentence)
2. State what this specific action accomplishes toward that goal
3. If you cannot articulate the connection, STOP and write the blocker to outputs/

If you have taken more than 10 actions without measurable progress toward the stated completion criteria, write a status to outputs/ and terminate. Do not keep trying.
Enter fullscreen mode Exit fullscreen mode

This sounds simple. It works because it makes the goal explicit and forces re-evaluation before each action rather than only when something fails.


What "Stopping Clean" Means

Stopping is not failing. A production agent that terminates cleanly and writes a diagnostic is infinitely more valuable than one that silently burns 200 API calls and produces nothing.

Clean termination means:

  1. Write current state — what was completed, what wasn't, what state the filesystem is in
  2. Write the blocker — exactly what stopped execution (error, step limit, goal check)
  3. Preserve partial work — don't clean up partially-completed files; document them instead
  4. Notify — push to your notification channel so you know to review
def terminate_clean(reason: str, state: dict):
    output_file = f"outputs/terminated_{timestamp()}.md"

    with open(output_file, "w") as f:
        f.write(f"# Agent Terminated: {reason}\n\n")
        f.write(f"**Steps taken:** {state['steps']}\n")
        f.write(f"**Completed:** {state['completed']}\n")
        f.write(f"**Incomplete:** {state['incomplete']}\n")
        f.write(f"**Files written:** {state['files']}\n")
        f.write(f"**Blocker:** {state['last_error']}\n")

    notify(f"⚠️ Agent stopped: {reason}{output_file}")
    sys.exit(0)  # 0, not 1 — clean termination is not an error
Enter fullscreen mode Exit fullscreen mode

Note sys.exit(0). If your orchestrator treats any exit code as success/failure, a clean termination should return 0. Reserve exit 1 for unhandled crashes.


Putting It Together: The Circuit Breaker Pattern

The full architecture looks like this:

Agent starts
    │
    ▼
Check: step_count < MAX_STEPS
    │ no → terminate_clean("max steps")
    ▼
Execute action
    │
    ├── success → continue
    │
    └── error → error_counts[type]++
                    │
                    ├── count < THRESHOLD → retry with backoff
                    │
                    └── count >= THRESHOLD → terminate_clean("repeated error: {type}")
    │
    ▼
Goal proximity check
    │ no progress in N steps → terminate_clean("no progress")
    ▼
Repeat
Enter fullscreen mode Exit fullscreen mode

The three layers create defense-in-depth:

  • Step counter catches runaway loops
  • Error accumulator catches stuck loops
  • Goal check catches false progress spirals

None of them requires you to predict what will go wrong. They just create the conditions under which the agent will stop instead of spin.


The Difference Between a Script and a System

A script does what you tell it. A system knows when it can't do what you told it.

Production Claude Code agents are systems. They operate on tasks that aren't fully specified in advance, on environments that change, against APIs that fail. The question isn't whether your agent will eventually hit a wall — it's whether it knows how to stop when it does.

Loop termination architecture is the difference between an agent that costs $0.003 per run and one that costs $0.47 because it retried a network error 80 times before context death.


The Production Implementation

The patterns above are the foundation. The production implementation includes:

  • Full circuit breaker class with configurable thresholds per error type
  • Goal proximity evaluator with configurable check intervals
  • Clean termination handler with filesystem state snapshot
  • Integration hooks for common notification channels (Telegram, Slack, webhook)
  • Claude Code CLAUDE.md templates for embedding termination logic in agent instructions
  • Tested with the 6 most common failure patterns in production Claude Code deployments

Loop Termination Architecture — Production Agent Circuit Breaker:
https://www.shopclawmart.com/listings/loop-termination-architecture-production-agent-circuit-breaker-e6d24abb

$19. Instant download. One-time purchase.


Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, programming, productivity

Top comments (0)