Every autonomous agent eventually gets stuck. Not "fails with an error" stuck — that's easy to handle. I mean the subtler kind: where the agent keeps doing things, keeps running, keeps spending tokens, but makes no actual progress.
This is one of the harder problems in agent engineering because stuck agents look normal from the outside. The heartbeat fires, the logs show activity, the API calls succeed. But the task never completes.
Here's a pattern I use to detect and recover from stuck states.
What "Stuck" Actually Looks Like
Stuck agents fall into three categories:
The Repeater — takes the same action over and over. Usually happens when the action should change state but doesn't (API returns the same result, tool isn't working as expected, model keeps selecting the same tool because its description is misleading).
The Wanderer — keeps doing things but nothing connects to the goal. The agent is "busy" but not making progress. Common when the goal isn't encoded clearly in state, or when the model gets distracted by interesting-but-irrelevant information.
The Looper — alternates between a small set of actions without resolving. Action A leads to situation B, B leads back to A. Usually a failure to distinguish between "I tried this and it didn't work" and "I should try this".
All three have the same symptom: high activity, zero progress.
The Detection Pattern
from dataclasses import dataclass, field
from datetime import datetime, UTC
from typing import Optional
import json
@dataclass
class StuckDetector:
# How many identical recent actions before we call it stuck
repeat_threshold: int = 3
# How many heartbeats without measurable progress before we escalate
no_progress_threshold: int = 5
# Track recent action types
recent_actions: list = field(default_factory=list)
# Track progress metric (define this for your domain)
progress_snapshots: list = field(default_factory=list)
def record_action(self, action_type: str, progress_metric: float) -> None:
"""Call this after every action."""
self.recent_actions.append({
"type": action_type,
"timestamp": datetime.now(UTC).isoformat(),
"progress": progress_metric
})
# Keep last 10
self.recent_actions = self.recent_actions[-10:]
def is_repeating(self) -> bool:
"""Returns True if last N actions are the same type."""
if len(self.recent_actions) < self.repeat_threshold:
return False
last_n = [a["type"] for a in self.recent_actions[-self.repeat_threshold:]]
return len(set(last_n)) == 1
def is_not_progressing(self) -> bool:
"""Returns True if progress hasn't changed in N heartbeats."""
if len(self.recent_actions) < self.no_progress_threshold:
return False
recent_progress = [a["progress"] for a in self.recent_actions[-self.no_progress_threshold:]]
# Progress is the same if all values are within 1% of each other
return max(recent_progress) - min(recent_progress) < 0.01
def stuck_reason(self) -> Optional[str]:
"""Returns description of why agent is stuck, or None if not stuck."""
if self.is_repeating():
action = self.recent_actions[-1]["type"]
return f"Repeating action '{action}' {self.repeat_threshold}+ times"
if self.is_not_progressing():
return f"No measurable progress in last {self.no_progress_threshold} heartbeats"
return None
Defining your progress metric is the hard part. It depends on what your agent is supposed to accomplish:
- Research agent: number of unique sources gathered
- Sales agent: pipeline stage (0=prospect, 1=contacted, 2=replied, 3=meeting)
- Code agent: test pass rate or number of failing tests
- Content agent: word count of completed draft
Pick a metric that can only go up when real work is done. Not "number of API calls made" — that goes up even when the agent is spinning.
What To Do When Stuck
The recovery strategy depends on the stuck type:
def handle_stuck(state: dict, reason: str, detector: StuckDetector) -> dict:
"""Called when stuck detection fires. Updates state with recovery plan."""
state["stuck_reason"] = reason
state["stuck_detected_at"] = datetime.now(UTC).isoformat()
if detector.is_repeating():
# The action isn't working. Try something different.
state["recovery_mode"] = "try_alternative"
state["last_failed_action"] = detector.recent_actions[-1]["type"]
state["instructions"] = (
f"Your last {detector.repeat_threshold} actions were all '{state['last_failed_action']}' "
f"and didn't advance the goal. "
f"Try a completely different approach. "
f"What else could you do to make progress?"
)
elif detector.is_not_progressing():
# Busy but not productive. Reassess the goal.
state["recovery_mode"] = "reassess_goal"
state["instructions"] = (
"You've been active for several heartbeats but the task isn't getting closer to done. "
"Stop and think: what does 'done' actually mean here? "
"What's the single most important thing that would move the needle?"
)
# Track stuck count
state["consecutive_stuck_heartbeats"] = state.get("consecutive_stuck_heartbeats", 0) + 1
# Escalate if stuck for too long
if state["consecutive_stuck_heartbeats"] >= 3:
state["status"] = "needs_human"
state["escalation_message"] = (
f"STUCK: {reason}. "
f"Tried recovery {state['consecutive_stuck_heartbeats']} times without progress. "
f"Task: {state.get('current_task', 'unknown')}. "
f"Last action: {detector.recent_actions[-1] if detector.recent_actions else 'none'}"
)
notify_human(state["escalation_message"])
return state
Integrating Into Your Heartbeat
def run_heartbeat():
state = load_state()
detector = StuckDetector.from_state(state) # Persist detector state between runs
# Check for stuck before doing anything
stuck_reason = detector.stuck_reason()
if stuck_reason:
state = handle_stuck(state, stuck_reason, detector)
if state.get("status") == "needs_human":
save_state(state)
return # Stop and wait for human
# Otherwise, state["instructions"] will guide the recovery action
# Clear stuck counter on successful progress
if not stuck_reason:
state["consecutive_stuck_heartbeats"] = 0
state.pop("recovery_mode", None)
# ... rest of your heartbeat logic
action_result = take_action(state, detector)
detector.record_action(action_result["type"], measure_progress(state))
# Persist detector state
state["detector"] = {
"recent_actions": detector.recent_actions,
}
save_state(state)
The Kill Switch
Stuck detection handles gradual failure. But sometimes you need to stop an agent immediately — runaway costs, unexpected behaviour, system issue.
The simplest kill switch is a file:
from pathlib import Path
KILL_FILE = Path("KILL")
def should_stop() -> bool:
"""Call at the start of every heartbeat."""
return KILL_FILE.exists()
def run_heartbeat():
if should_stop():
print("Kill switch active — stopping")
# Optionally: write final state, send alert
return
# ... rest of heartbeat
To stop the agent: touch KILL. To restart: rm KILL.
No dependencies, no API calls, no chance of the stop signal getting lost. Just a file.
Summary
- Define a progress metric specific to your agent's goal
- Track recent actions to detect repetition (The Repeater) and stagnation (The Wanderer/Looper)
- Recovery strategy depends on stuck type: change approach for repeater, reassess goal for wanderer
- Escalate to human after N failed recovery attempts — don't loop on recovery indefinitely
- Kill switch for immediate stop: just check for a file at the start of each heartbeat
These five things prevent agents from running up large bills making no progress, which is one of the most common production failures I've seen.
If you're also thinking about agents that need to generate revenue autonomously, the failure modes get more specific. Covered them here: what actually goes wrong when autonomous agents try to make money.
Jamie Cole — indie developer, UK. Building autonomous agents and writing about what breaks.
More patterns in the full guide: Autonomous AI Agents with Claude (£25)
About to deploy? Check the production readiness checklist first ($9).
Building autonomous agents? I wrote a guide on the income side: how agents can actually generate revenue — what works, what doesn't, £19.
Top comments (0)