Jamie Cole

Posted on Feb 26 • Edited on Mar 3

How to Detect When Your AI Agent Is Stuck (And What to Do About It)

#ai #automation #python #programming

Every autonomous agent eventually gets stuck. Not "fails with an error" stuck — that's easy to handle. I mean the subtler kind: where the agent keeps doing things, keeps running, keeps spending tokens, but makes no actual progress.

This is one of the harder problems in agent engineering because stuck agents look normal from the outside. The heartbeat fires, the logs show activity, the API calls succeed. But the task never completes.

Here's a pattern I use to detect and recover from stuck states.

What "Stuck" Actually Looks Like

Stuck agents fall into three categories:

The Repeater — takes the same action over and over. Usually happens when the action should change state but doesn't (API returns the same result, tool isn't working as expected, model keeps selecting the same tool because its description is misleading).

The Wanderer — keeps doing things but nothing connects to the goal. The agent is "busy" but not making progress. Common when the goal isn't encoded clearly in state, or when the model gets distracted by interesting-but-irrelevant information.

The Looper — alternates between a small set of actions without resolving. Action A leads to situation B, B leads back to A. Usually a failure to distinguish between "I tried this and it didn't work" and "I should try this".

All three have the same symptom: high activity, zero progress.

The Detection Pattern

from dataclasses import dataclass, field
from datetime import datetime, UTC
from typing import Optional
import json

@dataclass
class StuckDetector:
    # How many identical recent actions before we call it stuck
    repeat_threshold: int = 3
    # How many heartbeats without measurable progress before we escalate
    no_progress_threshold: int = 5
    # Track recent action types
    recent_actions: list = field(default_factory=list)
    # Track progress metric (define this for your domain)
    progress_snapshots: list = field(default_factory=list)

    def record_action(self, action_type: str, progress_metric: float) -> None:
        """Call this after every action."""
        self.recent_actions.append({
            "type": action_type,
            "timestamp": datetime.now(UTC).isoformat(),
            "progress": progress_metric
        })
        # Keep last 10
        self.recent_actions = self.recent_actions[-10:]

    def is_repeating(self) -> bool:
        """Returns True if last N actions are the same type."""
        if len(self.recent_actions) < self.repeat_threshold:
            return False
        last_n = [a["type"] for a in self.recent_actions[-self.repeat_threshold:]]
        return len(set(last_n)) == 1

    def is_not_progressing(self) -> bool:
        """Returns True if progress hasn't changed in N heartbeats."""
        if len(self.recent_actions) < self.no_progress_threshold:
            return False
        recent_progress = [a["progress"] for a in self.recent_actions[-self.no_progress_threshold:]]
        # Progress is the same if all values are within 1% of each other
        return max(recent_progress) - min(recent_progress) < 0.01

    def stuck_reason(self) -> Optional[str]:
        """Returns description of why agent is stuck, or None if not stuck."""
        if self.is_repeating():
            action = self.recent_actions[-1]["type"]
            return f"Repeating action '{action}' {self.repeat_threshold}+ times"
        if self.is_not_progressing():
            return f"No measurable progress in last {self.no_progress_threshold} heartbeats"
        return None

Defining your progress metric is the hard part. It depends on what your agent is supposed to accomplish:

Research agent: number of unique sources gathered
Sales agent: pipeline stage (0=prospect, 1=contacted, 2=replied, 3=meeting)
Code agent: test pass rate or number of failing tests
Content agent: word count of completed draft

Pick a metric that can only go up when real work is done. Not "number of API calls made" — that goes up even when the agent is spinning.

What To Do When Stuck

The recovery strategy depends on the stuck type:

def handle_stuck(state: dict, reason: str, detector: StuckDetector) -> dict:
    """Called when stuck detection fires. Updates state with recovery plan."""

    state["stuck_reason"] = reason
    state["stuck_detected_at"] = datetime.now(UTC).isoformat()

    if detector.is_repeating():
        # The action isn't working. Try something different.
        state["recovery_mode"] = "try_alternative"
        state["last_failed_action"] = detector.recent_actions[-1]["type"]
        state["instructions"] = (
            f"Your last {detector.repeat_threshold} actions were all '{state['last_failed_action']}' "
            f"and didn't advance the goal. "
            f"Try a completely different approach. "
            f"What else could you do to make progress?"
        )

    elif detector.is_not_progressing():
        # Busy but not productive. Reassess the goal.
        state["recovery_mode"] = "reassess_goal"
        state["instructions"] = (
            "You've been active for several heartbeats but the task isn't getting closer to done. "
            "Stop and think: what does 'done' actually mean here? "
            "What's the single most important thing that would move the needle?"
        )

    # Track stuck count
    state["consecutive_stuck_heartbeats"] = state.get("consecutive_stuck_heartbeats", 0) + 1

    # Escalate if stuck for too long
    if state["consecutive_stuck_heartbeats"] >= 3:
        state["status"] = "needs_human"
        state["escalation_message"] = (
            f"STUCK: {reason}. "
            f"Tried recovery {state['consecutive_stuck_heartbeats']} times without progress. "
            f"Task: {state.get('current_task', 'unknown')}. "
            f"Last action: {detector.recent_actions[-1] if detector.recent_actions else 'none'}"
        )
        notify_human(state["escalation_message"])

    return state

Integrating Into Your Heartbeat

def run_heartbeat():
    state = load_state()
    detector = StuckDetector.from_state(state)  # Persist detector state between runs

    # Check for stuck before doing anything
    stuck_reason = detector.stuck_reason()
    if stuck_reason:
        state = handle_stuck(state, stuck_reason, detector)
        if state.get("status") == "needs_human":
            save_state(state)
            return  # Stop and wait for human
        # Otherwise, state["instructions"] will guide the recovery action

    # Clear stuck counter on successful progress
    if not stuck_reason:
        state["consecutive_stuck_heartbeats"] = 0
        state.pop("recovery_mode", None)

    # ... rest of your heartbeat logic
    action_result = take_action(state, detector)
    detector.record_action(action_result["type"], measure_progress(state))

    # Persist detector state
    state["detector"] = {
        "recent_actions": detector.recent_actions,
    }
    save_state(state)

The Kill Switch

Stuck detection handles gradual failure. But sometimes you need to stop an agent immediately — runaway costs, unexpected behaviour, system issue.

The simplest kill switch is a file:

from pathlib import Path

KILL_FILE = Path("KILL")

def should_stop() -> bool:
    """Call at the start of every heartbeat."""
    return KILL_FILE.exists()

def run_heartbeat():
    if should_stop():
        print("Kill switch active — stopping")
        # Optionally: write final state, send alert
        return

    # ... rest of heartbeat

To stop the agent: touch KILL. To restart: rm KILL.

No dependencies, no API calls, no chance of the stop signal getting lost. Just a file.

Summary

Define a progress metric specific to your agent's goal
Track recent actions to detect repetition (The Repeater) and stagnation (The Wanderer/Looper)
Recovery strategy depends on stuck type: change approach for repeater, reassess goal for wanderer
Escalate to human after N failed recovery attempts — don't loop on recovery indefinitely
Kill switch for immediate stop: just check for a file at the start of each heartbeat

These five things prevent agents from running up large bills making no progress, which is one of the most common production failures I've seen.

If you're also thinking about agents that need to generate revenue autonomously, the failure modes get more specific. Covered them here: what actually goes wrong when autonomous agents try to make money.

Jamie Cole — indie developer, UK. Building autonomous agents and writing about what breaks.

More patterns in the full guide: Autonomous AI Agents with Claude (£25)

About to deploy? Check the production readiness checklist first ($9).

Building autonomous agents? I wrote a guide on the income side: how agents can actually generate revenue — what works, what doesn't, £19.

DEV Community