~K¹yle Million

Posted on Apr 10

Claude Code Agent Error Recovery: The Patterns That Keep Autonomous Systems Running

#claudecode #devtools #aiagents #automation

Your Claude Code agent runs clean in dev. You test it 10 times. It works every time.

Then you deploy it headless — scheduled, no supervision — and three days later you check and the thing has been silently failing since Tuesday.

No output. No errors. Just… nothing.

This is the most common failure mode for autonomous Claude Code agents, and it's not a Claude problem. It's an architecture problem. The agent had no recovery layer.

Here are the five patterns that fix it.

Why autonomous agents fail differently than interactive ones

In an interactive session, you notice when something goes wrong. You see the error, you correct, you continue.

In a headless agent (scheduled via cron, running -p mode), there's no one watching. Failures don't surface until something downstream breaks — a report doesn't appear, a delivery doesn't happen, a deadline is missed.

The architecture has to do what you'd do manually: detect the failure, classify it, attempt recovery if safe, escalate if not.

That requires explicit design. It doesn't happen by default.

Pattern 1: Fail-safe exits with full state capture

Before any long or risky operation, write your current state.

# Write a checkpoint before the risky step
cat > ~/intuitek/coordination/in_progress/deploy_checkpoint.json << EOF
{
  "task": "deploy_new_listing",
  "step": "pre-upload",
  "timestamp": "$(date -Iseconds)",
  "payload_hash": "$(sha256sum listing.json | cut -d' ' -f1)"
}
EOF

# Now do the risky operation
upload_listing listing.json
EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
  # State is captured. Log the failure.
  echo "$(date -Iseconds) | ERROR | deploy_new_listing failed at pre-upload | exit=$EXIT_CODE" >> logs/errors.log
  # Notify. Don't silently die.
  bash notify.sh "⚠️ Deploy failed at upload step — checkpoint at coordination/in_progress/deploy_checkpoint.json"
  exit $EXIT_CODE
fi

# Clean checkpoint on success
rm ~/intuitek/coordination/in_progress/deploy_checkpoint.json

The pattern: write state, do work, delete state on success. If the checkpoint file exists at next run, you know where the last run died.

Pattern 2: Idempotent task design

An idempotent task can run twice and produce the same result as running once. This is non-negotiable for any automated agent.

Wrong:

publish article to dev.to → always creates new article

Right:

check if article exists by title → if yes, skip → if no, publish

Wrong:

create Stripe payment link → always creates new link

Right:

check if link already exists for this product → if yes, return existing → if no, create

Idempotency means a failed-and-retried task doesn't create duplicates, double-charges, or conflicting state. Write every autonomous task with the assumption it will be retried.

Pattern 3: Error classification before recovery

Not all errors are the same. Treating them the same is how recovery logic makes things worse.

Error type	Description	Recovery action
Transient	Network timeout, rate limit, temporary unavailability	Retry with backoff
Input error	Bad data, wrong format, missing field	Log and skip — retrying won't help
Auth error	Expired token, invalid key	Attempt token refresh, then escalate
Fatal	Data corruption, unrecoverable state	Halt, alert, require human review

In practice:

HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_ENDPOINT" ...)

case $HTTP_STATUS in
  200|201) echo "Success" ;;
  429) echo "Rate limited — sleeping 60s" && sleep 60 && retry ;;
  401) echo "Auth failed — attempting token refresh" && refresh_token ;;
  400) echo "Bad request — input error, skipping" && log_skip ;;
  5*) echo "Server error — transient, will retry next cycle" ;;
  *) echo "Unknown status $HTTP_STATUS — escalating" && notify_kyle ;;
esac

The 401 case is worth expanding: many automated agents die permanently on expired credentials because they log "auth error" and stop. A proper recovery layer refreshes the token and retries before escalating to a human.

Pattern 4: The write-before-execute rule

Any operation that modifies external state — API calls, database writes, file publishes — should be preceded by writing what you're about to do.

ABOUT TO: publish article ID pending_20260409 to dev.to
REASON: idle-state content publication
ROLLBACK: delete article via DELETE /api/articles/{id}

This serves two purposes:

Audit trail — you know exactly what the agent was doing when it failed
Rollback instructions — when something goes wrong, recovery is scripted, not improvised

Write these to ~/intuitek/coordination/in_progress/ as lock files. Check for existing lock files before starting duplicate work. Delete them on success. This is also how you prevent two agent instances from racing on the same resource.

Pattern 5: The escalation boundary

Some failures should trigger automatic recovery. Others should stop the agent and put the decision in human hands.

The rule: if recovery requires judgment, escalate.

Auto-recoverable (no human needed):

Expired OAuth token → refresh and retry
Rate limit hit → wait and retry
Temporary network error → retry with backoff
Stale lock file (>1 hour old) → remove and continue

Escalate to human (write to outputs/question_for_kyle_{timestamp}.md):

Multiple retries all failed
Unexpected API response that doesn't match any known error type
Data integrity concern (checksums don't match, unexpected state)
Action would be irreversible and confidence is <100%

write_escalation() {
  local TASK="$1"
  local REASON="$2"
  local TIMESTAMP=$(date +%Y%m%d_%H%M%S)

  cat > ~/intuitek/outputs/question_for_kyle_${TIMESTAMP}.md << EOF
# Escalation Required: $TASK

**Time:** $(date -Iseconds)
**Reason:** $REASON
**Last known state:** [checkpoint data here]
**Suggested action:** [what you'd do if you had authorization]
**Risk of waiting:** [what happens if Kyle doesn't respond for 24h]
EOF

  bash ~/intuitek/notify.sh "⚠️ Agent escalation: $TASK — see outputs/question_for_kyle_${TIMESTAMP}.md"
}

The escalation is not a failure. It's the agent correctly recognizing its own authorization boundary.

Putting it together: the recovery-aware task wrapper

Every autonomous task should follow this structure:

task_with_recovery() {
  local TASK_ID="$1"
  local TASK_FN="$2"

  # 1. Check for existing lock (idempotency)
  local LOCK="~/intuitek/coordination/in_progress/${TASK_ID}.lock"
  if [ -f "$LOCK" ]; then
    local LOCK_AGE=$(( $(date +%s) - $(stat -c %Y "$LOCK") ))
    if [ $LOCK_AGE -lt 3600 ]; then
      echo "Task $TASK_ID already in progress (lock age: ${LOCK_AGE}s) — skipping"
      return 0
    fi
    echo "Stale lock for $TASK_ID (${LOCK_AGE}s) — removing and continuing"
    rm "$LOCK"
  fi

  # 2. Write checkpoint
  echo "{\"task\": \"$TASK_ID\", \"started\": \"$(date -Iseconds)\"}" > "$LOCK"

  # 3. Execute with error handling
  if ! $TASK_FN; then
    echo "$(date -Iseconds) | ERROR | $TASK_ID failed" >> ~/intuitek/logs/errors.log
    bash ~/intuitek/notify.sh "⚠️ Task failed: $TASK_ID"
    rm "$LOCK"
    return 1
  fi

  # 4. Clean checkpoint on success
  rm "$LOCK"
  echo "$(date -Iseconds) | COMPLETE | $TASK_ID" >> ~/intuitek/logs/heartbeat.log
  return 0
}

The observable test

Your error recovery is working when:

~/intuitek/logs/errors.log — errors appear here with full context, not silently discarded
~/intuitek/coordination/in_progress/ — only has lock files for currently-running tasks
Telegram — you receive escalations before you notice the failure yourself
~/intuitek/outputs/question_for_kyle_*.md — rare, but always actionable

If errors are silently disappearing, the recovery layer isn't there. If lock files are accumulating, tasks are dying mid-run. Both are diagnosable from the file system alone.

The architecture underneath this

These patterns came from building the ACE license delivery system — autonomous Stripe webhook processing, license provisioning, and email delivery — where a silent failure means a customer paid and got nothing.

The full recovery-aware agent architecture is one of the distilled skill sets available at shopclawmart.com/@thebrierfox. If you're building autonomous agents that need to stay running without supervision, the architecture skills there will save you the same debugging sessions it took to build these patterns.

The difference between an agent that runs for a week without supervision and one that silently dies on Tuesday is this layer. It's not glamorous. It's not AI. It's just good infrastructure.

Build it before you need it.

DEV Community