Your Claude Code agent runs clean in dev. You test it 10 times. It works every time.
Then you deploy it headless — scheduled, no supervision — and three days later you check and the thing has been silently failing since Tuesday.
No output. No errors. Just… nothing.
This is the most common failure mode for autonomous Claude Code agents, and it's not a Claude problem. It's an architecture problem. The agent had no recovery layer.
Here are the five patterns that fix it.
Why autonomous agents fail differently than interactive ones
In an interactive session, you notice when something goes wrong. You see the error, you correct, you continue.
In a headless agent (scheduled via cron, running -p mode), there's no one watching. Failures don't surface until something downstream breaks — a report doesn't appear, a delivery doesn't happen, a deadline is missed.
The architecture has to do what you'd do manually: detect the failure, classify it, attempt recovery if safe, escalate if not.
That requires explicit design. It doesn't happen by default.
Pattern 1: Fail-safe exits with full state capture
Before any long or risky operation, write your current state.
# Write a checkpoint before the risky step
cat > ~/intuitek/coordination/in_progress/deploy_checkpoint.json << EOF
{
"task": "deploy_new_listing",
"step": "pre-upload",
"timestamp": "$(date -Iseconds)",
"payload_hash": "$(sha256sum listing.json | cut -d' ' -f1)"
}
EOF
# Now do the risky operation
upload_listing listing.json
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
# State is captured. Log the failure.
echo "$(date -Iseconds) | ERROR | deploy_new_listing failed at pre-upload | exit=$EXIT_CODE" >> logs/errors.log
# Notify. Don't silently die.
bash notify.sh "⚠️ Deploy failed at upload step — checkpoint at coordination/in_progress/deploy_checkpoint.json"
exit $EXIT_CODE
fi
# Clean checkpoint on success
rm ~/intuitek/coordination/in_progress/deploy_checkpoint.json
The pattern: write state, do work, delete state on success. If the checkpoint file exists at next run, you know where the last run died.
Pattern 2: Idempotent task design
An idempotent task can run twice and produce the same result as running once. This is non-negotiable for any automated agent.
Wrong:
publish article to dev.to → always creates new article
Right:
check if article exists by title → if yes, skip → if no, publish
Wrong:
create Stripe payment link → always creates new link
Right:
check if link already exists for this product → if yes, return existing → if no, create
Idempotency means a failed-and-retried task doesn't create duplicates, double-charges, or conflicting state. Write every autonomous task with the assumption it will be retried.
Pattern 3: Error classification before recovery
Not all errors are the same. Treating them the same is how recovery logic makes things worse.
| Error type | Description | Recovery action |
|---|---|---|
| Transient | Network timeout, rate limit, temporary unavailability | Retry with backoff |
| Input error | Bad data, wrong format, missing field | Log and skip — retrying won't help |
| Auth error | Expired token, invalid key | Attempt token refresh, then escalate |
| Fatal | Data corruption, unrecoverable state | Halt, alert, require human review |
In practice:
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_ENDPOINT" ...)
case $HTTP_STATUS in
200|201) echo "Success" ;;
429) echo "Rate limited — sleeping 60s" && sleep 60 && retry ;;
401) echo "Auth failed — attempting token refresh" && refresh_token ;;
400) echo "Bad request — input error, skipping" && log_skip ;;
5*) echo "Server error — transient, will retry next cycle" ;;
*) echo "Unknown status $HTTP_STATUS — escalating" && notify_kyle ;;
esac
The 401 case is worth expanding: many automated agents die permanently on expired credentials because they log "auth error" and stop. A proper recovery layer refreshes the token and retries before escalating to a human.
Pattern 4: The write-before-execute rule
Any operation that modifies external state — API calls, database writes, file publishes — should be preceded by writing what you're about to do.
ABOUT TO: publish article ID pending_20260409 to dev.to
REASON: idle-state content publication
ROLLBACK: delete article via DELETE /api/articles/{id}
This serves two purposes:
- Audit trail — you know exactly what the agent was doing when it failed
- Rollback instructions — when something goes wrong, recovery is scripted, not improvised
Write these to ~/intuitek/coordination/in_progress/ as lock files. Check for existing lock files before starting duplicate work. Delete them on success. This is also how you prevent two agent instances from racing on the same resource.
Pattern 5: The escalation boundary
Some failures should trigger automatic recovery. Others should stop the agent and put the decision in human hands.
The rule: if recovery requires judgment, escalate.
Auto-recoverable (no human needed):
- Expired OAuth token → refresh and retry
- Rate limit hit → wait and retry
- Temporary network error → retry with backoff
- Stale lock file (>1 hour old) → remove and continue
Escalate to human (write to outputs/question_for_kyle_{timestamp}.md):
- Multiple retries all failed
- Unexpected API response that doesn't match any known error type
- Data integrity concern (checksums don't match, unexpected state)
- Action would be irreversible and confidence is <100%
write_escalation() {
local TASK="$1"
local REASON="$2"
local TIMESTAMP=$(date +%Y%m%d_%H%M%S)
cat > ~/intuitek/outputs/question_for_kyle_${TIMESTAMP}.md << EOF
# Escalation Required: $TASK
**Time:** $(date -Iseconds)
**Reason:** $REASON
**Last known state:** [checkpoint data here]
**Suggested action:** [what you'd do if you had authorization]
**Risk of waiting:** [what happens if Kyle doesn't respond for 24h]
EOF
bash ~/intuitek/notify.sh "⚠️ Agent escalation: $TASK — see outputs/question_for_kyle_${TIMESTAMP}.md"
}
The escalation is not a failure. It's the agent correctly recognizing its own authorization boundary.
Putting it together: the recovery-aware task wrapper
Every autonomous task should follow this structure:
task_with_recovery() {
local TASK_ID="$1"
local TASK_FN="$2"
# 1. Check for existing lock (idempotency)
local LOCK="~/intuitek/coordination/in_progress/${TASK_ID}.lock"
if [ -f "$LOCK" ]; then
local LOCK_AGE=$(( $(date +%s) - $(stat -c %Y "$LOCK") ))
if [ $LOCK_AGE -lt 3600 ]; then
echo "Task $TASK_ID already in progress (lock age: ${LOCK_AGE}s) — skipping"
return 0
fi
echo "Stale lock for $TASK_ID (${LOCK_AGE}s) — removing and continuing"
rm "$LOCK"
fi
# 2. Write checkpoint
echo "{\"task\": \"$TASK_ID\", \"started\": \"$(date -Iseconds)\"}" > "$LOCK"
# 3. Execute with error handling
if ! $TASK_FN; then
echo "$(date -Iseconds) | ERROR | $TASK_ID failed" >> ~/intuitek/logs/errors.log
bash ~/intuitek/notify.sh "⚠️ Task failed: $TASK_ID"
rm "$LOCK"
return 1
fi
# 4. Clean checkpoint on success
rm "$LOCK"
echo "$(date -Iseconds) | COMPLETE | $TASK_ID" >> ~/intuitek/logs/heartbeat.log
return 0
}
The observable test
Your error recovery is working when:
-
~/intuitek/logs/errors.log— errors appear here with full context, not silently discarded -
~/intuitek/coordination/in_progress/— only has lock files for currently-running tasks - Telegram — you receive escalations before you notice the failure yourself
-
~/intuitek/outputs/question_for_kyle_*.md— rare, but always actionable
If errors are silently disappearing, the recovery layer isn't there. If lock files are accumulating, tasks are dying mid-run. Both are diagnosable from the file system alone.
The architecture underneath this
These patterns came from building the ACE license delivery system — autonomous Stripe webhook processing, license provisioning, and email delivery — where a silent failure means a customer paid and got nothing.
The full recovery-aware agent architecture is one of the distilled skill sets available at shopclawmart.com/@thebrierfox. If you're building autonomous agents that need to stay running without supervision, the architecture skills there will save you the same debugging sessions it took to build these patterns.
The difference between an agent that runs for a week without supervision and one that silently dies on Tuesday is this layer. It's not glamorous. It's not AI. It's just good infrastructure.
Build it before you need it.
Top comments (0)