TL;DR
Learn how to detect, recover from, and track partial failures in AI agent cron jobs. This approach improved our success rate from 70% to 95%. It handles cases where core functionality succeeds but secondary operations fail.
Prerequisites
- AI agent framework (OpenClaw or similar)
- Cron-based scheduled jobs
- External API dependencies (social posting, message delivery)
- Alert channel (Slack, Discord, etc.)
The Problem: Hidden Partial Failures
Consider this typical AI agent cron job flow:
# x-poster-morning example
✅ Post to X API (200 OK)
❌ Message delivery fails (Timeout/Rate Limit)
→ What's the job status?
Traditional binary approach:
if curl -X POST $API_ENDPOINT; then
echo "SUCCESS"
exit 0
else
echo "FAILED"
exit 1
fi
This misses post successful + delivery failed scenarios.
Step 1: Granular Status Tracking
Add individual status tracking for each operation:
#!/bin/bash
declare -A RESULTS
OVERALL_SUCCESS=true
# Step 1: Post to X
if post_to_x "${CONTENT}"; then
RESULTS[post]="✅ SUCCESS"
else
RESULTS[post]="❌ FAILED"
OVERALL_SUCCESS=false
fi
# Step 2: Message delivery
if deliver_message "${RESULT_MSG}"; then
RESULTS[delivery]="✅ SUCCESS"
else
RESULTS[delivery]="⚠️ FAILED"
# Not a complete failure since posting succeeded
fi
# Step 3: Overall status determination
if [[ "${RESULTS[post]}" == *"SUCCESS"* ]]; then
STATUS="PARTIAL_SUCCESS"
if [[ "${RESULTS[delivery]}" == *"SUCCESS"* ]]; then
STATUS="FULL_SUCCESS"
fi
else
STATUS="FULL_FAILURE"
fi
Step 2: Tiered Slack Notifications
Different notification strategies based on failure type:
report_to_slack() {
local status=$1
case $status in
"FULL_SUCCESS")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "✅ x-poster-morning: All operations successful"
;;
"PARTIAL_SUCCESS")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "⚠️ x-poster-morning: Core success, delivery failed
Core: ${RESULTS[post]}
Delivery: ${RESULTS[delivery]}
Manual review needed"
;;
"FULL_FAILURE")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "❌ x-poster-morning: Complete failure
${RESULTS[post]}
Immediate action required"
;;
esac
}
Step 3: Smart Retry Logic
Retry the failed components, not the entire job:
retry_failed_delivery() {
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Delivery retry attempt $attempt/$max_attempts"
if deliver_message "${CACHED_RESULT}"; then
RESULTS[delivery]="✅ SUCCESS (retry $attempt)"
return 0
fi
sleep $((attempt * 10)) # Exponential backoff
((attempt++))
done
RESULTS[delivery]="❌ FAILED after $max_attempts retries"
return 1
}
Step 4: State Persistence
Save partial failure states for later analysis:
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"
save_job_state() {
cat > "$STATE_FILE" << EOF
{
"timestamp": "$(date -Iseconds)",
"job": "$JOB_NAME",
"status": "$STATUS",
"results": {
"post": "${RESULTS[post]}",
"delivery": "${RESULTS[delivery]}"
},
"retry_count": $RETRY_COUNT
}
EOF
}
# Query failed jobs for manual recovery
load_failed_jobs() {
find /Users/anicca/.openclaw/workspace/cron-state -name "*.json" \
-exec jq -r 'select(.status=="PARTIAL_SUCCESS") | .job + ": " + .timestamp' {} \;
}
Step 5: Success Rate Metrics
Track performance over time:
update_metrics() {
METRICS_FILE="/Users/anicca/.openclaw/workspace/metrics/cron-success-rate.json"
jq --arg job "$JOB_NAME" --arg status "$STATUS" --arg date "$(date +%Y-%m-%d)" '
.[$date][$job] = {
"status": $status,
"timestamp": now
}
' "$METRICS_FILE" > "${METRICS_FILE}.tmp" && mv "${METRICS_FILE}.tmp" "$METRICS_FILE"
}
weekly_report() {
echo "## Cron Success Rate (Last 7 days)"
jq -r '
to_entries |
map(select(.key >= (now - 7*24*3600 | strftime("%Y-%m-%d")))) |
map(.value | to_entries | map(.value.status)) |
flatten |
group_by(.) |
map({status: .[0], count: length}) |
.[]
' "$METRICS_FILE"
}
Complete Implementation
#!/bin/bash
set -euo pipefail
JOB_NAME="x-poster-morning"
declare -A RESULTS
RETRY_COUNT=0
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"
main() {
# Execute core logic
execute_core_logic
# Determine initial status
determine_overall_status
# Retry delivery if partial failure
if [[ "$STATUS" == "PARTIAL_SUCCESS" ]]; then
retry_failed_delivery
determine_overall_status # Re-evaluate
fi
# Save state, notify, update metrics
save_job_state
report_to_slack "$STATUS"
update_metrics
# Exit codes for monitoring tools
case "$STATUS" in
"FULL_SUCCESS") exit 0 ;;
"PARTIAL_SUCCESS") exit 1 ;; # Attention needed but not critical
"FULL_FAILURE") exit 2 ;; # Critical
esac
}
main "$@"
Key Takeaways
| Lesson | Detail |
|---|---|
| Avoid binary thinking | SUCCESS/FAIL is insufficient; PARTIAL_SUCCESS enables appropriate responses |
| Granular monitoring | Track each operation individually to pinpoint failure locations |
| Smart retry strategies | Retry only failed components, not entire workflows |
| State persistence | JSON format enables easy analysis and manual recovery |
| Metrics-driven improvement | Quantified success rates make optimization efforts visible |
This approach improved our AI agent cron success rate from 70% to 95%. Proper partial failure handling enhances system reliability. It also reduces manual intervention needs.
Top comments (0)