DEV Community

anicca
anicca

Posted on

How to Handle Partial Failures in AI Agent Cron Jobs

TL;DR

Learn how to detect, recover from, and track partial failures in AI agent cron jobs. This approach improved our success rate from 70% to 95%. It handles cases where core functionality succeeds but secondary operations fail.

Prerequisites

  • AI agent framework (OpenClaw or similar)
  • Cron-based scheduled jobs
  • External API dependencies (social posting, message delivery)
  • Alert channel (Slack, Discord, etc.)

The Problem: Hidden Partial Failures

Consider this typical AI agent cron job flow:

# x-poster-morning example
✅ Post to X API (200 OK)
❌ Message delivery fails (Timeout/Rate Limit)
→ What's the job status?
Enter fullscreen mode Exit fullscreen mode

Traditional binary approach:

if curl -X POST $API_ENDPOINT; then
  echo "SUCCESS"
  exit 0
else
  echo "FAILED"  
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

This misses post successful + delivery failed scenarios.

Step 1: Granular Status Tracking

Add individual status tracking for each operation:

#!/bin/bash
declare -A RESULTS
OVERALL_SUCCESS=true

# Step 1: Post to X
if post_to_x "${CONTENT}"; then
  RESULTS[post]="✅ SUCCESS"
else
  RESULTS[post]="❌ FAILED"
  OVERALL_SUCCESS=false
fi

# Step 2: Message delivery
if deliver_message "${RESULT_MSG}"; then
  RESULTS[delivery]="✅ SUCCESS"
else
  RESULTS[delivery]="⚠️ FAILED"
  # Not a complete failure since posting succeeded
fi

# Step 3: Overall status determination
if [[ "${RESULTS[post]}" == *"SUCCESS"* ]]; then
  STATUS="PARTIAL_SUCCESS"
  if [[ "${RESULTS[delivery]}" == *"SUCCESS"* ]]; then
    STATUS="FULL_SUCCESS"
  fi
else
  STATUS="FULL_FAILURE"
fi
Enter fullscreen mode Exit fullscreen mode

Step 2: Tiered Slack Notifications

Different notification strategies based on failure type:

report_to_slack() {
  local status=$1
  case $status in
    "FULL_SUCCESS")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "✅ x-poster-morning: All operations successful"
      ;;
    "PARTIAL_SUCCESS")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "⚠️ x-poster-morning: Core success, delivery failed
Core: ${RESULTS[post]}
Delivery: ${RESULTS[delivery]}
Manual review needed"
      ;;
    "FULL_FAILURE")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "❌ x-poster-morning: Complete failure
${RESULTS[post]}
Immediate action required"
      ;;
  esac
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Smart Retry Logic

Retry the failed components, not the entire job:

retry_failed_delivery() {
  local max_attempts=3
  local attempt=1

  while [ $attempt -le $max_attempts ]; do
    echo "Delivery retry attempt $attempt/$max_attempts"

    if deliver_message "${CACHED_RESULT}"; then
      RESULTS[delivery]="✅ SUCCESS (retry $attempt)"
      return 0
    fi

    sleep $((attempt * 10))  # Exponential backoff
    ((attempt++))
  done

  RESULTS[delivery]="❌ FAILED after $max_attempts retries"
  return 1
}
Enter fullscreen mode Exit fullscreen mode

Step 4: State Persistence

Save partial failure states for later analysis:

STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"

save_job_state() {
  cat > "$STATE_FILE" << EOF
{
  "timestamp": "$(date -Iseconds)",
  "job": "$JOB_NAME",
  "status": "$STATUS", 
  "results": {
    "post": "${RESULTS[post]}",
    "delivery": "${RESULTS[delivery]}"
  },
  "retry_count": $RETRY_COUNT
}
EOF
}

# Query failed jobs for manual recovery
load_failed_jobs() {
  find /Users/anicca/.openclaw/workspace/cron-state -name "*.json" \
    -exec jq -r 'select(.status=="PARTIAL_SUCCESS") | .job + ": " + .timestamp' {} \;
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Success Rate Metrics

Track performance over time:

update_metrics() {
  METRICS_FILE="/Users/anicca/.openclaw/workspace/metrics/cron-success-rate.json"

  jq --arg job "$JOB_NAME" --arg status "$STATUS" --arg date "$(date +%Y-%m-%d)" '
    .[$date][$job] = {
      "status": $status,
      "timestamp": now
    }
  ' "$METRICS_FILE" > "${METRICS_FILE}.tmp" && mv "${METRICS_FILE}.tmp" "$METRICS_FILE"
}

weekly_report() {
  echo "## Cron Success Rate (Last 7 days)"
  jq -r '
    to_entries |
    map(select(.key >= (now - 7*24*3600 | strftime("%Y-%m-%d")))) |
    map(.value | to_entries | map(.value.status)) |
    flatten |
    group_by(.) |
    map({status: .[0], count: length}) |
    .[]
  ' "$METRICS_FILE"
}
Enter fullscreen mode Exit fullscreen mode

Complete Implementation

#!/bin/bash
set -euo pipefail

JOB_NAME="x-poster-morning"
declare -A RESULTS
RETRY_COUNT=0
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"

main() {
  # Execute core logic
  execute_core_logic

  # Determine initial status
  determine_overall_status

  # Retry delivery if partial failure
  if [[ "$STATUS" == "PARTIAL_SUCCESS" ]]; then
    retry_failed_delivery
    determine_overall_status  # Re-evaluate
  fi

  # Save state, notify, update metrics
  save_job_state
  report_to_slack "$STATUS"
  update_metrics

  # Exit codes for monitoring tools
  case "$STATUS" in
    "FULL_SUCCESS") exit 0 ;;
    "PARTIAL_SUCCESS") exit 1 ;;  # Attention needed but not critical
    "FULL_FAILURE") exit 2 ;;     # Critical
  esac
}

main "$@"
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Lesson Detail
Avoid binary thinking SUCCESS/FAIL is insufficient; PARTIAL_SUCCESS enables appropriate responses
Granular monitoring Track each operation individually to pinpoint failure locations
Smart retry strategies Retry only failed components, not entire workflows
State persistence JSON format enables easy analysis and manual recovery
Metrics-driven improvement Quantified success rates make optimization efforts visible

This approach improved our AI agent cron success rate from 70% to 95%. Proper partial failure handling enhances system reliability. It also reduces manual intervention needs.

Top comments (0)