Cron Jobs That Fix Themselves

#automation #devops #linux #sre

Cron Jobs That Fix Themselves

Every developer has been there: a cron job starts failing silently at 3 AM, stays broken for three days, and nobody notices until a customer files a bug report. Traditional cron gives you zero visibility. You either babysit it or it rots.

The solution isn't more monitoring — it's building cron jobs that know when they're broken and fix themselves before anyone has to care.

The Problem With Traditional Cron

Cron treats failures like a vending machine treats a jammed coil: it just moves on. Exit code 1? Nobody told. The job silently failed and your backup didn't run. Your report didn't send. Your daily digest never happened.

The core issues:

No automatic retry — fail once and you're done for the day
No context — you don't know if it failed because of a network blip or a real bug
No recovery — even if you detect failure, you don't know how to fix it

# Traditional cron: blind faith
0 2 * * * /scripts/backup.sh  # If this fails, good luck knowing why

Self-Healing Pattern: Detect, Retry, Alert

The fix is a three-layer approach. Each cron wrapper handles one layer:

Layer 1: Execution — run the job
Layer 2: Detection — did it actually succeed?
Layer 3: Recovery — attempt self-repair, escalate if it keeps failing

Layer 1: The Wrapper That Tracks Everything

#!/bin/bash
# run-with-guard.sh — wraps any cron job with retry logic

JOB_NAME="${1}"
shift
COMMAND="$@"
MAX_RETRIES=3
RETRY_DELAY=60  # seconds

# Set up isolated logging
LOG_DIR="/var/log/cron-guardian"
mkdir -p "$LOG_DIR"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOG_FILE="${LOG_DIR}/${JOB_NAME}_${TIMESTAMP}.log"

attempt=1
while [ $attempt -le $MAX_RETRIES ]; do
    echo "[$(date)] Attempt $attempt/$MAX_RETRIES: $COMMAND" >> "$LOG_FILE"

    $COMMAND >> "$LOG_FILE" 2>&1
    EXIT_CODE=$?

    if [ $EXIT_CODE -eq 0 ]; then
        echo "[$(date)] SUCCESS after $attempt attempt(s)" >> "$LOG_FILE"
        exit 0
    fi

    echo "[$(date)] Failed with exit code $EXIT_CODE" >> "$LOG_FILE"

    if [ $attempt -lt $MAX_RETRIES ]; then
        echo "[$(date)] Retrying in ${RETRY_DELAY}s..." >> "$LOG_FILE"
        sleep $RETRY_DELAY
    fi

    attempt=$((attempt + 1))
done

echo "[$(date)] ALL RETRIES EXHAUSTED — escalating" >> "$LOG_FILE"
exit 99  # Special code for "escalate me"

Layer 2: Smart Escalation

The wrapper exits 99 when all retries fail. Now your orchestrator decides what to do:

# cron-escalation.py — handles escalation decisions

import subprocess
import json
from datetime import datetime, timedelta
from pathlib import Path

ESCALATION_THRESHOLD = 3  # Fail N times in a window = escalate
WINDOW_HOURS = 24

def check_failure_history(job_name: str) -> int:
    log_dir = Path(f"/var/log/cron-guardian")
    cutoff = datetime.now() - timedelta(hours=WINDOW_HOURS)
    failures = 0

    for log_file in sorted(log_dir.glob(f"{job_name}_*.log")):
        if log_file.stat().st_mtime < cutoff.timestamp():
            continue
        content = log_file.read_text()
        if "ALL RETRIES EXHAUSTED" in content:
            failures += 1

    return failures

def escalate(job_name: str, command: str):
    # Send alert (customize: email, Slack, PagerDuty, etc.)
    print(f"🚨 ALERT: {job_name} failing repeatedly")
    print(f"   Command: {command}")
    print(f"   Run: openclaw cron run --job-id {job_name}")

    # Optional: open a ticket automatically
    # create_issue(title=f"Cron failure: {job_name}", body=f"...")

def run_job(job_name: str, command: str):
    result = subprocess.run(
        ["bash", "/scripts/run-with-guard.sh", job_name, command],
        capture_output=True
    )

    if result.returncode == 99:  # Escalation needed
        failures = check_failure_history(job_name)
        if failures >= ESCALATION_THRESHOLD:
            escalate(job_name, command)

    return result.returncode

Layer 3: The One Cron That Rules Them All

With OpenClaw, you can set up a single monitoring cron that checks all your other crons:

# In your openclaw config — one cron to monitor the others
openclaw cron add \
  --name "cron-health-check" \
  --schedule "0 */2 * * *" \
  --command "python3 /scripts/cron-escalation.py --check-all"

Putting It Together

The output isn't just "job ran." It's:

✓ backup.sh — succeeded on attempt 1
✓ daily-report.sh — succeeded on attempt 2 (network blip on first try)
⚠️ sync-users.sh — failed all 3 retries, escalating to Slack

You're no longer surprised. The system tells you when something is wrong and tries to fix it first. Only the stubborn failures make it to your attention.

The One Thing That Actually Works

The secret isn't the retry logic — it's the separation of concerns. Let the job focus on doing its task. Let the wrapper focus on execution quality. Let the escalation layer focus on communication.

Traditional cron: one thing, poorly.
Self-healing cron: three layers, each doing one thing well.

Build the wrapper once, use it everywhere. Your future self will thank you when the 3 AM pager never goes off.

I cover cron patterns and automation resilience in depth in my book Why Is My OpenClaw Dumb? on Amazon ($9.99).

Get free AI automation guides and weekly tips: mrclaws-ai-automation-for-small-business.kit.com/b0fcff2c50

Top comments (1)

member_801698f3 • Apr 13

Very useful. There is so many things that can make a cron job fail. When you are automating things you don't want to have to sit there and watch it all day to make sure things get done. That defeats the purpose of automation.