Cron Jobs That Fix Themselves
Every developer has been there: a cron job starts failing silently at 3 AM, stays broken for three days, and nobody notices until a customer files a bug report. Traditional cron gives you zero visibility. You either babysit it or it rots.
The solution isn't more monitoring — it's building cron jobs that know when they're broken and fix themselves before anyone has to care.
The Problem With Traditional Cron
Cron treats failures like a vending machine treats a jammed coil: it just moves on. Exit code 1? Nobody told. The job silently failed and your backup didn't run. Your report didn't send. Your daily digest never happened.
The core issues:
- No automatic retry — fail once and you're done for the day
- No context — you don't know if it failed because of a network blip or a real bug
- No recovery — even if you detect failure, you don't know how to fix it
# Traditional cron: blind faith
0 2 * * * /scripts/backup.sh # If this fails, good luck knowing why
Self-Healing Pattern: Detect, Retry, Alert
The fix is a three-layer approach. Each cron wrapper handles one layer:
Layer 1: Execution — run the job
Layer 2: Detection — did it actually succeed?
Layer 3: Recovery — attempt self-repair, escalate if it keeps failing
Layer 1: The Wrapper That Tracks Everything
#!/bin/bash
# run-with-guard.sh — wraps any cron job with retry logic
JOB_NAME="${1}"
shift
COMMAND="$@"
MAX_RETRIES=3
RETRY_DELAY=60 # seconds
# Set up isolated logging
LOG_DIR="/var/log/cron-guardian"
mkdir -p "$LOG_DIR"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOG_FILE="${LOG_DIR}/${JOB_NAME}_${TIMESTAMP}.log"
attempt=1
while [ $attempt -le $MAX_RETRIES ]; do
echo "[$(date)] Attempt $attempt/$MAX_RETRIES: $COMMAND" >> "$LOG_FILE"
$COMMAND >> "$LOG_FILE" 2>&1
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "[$(date)] SUCCESS after $attempt attempt(s)" >> "$LOG_FILE"
exit 0
fi
echo "[$(date)] Failed with exit code $EXIT_CODE" >> "$LOG_FILE"
if [ $attempt -lt $MAX_RETRIES ]; then
echo "[$(date)] Retrying in ${RETRY_DELAY}s..." >> "$LOG_FILE"
sleep $RETRY_DELAY
fi
attempt=$((attempt + 1))
done
echo "[$(date)] ALL RETRIES EXHAUSTED — escalating" >> "$LOG_FILE"
exit 99 # Special code for "escalate me"
Layer 2: Smart Escalation
The wrapper exits 99 when all retries fail. Now your orchestrator decides what to do:
# cron-escalation.py — handles escalation decisions
import subprocess
import json
from datetime import datetime, timedelta
from pathlib import Path
ESCALATION_THRESHOLD = 3 # Fail N times in a window = escalate
WINDOW_HOURS = 24
def check_failure_history(job_name: str) -> int:
log_dir = Path(f"/var/log/cron-guardian")
cutoff = datetime.now() - timedelta(hours=WINDOW_HOURS)
failures = 0
for log_file in sorted(log_dir.glob(f"{job_name}_*.log")):
if log_file.stat().st_mtime < cutoff.timestamp():
continue
content = log_file.read_text()
if "ALL RETRIES EXHAUSTED" in content:
failures += 1
return failures
def escalate(job_name: str, command: str):
# Send alert (customize: email, Slack, PagerDuty, etc.)
print(f"🚨 ALERT: {job_name} failing repeatedly")
print(f" Command: {command}")
print(f" Run: openclaw cron run --job-id {job_name}")
# Optional: open a ticket automatically
# create_issue(title=f"Cron failure: {job_name}", body=f"...")
def run_job(job_name: str, command: str):
result = subprocess.run(
["bash", "/scripts/run-with-guard.sh", job_name, command],
capture_output=True
)
if result.returncode == 99: # Escalation needed
failures = check_failure_history(job_name)
if failures >= ESCALATION_THRESHOLD:
escalate(job_name, command)
return result.returncode
Layer 3: The One Cron That Rules Them All
With OpenClaw, you can set up a single monitoring cron that checks all your other crons:
# In your openclaw config — one cron to monitor the others
openclaw cron add \
--name "cron-health-check" \
--schedule "0 */2 * * *" \
--command "python3 /scripts/cron-escalation.py --check-all"
Putting It Together
The output isn't just "job ran." It's:
✓ backup.sh — succeeded on attempt 1
✓ daily-report.sh — succeeded on attempt 2 (network blip on first try)
⚠️ sync-users.sh — failed all 3 retries, escalating to Slack
You're no longer surprised. The system tells you when something is wrong and tries to fix it first. Only the stubborn failures make it to your attention.
The One Thing That Actually Works
The secret isn't the retry logic — it's the separation of concerns. Let the job focus on doing its task. Let the wrapper focus on execution quality. Let the escalation layer focus on communication.
Traditional cron: one thing, poorly.
Self-healing cron: three layers, each doing one thing well.
Build the wrapper once, use it everywhere. Your future self will thank you when the 3 AM pager never goes off.
I cover cron patterns and automation resilience in depth in my book Why Is My OpenClaw Dumb? on Amazon ($9.99).
Top comments (1)
Very useful. There is so many things that can make a cron job fail. When you are automating things you don't want to have to sit there and watch it all day to make sure things get done. That defeats the purpose of automation.