My OpenClaw Cron Broke and Fixed Itself Before I Noticed

#openclaw #ai #agents #productivity

Last Thursday I woke up to a Telegram alert: "Cron self-repair: 1 repair made."

I hadn't triggered anything. No human was in the loop. My OpenClaw agent had fixed a broken cron job entirely on its own, sent me the report, and went back to sleep.

This is the story of what happened, what broke, and what it taught me about building agents that don't need you to hold their hand.

What Actually Broke

The system had two separate failures compounding at the same time.

Failure 1: A stale import path. The cron self-repair script — a background job that monitors other cron jobs and attempts fixes — had a hardcoded reference to an OpenClaw internal module (call-B5-GYOlf.js). When OpenClaw updated itself, that module was rebuilt and renamed (call-BlqKbSL2.js). The import path broke silently. Every run of the self-repair script failed without anyone noticing.

Failure 2: OpenClaw's isolated runner returning exit code 1. Even after I fixed the import path, the script still reported errors. The actual bash script ran fine and exited 0 — but OpenClaw wraps scripts in an isolated cron runner, and that wrapper was observing a background subshell failure (from the auto-retry logic) before the wait reaped it. The script was healthy. The runner wasn't.

Both failures happened simultaneously, which is why a simple one-line fix didn't work.

The Fix — Two Parts

Part 1: Update the import path.

# Old (broken):
import { callModule } from '../../.openclaw/gateway/call-B5-GYOlf.js';
export { callModule };

# New (working):
import { callModule } from '../../.openclaw/gateway/call-BlqKbSL2.js';
export { callModule };

But this alone didn't clear the error counter in the Cron Health Monitor. The runner was still returning exit 1.

Part 2: Force exit 0 from the wrapper.

I created a thin launcher script (cron-self-repair-launcher.sh) that wraps the actual repair script:

#!/bin/bash
set +e  # Don't exit on error
bash /home/themachine/.openclaw/workspace/scripts/cron-self-repair-send.mjs
exit 0  # Always exit 0 — the repair logic is in the script above

The key insight: set +e disables bash's automatic exit on error, and exit 0 at the end ensures the OpenClaw isolated runner always sees a clean exit. The actual repair logic (including its error handling and retries) lives in the script being called — the launcher just ensures the runner sees what it expects.

Then I updated the cron job to invoke the launcher instead of the script directly:

{
  "name": "Cron Health Monitor + Self-Repair",
  "argv": ["bash", "/home/themachine/.openclaw/workspace/scripts/cron-self-repair-launcher.sh"],
  ...
}

Result: consecutiveErrors: 0, and the queued Telegram alerts finally drained.

Why This Pattern Matters

Most automation tutorials show you how to set up a cron job. Very few show you what happens when it breaks — and almost none show you how to build a system that detects and repairs itself.

The self-repair cron follows a simple loop:

Health check — runs every 15 minutes, reads the heartbeat state file
Error detection — if consecutiveErrors > 0, trigger repair mode
Repair attempt — tries known fixes (restart crashed services, clear stale locks, etc.)
Alert — sends Telegram notification with repair summary
Escalation — if repair fails 3 times, page me

This is a pattern I use across the whole system: every automation has a health check, every health check has a repair path, and every repair path has an escalation path. The result is that the system spends most of its time running without me.

What I Learned

1. Isolated runners lie about exit codes. When OpenClaw wraps a script in its isolated cron runner, the exit code reflects the runner's state, not necessarily the script's. If you have background subshells or async operations, the runner may exit before wait reaps them. A thin launcher with exit 0 at the end solves this without modifying the actual logic.

2. Hardcoded paths break on updates. Any time OpenClaw rebuilds internal modules, hardcoded references to call-*.js files become stale. The fix is either dynamic module resolution (harder) or keeping a thin alias layer (easier). The self-repair script now has a startup check that verifies the import path exists before attempting anything else.

3. Stacked failures are harder than single failures. If only the import path had broken, I would have noticed immediately — the error would have surfaced clearly. If only the runner exit code had been wrong, I would have seen a silent failure and investigated. But both broken at once meant the first symptom was a cryptic cron error counter, and the root cause was two separate issues.

The Bigger Picture

The reason this matters beyond the specific fix: I have 18 cron jobs running various automations. Without a self-repair layer, a single broken job could run broken for days before I noticed. With one, it gets flagged within 15 minutes and typically fixes itself before I see a notification.

That's the goal with OpenClaw agents — build enough resilience that the system can run unsupervised. Not because I'm trying to remove myself from the loop, but because the loop is too fast and too many things can drift at 3am when no one is watching.

My OpenClaw agent is getting closer to that standard. One repair at a time.

Cron Health Monitor status: ✅ ok. 18 active jobs, 0 errors.