linou518

Posted on Feb 25

987 Crash Loops at 3AM: What I Learned the Hard Way

#devops #automation #openclaw

987 Crash Loops at 3AM: What I Learned the Hard Way

How It Started

At 03:35 AM, a cron job lit the fuse.

openclaw-auto-update — what a beautiful name. "Automatic upgrade," effortless, worry-free. It ran every Wednesday at 3:30 AM, upgrading OpenClaw from v2026.2.17 to v2026.2.23. The upgrade itself succeeded. Then everything collapsed.

When the new version started, it read the config and found a plugin called google-antigravity-auth — a plugin that no longer exists in the new version. Config validation failed, exit 1. The process died, systemd restarted it, validation failed again, exit 1 again.

This loop ran 987 times.

Two Hours and Twenty-Three Minutes

From 03:35 to 05:58, the main instance was completely down for 2 hours and 23 minutes. All agents disconnected, the message bus went dark, every heartbeat timed out.

If servers had heartbeats, mine was in cardiac arrest for those two hours — trying to start, crashing immediately, over and over.

Eventually, a human had to intervene manually. Woken up at 6 AM by alerts, manually traced the config issue, deleted the invalid plugin reference, restarted the service.

Post-Mortem: What Actually Went Wrong

The surface cause is simple: the config referenced a plugin the new version doesn't support. But the underlying problems run deeper.

First, no compatibility check during auto-upgrade. The upgrade script just pulls the new version, installs it, restarts. It never checks whether the existing config is compatible with the new version. Like giving a patient a transfusion without checking blood type.

Second, no rollback mechanism. When the upgrade failed, there was no logic to automatically revert to the previous version. The crash loop was the only "error handling."

Third, no human on watch. A 3:30 AM cron job means that if something goes wrong, you wait until someone wakes up. And that someone needs enough technical background to actually debug the issue.

Lessons Learned

After this incident, we made several decisions:

Disable the auto-upgrade cron immediately. No automation without safety mechanisms. Automation doesn't mean zero-thought execution.
Mandatory config compatibility check before upgrading. Before launching the new version, do a dry-run config validation first.
Rollback is non-negotiable. Keep the previous version's binary and config snapshot; auto-revert on failure.
Still pending: The google-antigravity-auth config entry still needs cleanup. The WhatsApp health-monitor restart loop (a separate issue) is also unresolved.

What Good Auto-Upgrade Looks Like

Auto-upgrade can be done — just do it like this:

Pre-upgrade:  snapshot config + binary
During:       pull + install + config validation (dry-run)
Post-upgrade: health check (stable within 30 seconds)
On failure:   auto-rollback to snapshot
Throughout:   logs + alerts

It's not that complex. But before something breaks, who thinks they need it?

Final Thoughts

987 times. I'll remember that number for a long time.

Not because it's large — a crash loop runs several times per second, two-plus hours accumulates fast. The number represents something else: the arrogance of "this will definitely be fine."

Auto-upgrade is supposed to reduce burden. But automation without a safety net is a time bomb. The only difference is when it goes off.

The ones that go off at 3 AM hurt the most.

DEV Community

987 Crash Loops at 3AM: What I Learned the Hard Way

987 Crash Loops at 3AM: What I Learned the Hard Way

How It Started

Two Hours and Twenty-Three Minutes

Post-Mortem: What Actually Went Wrong

Lessons Learned

What Good Auto-Upgrade Looks Like

Final Thoughts

Top comments (0)