Lessons in resilience from running a 9-node, 20+ agent OpenClaw cluster in production.
Introduction
I run 20+ AI agents around the clock on 9 GMK Mini PCs and a Mac Mini at home. They handle everything from business automation to learning to family support. The use cases vary, but they share one common concern: they can't go down.
Today alone I hit two failures and had to implement fixes. Both were "obvious in hindsight" problems that could have been prevented.
Landmine 1: Anthropic API Overload — Every Agent Goes Silent
What Happened
The Claude Opus API became overloaded. OpenClaw's Gateway retries with backoff, but after consecutive failures, sessions get interrupted. Because no fallback model was configured, every agent on all 9 nodes went unresponsive simultaneously.
A textbook single point of failure (SPOF). When you depend on an API provider, this risk is unavoidable.
Fix: Cluster-Wide Codex Fallback Deployment
OpenClaw supports model.fallbacks to specify fallback models. We chose OpenAI Codex as the fallback.
Steps:
Auth profile propagation — Extract OAuth tokens from the main node's auth config and inject them into every node's
auth-profiles.jsonBulk config update — A Python script updated
openclaw.jsonon all nodes:
{
"agents": {
"defaults": {
"model": {
"primary": "anthropic/claude-opus-4-6",
"fallbacks": ["openai-codex/gpt-5.3-codex"]
}
}
}
}
- Gateway restart + verification on all nodes — 9/9 succeeded
Impact
When the Anthropic API goes down, agents automatically switch to Codex. Users see no errors. Quality drops slightly, but silence is far worse.
Takeaways
- Running production without a fallback is running naked. You can't control your API provider's availability, so an alternate path is mandatory.
- Templatize auth profiles so they can be applied instantly when adding new nodes. We store a template in shared storage.
Landmine 2: macOS sleep=1 — Mac Mini M1 Dies Every Minute
What Happened
An agent on the Mac Mini M1 node stopped responding. The logs showed:
- Slack WebSocket going stale (disconnecting) roughly every 30 minutes
- 7 disconnections today alone (07:18–10:21)
- The health monitor detected and auto-restarted, but messages were lost during disconnections
On top of that, a migration from an old node left behind stale configuration, resulting in two Gateways connecting with the same Slack token.
Root Cause
$ pmset -g | grep sleep
sleep 1 # ← macOS sleeps after 1 minute
Using a Mac Mini as a server but leaving the default sleep settings. Network drops, WebSocket dies, Gateway disconnects.
Fix
# Disable sleep entirely
sudo pmset -a sleep 0 displaysleep 0 disksleep 0
# Remove duplicate launchd service
sudo launchctl bootout system/com.openclaw.gateway
sudo rm /Library/LaunchDaemons/com.openclaw.gateway.plist
# Restart Gateway
launchctl kickstart -k gui/501/ai.openclaw.gateway
Takeaways
-
If you use macOS as a server,
pmset -a sleep 0on day one. This should be "obvious," but we missed it during setup. - Old and new launchd plists coexisting can cause one to crash-loop and consume resources. Always clean up after migration.
- macOS Gateway logs live at
~/.openclaw/logs/gateway.log. Looking in the wrong place first wastes time.
Summary: Cluster Resilience Checklist
| Check | Mitigation |
|---|---|
| API provider outage | Configure fallback models (different provider) |
| Node sleep/power management | Disable on day one for server use |
| Stale config after migration | Clean up old services and configs |
| Auth credential propagation | Templatize and store in shared storage |
| WebSocket disconnect detection | Verify health monitor configuration |
None of this is flashy technology, but doing all of it drastically reduces 3 AM emergencies. Before building automation, build a foundation that doesn't fall over.
Top comments (0)