linou518

Posted on Mar 19

Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them

#ai #infrastructure #devops #opensource

Lessons in resilience from running a 9-node, 20+ agent OpenClaw cluster in production.

Introduction

I run 20+ AI agents around the clock on 9 GMK Mini PCs and a Mac Mini at home. They handle everything from business automation to learning to family support. The use cases vary, but they share one common concern: they can't go down.

Today alone I hit two failures and had to implement fixes. Both were "obvious in hindsight" problems that could have been prevented.

Landmine 1: Anthropic API Overload — Every Agent Goes Silent

What Happened

The Claude Opus API became overloaded. OpenClaw's Gateway retries with backoff, but after consecutive failures, sessions get interrupted. Because no fallback model was configured, every agent on all 9 nodes went unresponsive simultaneously.

A textbook single point of failure (SPOF). When you depend on an API provider, this risk is unavoidable.

Fix: Cluster-Wide Codex Fallback Deployment

OpenClaw supports model.fallbacks to specify fallback models. We chose OpenAI Codex as the fallback.

Steps:

Auth profile propagation — Extract OAuth tokens from the main node's auth config and inject them into every node's auth-profiles.json
Bulk config update — A Python script updated openclaw.json on all nodes:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.3-codex"]
      }
    }
  }
}

Gateway restart + verification on all nodes — 9/9 succeeded

Impact

When the Anthropic API goes down, agents automatically switch to Codex. Users see no errors. Quality drops slightly, but silence is far worse.

Takeaways

Running production without a fallback is running naked. You can't control your API provider's availability, so an alternate path is mandatory.
Templatize auth profiles so they can be applied instantly when adding new nodes. We store a template in shared storage.

Landmine 2: macOS sleep=1 — Mac Mini M1 Dies Every Minute

What Happened

An agent on the Mac Mini M1 node stopped responding. The logs showed:

Slack WebSocket going stale (disconnecting) roughly every 30 minutes
7 disconnections today alone (07:18–10:21)
The health monitor detected and auto-restarted, but messages were lost during disconnections

On top of that, a migration from an old node left behind stale configuration, resulting in two Gateways connecting with the same Slack token.

Root Cause

$ pmset -g | grep sleep
 sleep           1  # ← macOS sleeps after 1 minute

Using a Mac Mini as a server but leaving the default sleep settings. Network drops, WebSocket dies, Gateway disconnects.

Fix

# Disable sleep entirely
sudo pmset -a sleep 0 displaysleep 0 disksleep 0

# Remove duplicate launchd service
sudo launchctl bootout system/com.openclaw.gateway
sudo rm /Library/LaunchDaemons/com.openclaw.gateway.plist

# Restart Gateway
launchctl kickstart -k gui/501/ai.openclaw.gateway

Takeaways

If you use macOS as a server, pmset -a sleep 0 on day one. This should be "obvious," but we missed it during setup.
Old and new launchd plists coexisting can cause one to crash-loop and consume resources. Always clean up after migration.
macOS Gateway logs live at ~/.openclaw/logs/gateway.log. Looking in the wrong place first wastes time.

Summary: Cluster Resilience Checklist

Check	Mitigation
API provider outage	Configure fallback models (different provider)
Node sleep/power management	Disable on day one for server use
Stale config after migration	Clean up old services and configs
Auth credential propagation	Templatize and store in shared storage
WebSocket disconnect detection	Verify health monitor configuration

None of this is flashy technology, but doing all of it drastically reduces 3 AM emergencies. Before building automation, build a foundation that doesn't fall over.

DEV Community

Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them

Introduction

Landmine 1: Anthropic API Overload — Every Agent Goes Silent

What Happened

Fix: Cluster-Wide Codex Fallback Deployment

Impact

Takeaways

Landmine 2: macOS sleep=1 — Mac Mini M1 Dies Every Minute

What Happened

Root Cause

Fix

Takeaways

Summary: Cluster Resilience Checklist

Top comments (0)