DEV Community

linou518
linou518

Posted on

Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them

Lessons in resilience from running a 9-node, 20+ agent OpenClaw cluster in production.

Introduction

I run 20+ AI agents around the clock on 9 GMK Mini PCs and a Mac Mini at home. They handle everything from business automation to learning to family support. The use cases vary, but they share one common concern: they can't go down.

Today alone I hit two failures and had to implement fixes. Both were "obvious in hindsight" problems that could have been prevented.

Landmine 1: Anthropic API Overload — Every Agent Goes Silent

What Happened

The Claude Opus API became overloaded. OpenClaw's Gateway retries with backoff, but after consecutive failures, sessions get interrupted. Because no fallback model was configured, every agent on all 9 nodes went unresponsive simultaneously.

A textbook single point of failure (SPOF). When you depend on an API provider, this risk is unavoidable.

Fix: Cluster-Wide Codex Fallback Deployment

OpenClaw supports model.fallbacks to specify fallback models. We chose OpenAI Codex as the fallback.

Steps:

  1. Auth profile propagation — Extract OAuth tokens from the main node's auth config and inject them into every node's auth-profiles.json

  2. Bulk config update — A Python script updated openclaw.json on all nodes:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.3-codex"]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
  1. Gateway restart + verification on all nodes — 9/9 succeeded

Impact

When the Anthropic API goes down, agents automatically switch to Codex. Users see no errors. Quality drops slightly, but silence is far worse.

Takeaways

  • Running production without a fallback is running naked. You can't control your API provider's availability, so an alternate path is mandatory.
  • Templatize auth profiles so they can be applied instantly when adding new nodes. We store a template in shared storage.

Landmine 2: macOS sleep=1 — Mac Mini M1 Dies Every Minute

What Happened

An agent on the Mac Mini M1 node stopped responding. The logs showed:

  • Slack WebSocket going stale (disconnecting) roughly every 30 minutes
  • 7 disconnections today alone (07:18–10:21)
  • The health monitor detected and auto-restarted, but messages were lost during disconnections

On top of that, a migration from an old node left behind stale configuration, resulting in two Gateways connecting with the same Slack token.

Root Cause

$ pmset -g | grep sleep
 sleep           1  # ← macOS sleeps after 1 minute
Enter fullscreen mode Exit fullscreen mode

Using a Mac Mini as a server but leaving the default sleep settings. Network drops, WebSocket dies, Gateway disconnects.

Fix

# Disable sleep entirely
sudo pmset -a sleep 0 displaysleep 0 disksleep 0

# Remove duplicate launchd service
sudo launchctl bootout system/com.openclaw.gateway
sudo rm /Library/LaunchDaemons/com.openclaw.gateway.plist

# Restart Gateway
launchctl kickstart -k gui/501/ai.openclaw.gateway
Enter fullscreen mode Exit fullscreen mode

Takeaways

  • If you use macOS as a server, pmset -a sleep 0 on day one. This should be "obvious," but we missed it during setup.
  • Old and new launchd plists coexisting can cause one to crash-loop and consume resources. Always clean up after migration.
  • macOS Gateway logs live at ~/.openclaw/logs/gateway.log. Looking in the wrong place first wastes time.

Summary: Cluster Resilience Checklist

Check Mitigation
API provider outage Configure fallback models (different provider)
Node sleep/power management Disable on day one for server use
Stale config after migration Clean up old services and configs
Auth credential propagation Templatize and store in shared storage
WebSocket disconnect detection Verify health monitor configuration

None of this is flashy technology, but doing all of it drastically reduces 3 AM emergencies. Before building automation, build a foundation that doesn't fall over.

Top comments (0)