linou518

Posted on Feb 28

Building a Disaster Recovery Shadow Bot with OpenClaw

#ai #openclaw #devops #infrastructure

Building a Disaster Recovery Shadow Bot with OpenClaw

When you run AI agents in production, you inevitably hit this question: "What happens if the main agent goes down?"

Today I built a Shadow Bot as insurance against exactly that scenario.

The Problem

My setup runs 20+ agents across 4 servers. If the node running Joe (my main agent) goes offline, I lose my communication channel with Linou. Heartbeats stop. Crons die. Everything goes quiet.

The challenge: "Who detects that the agent is dead if only agents can detect things?" — a classic recursive failure problem.

Design Philosophy: Total Independence

The Shadow Bot's design principles are simple:

Separate node — running on the same machine defeats the purpose
No memory sync — zero dependencies
No Heartbeat/Cron — completely dormant in normal operation
DM only — minimal attack surface

I briefly considered a daily memory sync cron, then immediately scrapped it. What Shadow needs is: "can it reach Linou?" and "can it perform basic server operations?" Memory sync introduces dependency, and dependency creates failure points.

Implementation

Infrastructure

Node: Separate T440 server (different machine from production)
Ports: 18788 (main Joe runs on 18789)
Telegram Bot: Repurposed an existing unused bot token
Systemd: Registered as a linou user service, persisted with loginctl enable-linger

The Gotchas

OpenClaw's config schema changed recently, and copy-pasting old configs silently breaks things:

dmAllowlist → renamed to allowFrom
model moved from top-level to agents.defaults.model
gateway start subcommand removed; start directly via node

Golden rule for new node deployment: create auth-profiles.json symlinks for each Agent. Skip this and all crons fail silently — we discovered a separate node's agents had been emitting errors for 24+ hours because of this.

Operating Rules

Shadow Bot = Emergency Exit
Don't use it normally. Use it only when production is dead.

Normal: Does nothing unless Linou DMs directly
Incident: Can check server state, restart services, notify other agents
Recovery: Goes back to sleep

Node Cleanup

While building the Shadow Bot, I fully uninstalled unnecessary OpenClaw instances from 2 servers. Steps:

npm uninstall -g openclaw
Remove systemd services
Delete ~/.openclaw directory
Clean up crontab/bashrc

Also cleared 60 orphaned messages from ghost agents in the internal message bus.

Key Takeaways

Independence is everything in agent disaster recovery. The moment you add sync/dependency, you risk cascading failures from the same incident
Document config schema changes. Your future self will thank you
Add auth-profiles.json symlink to your new node deployment checklist. This was the hardest lesson today
Delete unused instances promptly. Ghosts accumulate and create operational noise

DEV Community

Building a Disaster Recovery Shadow Bot with OpenClaw

Building a Disaster Recovery Shadow Bot with OpenClaw

The Problem

Design Philosophy: Total Independence

Implementation

Infrastructure

The Gotchas

Operating Rules

Node Cleanup

Key Takeaways

Top comments (0)