When Discord Takes Down Your Entire Agent Fleet

#openclaw #discord #reliability #aiagents

Your Discord bot loses its WebSocket connection. Normal Tuesday. Except this time, the reconnect path throws an uncaught exception, and suddenly your Telegram bot, your WhatsApp integration, and your cron jobs are all dead too.

That's the story of #54667 and #54691, two issues filed on the same day that together paint a nasty picture of blast radius in multi-channel agent deployments.

The Crash Path

Discord health monitor detects a stale socket
Triggers a provider restart
Reconnect hits Max reconnect attempts (0) reached after code 1005
Exception goes uncaught
Entire gateway process exits

One channel's reconnect failure kills everything. Telegram, WhatsApp, cron scheduler, the whole process.

The Zombie Path

54691 is the flip side — instead of crashing too hard, Discord bots don't crash enough. After a Discord outage, bots sit in a zombie state: `running=true` but `connected` is `undefined`. The health monitor checks `connected === false`, which `undefined` doesn't match. Three bots sat zombified for 35 minutes.

The fix: check connected !== true instead of connected === false. Pessimistic health checks beat optimistic ones.

The Pattern: Shared-Process Blast Radius

Issue	Failure mode	Blast radius
#54667	Uncaught exception in one channel	Kills all channels
#54691	Health check misses zombie state	One channel silently dead

Both stem from running multiple channel providers in a single process.

Lessons for Agent Builders

Map your blast radius. If one component throwing kills everything, fix that first.
Three-state health checks. Running/stopped isn't enough. You need running-and-working / running-but-broken / stopped.
Strict comparison in health logic. === false and !== true are very different when undefined enters the picture.
Test the reconnect path. Initial connection works in every demo. Reconnect-after-failure is where the bugs hide.