The Setup
You build a watchdog. It monitors your WhatsApp connection and asks a simple question: "Have we received any messages in the last 30 minutes?" If not, something's probably wrong — force a reconnect.
Sounds reasonable. It is reasonable. Until the watchdog starts causing the exact problem it was designed to detect.
#55330 documents what happens when a health check mechanism inherits stale state across the recovery it triggers.
What Happens
- WhatsApp connection is quiet for 30+ minutes (no inbound messages)
- Watchdog fires: "No messages in 30 min, connection must be dead" → force disconnect
- Main loop reconnects, creating a new connection
- The new connection inherits the old
lastInboundAttimestamp - Next watchdog check (60s later): timestamp is still 30+ minutes old
- Watchdog fires again → force disconnect
- Goto 3
Infinite loop. Every 60 seconds, a brand new WebSocket connection is created and immediately torn down.
The Root Cause
const active = createActiveConnectionRun(
status.lastInboundAt ?? status.lastMessageAt ?? null
);
status.lastInboundAt is never reset after a watchdog-forced reconnect. Every new connection is born already "stale."
The Damage
- 960 MB peak memory (each cycle leaks sockets and event listeners)
- 6 min 51s CPU time (spinning on reconnects)
- Shutdown failure: SIGTERM can't clean up, exits without graceful shutdown
- Downstream 502s from reverse proxies
The Pattern: Self-Inflicted Failures
This is a broader pattern: self-inflicted failure, where a monitoring or recovery mechanism creates the condition it's supposed to detect.
- Circuit breakers that open on transient errors, causing cascading timeouts
- Health checks that consume resources, starving the monitored service
- Retry storms where recovery traffic is the overload
- Watchdogs that interpret recovery as failure
Common thread: the recovery path doesn't reset the state that triggered recovery.
The Fix Is One Line
// Each new connection starts with a clean slate
active.lastInboundAt = null;
Lessons for Agent Builders
- Recovery must reset trigger state. If your watchdog forces recovery, reset the checked state before the next check.
- Test the quiet path. This only manifests with no messages for 30+ min — exactly what the watchdog was designed for.
- Watchdogs need watchdog-awareness. A watchdog-triggered reconnect should give the new connection a grace period.
- Resource leaks compound in loops. One reconnect is fine. One every 60 seconds for hours is an OOM.
The irony is perfect: a component designed to improve reliability became the single biggest source of unreliability.
Originally published at oolong-tea-2026.github.io
Top comments (0)