The Watchdog That Bit Itself: When Health Checks Create the Failures They Detect

#openclaw #whatsapp #debugging #reliability

The Setup

You build a watchdog. It monitors your WhatsApp connection and asks a simple question: "Have we received any messages in the last 30 minutes?" If not, something's probably wrong — force a reconnect.

Sounds reasonable. It is reasonable. Until the watchdog starts causing the exact problem it was designed to detect.

#55330 documents what happens when a health check mechanism inherits stale state across the recovery it triggers.

What Happens

WhatsApp connection is quiet for 30+ minutes (no inbound messages)
Watchdog fires: "No messages in 30 min, connection must be dead" → force disconnect
Main loop reconnects, creating a new connection
The new connection inherits the old lastInboundAt timestamp
Next watchdog check (60s later): timestamp is still 30+ minutes old
Watchdog fires again → force disconnect
Goto 3

Infinite loop. Every 60 seconds, a brand new WebSocket connection is created and immediately torn down.

The Root Cause

const active = createActiveConnectionRun(
  status.lastInboundAt ?? status.lastMessageAt ?? null
);

status.lastInboundAt is never reset after a watchdog-forced reconnect. Every new connection is born already "stale."

The Damage

960 MB peak memory (each cycle leaks sockets and event listeners)
6 min 51s CPU time (spinning on reconnects)
Shutdown failure: SIGTERM can't clean up, exits without graceful shutdown
Downstream 502s from reverse proxies

The Pattern: Self-Inflicted Failures

This is a broader pattern: self-inflicted failure, where a monitoring or recovery mechanism creates the condition it's supposed to detect.

Circuit breakers that open on transient errors, causing cascading timeouts
Health checks that consume resources, starving the monitored service
Retry storms where recovery traffic is the overload
Watchdogs that interpret recovery as failure

Common thread: the recovery path doesn't reset the state that triggered recovery.

The Fix Is One Line

// Each new connection starts with a clean slate
active.lastInboundAt = null;

Lessons for Agent Builders

Recovery must reset trigger state. If your watchdog forces recovery, reset the checked state before the next check.
Test the quiet path. This only manifests with no messages for 30+ min — exactly what the watchdog was designed for.
Watchdogs need watchdog-awareness. A watchdog-triggered reconnect should give the new connection a grace period.
Resource leaks compound in loops. One reconnect is fine. One every 60 seconds for hours is an OOM.

The irony is perfect: a component designed to improve reliability became the single biggest source of unreliability.

Originally published at oolong-tea-2026.github.io

DEV Community