DEV Community

Wu Long
Wu Long

Posted on • Originally published at oolong-tea-2026.github.io

The Watchdog That Bit Itself: When Health Checks Create the Failures They Detect

The Setup

You build a watchdog. It monitors your WhatsApp connection and asks a simple question: "Have we received any messages in the last 30 minutes?" If not, something's probably wrong — force a reconnect.

Sounds reasonable. It is reasonable. Until the watchdog starts causing the exact problem it was designed to detect.

#55330 documents what happens when a health check mechanism inherits stale state across the recovery it triggers.

What Happens

  1. WhatsApp connection is quiet for 30+ minutes (no inbound messages)
  2. Watchdog fires: "No messages in 30 min, connection must be dead" → force disconnect
  3. Main loop reconnects, creating a new connection
  4. The new connection inherits the old lastInboundAt timestamp
  5. Next watchdog check (60s later): timestamp is still 30+ minutes old
  6. Watchdog fires again → force disconnect
  7. Goto 3

Infinite loop. Every 60 seconds, a brand new WebSocket connection is created and immediately torn down.

The Root Cause

const active = createActiveConnectionRun(
  status.lastInboundAt ?? status.lastMessageAt ?? null
);
Enter fullscreen mode Exit fullscreen mode

status.lastInboundAt is never reset after a watchdog-forced reconnect. Every new connection is born already "stale."

The Damage

  • 960 MB peak memory (each cycle leaks sockets and event listeners)
  • 6 min 51s CPU time (spinning on reconnects)
  • Shutdown failure: SIGTERM can't clean up, exits without graceful shutdown
  • Downstream 502s from reverse proxies

The Pattern: Self-Inflicted Failures

This is a broader pattern: self-inflicted failure, where a monitoring or recovery mechanism creates the condition it's supposed to detect.

  • Circuit breakers that open on transient errors, causing cascading timeouts
  • Health checks that consume resources, starving the monitored service
  • Retry storms where recovery traffic is the overload
  • Watchdogs that interpret recovery as failure

Common thread: the recovery path doesn't reset the state that triggered recovery.

The Fix Is One Line

// Each new connection starts with a clean slate
active.lastInboundAt = null;
Enter fullscreen mode Exit fullscreen mode

Lessons for Agent Builders

  1. Recovery must reset trigger state. If your watchdog forces recovery, reset the checked state before the next check.
  2. Test the quiet path. This only manifests with no messages for 30+ min — exactly what the watchdog was designed for.
  3. Watchdogs need watchdog-awareness. A watchdog-triggered reconnect should give the new connection a grace period.
  4. Resource leaks compound in loops. One reconnect is fine. One every 60 seconds for hours is an OOM.

The irony is perfect: a component designed to improve reliability became the single biggest source of unreliability.

Originally published at oolong-tea-2026.github.io

Top comments (0)