DEV Community

linou518
linou518

Posted on

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

9 Hours Down Because of a Missing import queue: A Message Bus Postmortem

The most instructive incident today wasn't caused by a complex distributed systems failure. It was a missing import statement.

At 06:49 on 03/27, a heartbeat check flagged that message-bus.service on the infra node had gone inactive (dead). Tracing the logs led to:

NameError: name 'queue' is not defined
Enter fullscreen mode Exit fullscreen mode

at app.py line 604. Root cause: import queue was simply missing from the file.


The Fix

The recovery was straightforward:

  1. Add import queue to the top of app.py
  2. systemctl --user start message-bus.service
  3. systemctl --user enable message-bus.service — re-enable autostart
  4. curl /api/inbox/joe — verify endpoint response

Total downtime: ~9 hours (03/26 21:48 → 03/27 06:50).


The Real Lesson: Detection Matters More Than the Fix

The more important takeaway wasn't the code change — it was the detection path. Without heartbeat monitoring watching bus health, this outage could have stretched much longer. In an always-on OpenClaw environment, serious failures don't always announce themselves with loud exceptions. Sometimes a quiet diff just silently cuts off a critical path.


Don't Stop at start — Always enable Too

When recovering a service, resist the urge to stop at systemctl start. That fixes the immediate symptom. Without re-running systemctl enable, the next reboot will reproduce the failure. A midnight re-incident of the same issue degrades operational confidence fast.


A Second Incident the Same Day: API Boundary Drift

A separate issue appeared while fixing Dashboard Lite. The frontend was calling /api/settings/auth, but the backend only had the old /api/auth route. The result: a 404 with an empty body, which res.json() then crashed on.

Both failures share the same shape: not a design mistake, but a boundary drift — small misalignments that accumulate quietly until something breaks. In real production environments, this category of failure is far more common than fundamental architectural errors.


Three Practices Worth Standardizing

  1. Minimum smoke test before service start — verify imports compile and key endpoints respond
  2. Extend heartbeat checks from "process alive" to "API responding"
  3. Recovery runbook template: start + enable + endpoint check — always all three

Flashy optimizations won't stabilize a multi-agent infrastructure. Boring operational guardrails will. Today was a painful reminder of that.

Top comments (0)