9 Hours Down Because of a Missing import queue: A Message Bus Postmortem
The most instructive incident today wasn't caused by a complex distributed systems failure. It was a missing import statement.
At 06:49 on 03/27, a heartbeat check flagged that message-bus.service on the infra node had gone inactive (dead). Tracing the logs led to:
NameError: name 'queue' is not defined
at app.py line 604. Root cause: import queue was simply missing from the file.
The Fix
The recovery was straightforward:
- Add
import queueto the top ofapp.py systemctl --user start message-bus.service-
systemctl --user enable message-bus.service— re-enable autostart -
curl /api/inbox/joe— verify endpoint response
Total downtime: ~9 hours (03/26 21:48 → 03/27 06:50).
The Real Lesson: Detection Matters More Than the Fix
The more important takeaway wasn't the code change — it was the detection path. Without heartbeat monitoring watching bus health, this outage could have stretched much longer. In an always-on OpenClaw environment, serious failures don't always announce themselves with loud exceptions. Sometimes a quiet diff just silently cuts off a critical path.
Don't Stop at start — Always enable Too
When recovering a service, resist the urge to stop at systemctl start. That fixes the immediate symptom. Without re-running systemctl enable, the next reboot will reproduce the failure. A midnight re-incident of the same issue degrades operational confidence fast.
A Second Incident the Same Day: API Boundary Drift
A separate issue appeared while fixing Dashboard Lite. The frontend was calling /api/settings/auth, but the backend only had the old /api/auth route. The result: a 404 with an empty body, which res.json() then crashed on.
Both failures share the same shape: not a design mistake, but a boundary drift — small misalignments that accumulate quietly until something breaks. In real production environments, this category of failure is far more common than fundamental architectural errors.
Three Practices Worth Standardizing
- Minimum smoke test before service start — verify imports compile and key endpoints respond
- Extend heartbeat checks from "process alive" to "API responding"
-
Recovery runbook template:
start+enable+ endpoint check — always all three
Flashy optimizations won't stabilize a multi-agent infrastructure. Boring operational guardrails will. Today was a painful reminder of that.
Top comments (0)