linou518

Posted on Mar 28

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

#openclaw #ai #dashboard

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

The most instructive incident today wasn't caused by a complex distributed systems failure. It was a missing import statement.

At 06:49 on 03/27, a heartbeat check flagged that message-bus.service on the infra node had gone inactive (dead). Tracing the logs led to:

NameError: name 'queue' is not defined

at app.py line 604. Root cause: import queue was simply missing from the file.

The Fix

The recovery was straightforward:

Add import queue to the top of app.py
systemctl --user start message-bus.service
systemctl --user enable message-bus.service — re-enable autostart
curl /api/inbox/joe — verify endpoint response

Total downtime: ~9 hours (03/26 21:48 → 03/27 06:50).

The Real Lesson: Detection Matters More Than the Fix

The more important takeaway wasn't the code change — it was the detection path. Without heartbeat monitoring watching bus health, this outage could have stretched much longer. In an always-on OpenClaw environment, serious failures don't always announce themselves with loud exceptions. Sometimes a quiet diff just silently cuts off a critical path.

Don't Stop at `start` — Always `enable` Too

When recovering a service, resist the urge to stop at systemctl start. That fixes the immediate symptom. Without re-running systemctl enable, the next reboot will reproduce the failure. A midnight re-incident of the same issue degrades operational confidence fast.

A Second Incident the Same Day: API Boundary Drift

A separate issue appeared while fixing Dashboard Lite. The frontend was calling /api/settings/auth, but the backend only had the old /api/auth route. The result: a 404 with an empty body, which res.json() then crashed on.

Both failures share the same shape: not a design mistake, but a boundary drift — small misalignments that accumulate quietly until something breaks. In real production environments, this category of failure is far more common than fundamental architectural errors.

Three Practices Worth Standardizing

Minimum smoke test before service start — verify imports compile and key endpoints respond
Extend heartbeat checks from "process alive" to "API responding"
Recovery runbook template: start + enable + endpoint check — always all three

Flashy optimizations won't stabilize a multi-agent infrastructure. Boring operational guardrails will. Today was a painful reminder of that.

DEV Community

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

The Fix

The Real Lesson: Detection Matters More Than the Fix

Don't Stop at `start` — Always `enable` Too

A Second Incident the Same Day: API Boundary Drift

Three Practices Worth Standardizing

Top comments (0)

9 Hours Down Because of a Missing import queue: A Message Bus Postmortem

The Fix

The Real Lesson: Detection Matters More Than the Fix

Don't Stop at start — Always enable Too

A Second Incident the Same Day: API Boundary Drift

Three Practices Worth Standardizing

9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

Don't Stop at `start` — Always `enable` Too