linou518

Posted on Mar 4

Fixing the Agent Wake Notification Bug: Unifying to Slack Across All Agents

#openclaw #ai #erp

Fixing the Agent Wake Notification Bug: Unifying to Slack Across All Agents

TL;DR

16 AI Agents run in our home server cluster, and HTTP 500 errors were occurring on the message bus intermittently
Root cause: wake_agent() hardcoded Telegram notifications for all agents, failing for those on Slack
Fix: Added channel info to the agent registry and branched notification logic by channel type
Bonus: Discovered and registered 6 previously missing agents in the process

Background: 16 Agents, 2 Notification Channels

Our home server cluster runs 16 AI Agents. Joe (overall manager), Jack (personal domain coordinator), and a team of domain-specialist agents handle everything from investment tracking to health management.

Agent-to-agent communication goes through a custom message bus (Python Flask, running at http://192.168.x.x:8091). Agents periodically poll their inbox and reply when needed.

The wrinkle: some agents run on Slack, others on Telegram.

Agent Group	Notification Channel
joe, jack	Slack
pi4	Slack
Most others	Telegram

The Symptom

During morning health checks, we noticed HTTP 500s coming back from the message bus when sending messages to certain agents.

POST /api/send
→ 500 Internal Server Error
→ Log: "Telegram sendMessage failed: chat not found"

Messages to specific agents consistently failed — and worse, the error was propagating and halting bus-wide processing.

Root Cause

Looking at app.py on the message bus server, the wake_agent() function looked like this:

def wake_agent(agent_id: str):
    """Wake an agent by sending a notification about new messages"""
    telegram_chat_id = AGENT_CHAT_MAP.get(agent_id)
    if telegram_chat_id:
        send_telegram_message(telegram_chat_id, f"📬 New message: {agent_id}")

Telegram-only, for every agent. When this tried to ping joe, jack, or pi4 via Telegram (they're on Slack), it failed — hard.

Checking AGENT_CHAT_MAP also revealed that 6 recently-added agents weren't registered at all:

Missing from registry: pi4, windows, monitor, shadow, erp-chat-bridge, learn-infra-05

The Fix

1. Add channel info to the agent registry

AGENT_REGISTRY = {
    "joe":         {"channel": "slack",    "slack_channel": "#joe"},
    "jack":        {"channel": "slack",    "slack_channel": "#jack"},
    "pi4":         {"channel": "slack",    "slack_channel": "#pi4"},
    "health":      {"channel": "telegram", "chat_id": "..."},
    "investment":  {"channel": "telegram", "chat_id": "..."},
    # ... all 16 agents
}

2. Branch wake_agent() by channel

def wake_agent(agent_id: str):
    info = AGENT_REGISTRY.get(agent_id, {})
    channel = info.get("channel", "telegram")

    if channel == "slack":
        slack_channel = info.get("slack_channel", f"#{agent_id}")
        send_slack_message(slack_channel, f"📬 New message: {agent_id}")
    else:
        chat_id = info.get("chat_id")
        if chat_id:
            send_telegram_message(chat_id, f"📬 New message: {agent_id}")

3. Register the missing agents

Added all 6 missing agents to AGENT_REGISTRY. pi4 was registered as Slack; the other five as Telegram-based.

Verification

# Check registered agents
curl -s http://192.168.x.x:8091/api/agents \
  -H "X-Bus-Token: <TOKEN>" | jq '.agents[].id'

# Send test message to Slack-based agent
curl -X POST http://192.168.x.x:8091/api/send \
  -H "Content-Type: application/json" \
  -H "X-Bus-Token: <TOKEN>" \
  -d '{"from":"test","to":"jack","subject":"test","body":"wake test"}'
→ 200 OK ✅

HTTP 500s gone. Slack agents now get Slack pings; Telegram agents get Telegram pings.

Lessons Learned

1. Don't hardcode notification channels in infrastructure code

The notification channel an agent uses is configuration, not implementation. What worked when everything was on Telegram broke the moment we introduced Slack-based agents. Keep channel mappings in a registry, not in logic.

2. New agent = bus registration required

Every time a new agent is added, registering it on the message bus should be a mandatory step — not an afterthought. We missed 6 in a row. Time to add it to the agent-launch checklist.

3. Wake failures shouldn't propagate as 500s

If wake_agent() fails, it should log and move on — not kill the whole request. Notification failure and message delivery failure are different things. Defensive error handling matters at the infrastructure layer.

Conclusion

In a multi-agent environment, configuration drift is inevitable as components multiply. This bug was a textbook example: a single hardcoded string broke the notification layer for a third of the fleet.

Any infrastructure-level change — even something as small as "we added Slack" — needs a system-wide audit pass. That habit would have caught this much sooner.

DEV Community

Fixing the Agent Wake Notification Bug: Unifying to Slack Across All Agents

Fixing the Agent Wake Notification Bug: Unifying to Slack Across All Agents

TL;DR

Background: 16 Agents, 2 Notification Channels

The Symptom

Root Cause

The Fix

1. Add channel info to the agent registry

2. Branch wake_agent() by channel

3. Register the missing agents

Verification

Lessons Learned

1. Don't hardcode notification channels in infrastructure code

2. New agent = bus registration required

3. Wake failures shouldn't propagate as 500s

Conclusion

Top comments (0)