Fixing the Agent Wake Notification Bug: Unifying to Slack Across All Agents
TL;DR
- 16 AI Agents run in our home server cluster, and HTTP 500 errors were occurring on the message bus intermittently
- Root cause:
wake_agent()hardcoded Telegram notifications for all agents, failing for those on Slack - Fix: Added channel info to the agent registry and branched notification logic by channel type
- Bonus: Discovered and registered 6 previously missing agents in the process
Background: 16 Agents, 2 Notification Channels
Our home server cluster runs 16 AI Agents. Joe (overall manager), Jack (personal domain coordinator), and a team of domain-specialist agents handle everything from investment tracking to health management.
Agent-to-agent communication goes through a custom message bus (Python Flask, running at http://192.168.x.x:8091). Agents periodically poll their inbox and reply when needed.
The wrinkle: some agents run on Slack, others on Telegram.
| Agent Group | Notification Channel |
|---|---|
| joe, jack | Slack |
| pi4 | Slack |
| Most others | Telegram |
The Symptom
During morning health checks, we noticed HTTP 500s coming back from the message bus when sending messages to certain agents.
POST /api/send
→ 500 Internal Server Error
→ Log: "Telegram sendMessage failed: chat not found"
Messages to specific agents consistently failed — and worse, the error was propagating and halting bus-wide processing.
Root Cause
Looking at app.py on the message bus server, the wake_agent() function looked like this:
def wake_agent(agent_id: str):
"""Wake an agent by sending a notification about new messages"""
telegram_chat_id = AGENT_CHAT_MAP.get(agent_id)
if telegram_chat_id:
send_telegram_message(telegram_chat_id, f"📬 New message: {agent_id}")
Telegram-only, for every agent. When this tried to ping joe, jack, or pi4 via Telegram (they're on Slack), it failed — hard.
Checking AGENT_CHAT_MAP also revealed that 6 recently-added agents weren't registered at all:
Missing from registry: pi4, windows, monitor, shadow, erp-chat-bridge, learn-infra-05
The Fix
1. Add channel info to the agent registry
AGENT_REGISTRY = {
"joe": {"channel": "slack", "slack_channel": "#joe"},
"jack": {"channel": "slack", "slack_channel": "#jack"},
"pi4": {"channel": "slack", "slack_channel": "#pi4"},
"health": {"channel": "telegram", "chat_id": "..."},
"investment": {"channel": "telegram", "chat_id": "..."},
# ... all 16 agents
}
2. Branch wake_agent() by channel
def wake_agent(agent_id: str):
info = AGENT_REGISTRY.get(agent_id, {})
channel = info.get("channel", "telegram")
if channel == "slack":
slack_channel = info.get("slack_channel", f"#{agent_id}")
send_slack_message(slack_channel, f"📬 New message: {agent_id}")
else:
chat_id = info.get("chat_id")
if chat_id:
send_telegram_message(chat_id, f"📬 New message: {agent_id}")
3. Register the missing agents
Added all 6 missing agents to AGENT_REGISTRY. pi4 was registered as Slack; the other five as Telegram-based.
Verification
# Check registered agents
curl -s http://192.168.x.x:8091/api/agents \
-H "X-Bus-Token: <TOKEN>" | jq '.agents[].id'
# Send test message to Slack-based agent
curl -X POST http://192.168.x.x:8091/api/send \
-H "Content-Type: application/json" \
-H "X-Bus-Token: <TOKEN>" \
-d '{"from":"test","to":"jack","subject":"test","body":"wake test"}'
→ 200 OK ✅
HTTP 500s gone. Slack agents now get Slack pings; Telegram agents get Telegram pings.
Lessons Learned
1. Don't hardcode notification channels in infrastructure code
The notification channel an agent uses is configuration, not implementation. What worked when everything was on Telegram broke the moment we introduced Slack-based agents. Keep channel mappings in a registry, not in logic.
2. New agent = bus registration required
Every time a new agent is added, registering it on the message bus should be a mandatory step — not an afterthought. We missed 6 in a row. Time to add it to the agent-launch checklist.
3. Wake failures shouldn't propagate as 500s
If wake_agent() fails, it should log and move on — not kill the whole request. Notification failure and message delivery failure are different things. Defensive error handling matters at the infrastructure layer.
Conclusion
In a multi-agent environment, configuration drift is inevitable as components multiply. This bug was a textbook example: a single hardcoded string broke the notification layer for a third of the fleet.
Any infrastructure-level change — even something as small as "we added Slack" — needs a system-wide audit pass. That habit would have caught this much sooner.
Top comments (0)