linou518

Posted on Feb 26

How We Cut Agent-to-Agent Message Latency from 30 Minutes to 1 Second

#openclaw #python #multiagent #infrastructure

How We Cut Agent-to-Agent Message Latency from 30 Minutes to 1 Second

TL;DR

We run 19 AI agents across 9 mini-PCs using OpenClaw. Agent-to-agent message delivery was taking up to 30 minutes — we got it down to ~1 second using a lightweight SSE + systemd bridge architecture. Here's how.

The Problem: Heartbeat-Driven Polling

OpenClaw agents are event-driven by design. They respond to user messages instantly — but inter-agent communication is a different story.

In our setup, we run a custom message bus: a simple Flask + Gunicorn HTTP API where agents post messages and recipients poll for them. The polling happens via OpenClaw's cron.wake heartbeat.

The heartbeat interval maxes out at 30 minutes. This means:

Agent A posts a message → 0 seconds
Agent B's next heartbeat fires → up to 30 minutes later
B reads and processes the message → a few more seconds

For real-time coordination tasks, this was a dealbreaker.

First Attempt: sessions_send (Didn't Work)

OpenClaw has a sessions_send API for injecting messages directly into another session:

sessions_send(sessionKey="agent:some-agent:main", message="New task for you")

This looked perfect — messages delivered instantly! But there was a catch.

sessions_send only works for main/webchat sessions. Our agents primarily run on Telegram sessions. Messages injected this way were silently ignored by the agents.

Back to the drawing board.

The Solution: SSE + bus-watcher Bridge

We flipped the approach: instead of agents polling the bus, the bus pushes events to a lightweight watcher process running on each node.

Architecture

[Agent A] → POST /api/send → [Message Bus] → SSE /api/stream
                                                     ↓
                                             [bus-watcher.py]
                                                     ↓
                                           cron.wake(mode=now)
                                                     ↓
                                               [Agent B wakes]
                                                     ↓
                                          heartbeat → GET /api/inbox
                                                     ↓
                                             [Message processed]

Step 1: Add SSE Endpoint to the Message Bus

We added /api/stream to the Flask app — a persistent connection that pushes new messages in real time:

@app.route('/api/stream')
def stream():
    def generate():
        last_id = 0
        while True:
            new_msgs = get_messages_after(last_id)
            for msg in new_msgs:
                yield f"data: {json.dumps(msg)}\n\n"
                last_id = msg['id']
            time.sleep(1)
    return Response(generate(), mimetype='text/event-stream')

Gotcha — Gunicorn worker count: We initially ran with 2 workers, which caused SSE subscribers to be spread across workers. A message arriving at worker 1 wouldn't reach a subscriber on worker 2. Switching to a single gevent worker fixed this.

Step 2: bus-watcher.py on Each Node

A minimal Python script subscribes to the SSE stream and triggers cron.wake when a message arrives for a local agent:

#!/usr/bin/env python3
"""SSE → cron.wake bridge"""
import urllib.request, json, subprocess

def watch():
    url = "http://192.168.x.x:8091/api/stream"  # internal message bus
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as resp:
        for line in resp:  # line-based reading
            line = line.decode().strip()
            if line.startswith("data:"):
                msg = json.loads(line[5:])
                if msg["to_agent"] in LOCAL_AGENTS:
                    subprocess.run([
                        "openclaw", "cron", "wake",
                        msg["to_agent"], "--mode=now"
                    ])

Gotcha — urllib buffering: Using resp.read() buffered the stream and events didn't arrive in real time. Switching to readline()-based iteration (iterating over the response object directly) solved it.

Step 3: systemd Service for Reliability

We deployed bus-watcher.service on every node for auto-start and auto-reconnect:

[Unit]
Description=Message Bus Watcher
After=network.target

[Service]
ExecStart=/usr/bin/python3 /path/to/bus-watcher.py
Restart=always
RestartSec=5

[Install]
WantedBy=default.target

Deployed to all 7 nodes, tested with all 19 agents.

Results

Metric	Before	After
Message delivery latency	Up to 30 min	~1 second
Additional infrastructure	None	SSE endpoint + lightweight watcher
CPU/Memory overhead	—	Nearly zero
New dependencies	—	None (stdlib only)

Watching agents respond to each other in real time for the first time was genuinely exciting. Multiple agents firing off replies in rapid succession — it finally felt like a live agent network.

Key Takeaways

Know sessions_send's limits: OpenClaw session injection is channel-aware. It's not a universal delivery mechanism.
SSE is underrated: Far simpler than WebSockets for this use case, and more than sufficient.
Gunicorn + SSE = watch your worker count: Single gevent worker is the right setup for SSE.
urllib buffering bites: For streaming, always iterate line-by-line rather than calling read().
cron.wake --mode=now is powerful: OpenClaw's hidden gem for instant agent activation without waiting for the next heartbeat.

Wrap-Up

You don't need Redis, RabbitMQ, or any heavy message queue to build real-time inter-agent communication. SSE + a few dozen lines of Python got the job done.

In a multi-agent system, communication latency defines the responsiveness of the whole network. The gap between 30 minutes and 1 second isn't just a performance metric — it's the difference between a batch system and a live collaborative agent team.

Tags: #OpenClaw #MultiAgent #SSE #Python #Infrastructure #RealTime #MessageBus #systemd

DEV Community

How We Cut Agent-to-Agent Message Latency from 30 Minutes to 1 Second

How We Cut Agent-to-Agent Message Latency from 30 Minutes to 1 Second

TL;DR

The Problem: Heartbeat-Driven Polling

First Attempt: sessions_send (Didn't Work)

The Solution: SSE + bus-watcher Bridge

Architecture

Step 1: Add SSE Endpoint to the Message Bus

Step 2: bus-watcher.py on Each Node

Step 3: systemd Service for Reliability

Results

Key Takeaways

Wrap-Up

Top comments (0)