Multi-Agent AI: 5 Coordination Patterns I Learned the Hard Way

#programming #opensource #ai #architecture

TL;DR:

Direct agent-to-agent calls create distributed monoliths. Use a message bus.
Big agent tasks hallucinate. Small sequential spawns with review between each are faster.
Your 2GB server can't run Chromium. Match workloads to hardware or watch things OOM.
Shared knowledge + private memory. Not everything belongs in the same bucket.
Agents go down. Build for it.

I run 8 AI agents across 3 machines. A $15/month EC2, a Mac Mini, and a WSL2 workstation with a GPU. They handle QA, voice AI, ad creative, knowledge management, and interview analysis.

After two months of things breaking in creative ways, here are the coordination patterns that survived contact with reality.

1. Message Bus Over Direct Calls

My first architecture: Agent A calls Agent B's endpoint. Agent B needs context from Agent C. Agent C is offline.

Cascading failure. Everything dies.

The fix was embarrassingly simple — a shared message bus. We built Quoth, a multi-agent knowledge platform. Agents publish messages to a shared channel. Other agents subscribe to what they care about. Messages persist until acknowledged.

agent:main → bus: "New interview starting, candidate-123" (priority: high)
agent:interviews → bus: (subscribes, picks it up when ready)
agent:main → bus: "Run QA on PR #234" (priority: normal)  
agent:attqa → bus: (offline, picks it up 3 hours later)

Why it works: Agents don't need to know each other's APIs, endpoints, or even if they're online. The bus decouples everything. An agent can be down for hours and catch up when it comes back.

The gotcha: You need priority levels. Without them, a low-priority "update docs" message blocks a high-priority "production is broken" alert.

2. Spawn Small, Review, Repeat

I built a swarm skill that parallelizes work across sub-agents. First attempt: "Refactor all 6 modules and update the test suite."

The result was a mess. Agents hallucinated imports that didn't exist, created circular dependencies, and duplicated work. More than 5-6 parallel agents doesn't improve output — it degrades it.

The pattern that works:

spawn("Refactor module A: extract shared utils")
→ review output
→ spawn("Refactor module B: use new shared utils from A")  
→ review output
→ spawn("Update tests for A and B")
→ review output

Sequential. Small. Reviewed between each step.

It feels slower. It's not. Each spawn gets complete, updated context. No mid-flight corrections. No "wait, also change this" messages that may or may not arrive before the agent finishes.

The gotcha: If requirements change while an agent is running, let it finish. Review. Spawn a new one with the updated requirements. Trying to steer a running agent is unreliable.

3. Match Hardware to Workload

This one cost me an afternoon and a crashed EC2 instance.

Chromium takes ~500MB of RAM. My EC2 has 1.9GB total. I ran a Playwright script to take screenshots. Two browser instances later, the OOM killer nuked everything — including the main gateway process. All 3 agents on that node went dark.

Now each node has an explicit role:

EC2 (2GB RAM): Orchestration, text processing, API calls. Never a browser.
Mac Mini: Browser automation, development workflows, QA testing.
WSL2 + RTX 3080: GPU inference, image generation, heavy Playwright jobs.

# Every script that uses Playwright starts with this
import os
mem_gb = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3)
if mem_gb < 4:
    print("Not enough RAM for browser automation. Aborting.")
    sys.exit(1)

Crude but effective. I haven't crashed a node since.

The gotcha: It's not just RAM. CPU matters for video processing, disk I/O matters for large model files, and network latency matters for real-time voice. Profile your workloads, don't just count gigabytes.

4. Shared Knowledge, Private Memory

Every agent needs memory. The question is: what gets shared?

Wrong approach: everything in one database. My QA agent's test failure patterns mixed with the ad pipeline's lead scoring data. Searches returned irrelevant noise.

The split that works:

Shared knowledge (Quoth): Architecture decisions, API contracts, deployment procedures. Things any agent might need.
Private memory (local files): Session notes, work-in-progress, agent-specific context. Things only that agent cares about.

Each agent has a MEMORY.md (curated long-term) and daily memory/YYYY-MM-DD.md files (raw logs). The shared knowledge bus handles cross-agent documentation.

The gotcha: Agents will write shared docs from their own perspective. "The deployment process" means something different to the QA agent (run tests → deploy) versus the ad pipeline agent (generate assets → upload → deploy). Shared knowledge needs a review step — don't let agents auto-publish to shared indexes without validation.

5. Design for Agent Downtime

Agents crash. Nodes lose network. Gateways restart. SSH connections drop.

In any given week, at least one of my 8 agents is offline for some period. The Mac goes to sleep. The WSL2 instance loses its network bridge. The EC2 gets rate-limited.

The system can't depend on 100% uptime from any agent. Three rules:

Messages persist: If an agent is offline, messages queue. When it comes back, it catches up.
No blocking dependencies: Agent A can request work from Agent B, but A keeps working. If B never responds, A doesn't hang.
Health checks with alerts: A simple heartbeat (every 30 min). If an agent misses 3 heartbeats, alert. Don't wait for a user to notice.

# Heartbeat check (runs on the orchestrator)
for agent in fleet:
    last_seen = get_last_heartbeat(agent)
    if now() - last_seen > 90 minutes:
        alert(f"{agent.name} hasn't checked in for {minutes} minutes")

The gotcha: "Offline" isn't binary. An agent can respond to heartbeats but be stuck in an error loop, burning tokens on repeated 429 retries. Check for useful activity, not just any activity.

The Honest Part

These patterns weren't invented. They were extracted from failures. The message bus exists because direct calls failed. Small spawns exist because big ones hallucinated. The hardware matching exists because I crashed production.

Two months in, the fleet handles work across QA, voice AI, ad creative, knowledge curation, and interview analysis. The total infrastructure cost is about $20/month. The actual AI inference costs $0 in API keys (Claude Max subscription through OpenClaw).

It's not elegant. But it works.

Building from Argentina. The code, the agents, and the ecosystem are at triqual.dev.