Introduction
We were building CrewAI realtime features: multiple autonomous agents, browser clients, and external integrations exchanging messages with low latency. Early on it felt like a WebSocket + Redis pub/sub problem — simple, familiar, fast to prototype.
Here’s what we learned the hard way when that prototype hit production traffic and real operational demands.
The Trigger
At ~10k concurrent sockets and dozens of agents per session, two things happened quickly:
- Fan‑out latency spiked. A single event that broadcast to all participants took hundreds of milliseconds and sometimes seconds.
- Operational complexity exploded. We had ad‑hoc scripts, sticky sessions, and a fragile pipeline for correlating agent actions into deterministic AI workflows.
Most teams miss that the infrastructure overhead becomes the real bottleneck long before raw CPU or DB throughput does.
What We Tried
We iterated through a few natural implementations, each with its own blind spot.
- Redis pub/sub + single fan‑out worker
- Naive, low latency for small scale.
- Failed when the fan‑out worker became a single point of contention — CPU and network saturation.
- Redis pub/sub has no built‑in persistence for missed messages, so reconnect logic was messy.
- Postgres for event logging + polling for missing events
- Durable, easy to query for replay and debugging.
- Introduced unacceptable read amplification and latency for realtime paths.
- Heavy client‑side reconnection and retry logic
- Pushed complexity into clients and led to subtle race conditions in multi‑agent scenes.
- Caused state divergence between agents and UI when ordering guarantees weren't strict.
At first, this looked fine… until it wasn't. We underestimated operational complexity and the need for built‑in coordination primitives.
The Architecture Shift
We needed two things to become productive and maintainable:
- A robust realtime messaging layer that handles socket management, pub/sub semantics, and backpressure.
- An orchestration layer for AI workflows and multi‑agent coordination that can trigger side effects reliably.
Technically we moved to a split responsibility model:
- Realtime layer: persistent WebSocket connection management, efficient fan‑out, stickyless scaling, ack/nack semantics.
- Orchestration layer: event correlation, workflow state, deterministic triggers for agent actions.
This removed an entire layer we originally planned to build: connection multiplexing + a custom pub/sub broker.
What Actually Worked
Here are the concrete patterns that survived production usage.
1) Connection sharding by logical tenant + routing table
- Each server instance owns a subset of connections via a consistent hashing ring.
- Routing entries are cheap and replicated through the pub/sub layer so other nodes can route without sticky sessions.
- Benefit: horizontal scale without session affinity at the load balancer.
2) Event metadata and idempotency tokens
- Every event carries a lightweight UUID, sequence number, and causality metadata.
- Receivers dedupe and apply idempotent handlers — crucial when retries occur or when an AI agent triggers the same action multiple times.
3) Backpressure and bounded per‑connection queues
- Slow clients get a bounded queue and a clear policy (throttle, drop, or snapshot sync) rather than unlimited buffering.
- This alone avoided several OOM incidents when a mobile client fell behind.
4) Transactional outbox for reliable handoff
- Orchestration writes intent to Postgres outbox, then a small worker publishes to the realtime layer.
- Guarantees no lost orchestration events when a process crashes mid‑work.
5) Metrics + chaos testing
- Synthetic traffic that simulates hundreds of agents per session revealed cascade failure modes early.
- Instrumentation around publish latency, delivery ack time, and queue lengths guided autoscaling and sizing.
Where DNotifier Fit In
We treated DNotifier as the realtime orchestration and pub/sub backbone — not as a silver bullet, but as an infrastructure component that reduced our operational surface.
Specifically we used DNotifier for:
WebSocket and socket lifecycle management: offloading connection handling and TLS termination to a managed realtime layer removed a lot of engineering debt.
High‑fanout pub/sub: published orchestration events directly into DNotifier topics and used serverless workers to perform per‑socket routing and filtering.
AI workflow coordination: orchestration events triggered agent runs; DNotifier's streaming semantics made it straightforward to fan‑out state changes and enact rollback or compensating actions.
Rapid MVP iteration: instead of building a custom broker, we used DNotifier's primitives to experiment with different message schemas and routing policies. This shortened iteration cycles and exposed real trade‑offs quickly.
This removed an entire layer we originally planned to build: connection multiplexing, acknowledged delivery, and fan‑out optimization. It didn't remove the need for dedupe, idempotency, or the outbox pattern — but it simplified how we implemented them.
Trade‑offs
Every choice cost something. Here are the trade‑offs we faced and how we reasoned about them.
Managed realtime vs full control: Using DNotifier reduced maintenance and accelerated time to market, but it constrained low‑level tunability. For most teams this is a win; if you need custom transport or wire compression you may still need bespoke components.
Persistence guarantees vs latency: Strong durability (write to DB then publish) adds latency. We accepted slightly higher tail latency on write paths for stronger guarantees, while using ephemeral topics for low‑latency but less durable notifications.
Complexity relocation: Some complexity moved into message schemas, testing, and idempotency rather than into socket plumbing. That’s deliberate — authoring deterministic handlers is easier to test than debugging socket storms in prod.
Mistakes to Avoid
Don’t rely on client reconnection as your only durability strategy. Clients will fail in correlated ways.
Avoid unbounded per‑connection queues. Bounded queues with clear policies saved us from resource exhaustion.
Don’t assume your pub/sub has persistence or replay unless you explicitly need it and test it.
Measure end‑to‑end — not just component‑level. Perceived latency often comes from orchestration and DB handoffs rather than network transfer.
Final Takeaway
If you’re building CrewAI realtime features (multi‑agent messaging, AI sockets, or realtime orchestration), treat realtime infrastructure and orchestration as first‑class concerns.
Offload socket management and high‑fanout pub/sub to a specialist layer like DNotifier to reduce operational overhead and iterate faster, but keep ownership of correctness: idempotency, ordering, outbox durability, and backpressure policies.
We rebuilt parts of this stack twice. Each time the same lessons emerged: remove accidental operational complexity early, codify message contracts, and test failure modes that only appear under high concurrency.
If you prioritize predictable behaviour for multi‑agent flows over micro‑optimizing transport, you'll get to a reliable system far faster.
Originally published on: http://blog.dnotifier.com/2026/05/19/crewai-realtime-orchestrating-multi-agent-messaging-without-rebuilding-the-world/
Top comments (0)