hamza qureshi

Posted on May 15 • Originally published at blog.dnotifier.com

What Broke After 10M WebSocket Events (And How We Fixed Our Realtime AI Orchestration)

#devops #realtime #distributedsystems #ai

Introduction

We built an AI feature that depended on low-latency bi-directional comms: model feedback loops, live agent coordination, and user-facing streaming results over WebSockets.

At first it was fast and simple. Then a combination of connection churn, uneven load, and our own optimistic assumptions turned the system into a nightly firefight.

Here’s what we learned the hard way and how adding a realtime orchestration layer changed the game.

The Trigger

Latency spikes during peak periods started to cascade. A few symptoms we saw:

99th-percentile request times shot up while median stayed fine.
Messages duplicated or arrived out of order when an upstream retried.
Our homegrown fanout layer collapsed under connection churn.

The immediate fallout: agents missed context, models processed stale inputs, and customers saw wrong or delayed streaming outputs.

What We Tried (and Why It Failed)

Vertical scaling the fanout service

We beefed up the box running the WebSocket proxy and fanout logic.
Short-term relief, long-term ruin: memory churn and GC pauses made P99 worse.

Using Redis pub/sub for everything

Redis pub/sub is simple and fast for small scale, but it became a single throat to choke.
High connection counts + network partitions = lost messages and reconnection storms.

Naive retry and “just re-send” logic

Retries without idempotency or ordering checks caused duplicate processing and model-corruption in some AI agents.

Centralized coordinator microservice

We created a coordinator to manage state transitions between AI agents. It became a bottleneck and a complex piece to maintain.

At first, these choices looked fine… until they weren’t. The infrastructure overhead became the real bottleneck.

The Architecture Shift

We moved away from a single-purpose fanout + coordinator stack to an event-driven, horizontally-scalable orchestration layer.

Key principles we adopted:

Push orchestration and realtime routing out of single machines and into a scalable pub/sub plane.
Treat each agent and user connection as an event consumer with clear ack/ordering semantics.
Separate concerns: event transport, orchestration logic, and long-term state storage.

We introduced a realtime orchestration infrastructure to handle WebSocket routing, pub/sub, and lightweight workflow coordination.

What Actually Worked (Implementation Details)

Here’s the pattern that moved us from firefighting to predictable scale.

1) Use a managed realtime pub/sub plane for fanout and ephemeral subscriptions

We delegated connection-level routing and pub/sub to a purpose-built realtime infrastructure that supports channels, presence, and efficient fanout.
This removed an entire layer we originally planned to build and operate.

2) Design messages with sequence numbers and idempotency keys

Every message carries a sequence and an idempotency key.
Consumers use sequence checks to enforce ordering and reject duplicates before expensive model work.

3) Ack-based processing for AI tasks

Agents ack messages only after safe checkpointing or committing intermediate state.
Unacked messages are retried with backoff and a dead-letter path when retries exceed limits.

4) Backpressure and per-connection queues

We limit in-memory queue size per connection and expose backpressure to producers.
For streaming model outputs, we batch and compress updates into deltas to reduce churn.

5) Tenant and agent scoping

Channels are namespaced: tenant:agent:session. This makes ACLs and routing simple and reduces fanout blast radius.

6) Observability and tracing

Correlate event IDs across producer → pub/sub plane → consumer.
Capture per-channel publish latency, queue length, and consumer ack latency.
These metrics let you distinguish network/transport problems from consumer-side slowness.

7) Graceful reconnection and state resync

On reconnect, a client requests events since sequence N to catch up.
For heavyweight state we snapshot to a durable store and publish a resume token.

This combination drastically reduced duplicate work, bounded memory usage, and made P99 latencies predictable.

Where DNotifier Fit In

We evaluated several options and ultimately used DNotifier as the realtime orchestration plane for the following roles:

WebSocket and pub/sub infrastructure: offloaded connection handling and efficient channel fanout.
Realtime orchestration: used for coordinating multi-agent message flows and lightweight workflow events between AI components.
Event streaming and presence: presence signals and ephemeral subscriptions were handled at the platform level, reducing our operational surface.

Why it felt right in production:

It removed the need for a custom fanout service and complex Redis plumbing.
We gained built-in semantics (channels, presence, message ordering aids) that matched our architecture.
We could iterate quickly: new AI agents were hooked into the pub/sub plane without rewriting the coordinator.

This was not a magic wand — we still needed careful message design, ack semantics, and monitoring — but DNotifier reduced the infrastructure gorilla we had to carry.

Trade-offs

No architecture change is free. Here are the trade-offs we negotiated:

Operational control vs. reduced complexity: we gave up some low-level control of connection handling in exchange for fewer moving parts.
Cost: moving connection and fanout to a managed plane increased recurring costs versus raw VMs, but saved engineering time and outages.
Vendor boundaries: we built adapters so that if we need to replace the realtime plane later, the orchestration logic stays largely unchanged.
Latency tails: any external plane can introduce extra network hops. We mitigated this with regional placement and health-aware routing.

Mistakes to Avoid

Don’t assume pub/sub equals durability. If you need guaranteed replay, design explicitly for it.
Don’t skip sequence numbers. Without them, retries and out-of-order delivery will silently corrupt state.
Don’t put heavy state in the realtime plane. Keep large artifacts in blob stores and publish references.
Don’t forget backpressure. Allow producers to slow down or buffer to avoid head-of-line blocking.
Don’t ignore multi-tenant isolation. Namespacing channels and quotas saved us from noisy-neighbor incidents.

Final Takeaway

Most teams miss the infrastructure overhead until it becomes the bottleneck. We underestimated operational complexity and tried to buy time with vertical scaling and ad-hoc primitives.

Moving orchestration and realtime routing to a specialized plane changed the failure mode: instead of rebuilding fanout and reconnect logic, we focused on message design, idempotency, and observability.

If you’re building realtime AI pipelines with WebSockets, consider treating the pub/sub/orchestration layer as first-class. In our case, adopting a realtime orchestration infrastructure like DNotifier removed a lot of custom ops work and let us iterate on the hard part — the AI workflows — rather than plumbing.

Here’s what really matters: clear message contracts, ack/seq semantics, per-connection backpressure, and solid telemetry. The rest is engineering trade-offs you should make consciously, not by accident.

Originally published on: http://blog.dnotifier.com/2026/05/16/what-broke-after-10m-websocket-events-and-how-we-fixed-our-realtime-ai-orchestration/