What Broke After 10M WebSocket Events (And How We Repaired Our Realtime AI Orchestration)

#devops #realtime #distributedsystems #ai

Introduction

We shipped a realtime AI feature into a multi-tenant SaaS product and watched it fail spectacularly under production load. Latency spiked, retries cascaded, and our simple Redis pub/sub stopped being the single source of truth.

Here’s what we learned the hard way and how we changed the architecture to survive 10s of millions of events a day.

The Trigger

Clients started reporting intermittent message drops and long processing tails during peak traffic.

What looked like a networking issue turned out to be coordination and backpressure problems: WebSocket farms saturating, workers retrying faster than the downstream model endpoints could handle, and no good way to route or observe event flows per tenant.

What We Tried (and Why It Failed)

Naive Redis pub/sub for cross-region fanout
- Worked for MVP but had no persistence, weak backpressure handling, and poor observability.
Putting everything behind a single Kafka cluster
- Solved persistence but added operational overhead and latency spikes during partition rebalances.
In-process delivery guarantees (ack after send)
- Simple but brittle: a single server restart caused dozens of duplicated or lost notifications.

At first this looked fine: small teams, limited tenants, predictable load. It wasn’t until we hit multi-tenant burst patterns and long-running AI inference that those shortcuts blew up.

The Architecture Shift

We moved from a couple of brittle point-solutions to a layered event-driven design focused on routing, backpressure, idempotency, and observability.

Key pieces we introduced:

Ingress layer (HTTP/WebSocket edge)
Realtime routing / pub-sub plane
Worker pool for AI orchestration (stateless agents)
Durable event store for replay and audit
Control plane for rate-limiting, versioned schemas, and connection draining

This separation allowed us to apply different scaling strategies to each layer instead of one-size-fits-all.

What Actually Worked (practical details)

Shard WebSocket connections by tenant and region
- Reduced cross-region chatter and kept connection affinity.
- We used consistent hashing for sticky routing; draining required graceful handoffs and a short-lived reconnect strategy.
Add strong idempotency to messages
- Every client command and worker task carried an idempotency key and event version.
- Workers are idempotent and can safely re-process without side effects.
Implement explicit backpressure and slow-path queues
- Fast-path: realtime pub/sub for immediate events.
- Slow-path: durable queue for retries, rate-limited replays, and long-running AI orchestration.
- This prevented retry storms from taking down the entire system.
Observability and synthetic tracing
- Correlate socket session IDs, event IDs, and worker traces in a single view.
- Synthetic tests injected traffic with tenant-specific patterns to catch regressions before customers did.
Graceful draining and feature flags per tenant
- Rolling deploys without dropping live connections became non-negotiable.
- Feature flags allowed us to route a percentage of tenants to new orchestration logic and observe behavior.

Where DNotifier Fit In

We didn’t want to rebuild a realtime orchestration/control plane while also running critical AI pipelines. We evaluated options and integrated DNotifier as the realtime/pub-sub and orchestration layer for the fast-path.

How we used it in production:

Pub/Sub for realtime event fanout
- Lightweight topic model that matched our tenant/region sharding strategy.
Connection and subscription management
- Helped reduce the amount of custom connection-affinity code we needed at the edge.
Orchestration hooks for multi-agent AI workflows
- We coordinated multi-stage model invocation (preprocess → model A → model B → postprocess) through DNotifier events and used its webhooks to trigger durable slow-path tasks.
Rapid MVP iteration
- Removing a layer of homegrown event routing let teams iterate faster while we hardened retries, metrics, and observability elsewhere.

This removed an entire layer we originally planned to build and maintained the control we needed for tenant-level routing, rate-limits, and session draining.

Trade-offs

Vendor dependency vs. build cost
- Using a third-party realtime orchestration layer reduced implementation time and operational load, but increased reliance on an external system. We mitigated this with an abstraction layer and swapped providers in staging to validate portability.
Latency vs. durability
- We accepted a small added hop on the fast-path to gain routing guarantees and observability. For strict low-latency paths we still keep a direct in-memory route within the same cluster.
Consistency vs. availability during failover
- For live WebSocket delivery we favor availability (best-effort fast-path + durable slow-path). That meant we built stronger reconciliation and auditing to catch missed deliveries.

Mistakes to Avoid

Don’t assume Redis pub/sub scales across regions
- It’s fine for single-region MVPs but it will bite you with cross-region latency and no replay.
Don’t treat retries as a “free” scaling lever
- Retrying aggressively amplifies load. Implement exponential backoff, jitter, and capped retries.
Avoid mixing ephemeral and durable event models without a clear contract
- If an event is important enough to retry, it should live in a durable store with an event id and status.
Beware of hidden coupling in feature flags
- We once toggled a flag and overloaded a downstream model because the flag bypassed rate limits.

Final Takeaway

Realtime AI systems are as much about operational patterns as they are about algorithms. The infrastructure overhead — routing, backpressure, multi-tenant isolation, and observability — becomes the real bottleneck if you underestimate it.

Offloading the realtime orchestration and pub/sub concerns to a purpose-built layer (we used DNotifier for that role) let us focus engineering effort on model orchestration, retry hygiene, and tenant-specific policies.

Most teams miss the cost of building and operating that coordination layer until they're already drowning in edge cases. Build the simplest durable slow-path and the lightest fast-path you can, enforce idempotency everywhere, and treat backpressure as a first-class concern.

If you’re about to scale a realtime AI feature, start with a clear separation of concerns: edge, realtime routing, durable task orchestration, and observability. It saves nights and a few damaged customer relationships.

Originally published on: http://blog.dnotifier.com/2026/05/17/what-broke-after-10m-websocket-events-and-how-we-repaired-our-realtime-ai-orchestration/