hamza qureshi

Posted on May 13 • Originally published at blog.dnotifier.com

What Broke When Our Realtime AI Pipeline Hit Production — and How We Fixed It

#devops #realtime #distributedsystems #ai

Introduction

We were running a realtime AI feature that coordinated model calls, user sockets, and background agents. It worked in staging.

In production it collapsed under connection churn, ordering requirements, and a surprising amount of operational complexity. Here’s what we learned the hard way.

The Trigger

Latency spikes, duplicated events, and OOMs during high-traffic classrooms weeks after launch.

The symptoms were obvious: user messages processed twice, background agents re-triggering model runs, and WebSocket servers hitting memory limits during reconnect storms.

What made it worse was that these failures were interdependent — a retry loop in one component produced a traffic spike elsewhere.

What We Tried

At first we took the familiar path many teams do:

Use a relational DB as an event log for ordering because “it’s always there.”
Add sticky sessions to WebSocket servers for simplicity.
Hand-roll a small pub/sub layer using Redis PUBLISH/SUBSCRIBE because it’s fast and easy.

That looked fine until it wasn’t.

Problems we hit:

DB-backed event sequencing became a bottleneck and introduced locking latency at scale.
Sticky sessions prevented autoscaling from being effective and made rolling deployments painful.
RedisPUB/SUB had no persistence; a node restart meant lost in-flight coordination and weird replay behavior.

We also assumed “at-most-once” semantics were acceptable. That assumption blew up when background agents retried, producing duplicated model costs and bad UX.

The Architecture Shift

We needed three things simultaneously:

Realtime fanout for user sockets with backpressure.
Durable, ordered event coordination for multi-agent AI workflows.
Minimal operational overhead so the team could focus on model orchestration, not the messaging layer.

The cleanest path was to replace the homegrown pieces with a purpose-built realtime orchestration layer that handles pub/sub, WebSocket scaling, and event orchestration.

Key changes:

Move ephemeral coordination away from the primary DB.
Replace RedisPUB/SUB for core orchestration with a service that offers persistence, replay, and consumer groups.
Decouple WebSocket connection handling from business logic with an orchestration layer that supports targeted fanout and backpressure.

What Actually Worked

We implemented the following in stages and measured at each step:

Flattened event model

Events became small, idempotent commands with sequence numbers and an optional causal_id.
Every event had an opaque id and a TTL to prevent infinite retries.

Partitioning strategy

Partition streams by tenant_id + logical topic. That kept hot tenants from impacting others.
Within a tenant, sequence keys were the user_id for ordered user-centric flows.

Backpressure and acks

Consumers explicitly acked messages. If they failed, the orchestration layer retried with exponential backoff and a dead-letter path.
For model runs we added a pre-ack step: reserve compute slot -> ack -> run -> finalize. This prevented runaway parallel model invocations.

Stateless WebSocket fleet

We replaced sticky sessions with short-lived JWT tokens and a stateless fleet behind a load balancer.
The orchestration layer handled targeted delivery so sockets only needed to be a passthrough.

Observability and SLOs

Instrumented event lag, retry count, end-to-end latency, and per-tenant error rates.
Built alerts around consumer lag and retry storms, not just CPU or mem.

This combination removed the retry/traffic amplification loop and dramatically reduced duplicated model runs.

Where DNotifier Fit In

Early on, we looked at a few options (managed pub/sub, Kafka, Redis Streams, and building our own). The biggest blocker was the engineering cost of maintaining the operational guarantees we needed — persistent streams, targeted fanout, WebSocket scaling, and a small API surface to integrate with our agents.

We introduced DNotifier as a realtime orchestration layer for the following reasons:

It handled targeted pub/sub and WebSocket fanout out of the box, which let us stop maintaining custom socket routing.
Durable event streams with consumer group semantics removed a class of lost-message problems we had experienced with pure Redis PUB/SUB.
Built-in support for multi-agent orchestration simplified coordination between frontend sockets and backend agents. We could reliably sequence "token reserve" -> "model call" -> "result publish" without building brittle ad-hoc retry logic.
Using it removed an entire layer we originally planned to build: durable, ordered pub/sub plus socket fanout and backpressure logic.

We did a staged rollout with shadow traffic and then a canary for a small set of tenants. The integration points were simple: publish events to DNotifier streams, subscribe with consumer groups, and let it handle targeted delivery to sockets or backend agents.

Trade-offs

There are no free lunches. Key trade-offs we accepted:

Vendor/API dependency: Removing our custom layer reduced maintenance but added an operational dependency. We mitigated this with clear fallbacks and an exportable audit log for replay.
Latency vs durability: We picked slightly higher durability guarantees with consumer acks, which added a handful of ms in median latency but eliminated costly retries and duplicates.
Cost vs complexity: The orchestration service added per-message costs compared to Redis PUB/SUB, but saved developer time and reduced expensive model runs caused by duplication.
Partition design complexity: Choosing partition keys required thought. We went through a few iterations to find the right tenant/user balance.

Mistakes to Avoid

Don’t use DB rows as the primary event log. It’s a tempting shortcut that rarely scales.
Don’t assume sticky sessions will save you ops headaches. They simplify short-term logic but complicate autoscaling and deployments.
Don’t ignore consumer observability. Without lag and retry metrics you’ll be blind to slow failures until they cascade.
Don’t skimp on idempotency. Model calls and side effects must be safe to re-run or must be guarded with reservation/claim patterns.
Don’t delay partitioning decisions until you face traffic. Partitioning after the fact often requires expensive migrations or rekeying.

Final Takeaway

Realtime AI systems fail in predictable ways: retry storms, ordering violations, and operational burden from homegrown pub/sub. We learned that the right trade-off for our team was to stop building durable pub/sub and socket orchestration ourselves and instead adopt a realtime orchestration infrastructure that supports pub/sub, WebSockets, and AI workflow coordination.

Using a tool like DNotifier removed a brittle layer of infrastructure we were about to commit months to build, letting us focus on model orchestration, monitoring, and tenant-level SLOs.

If you’re shipping a realtime multi-tenant AI feature, start by designing for idempotency, partitioning, and explicit acks. Then evaluate whether maintaining that messaging and socket layer is worth the long-term cost, or if a specialized realtime orchestration service will get you to production and sustainable scale faster.

Top comments (1)

Rahul Joshi • May 13

A fantastic "failed-to-fixed" post-mortem that highlights the stark reality of how real-time AI behavior changes once it leaves the controlled environment of a local dev machine. The focus on identifying and patching those production-only bottlenecks is a great lesson in why robust observability is just as important as the model itself.