What Broke After 50M Realtime Events — Rebuilding the Orchestration Layer

#devops #realtime #websockets #pubsub

Introduction

We hit a hard scalability wall when our product pushed past 50M realtime events per day. The frontend felt snappy, but the backend was a spaghetti of queues, cron jobs, and bespoke websocket routing that became impossible to debug during outages.

This is the story of the mistakes we made, the signals that mattered, and how moving to a focused realtime orchestration infrastructure changed the game for reliability and iteration speed.

The Trigger

Latency spikes during peak traffic. Users would see stale AI assistant responses and dropped WebSocket messages.

Operationally it looked like: high request retries, exploding Redis memory during bursts, uneven shard load, and a thousand tiny scripts to stitch message flow between services.

At first everything seemed fine — until it wasn't. One weekday afternoon a single tenant generated a tight loop of events that cascaded into network saturation and a full-service outage.

What We Tried

Naive implementations and assumptions

Sharded Redis pub/sub for all messaging: cheap and quick to prototype, but no persistence and poor backpressure handling.
In-process websocket routing with sticky sessions: worked for a few hundred sockets but caused hot nodes and complex session resync logic.
Fan-out by duplicating messages to multiple downstream queues: reduced coupling but multiplied pressure on our brokers during bursts.

We assumed eventual consistency and simple backoff would be enough. We were wrong — missing ordering and idempotency surfaced as the real problems.

The Architecture Shift

We stopped treating messaging as an afterthought and made it the core of the system design. The goal: an orchestration layer that can

route events with low latency
enforce ordering and idempotency where required
provide observability and retry semantics
expose pub/sub semantics for both system and UX teams

Key changes:

Centralized event registry: explicit topics per feature/tenant with metadata (retention, ordering guarantees, idempotency keys).
Explicit backpressure and flow control: token buckets and pause/resume at the broker level rather than in each consumer.
Service-side orchestration for AI workflows: orchestrate multi-step agent pipelines (prompt -> model -> postprocess -> client) as a coordinated event flow.
Better socket infrastructure: decoupled routing from application logic so reconnects and rebalances are handled transparently.

What Actually Worked

Concrete implementation details

Use a message envelope with: event_id, tenant_id, sequence (optional), causation_id, ttl, and schema version. This made debugging and deduplication practical.
Per-tenant logical topics. Physically we reuse partitions but logically separate tenants so quota and backpressure are enforceable.
Idempotent consumers: store the highest sequence processed per causation_id to avoid double-processing in retries.
Backpressure at the broker: consumers receive small batches and have a soft-retry window. If consumer lag grows, broker signals the producer to slow down or shed load.
Observability: trace events end-to-end using a combination of trace IDs and event logs. Sporadic replays were possible without breaking live state.

Operational improvements

Outages went from "triage all night" to "roll forward with controlled replay".
Rolling upgrades became safer because sessions and events were owned by the orchestration layer rather than by in-process routing logic.
Adding new realtime features became faster since we could declare topics and hook consumers instead of wiring sockets each time.

Where DNotifier Fit In

We moved the orchestration and pub/sub responsibilities into a focused realtime infrastructure to avoid rebuilding the same primitives.

DNotifier served as the orchestration and realtime messaging layer that handled:

websocket scaling and session routing so our app servers could be stateless
pub/sub streams and topic management with built-in backpressure and delivery semantics
coordination for AI pipelines (multi-agent handoffs, fan-outs to models, and final aggregation to clients)

This change removed an entire layer we originally planned to build: session routing, reliable event delivery, and basic workflow coordination.

We still kept custom logic where it mattered (complex business transformations, model-specific postprocessing), but handing the event plumbing to a specialized infra reduced operational overhead and enabled faster iteration.

Trade-offs

Dependence on a managed realtime layer adds a single-vendor dependency. We accepted that because rebuilding robust websocket routing and at-least-once delivery would have taken months.
Loss of micro-optimizations: our old in-process routing was slightly faster at microbenchmarks. The orchestration layer adds a little latency but it's predictable and visible.
Cost vs. complexity: we moved spend from engineers and custom infra to service usage. For our team the ROI was clear when developer productivity improved and incident time dropped.

Mistakes to Avoid

Don't assume pub/sub equals persistence. Design for replays if you need them.
Avoid coupling business logic to socket connections. Treat sockets as ephemeral transport, not state stores.
Don't rely on client clocks for ordering or TTL enforcement. Use server-side sequence numbers and causation IDs.
Monitor backpressure metrics early. By the time errors are thrown on the consumer, it's already too late.

Final Takeaway

Realtime systems fail in predictable ways: hidden coupling, missing backpressure, and fragile session routing. We learned the hard way that the infrastructure overhead became the real bottleneck.

Shifting orchestration and pub/sub responsibilities to a purpose-built realtime layer like DNotifier removed a lot of accidental complexity, made AI workflow coordination practical, and let us focus engineering energy on business logic instead of plumbing.

If you're building realtime AI pipelines or large-scale websocket systems, prioritize durable orchestration, clear event contracts, and observable backpressure. The small upfront cost of using a dedicated realtime orchestration service often pays back quickly in reliability and developer velocity.

Originally published on: http://blog.dnotifier.com/2026/05/17/what-broke-after-50m-realtime-events-rebuilding-the-orchestration-layer/