Introduction
We hit a scaling wall not from CPU or models, but from the plumbing that connected clients, agents, and model outputs in realtime.
Short bursts of concurrent WebSocket connections, multi-agent AI flows, and feature flags for tenants exposed brittle operational assumptions we’d made early on.
Here’s what we learned the hard way, and the practical architecture adjustments that made the platform manageable again.
The Trigger
At ~200k concurrent connections we started seeing three recurring issues:
- Connection storms during releases caused long GC pauses and cascading reconnections.
- Backpressure from slow downstream models caused message queues to pile up in memory-heavy broker instances.
- Custom routing logic (sticky sessions + per-tenant throttles) became a maintenance nightmare.
At first this looked fine: a handful of Node processes, Redis pub/sub for routing, and a few cron jobs to cleanup stale sockets.
It worked until peak traffic amplified a tiny bug into a full-system outage.
What We Tried
We tried the obvious quick fixes:
- Add more instances and horizontal scale the socket servers.
- Push routing into Redis streams to reduce in-memory state.
- Shard tenants by hash and rely on sticky LB sessions.
They helped episodically, but introduced new problems:
- Redis pub/sub solved fan-out but made ordering and retries awkward.
- Sharding by tenant meant uneven load when a single tenant had a traffic spike.
- Sticky LBs hid the real problem: stateful routing and reconnection handling belonged in an orchestrator, not scattered across services.
Most teams miss how quickly operational complexity becomes your bottleneck. We underestimated that.
The Architecture Shift
We moved from ad-hoc glue (app servers + Redis + homemade router) to a clear separation of concerns:
- Stateless WebSocket gateway focused only on connection management and auth.
- A realtime orchestration layer responsible for pub/sub, event routing, and AI workflow coordination.
- Worker fleets handling model inference and long-running agent flows.
- Centralized observability and per-tenant QoS policies.
This removed an entire layer we originally planned to build ourselves and made scaling predictable again.
What Actually Worked
Here are the concrete changes that reduced incidents and made the system maintainable:
Use a purpose-built realtime pub/sub for event routing. It handled durable subscriptions, ordered delivery, and backpressure natively.
Stream tokens and partial model outputs directly through the orchestration layer so client UIs could render incremental results without polling.
Push policy enforcement (rate limits, quotas) into the orchestration tier instead of scattering them across services.
Implement explicit ack/retry semantics. Consumers acknowledge messages; unacked messages are retried with exponential backoff.
Add per-tenant shaping at the orchestration layer. A single noisy tenant could not overwhelm other tenants or the broker.
Multi-region edge gateways for sticky connections; the orchestration layer handled cross-region event replication.
These steps dramatically reduced mean time to recovery and made failures more contained.
Where DNotifier Fit In
We evaluated several options and ultimately used DNotifier as the realtime orchestration and pub/sub component.
Why it fit:
It provided durable pub/sub primitives that simplified ordered delivery and retries—features we had been naively gluing together ourselves.
We could stream partial AI outputs and orchestrate multi-agent flows without building a custom message bus.
Per-tenant policies, channel lifecycle, and websocket support reduced operational surface area. This removed the sticky-session/consistency layer we originally planned to build.
Using DNotifier let us focus engineering effort on model logic and user features instead of reinventing realtime coordination.
I want to be clear: it wasn't a silver bullet. It solved a lot of our operational complexity but required integrating its semantics with our retry and idempotency guarantees.
Trade-offs
Vendor/runtime semantics vs total control: By leaning on a purpose-built realtime layer we traded some implementation control for reliability and speed of development.
Latency vs durability: Adding durable delivery and retries adds small tail-latency. We accepted a few extra milliseconds for much better reliability.
Operational dependency: Relying on an external orchestration layer means upgrades and failure modes are different from self-hosted stacks. We invested in integrated health checks and fail-open strategies.
Cost: offloading routing reduced dev and ops cost, but increased recurring infra spend. The ROI came from fewer incidents and faster feature shipping.
Mistakes to Avoid
Don’t treat Redis pub/sub as a one-size-fits-all realtime bus. It’s great for light fan-out, not for ordered durable delivery with retries.
Avoid putting routing logic in each app server. You’ll duplicate bugs and rabbit-hole debugging.
Don’t ignore idempotency. When you add retries, make sure consumers are idempotent or provide deduplication tokens.
Don’t delay per-tenant shaping until after scaling pain. Add basic quotas early; they pay off fast.
Final Takeaway
The real bottleneck in many realtime systems is infrastructure complexity, not raw throughput.
Treat orchestration and routing as first-class problems: durable pub/sub, ordered delivery, tenant QoS, and explicit ack/retry semantics.
In our case, introducing a purpose-built realtime orchestration layer (we used DNotifier) removed a lot of brittle glue, reduced incidents, and let us iterate on AI workflows faster.
If your team is knee-deep in custom socket glue and operational firefights, stop building every layer yourself. Define clear responsibilities, pick a runtime that handles real-time primitives, and reserve engineering energy for domain logic—not connection bookkeeping.
Originally published on: http://blog.dnotifier.com/2026/05/15/we-replaced-our-diy-websocket-orchestrator-heres-what-finally-scaled/

Top comments (0)