When Virat Kohli walks to the crease, traffic on a cricket scoring app doesn't climb gradually — it spikes vertically. One moment you have 5,000 connected users, three minutes later you have 120,000, and every single one wants a push notification on the next ball. That graph broke our first attempt at real-time at Xenotix Labs. Here's what we learned rebuilding it.
The naive stack (don't do this)
Our first iteration: one Node.js process running socket.io, every connected client subscribed to every live match. It worked beautifully at 2,000 concurrent connections. At 15,000 it started dropping heartbeats. At 40,000 the event loop lag crossed 3 seconds and reconnection storms made everything worse.
Lessons from the ashes: a single Node process caps out somewhere between 20k–40k sockets, depending on what else the event loop is doing. Broadcasting to all clients from a single process is O(N) per event — one hot match drives the whole loop. Reconnection storms are real: when you restart a gateway, every disconnected client reconnects within ~2 seconds, a self-inflicted DDoS.
The architecture that held
We rebuilt around three principles. First, WebSocket gateway nodes are dumb and stateless — they only hold connections and forward messages, no business logic. Second, Redis pub/sub is the bus — every gateway subscribes to Redis channels keyed by match_id; score updates are published once and every gateway fans out to its own connections. Third, sticky sessions on the ALB — client reconnects to the same gateway via cookie, so we don't thrash connection state.
The flow: score provider → ingest worker → Redis PUB match:123 → N gateways SUB match:123 → WS push to clients. Scaling is now horizontal: add gateway nodes, Redis fans out. A single Redis cluster handles hundreds of thousands of pub/sub messages per second.
Delta, not snapshots
Every WebSocket message is a delta, not a full state refresh. When a ball is bowled we push {over: 14.3, runs: 4, batsman: "Kohli"}, not the whole scorecard. Why: at 120k connections, a 200-byte delta vs. a 4KB snapshot is the difference between 24 MB/sec and 480 MB/sec of outbound bandwidth per gateway. That changes what instance sizes you need.
Backpressure and slow clients
A real production killer: a mobile client on 2G takes 8 seconds to ACK each message. If you don't handle this, the server buffers pending messages in memory, and eventually that buffer OOMs your Node process. Our rule: if a client hasn't ACKed in 5 seconds, drop the oldest queued messages and send a "resync" event. The client re-fetches the full scorecard from a REST endpoint and resumes the WebSocket. Trades a small UX hiccup for server stability.
Reconnection jitter
When a gateway restarts, add random 0–5 second jitter to the client's reconnect delay. Without it, all N clients reconnect simultaneously and crush the ALB. With it, the load spreads smoothly. On the server side, drain gateways gracefully: ALB stops sending new connections, existing connections finish their current messages, then the process exits. Rolling deploys become a non-event.
Monitoring: three numbers matter
Forget fancy dashboards. Three numbers tell you if real-time is healthy: event loop lag on each gateway (p99 under 50 ms, always), connection count per gateway (under 25k each), Redis pub/sub fan-out latency (time from PUB to last gateway receive, under 100 ms). If any of those drift, rebalance or scale before users notice.
What we'd do differently
Use uWebSockets.js from the start — it's ~5x more efficient than socket.io for raw WebSocket throughput. We migrated mid-project and regretted not doing it day one. Build a load-shedding mechanism earlier: when the system is overloaded, drop low-priority events ("commentary") before high-priority ones ("wicket") — don't treat all messages equally. Test with airplane-mode and 2G emulation — most WebSocket bugs appear during bad-network transitions, not at steady state.
Stack summary
- Gateway: Node.js + uWebSockets.js, containerized on ECS
- Bus: Redis pub/sub on ElastiCache
- Ingestion: Node.js worker, consuming from the score provider
- Client: Flutter + Next.js with delta-merge logic
- Load balancer: AWS ALB with sticky sessions
Building a real-time product?
Whether it's live sports, collaborative editing, trading platforms, or real-time dashboards — scaling WebSockets is a discipline with sharp edges. If you're building in this space, Xenotix Labs has shipped real-time stacks that survive match-day India traffic. Reach out at https://xenotixlabs.com.
Top comments (0)