JSGuruJobs

Posted on Mar 1

7 WebSocket Scaling Patterns That Let Node.js Handle 1M Real-Time Connections

#architecture #node #performance #systemdesign

Most WebSocket demos die at 10,000 connections. Not because Node.js cannot handle more, but because the architecture is wrong.

Here are 7 production patterns that take you from “it works locally” to “it survives 1 million concurrent connections”.

1. Single Instance Chat → Redis Pub/Sub Fan-Out

A single ws server works until you add a second instance behind a load balancer.

Before (single instance only):

import { WebSocketServer } from 'ws';

const wss = new WebSocketServer({ port: 3000 });
const clients = new Set<WebSocket>();

wss.on('connection', (ws) => {
  clients.add(ws);

  ws.on('message', (msg) => {
    for (const client of clients) {
      if (client.readyState === ws.OPEN) {
        client.send(msg.toString());
      }
    }
  });

  ws.on('close', () => clients.delete(ws));
});

Deploy this across 3 instances and messages disappear. Each process only knows about its own connections.

After (Redis pub/sub bridge):

import { WebSocketServer } from 'ws';
import { createClient } from 'redis';

const wss = new WebSocketServer({ port: 3000 });

const pub = createClient({ url: process.env.REDIS_URL });
const sub = createClient({ url: process.env.REDIS_URL });

await pub.connect();
await sub.connect();

const localClients = new Set<WebSocket>();

wss.on('connection', (ws) => {
  localClients.add(ws);

  ws.on('message', async (msg) => {
    await pub.publish('chat', msg.toString());
  });

  ws.on('close', () => localClients.delete(ws));
});

await sub.subscribe('chat', (message) => {
  for (const client of localClients) {
    if (client.readyState === client.OPEN) {
      client.send(message);
    }
  }
});

Now any instance can publish. All instances broadcast locally. Horizontal scaling unlocked.

2. No Sticky Sessions → Session Affinity at the Load Balancer

Without sticky sessions, reconnects land on different instances and break in-memory state.

Before: default round-robin load balancer.

After: enable session affinity.

Nginx example:

upstream websocket_backend {
  ip_hash;
  server app1:3000;
  server app2:3000;
  server app3:3000;
}

Or use cookie-based affinity on managed platforms.

This keeps a client pinned to one instance, which drastically reduces cross-node coordination overhead.

3. No Heartbeat → Ping/Pong Cleanup

Dead connections silently consume memory and file descriptors.

Before:

wss.on('connection', (ws) => {
  // no liveness tracking
});

This leaks connections on mobile network drops.

After:

const INTERVAL = 30000;

wss.on('connection', (ws) => {
  let alive = true;

  ws.on('pong', () => {
    alive = true;
  });

  const heartbeat = setInterval(() => {
    if (!alive) {
      ws.terminate();
      return;
    }
    alive = false;
    ws.ping();
  }, INTERVAL);

  ws.on('close', () => clearInterval(heartbeat));
});

Without this, you will eventually crash under load. Every production server needs it.

4. Blocking the Event Loop → Worker Threads for Heavy Work

JSON parsing, crypto, compression. Do that on the main thread and latency spikes for every connection.

Before:

ws.on('message', (data) => {
  const parsed = JSON.parse(data.toString());
  const result = expensiveOperation(parsed);
  ws.send(JSON.stringify(result));
});

This blocks the event loop.

After (worker threads):

import { Worker } from 'worker_threads';

function runWorker(payload: unknown) {
  return new Promise((resolve, reject) => {
    const worker = new Worker('./worker.js', {
      workerData: payload
    });

    worker.on('message', resolve);
    worker.on('error', reject);
  });
}

ws.on('message', async (data) => {
  const parsed = JSON.parse(data.toString());
  const result = await runWorker(parsed);
  ws.send(JSON.stringify(result));
});

In production, use a worker pool like piscina. This isolates CPU spikes from your real-time loop.

5. Stateless Reconnect → Exponential Backoff With Jitter

If 20,000 clients reconnect at once, your server never recovers.

Before:

ws.onclose = () => {
  setTimeout(connect, 1000);
};

All clients retry at the same time.

After:

function reconnect(attempt: number) {
  const base = 1000;
  const max = 30000;

  const delay = Math.min(base * 2 ** attempt, max);
  const jitter = Math.random() * 1000;

  setTimeout(connect, delay + jitter);
}

Backoff plus jitter prevents the thundering herd problem.

6. Using WebSockets for Everything → Event Bus Architecture

A common mistake is replacing your REST API with WebSockets.

That does not scale cleanly.

Before:

ws.on('message', async (msg) => {
  const data = JSON.parse(msg.toString());
  const result = await db.query('SELECT * FROM users WHERE id = ?', [data.id]);
  ws.send(JSON.stringify(result));
});

Your WebSocket layer now owns business logic and database access.

After:

// REST handles writes
app.post('/update-user', async (req, res) => {
  const user = await updateUser(req.body);
  await redis.publish('user.updated', JSON.stringify(user));
  res.json(user);
});

// WebSocket only pushes events
await redis.subscribe('user.updated', (msg) => {
  broadcastToRoom('users', msg);
});

WebSockets become a delivery layer for events. Not your query engine.

If you are designing large JavaScript systems, this separation mirrors the event-driven architecture patterns discussed in the JavaScript application architecture system design guide.

7. Ignoring OS Limits → Raising File Descriptors

You will hit the OS limit before you hit CPU.

Check:

ulimit -n

Typical default: 1024.

Raise it:

ulimit -n 100000

Persist in /etc/security/limits.conf:

* soft nofile 100000
* hard nofile 100000

Each WebSocket connection consumes one file descriptor. At scale, this matters more than your Node.js code.

What Actually Gets You to 1 Million Connections

It is not a magic library.

It is:

Horizontal scaling with Redis or a coordinator
Sticky sessions
Heartbeat cleanup
Worker isolation for CPU work
Backoff on reconnect
Clear separation between API and event delivery
Proper OS tuning

Most teams fail at one of these and blame Node.js.

You can implement all 7 patterns in a weekend. The difference between a 5,000 connection demo and a 1,000,000 connection system is not syntax. It is architecture.

If you are building real-time systems, start by load testing at 1,000 connections locally. Measure latency. Watch file descriptors. Then scale horizontally before production forces you to.