DEV Community

speed engineer
speed engineer

Posted on • Originally published at Medium

What Broke When We Pushed WebSockets From 100k to 1M Users

A post-mortem on OOM kills, GC pauses, and the slow consumers that ate our RAM.


What Broke When We Pushed WebSockets From 100k to 1M Users

A post-mortem on OOM kills, GC pauses, and the slow consumers that ate our RAM.

We thought we had a leak. Turns out, we just didn’t know how to turn off the tap.

At 100k users, everything looked perfect. The dashboards were green, latency was flat, and we felt like geniuses. At 1M users, the exact same architecture started killing nodes like clockwork.

We were building a live commentary platform for a massive sports event. The premise was simple: ingest scores, push them to the browser. We tested it. We load-tested it. We thought we were ready.

Then the finals started.

The user count ticked past 300k, and latency jittered. By 600k, the alerts weren’t just pinging; they were screaming. By 800k, our nodes turned into zombies — connected, technically “alive,” but totally unresponsive — before being abruptly shot in the head by the Linux Out-Of-Memory (OOM) killer.

We didn’t just hit a bottleneck; we drove 100 miles an hour into a brick wall we didn’t even know was there.

Here is exactly what broke, the specific “obvious” design choice that doomed us, and how we redesigned the system to survive the night.

The Architecture: A Fan-Out Illusion

On paper, the design was standard. It’s the same diagram you’d draw on a whiteboard during a system design interview.

  • Ingest: A writer service pushes a score update (payload: ~500 bytes) to Redis Pub/Sub.
  • Hubs: A fleet of Go services subscribes to the Redis channel.
  • Fan-out: Each hub iterates through its local map of connected WebSockets and broadcasts the JSON payload.

Caption: Ideally, one message in leads to $N$ messages out. In reality, $N$ is variable, dangerous, and expensive.

It looked beautifully simple — until we realized this “fan-out” pattern behaved more like a “RAM hoarder.”

Here is the code we thought was fine. It’s standard Go concurrency: a buffered channel per client to prevent blocking the main loop.

Go

type Client struct {  
    hub  *Hub  
    conn *websocket.Conn  
    // We gave every client a buffer. 256 slots seems generous, right?  
    // If they lag a bit, this catches the overflow.  
    send chan []byte   
}  

func (c *Client) writePump() {  
    for {  
        select {  
        case message, ok := <-c.send:  
            // If the channel is closed, we kill the socket.  
            if !ok {  
                c.conn.WriteMessage(websocket.CloseMessage, []byte{})  
                return  
            }  

            // Here is the silent killer.  
            // We assume this Write() is fast.   
            // We assume the network is a pipe.   
            // It is not.  
            w, err := c.conn.NextWriter(websocket.TextMessage)  
            if err != nil {  
                return   
            }  
            w.Write(message)  

            // If this Write blocks because the user is on 3G...  
            // We are stuck here. Meanwhile, the channel fills up.  
        }  
    }  
}
Enter fullscreen mode Exit fullscreen mode

We allocated a buffer of 256 messages for the send channel. We reasoned, "256 messages is plenty. If a client lags, they have a buffer. If they lag too much, they’ll catch up."

This was the lie.

The Symptom: The Silent Memory Creep

When the user count climbed, we expected CPU usage to rise. JSON serialization is expensive, right? But the CPU wasn’t the bottleneck. It was barely sweating.

The graphs showed memory rising in a sawtooth pattern. It would creep up, hit the container limit, the node would crash (OOM), restart, and repeat.

I stared at the heap profiles for an hour while the production channels were on fire. “It’s the GC,” I yelled to anyone listening. “The Garbage Collector is thrashing because we have too many pointers! We need to optimize the struct alignment!”

I was wrong. I was looking at what was using memory, not why.

I thought the problem was the number of connections. It wasn’t. It was the speed of the connections.

The Misconception: The Slow Consumer

Here is the trap: The internet is not a uniform pipe.

In our load tests, we used EC2 instances to simulate clients. These “users” had gigabit bandwidth. They could read data as fast as we could write it.

In the real world, 15% of our users were on cellular networks in crowded stadiums. Another 20% were on dodgy home Wi-Fi where the microwave kills the signal.

When you “write” to a TCP socket, you aren’t hitting the network — you’re dumping data into the kernel’s send buffer. Fast clients empty it quickly by ACKing packets. But for a user on flaky 3G, the ACKs don’t come fast enough. The kernel buffer fills up.

Once that kernel buffer is full, the Write call in our application blocks (or returns EAGAIN, depending on your setup).

Because we were using buffered channels in Go (c.send), our main "pump" loop was flinging messages into these channels. For a fast client, the channel stays empty. For a slow client, the channel fills up to 256 messages instantly.

The Math That Killed Us

Let’s look at the numbers.

  • Payload: ~1KB (with overhead).
  • Buffer size: 256 messages.
  • Slow consumers: ~200,000 (20% of 1M).

At a cluster level, that’s:

$$200,000 \times 256 \times 1\text{KB} \approx 51.2\text{GB}$$

That sounds manageable distributed across a cluster, right? But looking at it that way hides the density.

Spread across 8 nodes, that’s ~6.4GB of just queued messages per node — on top of connection structs, heap fragmentation, and the kernel’s own TCP buffers (which aren’t cheap). Our 16GB nodes were quietly being eaten alive by stale sports data.

We weren’t experiencing a traffic spike; we were experiencing a backpressure failure. We were holding onto gigabytes of data for users who were 45 seconds behind reality.

The Fixes: Discipline and Brutality

To survive the night (and the next match), we had to stop being polite. We needed to treat the system not as a storage engine, but as a pressure valve.

1. The “Ruthless” Send (Drop-If-Full)

The first change was identifying that real-time data has an expiration date.

In a chat app, every message matters. In a live sports dashboard, “Goal for Team A!” matters. But “Time: 12:01”, followed by “Time: 12:02”, renders the previous message useless. If a client is 10 seconds behind, sending them the second-by-second updates for the last 10 seconds is a waste of bandwidth.

We changed our send logic from a blocking queue to a “drop-if-full” pattern.

// The "Ruthless" Send  
func (c *Client) sendUpdate(msg []byte) {  
    select {  
    // Try to push to the channel...  
    case c.send <- msg:  
        // Success! The client is keeping up.  

    // If the buffer is full, we hit the default case immediately.  
    default:  
        // The buffer is full. The client is a slow consumer.  
        // We have two choices: Drop the message, or kill the client.  

        // We chose violence.  
        c.conn.Close()   
    }  
}
Enter fullscreen mode Exit fullscreen mode

Wait, we closed the connection?

Yes. If a client is so slow that they have filled their buffer, they are already looking at stale data. Keeping them connected consumes resources (file descriptors, heartbeats, memory) for zero value. It is better to cut them loose and let the client-side logic attempt a reconnect (hopefully with better network conditions) or fall back to long-polling.

Constraints force clarity. By enforcing a hard limit, we capped our memory usage.

2. Message Coalescing

Dropping connections is a nuclear option. We needed a middle ground for users who were slightly slow but not dead.

We realized that our fan-out loop was dumb. It looked like this:

Redis Message -> Loop users -> Write to every user’s channel

If we generated 50 updates a second (clock ticks, player positions), we tried to push 50 updates to the client.

We introduced a Coalescing Buffer. Instead of a queue of raw messages, the client struct held a pointer to the latest state.

When the write pump wakes up:

  • Check pending state.
  • Write one aggregated message.
  • Sleep until the socket is ready again.

This decoupled the production rate from the consumption rate. The backend could generate 100 updates/sec. If the client could only read 5 updates/sec, they just got 5 sample snapshots. They missed the intermediate animation frames, but they never fell behind on the actual score.

The Realization

The pivotal moment came when I looked at the Grafana dashboard after deploying the “drop-if-full” logic.

The memory line, which used to look like a jagged mountain range, turned into a flat prairie.

Watching that line flatten out was embarrassing and relieving at the same time. We hadn’t been “under-provisioned”; we’d been sentimental. We tried to save every packet for every user, and in doing so, we crashed the platform for everyone.

Since then, every push system we design starts with one question: “Where do we cut people off when they fall behind?” Not “if” — where.

Gotcha: The Thundering Herd

Of course, fixing the memory introduced a new bug. That’s how it always goes, right?

Remember when I said we closed connections for slow consumers? When a big play happened (GOAL!), thousands of users on bad networks would get dropped simultaneously because the payload size spiked and their buffers filled up.

Their clients would immediately try to reconnect.

Suddenly, our NGINX ingress controller was hit with 50,000 handshake requests in a single second. This spiked the CPU on the load balancers, causing healthy users to timeout, which made them reconnect. It was a self-sustaining DDoS.

The Fix: Random Jitter.

We had to patch the client-side JavaScript. Instead of:

JavaScript

// The "please destroy my server" logic  
socket.onclose = () => connect();
Enter fullscreen mode Exit fullscreen mode

We did:

socket.onclose = () => {  
    // Add randomness so they don't all hit us at the exact same millisecond  
    // 0 to 5000ms delay  
    const delay = Math.random() * 5000;   
    setTimeout(connect, delay);  
}
Enter fullscreen mode Exit fullscreen mode

This spread the reconnect storm over a 5-second window, making it digestible for the auth service and the WebSocket nodes.

The Result

By the finals, we were handling 1.2M concurrent users. Node memory stayed flat around ~6GB instead of racing to 64GB and dying, and p99 latency dropped under 200ms. We didn’t add more servers; we just stopped hoarding useless data.

Closing

Scaling WebSockets isn’t about opening more sockets. It’s about managing queues.

When you move from request/response (HTTP) to streaming, you lose the natural flow control that HTTP gives you (where the server just waits for the next request). In WebSockets, the server is the aggressor. You are pushing data into a pipe that might be clogged.

If you don’t account for the clogged pipes, your memory becomes the buffer of last resort. And your memory is expensive.

Next time you build a push system, ask yourself: “What happens when the user goes into a tunnel?” If the answer isn’t “we drop the data,” you’re planning for a crash.


Enjoyed the read? Let’s stay connected!

  • 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
  • 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
  • ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

Top comments (1)

Collapse
 
acytryn profile image
Andre Cytryn

the slow consumer problem is so underdiagnosed. your load tests with EC2 clients hit a fundamental blind spot: uniform network conditions mask exactly the failure mode that kills you in production. the OOM sawtooth pattern you described is almost always slow consumers, not GC pressure.

one thing worth adding to the circuit breaker on top of drop-if-full: differentiate between messages by type before dropping. for a sports platform, score updates are idempotent and droppable, but session/auth messages aren't. keeping a tiny priority queue (2-3 slots) for critical control messages even when the buffer is full can avoid a class of subtle bugs where clients reconnect in a broken state. have you run into that edge case?