DEV Community

Cover image for Load Balancing 100,000 WebSockets (and Somehow Surviving It)
<devtips/>
<devtips/>

Posted on

Load Balancing 100,000 WebSockets (and Somehow Surviving It)

HAProxy, Nginx, and one bad idea that taught me more about scaling than any tutorial ever did.

When the Graph Went Vertical

It didn’t start with a bang. It started with graphs.
You know the kind the “oh cool, traffic is up!” kind that turns into the “wait, why is everything red?” kind before your next sip of coffee.

The dashboards weren’t just spiking; they were ascending.
CPU at 98%. RAM evaporating like your confidence in distributed systems.
Connections kept climbing, long after they should’ve stopped.

That’s when it hit us: we were sitting on 100,000 live WebSocket connections.
Not “requests per second.” Not “unique users.” Actual, ongoing, needy little sockets that refused to hang up.

Each one was clinging to a thread of server memory like a cat dangling off a ledge and we had a hundred thousand cats.

You think you understand scaling until you meet WebSockets.
Then you realize it’s not about throughput it’s about emotional endurance.

Load balancing them isn’t “divide and conquer.” It’s “please don’t drop this one guy’s connection while 99,999 others are breathing fire.”
Every reconnection multiplied the chaos. Every retry was a punchline in a cosmic joke.

There’s a moment, right before your infra gives up, when everything still looks almost fine.
Then a single metric twitches, and your logs start confessing crimes you didn’t even know existed.

That’s the moment you stop blaming your code and start suspecting your load balancer.

WebSockets: The Sticky Kind of Trouble

WebSockets are like that one friend who says,
“Hey, I’ll just crash on your couch for a night,”
and six months later, they’re still there eating your RAM and using your CPU cycles.

That’s the difference between HTTP and WebSockets.
HTTP shows up, asks for something, and leaves.
WebSockets move in.
They unpack their state, start sending heartbeats, and expect you to be emotionally available.

With HTTP, you can scale requests horizontally easy math, stateless bliss with WebSockets, you’re scaling relationships.
You can’t just toss a connection to any random server; it needs to stay sticky, loyal, connected because the second it jumps nodes, that connection dies faster than a junior dev’s dream after seeing Kubernetes YAML.

Sticky sessions sound simple.
In practice? They’re glue traps.
You spread traffic evenly, and suddenly one unlucky node is babysitting 20,000 chatty users,
while the others are chilling, sipping CPU cocktails.

You try to fix it with round robin now your connections are constantly jumping servers, dropping like flies.
Least connections? Great, until the ones that stay connected forever hog your entire pool.
IP hashing? Sure, until your users start coming through random CDNs and proxies.

It’s not just routing anymore it’s matchmaking and every time you tweak your setup, half your users reconnect, half your pods restart, and your infra goes through another round of “let’s see who crashes first.”

Scaling WebSockets teaches you something profound:
Every persistent connection is a promise you have to keep and your servers are terrible at keeping promises.

HAProxy: The Veteran That Gets It Done

HAProxy isn’t shiny.
It doesn’t have fancy dashboards or slick marketing pages with words like edge, mesh, or AI-powered load distribution.

It’s old-school like that sysadmin who still uses vi and doesn’t trust the cloud but can fix anything with a single command and a muttered curse.

So, naturally, we trusted it with our WebSocket apocalypse.And to its credit it delivered.
HAProxy handled tens of thousands of persistent connections like a champ.
CPU load dropped, latency stabilized, and the room went quiet for the first time in days.
You could almost hear the servers breathe again.

But here’s the thing about HAProxy:
It’s powerful, because it makes you earn it.

Want better performance?
You’ll tweak tune.bufsize, maxconn, and thread pinning until you start hearing phantom fans in your sleep.
Want to squeeze more out of your hardware?
You’ll stare at nbproc and ulimit -n like you’re decoding a dead language.

It’s not plug-and-play it’s plug-and-pray.

You don’t use HAProxy; you negotiate with it.
It’s a relationship built on patience, coffee, and a growing list of things you swore you’d never touch in production.

Still, for all its quirks, it’s one of the few tools that actually respects your connections.
It doesn’t flake out, doesn’t ghost users mid-stream.
Once you get it dialed in, it just… works.

If Nginx is the popular kid at the party,
HAProxy is the quiet one in the corner who’s already optimized the sound system, fixed the lights, and is now casually holding your entire network together.

Nginx: The Cool Kid Who Couldn’t Commit

On paper, Nginx looked perfect.
Everyone uses it. Everyone swears by it.
It’s fast, elegant, modern the Swiss Army knife of web servers.

So, naturally, we thought, “Hey, if it handles millions of HTTP requests effortlessly, WebSockets should be a breeze, right?”
Yeah… about that.

HTTP and WebSockets might look similar from afar,
but one is a friendly handshake the other is an eternal bear hug that slowly drains your soul.

At first, Nginx made everything look smooth.
The dashboards turned green, the metrics looked calm, the team started to breathe again.
Then… the weirdness began.

Connections started dropping silently. Clients would reconnect for no reason.Some sockets stayed alive but refused to transmit anything,
like ghost users haunting your system.

We combed through configs like archaeologists digging for lost wisdom: proxy_pass, upgrade, connection keep-alive, timeout, buffering, stream module.
Every fix we tried worked for a while… until it didn’t.

The “stream” module seemed like a savior until it wasn’t.
We’d fix one bug, and three new ones would spawn.
It felt like playing Whac-A-Mole, except every mole was a 502 Bad Gateway.

Nginx is incredible at short-lived traffic.
It’s poetry for serving static files, a magician for reverse proxying APIs.
But when it comes to long-lived, chatty, unpredictable WebSocket connections?
It starts sweating.
You can almost hear it whisper, “Please don’t make me do this.”

We kept trying to make it work tuning, retrying, hoping.
But eventually, we accepted the truth:
Nginx wasn’t broken. It was just too nice to say no.
It wanted to please everyone… and ended up pleasing no one.

The Custom Proxy That Should’ve Stayed an Idea

Every developer eventually hits that point.
The moment where, after fighting with every off-the-shelf tool,
you stare into the void and say:

“How hard could it be to just build my own?”

Those are the seven most dangerous words in software engineering.

So yeah, we built our own load balancer from scratch in Go.
Because obviously, if Google uses Go for networking,
we could totally whip up something that handles 100,000 WebSockets on a weekend.

And to be fair at first, it worked.

  • The logs were clean.
  • Connections were stable.
  • The CPU graph looked normal.
  • It was beautiful.

For about an hour.

Then, small ghosts started appearing.
A connection here, mysteriously timed out.
A goroutine there, stuck forever.
Memory usage climbing like a SpaceX launch.

We added metrics. Then logs. Then more logs.
Eventually, we had so many debug statements that the proxy spent more time writing about its pain than actually routing packets.

We found the leaks. Patched them.
Then new ones appeared, in different places,
like a hydra made of deadlocks.

At some point, someone on the team said,

“So… we basically reimplemented HAProxy, just worse?”
And yeah, that was the moment the caffeine wore off and reality set in.

Building a load balancer isn’t hard.
Building a reliable one is basically system-level dark magic.
You’re not just moving bytes you’re juggling timeouts, thread contention, kernel buffers, and every weird corner case the TCP gods can throw at you.

It worked technically.
But emotionally?
It broke us.

We buried it quietly in the repo.
No postmortem, no announcement, no ceremony.
Just a commit message that read:

“Revert custom proxy. It knew too much.”

The Real Problem: It Wasn’t the Proxy

At some point, after enough caffeine and failure,
you stop blaming the tools.

  • You stop side-eyeing HAProxy.
  • You stop cursing at Nginx.
  • You even stop making jokes about your home-brewed disaster.

Because you finally realize it was never about the proxy.

The real problem was the architecture.

We weren’t balancing traffic we were balancing state, stickiness, and trust.
And we’d been doing it wrong.

WebSockets aren’t like requests.
They’re not transactional, they’re emotional.
They stay open, they carry identity, and they expect memory both yours and the server’s.

The moment we started designing around that truth, things changed.

We stopped trying to force everything through one load balancer.
We distributed connections across shards each handling a slice of the world.
We moved session state to a shared store,so any node could pick up a user if another went down.
Redis, message queues, connection registries suddenly they weren’t “extras.”
They were the foundation.

We started to think asynchronously, not reactively.
We built backpressure, health checks, and connection lifecycles
that made the system feel more alive and less fragile.

And when it finally stabilized when the dashboards stopped flickering like horror-movie lights we realized the ugly truth about scaling WebSockets:

You can’t fix architectural problems with config files.

You can tune all you want tweak buffers, add CPUs,
rewrite YAML until your wrists give out but if your design doesn’t respect how connections actually live and breathe,
you’re just delaying the crash.

HAProxy didn’t save us.
Neither did Nginx.
Our architecture did.

What Actually Worked (and What Didn’t)

When the dust finally settled, the graphs stopped screaming.
No more dropped connections. No more midnight Slack pings that start with “weird question…”

Everything was still and it felt wrong like the silence after a deploy that should have broken something but didn’t.

We stuck with HAProxy in the end.
It wasn’t perfect, but it was predictable and predictability is gold when your infrastructure’s been through therapy.

Nginx went back to doing what it does best:
serving static files, routing HTTP traffic, looking pretty in blog posts.

The custom proxy?
It’s now archived in /graveyard/side_projects/, right next to “build_our_own_queue.go” and
“let’s_write_a_database_v2”.
A monument to youthful overconfidence.

But here’s the thing that doomed proxy experiment taught us more about load balancing than success ever could.
It forced us to understand why HAProxy worked, not just that it did.
It exposed the real bottlenecks, the hidden assumptions, and the fact that
“working” in production is never just about code it’s about empathy for your systems.

We learned to respect the layers from kernel threads to socket buffers to the poor network engineer who warned us,
“don’t mess with stateful traffic unless you’re ready to suffer.”

He was right.

And yet, looking at those clean, steady graphs now,
you feel a strange kind of pride.
Not because it’s perfect it never will be but because it’s alive, stable, and finally understood.

Scaling WebSockets isn’t just an engineering challenge.
It’s a rite of passage.

You start thinking you’re optimizing connections.
But really, you’re learning patience, humility, and the quiet art of letting the right things stay persistent.

Helpful Resources

Top comments (0)