DEV Community

speed engineer
speed engineer

Posted on • Originally published at Medium

How to Engineer a Single Backend Server for 5M Concurrent Connections

The invisible cost of scale isn’t your code — it’s every assumption the operating system makes about what “normal” looks like.


How to Engineer a Single Backend Server for 5M Concurrent Connections

The invisible cost of scale isn’t your code — it’s every assumption the operating system makes about what “normal” looks like.

Your server didn’t run out of resources — it ran out of the kernel’s ability to track what “connected” even means.

We had 63,000 IoT devices connected when everything just… stopped growing. Not crashed. Not erroring. Just stuck at 63K like we’d hit some invisible ceiling.

New devices tried connecting. Connection refused. Tried again. Same thing. My graphs showed CPU at 15%, memory at 8GB of 64GB available. Nothing looked wrong. Everything felt wrong.

I spent three days convinced it was my code. Checked database pools — barely using them. Rewrote message handlers — no change. Suspected the load balancer — wasn’t it. That nightmare debugging where you start questioning whether you actually understand computers.

Found it eventually: ulimit -n showed 65,536. Every network connection in Linux is a file descriptor. Hit that limit and the kernel just stops accepting new ones. Some handshakes threw EMFILE errors my accept() loop didn't log. Clients got connection refused. My server thought everything was fine because the failures happened before my code could even see them.

The really stupid part? I’d set that limit myself six months earlier. Ran the command and felt generous about it. “64K file descriptors? Way more than we’ll need!” Past Me was so confident. Past Me was also completely wrong.

Why Our Problem Was Weird (And Why We Couldn’t Just “Scale Horizontally”)

Our workload was strange: millions of mostly-idle WebSockets staying open for weeks. Industrial IoT monitoring — temperature sensors in warehouses, humidity trackers in data centers, equipment monitors in factories. Each device held one connection, sent a JSON blob when readings changed, then went quiet again.

Most systems don’t work like this. They have HTTP requests that come and go, database queries, short bursts of activity. Our devices needed to feel always-connected because we were selling “real-time monitoring” not “check back in 30 seconds and hope nothing broke.”

Everyone said “just scale horizontally.” Run 100 servers at 50K connections each, load balance, done.

Okay but here’s what that costs: a single c6i.8xlarge on AWS was roughly $800/month in our region. The 100-server approach? About $8,000 monthly. Even if I’m off by 20–30%, that’s still 10x more expensive.

Plus every extra server means more monitoring agents, more log pipelines, more DNS complexity, more TLS handshakes. The load balancer needs its own redundancy. Client code has to handle 100 different endpoints. We’d need to shard connections somehow — by device ID? Geography? What happens when devices move?

The math was clear. Getting there was not.

Two Weeks Optimizing the Wrong Thing

I did what any developer does when things break mysteriously: optimized application code. Rewrote handlers, swapped JSON for MessagePack, tuned connection pools that weren’t even hot. Every profiler said “you’re fine.” Every graph said “you’re fine.” The only thing broken was my understanding of what was actually happening.

Then one afternoon I’m reading old setup docs — you know, the documentation you write for Future You that Present You never reads — and I see a note about ulimits. Just randomly ran ulimit -n in production.

65536

Oh. Oh.

Bumped it to 6 million. Restarted. Felt smart. Watched the server accept 200K connections then die in a completely new way.

The TCP accept queue was overflowing. There’s this kernel setting called somaxconn that controls how many completed connections can wait before your app processes them. Default is 128. At high connection rates that fills in milliseconds. Handshake completes, tries to join queue, queue's full, kernel drops it.

This happens after the three-way handshake but before my accept() sees it. Pure kernel-space rejection. No logs. No visibility. Just silent failure.

That’s when I understood: the kernel has opinions about connection limits, and those opinions live in config files scattered across the system that nobody ever touches.

What Actually Fixed It (And Why These Numbers Matter)

Here’s what worked after weeks of testing. These are specific to our workload — your numbers will differ. But understanding why each matters is more important than the exact values.

File descriptor limits:

 fs.file-max = 6000000  # system-wide, needs headroom over target  
* soft nofile 6000000  # per-process limit that bit us  
* hard nofile 6000000
Enter fullscreen mode Exit fullscreen mode

Connection setup under load:

 net.core.somaxconn = 4096  # completed handshakes waiting for accept()  
net.ipv4.tcp_max_syn_backlog = 8192  # half-open connections during handshake  
net.core.netdev_max_backlog = 5000  # NIC queue before kernel processes
Enter fullscreen mode Exit fullscreen mode

We kept seeing “TCP: Possible SYN flooding” in dmesg until we hit 8192 for the syn backlog. At our test rate of 10K new connections per second, that’s where warnings stopped.

Buffer ceilings (not actual allocations):

 net.core.rmem_max = 16777216  # 16MB max recv buffer  
net.core.wmem_max = 16777216  # 16MB max send buffer  
net.ipv4.tcp_rmem = 4096 87380 16777216  # min, default, max  
net.ipv4.tcp_wmem = 4096 65536 16777216
Enter fullscreen mode Exit fullscreen mode

First time I saw “16MB per socket” I panicked: 5M × 16MB = 80TB of RAM. But these are ceilings. TCP negotiates down based on actual usage. Our idle connections used about 4KB. During data bursts maybe 256KB. The 16MB max is for TCP autotuning under high throughput — rare for us but critical when it happens.

After rebooting (because limits.conf needs it), the kernel could track millions. But my app still couldn’t handle them efficiently because I was thinking in blocking I/O terms.

Why epoll Changed Everything

Traditional I/O: call select() with your FD list, kernel iterates through every single one checking if data's ready. O(n) complexity.

At 5K connections it’s slow. At 500K your CPU spins checking the same dormant sockets: “Socket 47,382 ready? №47,383? №47,384? No…” while data waits in buffers.

epoll is different. Register FDs once with epoll_ctl(). Then epoll_wait() returns only active ones. Kernel uses a red-black tree and ready list—O(1) operations, you only wake when something needs attention.

Switched to epoll-based I/O and connection handling CPU dropped 75%. Same workload. Just asking the kernel differently.

Go hides epoll complexity, which is both great and dangerous:

// Works at 10K-50K connections  
// At millions this kills you  
func acceptConnections(listener net.Listener) {  
    for {  
        conn, err := listener.Accept()  // waits for connection  
        if err != nil {  
            log.Printf("error: %v", err)  // log these!  
            continue  
        }  
        go handleClient(conn)  // goroutine per connection - doesn't scale  
    }  
}  

func handleClient(conn net.Conn) {  
    defer conn.Close()  // cleanup  
    buf := make([]byte, 4096)  // 4KB × 5M = 20GB in buffers alone  
    for {  
        n, err := conn.Read(buf)  // yields to epoll internally  
        if err != nil { return }  
        processMessage(buf[:n])  
    }  
}
Enter fullscreen mode Exit fullscreen mode

This works beautifully at tens of thousands. At millions that 4KB per connection is 20GB mostly sitting idle. Each goroutine takes a few KB of stack. You’re paying memory costs for idle resources.

What Go does is clever: conn.Read() doesn't block an OS thread. It registers with epoll and yields the goroutine. When data arrives, epoll wakes the scheduler which resumes that goroutine. This multiplexes hundreds of thousands of goroutines onto a handful of threads.

At 5M scale you need buffer pools and different patterns. The shape of the solution matters more than the exact implementation.

Three Runtimes, Three Different Pains

Python + asyncio : Fast development. But around 35K connections RSS hit 12GB and latency spiked past 200ms. Every connection is a Python object with overhead. The GIL becomes real. Could push further but I’d be fighting Python’s design.

Go : What shipped. At 5M connections we saw 50–100ms GC pauses every 30–40 seconds with default settings. After buffer pooling and tuning GOGC to 50, pauses dropped to 20–40ms. Noticeable but acceptable. We shipped on schedule.

Rust + Tokio : Built a prototype. Same 5M connections at 16GB RSS versus Go’s 23GB. P99 latency under 5ms even during connection ramps. Incredible performance — 3x longer development time. Spent two days debugging a use-after-free in my buffer pool. The borrow checker prevented production disasters but slowed everything down.

Rust was faster. Go shipped three weeks earlier. Shipping mattered more.

But what mattered most: SO_REUSEPORT.

// Multi-core scaling with SO_REUSEPORT  
async fn start_server() -> std::io::Result<()> {  
    let socket = Socket::new(Domain::IPV4, Type::STREAM, None)?;  
    socket.set_reuse_port(true)?;  // multiple processes same port  
    socket.bind(&"0.0.0.0:8080".parse().unwrap())?;  
    socket.listen(4096)?;  // matches somaxconn  
    let listener = TcpListener::from_std(socket.into())?;  
    loop {  
        let (stream, _) = listener.accept().await?;  
        tokio::spawn(async move { /* handle */ });  
    }  
}
Enter fullscreen mode Exit fullscreen mode

Multiple processes bind to identical port. Kernel load-balances automatically. No userspace proxy, no contention. Run one per core, let the kernel distribute.

Before: one epoll loop handling 5M on one thread, 15 cores idle. After: 16 loops each managing ~300K, every core working.

The Memory That Disappeared

Hit 2M connections in testing. Looked good. Then: OOM error.

App RSS: 2.3GB. System memory: 59GB used. Where’s 57GB?

Spent a day hunting leaks. Ran Valgrind, logged allocations. My app really was only using 2.3GB. The memory wasn’t in user space.

Kernel was using it. Every TCP connection has kernel state: TCP state machine, buffers, routing info, connection tracking. From slabtop:

At 2M connections:  
tcp_sock:      ~1.5KB  →  ~3.0GB  
inet_sock:     ~0.7KB  →  ~1.4GB  
nf_conntrack:  ~0.9KB  →  ~1.8GB
Enter fullscreen mode Exit fullscreen mode

Roughly 3.1KB per connection before buffers. At 2M that’s 6.2GB just for tracking state.

nf_conntrack killed me—we weren't using iptables meaningfully but connection tracking was on by default, eating 1.8GB for nothing.

Fix:

iptables -t raw -A PREROUTING -p tcp --dport 8080 -j NOTRACK  
iptables -t raw -A OUTPUT -p tcp --sport 8080 -j NOTRACK
Enter fullscreen mode Exit fullscreen mode

Freed gigabytes instantly. Started tracking slabtop snapshots and tcp_sock/nf_conntrack counts in every load test. When those diverged from expectations, we investigated immediately.

Also fixed keepalive:

net.ipv4.tcp_keepalive_time = 300  # 5min not 2 hours  
net.ipv4.tcp_keepalive_intvl = 30  
net.ipv4.tcp_keepalive_probes = 3
Enter fullscreen mode Exit fullscreen mode

Dead connections clean up in 6 minutes not 2+ hours. At 5M every zombie matters.

When Standard Tools Break

Hit 5M ESTABLISHED in staging. Server was fine — 19% CPU, stable memory.

Tried debugging one bad connection. Ran netstat. Terminal hung 90 seconds then dumped 5M lines. Tried ss—better but slow. Started tcpdump—drowned in noise.

Standard Unix tools assume hundreds or thousands of connections, not millions.

Wrote custom eBPF programs to observe specific patterns without drowning. Found a kernel 4.19 bug at ~4.7M connections — TIME_WAIT recycling caused random 30+ second hangs. Upgraded to 5.4, disappeared. Only found it through week-long A/B testing because symptoms were intermittent.

You develop strange patience debugging issues that only exist at scales your laptop can’t reproduce. Problems impossible to unit test, impossible to see locally, only visible in production under specific load.

But it worked: 100 servers ($8K/month, operational nightmare) became 3 servers with failover ($2,400/month, actually understandable). Devices stayed connected, dashboards updated real-time, costs dropped 70%.

What It Felt Like When It Worked

When Grafana showed 5,000,000 ESTABLISHED and CPU at 18%, it didn’t feel like winning a benchmark. It felt like finally understanding a conversation I’d been having wrong for months — learning what the kernel could actually do, what it couldn’t, and why.

Then we noticed the next bottleneck wasn’t connections. It was copying bytes. Processing millions of tiny messages means the kernel does shocking amounts of memcpy between buffers. That’s where zero-copy like sendfile() starts mattering, where you question if the kernel should touch hot data at all.

Past a certain scale you’re not building on top of the OS — you’re negotiating with it. Understanding its opinions, working within constraints, sometimes fighting assumptions about “normal” workloads. Exhausting and occasionally infuriating when kernel edge cases only surface under traffic patterns your laptop will never see.

But that night when Grafana showed 5,000,000 and 18% CPU, it stopped feeling like heroics. Started feeling like what it was: learning to speak the kernel’s language instead of shouting at it from user space.


Enjoyed the read? Let’s stay connected!

  • 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
  • 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
  • ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

Top comments (0)