Unix Socket Stack Is Misconfigured

#unix #nginx #node #performance

Your Unix Socket Stack Is Misconfigured. Here's What to Fix and Why.

You already switched from TCP to UDS and saw the first win — fair. You closed the ticket, merged the PR, called it a day. But if you haven't touched unix domain sockets configuration beyond the default path swap, you're leaving the real performance on the table — and running a half-tuned system that fails silently in ways that will only show up at 3am under real production load.

The default kernel and Nginx settings were not designed for 5k–10k RPS over a local socket. They were designed to not obviously break. Under controlled benchmarks — Linux 6.6, Node.js 20, Nginx 1.24, autocannon at 100 connections, 60-second measurement runs — UDS shows p50 latency of 0.31ms versus 0.48ms for TCP localhost. At p999 the gap widens to 59%: 3.8ms versus 9.2ms. That's not marketing. That's syscall reduction — 4 per request instead of 8–10, because sendmsg/recvmsg bypass the IP stack, checksum computation, and Nagle algorithm delay entirely. But those numbers assume your stack is actually configured to use them. Most aren't.

Nginx unix socket keepalive: the formula everyone skips

The Nginx side is where most setups silently bleed performance. The fix sounds simple: set keepalive in the upstream block to 2× your Node worker count. Four Node workers means keepalive 8. The ×2 factor covers the overlap window where a new request arrives while the previous connection is still in TIME_WAIT on the Node side. Too low and you get connection churn and p99 spikes under burst. Too high and you're holding idle file descriptors that never get used, burning FD budget from your ulimit.

But here's the part that kills it silently: skip proxy_http_version 1.1 and the companion proxy_set_header Connection "", and every proxied request opens a brand new UDS connection regardless of your keepalive setting. HTTP/1.0 does not support persistent connections. Your keepalive pool exists on paper only. Full connection setup cost on every single request, zero log entries about it, zero 502s to alert you. The Nginx error log will eventually say worker_connections are not enough — but only if you know to look.

Node.js cluster IPC socket: the broken pattern every tutorial shows

This one is obvious in retrospect and wrong in almost every guide you'll find. Multiple workers calling server.listen(sockPath) directly means only one worker successfully binds. The second worker to call bind() on an already-bound path gets EADDRINUSE and either crashes or fails silently, leaving you with one live worker and no indication anything is wrong. The socket file exists. Nginx connects. Requests flow — to one worker. Congratulations, your cluster is a single-threaded server with extra memory usage and the illusion of horizontal scale.

The correct pattern: master process binds the socket, then passes the server handle to each worker via IPC using worker.send('server', serverHandle). One accept queue, one bound socket path, true OS-level load distribution. The OS round-robins accepted connections across workers. Benchmark difference at 5k RPS with 4 workers: the correct IPC pattern shows ~4× throughput and flat p99. The broken pattern shows 1× throughput with erratic p99 spikes from the single overloaded worker. Most tutorials skip this entirely.

net.core.somaxconn and ulimit: the kernel drops connections before your app even runs

Pass backlog: 2048 to server.listen() all you want. If net.core.somaxconn is still at its Linux default of 128, the kernel silently clamps your backlog to 128. Connections beyond queue depth get ECONNREFUSED immediately — no stack trace, no Node.js error event, no log entry. They just disappear. Your load balancer sees dropped requests. Your application sees nothing at all.

Then there's ulimit -n 1024 — the per-process file descriptor ceiling that ships as default on most Linux distributions. A Node.js process at 1k concurrent connections needs roughly 1000 sockets plus internal FDs. You hit the wall around 980 connections and the process starts getting EMFILE. Node doesn't crash. It doesn't log. It just silently rejects new connections. Your monitoring shows nothing. Your users see timeouts. The fix is setting LimitNOFILE=65536 in your systemd unit — it propagates to all forked cluster workers automatically, which is exactly why the systemd unit is the right place and not /etc/security/limits.conf.

When unix socket performance tuning stops mattering — and how to find out fast

UDS wins on transport overhead. That's the only thing it wins on. The p50 latency advantage over TCP localhost is roughly 0.17ms. If your average request handler takes 2ms, you just optimized 8% of the problem. GC pauses exceeding 5ms, payloads above 512KB, a misconfigured accept queue — three specific scenarios where socket type is completely irrelevant and further tuning does exactly zero.

The guide includes a two-minute strace -c workflow that confirms whether you're actually transport-bound before you spend an afternoon adjusting kernel buffer sizes. Attach to the running Node process, filter to sendmsg, recvmsg, epoll_wait, and accept4, let it run for 10 seconds. If epoll_wait dominates at over 60% of syscall time, you're I/O bound and socket tuning helps. If your app functions top the perf report instead, stop tuning the socket and go fix what actually dominates. Every config block in this guide is annotated. Every directive has a reason. If you can't explain why a line is there, it doesn't belong in a production config.