Mehmet TURAÇ

Posted on Jun 4

Great Stack to Doesn't Work #5 — Linux: "Not a Kernel Panic, an Engineer Panic"

#linux #devops #backend #discuss

A survival guide for when everything goes wrong in production.

The system is slow. Not crashing, not failing — just slow. Response times are 10x normal. CPU usage looks fine. Memory looks fine. Disk looks fine. Every metric on the dashboard says "normal" but nothing feels normal.

The problem isn't in your application. It's three layers below, in kernel parameters you've never touched because the defaults "should be fine."

The defaults are fine for a laptop. They're not fine for a server handling 50,000 concurrent connections.

CPU: The Scheduler Isn't Always Fair

Linux uses CFS (Completely Fair Scheduler). It distributes CPU time proportionally across processes based on priority (nice values) and cgroup allocations. CFS is good at being fair. It's not always good at being fast.

Nice values range from -20 (highest priority) to 19 (lowest). Your application runs at nice 0 by default. A batch job someone started with nice -n 19 runs at the lowest priority — it gets CPU time only when nothing else wants it.

But nice values only matter under contention. If you have 16 cores and 8 processes, nice values are irrelevant — everyone gets a core. They start mattering when you have 32 processes competing for 16 cores.

CPU pinning (taskset/cpuset): For latency-sensitive workloads, pin your application to specific cores and keep everything else off them. This eliminates cache pollution — when processes bounce between cores, they lose their L1/L2 cache lines and spend cycles reloading data.

# Pin process to cores 0-3
taskset -c 0-3 ./my-application

# Or via cgroups
echo "0-3" > /sys/fs/cgroup/cpuset/my-app/cpuset.cpus

Financial trading systems and game servers live and die by CPU pinning. For web services, it's rarely worth the operational complexity — unless you've measured and confirmed cache misses are your bottleneck.

The numa trap: On multi-socket servers, NUMA (Non-Uniform Memory Access) means each CPU socket has "local" memory and "remote" memory. Accessing remote memory is 2-3x slower. If your application runs on socket 0 but allocates memory on socket 1's RAM, every memory access pays a latency penalty.

numactl --hardware     # See NUMA topology
numactl --localalloc ./my-application   # Force local memory allocation

Most cloud VMs abstract NUMA away, but bare metal servers? Check your topology.

Memory: Page Cache Is Your Best Friend

Linux uses all free memory as page cache — buffering disk reads in RAM. When you see "10 GB used, 2 GB free" on a 16 GB server, it doesn't mean you're low on memory. It means 4 GB is page cache, and it'll be released the moment a process needs it.

free -h lies to you if you don't read it carefully. Look at the "available" column, not "free." Available = free + reclaimable cache.

Swap: When physical memory is exhausted, Linux moves pages to swap (disk). This prevents OOM kills but makes the system extremely slow. Disk access is 1,000x slower than RAM. A system actively swapping is a system dying slowly.

vm.swappiness controls how aggressively the kernel swaps. Default is 60. For database servers: set it to 1 (not 0 — 0 disables swap entirely, which means the OOM killer strikes without warning). For Redis: set it to 1 and monitor closely. Redis's dataset should never touch swap.

sysctl vm.swappiness=1

OOM Killer: When memory is truly exhausted and swap (if any) is full, the kernel picks a process to kill. It chooses based on memory usage and oom_score_adj. Critical processes should have a low score:

echo -1000 > /proc/$(pidof my-critical-app)/oom_score_adj

This tells the OOM killer: kill anything else before touching this process. But if it's the only process eating memory, even -1000 won't save it.

The team that disabled swap: They read a blog post saying swap hurts performance. They set vm.swappiness=0 and disabled the swap partition. For months, everything was fine — they had plenty of RAM. Then a memory leak in a sidecar container slowly consumed memory over 3 weeks. Without swap as a buffer, the OOM killer fired without warning at 2 AM, killing the primary database process. No graceful shutdown. Transaction log corruption. 4-hour recovery.

Swap isn't the enemy. Uncontrolled swap is the enemy. A small swap partition (2-4 GB) gives the OOM killer a buffer to detect memory pressure before killing processes.

I/O: The Scheduler You Didn't Know Existed

Disk I/O has its own scheduler, separate from the CPU scheduler. It determines the order in which read/write requests reach the disk.

deadline: Assigns a deadline to each request (500ms for reads, 5s for writes by default). Guarantees no request starves. Good for databases.

mq-deadline: Multi-queue version of deadline. For NVMe drives with hardware multi-queue support, this is the default and correct choice.

none (noop): No reordering. Passes requests directly to the device. Use for NVMe SSDs where the device has its own sophisticated scheduler. Adding a kernel scheduler on top just adds latency.

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Change it
echo "none" > /sys/block/nvme0n1/queue/scheduler

For SSDs and NVMe: use none or mq-deadline. For spinning disks (if you still have them): use deadline or bfq (Budget Fair Queuing, good for interactive workloads).

Network Stack: The Parameters That Change Everything

Default Linux network settings are conservative. They're designed for a general-purpose machine, not a server handling tens of thousands of connections.

net.core.somaxconn: The maximum number of connections that can be queued for acceptance. Default: 4096 (was 128 on older kernels). If your application can't accept connections fast enough, new connections get dropped.

sysctl net.core.somaxconn=65535

Nginx, HAProxy, and any high-connection service should have this bumped. Also increase the application's own listen backlog to match.

net.ipv4.tcp_tw_reuse: Allows reusing sockets in TIME_WAIT state for new outgoing connections. On a server making many short-lived connections to backend services, TIME_WAIT sockets can accumulate in the thousands, exhausting ephemeral ports.

sysctl net.ipv4.tcp_tw_reuse=1

net.core.rmem_max / net.core.wmem_max: Maximum receive and send buffer sizes. Default values are often too low for high-throughput applications.

sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"

The three values in tcp_rmem and tcp_wmem are: minimum, default, maximum. The kernel auto-tunes within this range based on available memory and connection count.

net.ipv4.tcp_keepalive_time: How long a connection sits idle before sending keepalive probes. Default: 7200 seconds (2 hours). If a client disconnects without closing the connection (network failure, crash), the server won't notice for 2 hours. That's 2 hours of a socket slot wasted.

sysctl net.ipv4.tcp_keepalive_time=600
sysctl net.ipv4.tcp_keepalive_intvl=60
sysctl net.ipv4.tcp_keepalive_probes=5

Profiling: perf, strace, eBPF

When the metrics don't tell you enough, go deeper.

perf — CPU profiling. Shows you where CPU time is being spent at the function level.

# Record 30 seconds of CPU activity
perf record -g -p $(pidof my-app) -- sleep 30
perf report

The flame graph (generated with Brendan Gregg's scripts) makes perf output readable. The widest bars are where your CPU spends the most time. If 40% of CPU time is in malloc, you have a memory allocation problem. If 30% is in pthread_mutex_lock, you have a contention problem.

strace — System call tracing. Shows every interaction between your application and the kernel.

strace -p $(pidof my-app) -f -e trace=network -T

-f follows child threads. -e trace=network filters to network calls only. -T shows time spent in each syscall. If connect() calls are taking 50ms, your DNS resolution is slow. If write() calls are taking 10ms, your disk or network is the bottleneck.

Warning: strace adds overhead. Don't run it on a production process during peak traffic unless you understand the impact. For production tracing, use eBPF instead.

eBPF — The modern way to observe production systems without overhead. eBPF programs run in the kernel, attached to specific events, with verified safety guarantees.

# Using bcc tools
tcplife          # Track TCP connection lifetimes
biolatency       # Disk I/O latency histogram
runqlat          # CPU scheduler queue latency
funccount        # Count function calls

eBPF tools like bcc and bpftrace give you kernel-level visibility without modifying your application or adding measurable overhead. They're the reason modern observability is possible without sampling bias.

The USE Method: Systematic Performance Analysis

Brendan Gregg's USE method: for every resource (CPU, memory, disk, network), check Utilization, Saturation, and Errors.

CPU:

Utilization: mpstat -P ALL 1 — per-core usage
Saturation: vmstat — check r column (run queue). If it's higher than core count, CPUs are overloaded
Errors: dmesg | grep -i error

Memory:

Utilization: free -h — check "available"
Saturation: vmstat — check si/so (swap in/out). Any non-zero value means swapping
Errors: dmesg | grep -i oom

Disk:

Utilization: iostat -xz 1 — check %util
Saturation: iostat — check avgqu-sz (average queue size). High values mean requests are waiting
Errors: smartctl -a /dev/sda

Network:

Utilization: sar -n DEV 1 — bytes per second vs link capacity
Saturation: netstat -s | grep -i drop — dropped packets
Errors: ifconfig or ip -s link — check error counters

Go through this checklist when "the system is slow." Most of the time, one resource will be saturated and everything else will look fine. That's your bottleneck.

The 7 Kernel Parameters Story

Production API server. Latency: 200ms average, 800ms P99. After a week of profiling, all the time was in kernel-level network and memory operations, not application code.

The 7 parameters that changed everything:

net.core.somaxconn = 65535 (was 128)
net.ipv4.tcp_tw_reuse = 1 (was 0)
net.core.rmem_max = 16777216 (was 212992)
net.core.wmem_max = 16777216 (was 212992)
vm.swappiness = 1 (was 60)
net.ipv4.tcp_keepalive_time = 600 (was 7200)
I/O scheduler to none (was cfq on an NVMe drive)

Result: average latency dropped to 20ms. P99 dropped to 85ms. No code changes. No infrastructure changes. Seven sysctl commands.

The defaults are designed for safety and generality. Production servers are neither safe nor general — they're specific, high-performance machines with specific workloads. Tune accordingly.

Key Takeaways

The kernel is not a black box. /proc and /sys expose everything. perf, strace, and eBPF let you look inside without guessing.

When "the system is slow," use the USE method. Check utilization, saturation, and errors for every resource. The bottleneck will reveal itself.

Default kernel parameters are fine for development machines. They're wrong for production. Every production server should have a tuned sysctl.conf based on its workload.

And never disable swap without understanding what happens when memory runs out. The OOM killer doesn't negotiate.

Over to You

Which kernel parameter change gave you the biggest performance win? Have you ever had the OOM killer strike at the worst possible moment?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Top comments (1)

Valentyn Kit • Jun 29

The "every metric says normal but it's slow" symptom is usually not the scheduler in my experience, it's the accept queue. A full net.core.somaxconn / SYN backlog silently delays connections and never shows up on a CPU or memory dashboard, and at 50k concurrent it's the first thing I'd check before touching nice values. nf_conntrack hitting its table limit bites the same invisible way.