DEV Community

speed engineer
speed engineer

Posted on • Originally published at Medium

How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links

Most packet loss doesn’t happen on the wire — it happens in a 512-slot queue that nobody knew existed.


How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links

Most packet loss doesn’t happen on the wire — it happens in a 512-slot queue that nobody knew existed.

That’s your 10Gbps NIC trying to fit into kernel buffers sized for dial-up era.

If your “10G” link stalls around 1G, start with three suspects: tiny NIC ring buffers, a netdev_max_backlog stuck at 1000, and RSS dumping everything onto one CPU.

I spent three weeks thinking our switches were defective. Clean captures, zero NIC errors, but throughput would spike to 10Gbps for maybe two seconds then crater to 800Mbps and just camp there. Turned out the bottleneck was staring at me from /proc/sys/net/core the whole time, laughing.

If Your 10G Link Behaves Like Hotel Wi-Fi

If you’ve ever stared at Grafana showing 8% link utilization while everyone swears “it’s the network,” this is your bug. Replicated DBs, Kafka, object storage, CDN origins — anything that should saturate 10G but mysteriously plateaus for no good reason lives here.

The Market Data Incident Nobody Wants to Talk About

We were building a high-frequency pipeline for market ticks. Intel X710 NICs, direct fiber, NUMA-pinned processes — the whole nine yards. But under burst load we’d silently lose 15–20% of packets. No kernel warnings. No ethtool counters. Just gaps that the finance team’s reconciliation found later, which led to some deeply unpleasant Slack conversations with traders who thought we’d cost them their edge.

I assumed fabric packet loss. Switch counters showed zero drops. Meaning packets were vanishing after the NIC received them but before our app saw them. That’s when I learned the kernel still thinks you’re running Thunderbird on DSL and allocates memory accordingly.

The Three Bouncers (And Why Two Are Asleep)

When a frame hits your NIC, it goes through three checkpoints:

NIC ring buffer sits in hardware memory on the card — think of it as the holding pen before anything touches kernel land.

netdev_max_backlog is the per-CPU software queue before protocol processing kicks in. This is where bursts go to die if you haven’t tuned it — like a bouncer who only remembers the last thousand faces.

Socket receive buffer is the last checkpoint — if this bouncer is full, your app never even gets to complain. The packets just vanish outside the club.

Default config assumes light browsing. For a 10G NIC under real burst traffic, queues start dropping packets before you even blink.

My first attempt was adorably naive:

sysctl -w net.core.rmem_max=16777216   # max recv - thought this would save me  
sysctl -w net.core.wmem_max=16777216   # max send - narrator: it did not
Enter fullscreen mode Exit fullscreen mode

Maximum buffer sizes mean nothing if the defaults never grow. I didn’t understand TCP auto-tuning yet. Just kidding — I had no idea those were even separate settings.

When The Kernel Ghosts Your Packets

I thought buffer overflows would scream at me. Syslog messages, error counters, something. The kernel just drops packets and moves on with its life. Basically shrugged and said “not my problem.”

Then I ran netstat -s | grep -i drop and watched drop counters spin like crazy. Not NIC drops, not switch fabric—kernel-side backlog drops. And netdev_max_backlog was still sitting at 1000 while we were pushing 800,000 packets per second through that tiny queue.

I thought the driver was choking — then I actually looked at what the queue depth was screaming at me.

Quick diagnosis: check the second column of /proc/net/softnet_stat. If those numbers climb, your backlog is drowning.

What Finally Stopped The Bleeding

Here’s what worked on our 10G boxes. Steal these, then scale down if you’re not flinging firehose traffic:

net.core.rmem_max = 134217728        # 128MB ceiling (recv)  
net.core.wmem_max = 134217728        # 128MB ceiling (send)  
net.core.rmem_default = 16777216     # 16MB starting point  
net.core.wmem_default = 16777216     # lets auto-tuning grow from here  

# Per-socket TCP tuning: min, default, max in bytes  
net.ipv4.tcp_rmem = 4096 87380 134217728    # recv can grow to 128MB  
net.ipv4.tcp_wmem = 4096 65536 134217728    # send grows too  
# CRITICAL: This is in 4KB pages, NOT bytes  
net.ipv4.tcp_mem = 6291456 8388608 12582912 # ~24GB total across all sockets  
net.core.netdev_max_backlog = 250000    # the hero that saved us
Enter fullscreen mode Exit fullscreen mode

That tcp_mem line almost ended me. It's in pages. I set it to 16777216 thinking "16MB, plenty of headroom" and couldn't figure out why connections throttled. At 4KB per page, that's actually 64GB of TCP memory. Reading kernel docs at 3AM hits different.

The backlog jump from 1000 to 250000 was the real win. That’s where the silent drops were actually happening, and once we fixed it, drop counters flatlined.

BBR: When I Stopped Fighting The Algorithm

Bigger buffers handle bursts. But they don’t solve the actual question: how does TCP know how fast to send without triggering congestion?

CUBIC (the default) uses packet loss as a signal. Fill buffers until something drops, back off, repeat. On high-bandwidth, high-latency links this creates a cycle where you never hit full utilization — you’re always either filling or recovering.

BBR measures actual delivery rate and RTT to model the path. It sends at the calculated bottleneck rate without needing to cause congestion first.

net.core.default_qdisc = fq             # fair queuing for pacing  
net.ipv4.tcp_congestion_control = bbr   # goodbye CUBIC
Enter fullscreen mode Exit fullscreen mode

Our WAN throughput jumped 40% immediately. No buffer tweaks needed — just smarter congestion sensing.

Catch: BBR can be aggressive on shared links. Long-haul BBR flows versus short CUBIC flows, BBR tends to grab more bandwidth. Our pipeline was isolated so we didn’t care, but watch for this if you’re on shared infrastructure.

The Hardware Queue I Forgot Existed

Oh wait — there’s a hardware queue before the kernel even sees packets. The NIC ring buffer is where frames sit while the driver DMAs them into kernel memory.

ethtool -g eth0  # check current and max ring sizes  

# What I saw:  
# Pre-set maximums:  
# RX:         4096      ← NIC supports this  
# Current hardware settings:  
# RX:         512       ← are you kidding me
Enter fullscreen mode Exit fullscreen mode

Default: 512 packets. Our burst: 2000 packets in 100 microseconds. Cool, cool, definitely not a problem.

ethtool -G eth0 rx 4096 tx 4096    # bump to hardware max
Enter fullscreen mode Exit fullscreen mode

That change alone recovered 8% of our lost packets.

But — and this bit me later — if you’re chasing ultra-low latency, huge rings add queueing delay. We needed throughput over latency so maxing out made sense. If you’re doing sub-millisecond stuff, step up gradually and watch your p99s. Once we accepted that we couldn’t have infinite buffers and low latency, the tradeoffs became obvious. Constraints force clarity.

When I Accidentally DOSed My Own CPU

Every incoming packet fires a hardware interrupt to a CPU. At 800,000 packets/sec, that’s 800,000 context switches. Your cores spend more time fielding interrupts than doing actual work.

Interrupt coalescing batches packets. The NIC waits a few microseconds or until N packets arrive, then interrupts once for the batch.

ethtool -c eth0  # see current coalescing settings


# What worked for bulk replication on our 10G boxes:  
ethtool -C eth0 rx-usecs 128 rx-frames 64    # wait 128µs or 64 packets


# What you'd use for twitchy latency (we used something in between):  
ethtool -C eth0 rx-usecs 0 rx-frames 1       # interrupt immediately
Enter fullscreen mode Exit fullscreen mode

We landed on rx-usecs 10 rx-frames 8 for market ticks because we needed p99 under 50 microseconds but couldn't afford drowning in interrupt storms.

Gotcha that wrecked my weekend: these reset on reboot. Persist them in systemd, udev rules, whatever. I found out during maintenance when everything came back with defaults and our alerts went nuclear.

How I Finally Connected The Dots

At 3AM, staring at graphs, it clicked. Ring buffer size sets your burst absorption at hardware. netdev_max_backlog is your software surge tank before protocol processing. tcp_rmem/wmem controls individual flow throughput. tcp_mem caps everything combined across all connections. BBR versus CUBIC changes how aggressively you fill those buffers. Interrupt coalescing determines how often CPUs even check for new packets.

Tune one, ignore the others, and you just relocate the bottleneck. I fixed ring buffers but left backlog at default — all I did was move packet drops 50 microseconds downstream.

Imagine a 10K packet burst. Your 512-slot ring buffer thrashes — anything the driver can’t service in time just disappears. Whatever survives piles into a 1,000-slot netdev_max_backlog, where more packets quietly vanish under load. Only what's left ever reaches your socket buffers. At every stage, the kernel is making drop decisions based on settings from when 100Mbps Fast Ethernet was exotic.

What Actually Changed

After tuning everything:

  • Zero packet loss under production bursts (was losing 15–20%)
  • 10Gbps sustained (was averaging 7.2Gbps)
  • WAN transfers: 3.5Gbps → 5.8Gbps with BBR
  • P99 latency: 2.1ms → 180µs
  • CPU cost per packet dropped 15% (fewer interrupts = more useful work)

The real moment was during a major market data spike — the kind that used to trigger loss alerts and panicked messages. This time? Nothing. Metrics stayed green. The system just absorbed it. That’s infrastructure becoming boring in the best way.

The RSS Disaster I Nearly Shipped

All those buffers and queues are per-CPU. RSS (Receive Side Scaling) spreads packets across cores. If RSS is misconfigured, you can tune everything perfectly and still bottleneck because all traffic lands on CPU 0.

ethtool -l eth0  # how many RSS queues are active  
ethtool -L eth0 combined 16    # match your CPU count  
cat /proc/interrupts | grep eth0    # see which CPUs handle which queues
Enter fullscreen mode Exit fullscreen mode

I had perfectly tuned buffers on 16 cores but RSS only used 4 queues. Bottlenecked on those 4 while the other 12 sat idle. Redistributing interrupts gave us another 20% throughput.

If you see only a few eth0 IRQ lines in /proc/interrupts with huge counts while other CPUs show almost nothing, RSS isn't distributing. Fix it before you tune anything else or you're just optimizing 25% of your hardware.

Where This Goes Next

This covers receive path. Transmit has its own maze: TSO/GSO offloads, qdisc queuing, how app send buffers interact with kernel TCP pacing.

If I needed sub-50 microsecond tick handling, I’d probably look at XDP next — it bypasses the kernel stack entirely for line-rate packet filtering. But that’s a different kind of pain.

If you want a Monday-morning checklist:

  • Check your RX ring: ethtool -g eth0 and bump it if you're still at 512.
  • Check your backlog and drops: net.core.netdev_max_backlog + /proc/net/softnet_stat.
  • Check RSS: ethtool -l/-L and /proc/interrupts to make sure all cores are actually getting traffic.

The dirty truth: hardware stopped being the bottleneck years ago. It’s the software defaults — and most systems still ship with config from when YouTube didn’t exist. Go check netdev_max_backlog, your RX ring size, and /proc/net/softnet_stat tomorrow. I'd bet good money at least one is wrong.


Enjoyed the read? Let’s stay connected!

  • 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
  • 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
  • ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

Top comments (0)