I Lost 30% of My UDP Packets — and the Network Was Innocent

#networking #debugging #programming #performance

A receiver pulling a UDP feed was missing roughly 30% of its messages. No errors, no exceptions, no stack traces — just gaps in the sequence numbers. The first suspect is always the network: a flaky switch, a saturated link, a tired NIC.

The network was innocent. The packets were being dropped on the receiving host, after they'd already arrived. Here's how to tell the difference, and why it matters.

Why UDP makes this sneaky

UDP has no retransmission and no backpressure. When a datagram is lost, nobody is notified — not the sender, not the receiver. The packet simply isn't there.

That means two completely different failures look identical from the application's point of view:

The network dropped the packet before it reached your machine.
Your own host accepted the packet and then threw it away after it arrived.

The application sees the same thing in both cases: a missing sequence number. But the fix is in a different building depending on which one it is.

Where the packets actually go

The receive path is: NIC → kernel socket receive buffer → your recv() call. The kernel parks incoming datagrams in a per-socket buffer until your code reads them. If your code doesn't drain that buffer fast enough, it fills, and the kernel drops the overflow. Crucially, the kernel counts those drops.

On Linux:

# Per-protocol summary — look for "receive buffer errors"
netstat -su

# Or straight from the kernel counters
cat /proc/net/snmp | grep -A1 Udp
#   InDatagrams  ... InErrors  RcvbufErrors ...

If RcvbufErrors is climbing, the network did its job and your host discarded the datagrams. That single counter collapses a week of "is it the switch?" into about ten seconds of certainty.

The actual cause

In this case the socket receive buffer was sitting at the default (~208 KB). The sender burst faster than a single receive thread could call recv(). Average throughput looked fine on every dashboard — but the bursts filled the buffer in milliseconds, and everything past the brim was dropped. The metric that mattered wasn't mean throughput; it was peak burst versus drain rate.

The fix, in order of leverage

Drain faster. The receive loop was parsing and doing a database write inline. Anything that isn't "copy bytes out of the socket" belongs off the hot path: recv() → hand the buffer to a queue → immediately loop back to recv().
Raise the buffer. Bump SO_RCVBUF, and raise net.core.rmem_max so the kernel actually honors the request. A bigger buffer doesn't fix a slow consumer — it absorbs bursts so a fast-enough consumer never falls behind. You usually need both this and #1.
Batch your syscalls. recvmmsg() pulls many datagrams per system call, which cuts per-packet overhead when volume is high.
Spread the load. If one core genuinely can't keep up, SO_REUSEPORT lets multiple threads share the same port with separate buffers.

Key takeaways

"Packet loss" is a location, not a cause. Find out where before you theorize about why.
With UDP, silent drops are the default — the protocol won't tell you, so the kernel counters have to.
RcvbufErrors is the first thing to check. It almost always points at a receive buffer that's too small or a consumer that's too slow.
A bigger buffer absorbs bursts; a faster drain prevents them. You usually want both.

The full debugging story — the live-feed before/after, the buffer math, and the exact counters I watched while tuning it — is on Medium:

Networking for Developers: I Lost 30% of UDP Packets — The Debugging Story

I write more like this on Medium as **The Speed Engineer* — performance engineering, debugging stories, and the lower-level systems work that doesn't fit in a tweet.*