I Lost 30% of My UDP Packets — Here's What I Learned Debugging It

#beginners #networking #programming #linux

Networking bugs are humbling. You think you understand sockets, buffers, and protocols — until packets start vanishing into thin air.

I recently spent a painful week tracking down why my application was silently dropping nearly a third of its UDP traffic. The root cause surprised me, and the debugging journey taught me more about networking fundamentals than any textbook.

The Setup

I was building a high-throughput data ingestion service. UDP was the obvious choice for speed — no handshake overhead, no connection state, just raw datagrams flying across the wire. Everything looked great in unit tests.

Then I deployed to staging.

The Symptoms

Monitoring showed roughly 70% delivery rate. Not terrible enough to trigger alarms immediately, but devastating for data integrity. The worst part? No errors. No logs. Packets were just... gone.

The Debugging Journey

Here's what I checked (and what you should check too):

1. Socket Buffer Sizes

The default receive buffer on most Linux systems is embarrassingly small for high-throughput workloads. Check yours:

sysctl net.core.rmem_default
sysctl net.core.rmem_max

If your application sends bursts faster than the receiver processes them, the kernel silently drops overflow packets. No error, no warning.

2. Network Interface Queues

ethtool -S eth0 | grep -i drop
netstat -su | grep "packet receive errors"

These counters tell you whether drops happen at the NIC level or the kernel level — a critical distinction.

3. Application-Level Backpressure

Even with large buffers, if your application blocks on processing while new packets arrive, you're toast. The fix? Decouple receive and process:

# Don't do this
while True:
    data = sock.recvfrom(65535)
    process(data)  # If this is slow, packets pile up

# Do this instead
import queue, threading

packet_queue = queue.Queue(maxsize=10000)

def receiver():
    while True:
        data = sock.recvfrom(65535)
        packet_queue.put(data)

def processor():
    while True:
        data = packet_queue.get()
        process(data)

The Root Cause

In my case, it was a combination of undersized socket buffers AND a processing bottleneck in a downstream serialization step. The kernel was faithfully receiving packets, but my application couldn't drain the socket buffer fast enough during traffic spikes.

Key Takeaways

UDP gives you speed but zero safety net. If you choose UDP, you own reliability.
Always instrument your packet counts. Send-side count vs. receive-side count should be your first dashboard.
Kernel defaults are conservative. Tune rmem_max and SO_RCVBUF for your workload.
Decouple I/O from processing. This pattern prevents backpressure-induced drops.

If you're building systems that handle real-time data — whether it's time tracking events, analytics pipelines, or monitoring — getting networking fundamentals right is non-negotiable. Tools like FillTheTimesheet deal with exactly this kind of reliability challenge when tracking time events across distributed teams.

Want the Full Deep Dive?

I wrote a more detailed version of this debugging story on Medium, covering the exact strace commands, kernel tuning parameters, and production fixes that solved the problem.

👉 Read the full article on Medium

Also check out my other systems engineering posts:

By The Speed Engineer — writing about performance, systems, and the bugs that keep you up at night.