speed engineer

Posted on Jun 11 • Originally published at Medium

Networking for Developers: TCP vs UDP (When Each Protocol Kills Your App)

Your video call stutters. Your game lags. You picked the wrong protocol — and now you’re debugging packets at mid night.

Networking for Developers: TCP vs UDP (When Each Protocol Kills Your App)

Your video call stutters. Your game lags. You picked the wrong protocol — and now you’re debugging packets at mid night.

Choosing between TCP and UDP isn’t academic — it’s the difference between your app working and your users complaining. Pick wrong and you’ll trace symptoms for days before finding the real cause.

Our P99 latency hit 5 seconds randomly. Three days of packet tracing led me to a single dropped packet.

Not corrupted. Not delayed. Just gone.

I’d been running a real-time sensor network. Temperature readings every 100ms. Life-or-death? No. But the client paid for sub-second response times. We were violating SLA hourly.

The system used TCP. Reliable delivery, ordered packets — textbook choice for anything important, right?

Wrong.

The Debugging Session That Changed Everything

So I checked the obvious first. Network saturation? No. Server CPU? Fine. Memory leaks? Clean. I spent two days looking at application code before someone suggested I actually look at the network.

I fired up tcpdump on the sensor gateway:

tcpdump -i eth0 -n 'tcp port 8080' -w capture.pcap

Watched it for an hour. Then opened Wireshark.

That’s when I saw it. One packet dropped. TCP’s retransmission timer kicked in. 200ms wait. Retry. Another drop. Exponential backoff. 400ms. 800ms. 1600ms. By the time the packet finally made it through, we’d blown past 5 seconds.

Five seconds of latency because of one 512-byte packet.

I Assumed TCP Was Reliable — Then Packet Loss Taught Me Different

TCP is reliable in that it eventually delivers your data. But reliable doesn’t mean fast. It doesn’t even mean predictable.

Networks fail.

When a packet drops on TCP, the entire connection stalls. TCP guarantees ordering. So if packet #47 disappears, packet #48 through #500 just… wait. They’re already at the receiver. Sitting in a buffer. Unusable. This is head-of-line blocking.

UDP doesn’t care. Packet #47 vanishes? Packet #48 gets delivered anyway. No waiting. No retries. No guarantees.

I had temperature sensors. If reading #47 was lost, reading #48 was still useful. More useful than waiting 5 seconds for stale data.

This matters for revenue. Our client was aggregating sensor data for HVAC optimization. Five-second-old temperature readings meant HVAC systems reacting to old conditions. Wasted energy. Real dollars.

The Foundation: What These Protocols Actually Do

TCP establishes connections. Three-way handshake: SYN, SYN-ACK, ACK. Overhead before you send a single byte of application data. Every packet gets acknowledged. Missing ACK? Retransmit. Receiver buffers out-of-order packets and delivers them in sequence to your application.

Flow control prevents fast senders from overwhelming slow receivers. Congestion control backs off when the network is saturated. TCP is a state machine with 11 different states. It’s complex because it’s trying to make an unreliable network look reliable.

UDP is simpler. You call sendto(). Packet goes on the wire. That's it. No connection. No state. No acknowledgments. No retries. No guarantees about ordering or delivery.

Here’s what UDP looks like:

import socket  

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)  
sock.sendto(b"temperature:23.5", ("10.0.1.50", 8080))

Four lines. Fire and forget. If the network drops it, you’ll never know.

TCP needs more ceremony:

import socket  

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)  
sock.connect(("10.0.1.50", 8080))  
sock.sendall(b"temperature:23.5")  
sock.close()

Connection setup and teardown. More syscalls. More network round-trips. Higher latency even when nothing goes wrong.

TCP’s handshake and acknowledgment overhead seems reasonable — until you’re sending hundreds of small messages per second and every round-trip adds milliseconds. UDP skips all of it.

When Different Protocols Make Sense

I rebuilt the sensor system with UDP. Latency dropped to 50ms P99. Problem solved.

Except not every problem wants UDP.

Use TCP when : data loss is unacceptable. HTTP requests — you can’t skip part of an HTML page. File transfers — corrupted files are useless. Database queries — missing rows break application logic. Email delivery — partial messages don’t work. API calls where you need confirmation.

Use UDP when : timeliness beats completeness. Video streaming — one dropped frame is invisible, buffering to retry is noticeable. Gaming — 200ms-old player positions are worthless, waiting for retransmits is worse. DNS queries — if the response doesn’t arrive, just ask again. Metrics collection — one missing data point doesn’t invalidate the trend. VoIP — humans tolerate brief audio dropouts better than lag.

Speaking of DNS, that’s often your first failure point. DNS uses UDP for speed. Queries timeout after 2 seconds and retry. If DNS is slow, everything feels slow — even TCP connections stall at hostname resolution.

Actually, most people don’t realize DNS is UDP by default. It falls back to TCP only for responses over 512 bytes (before EDNS). This design choice prioritizes speed for the 99% case.

The Moment I Actually Saw Network Behavior

Back to my debugging session.

I set up continuous packet capture. Left it running overnight. Next morning I filtered for retransmissions:

tcp.analysis.retransmission

Thousands of entries. Not random though. They clustered around 2 AM. Same time every night.

So I checked the infrastructure logs. Backup jobs. Every night at 2 AM, backup traffic saturated the 1Gbps link. TCP saw congestion, slowed down, retransmitted dropped packets. My sensor traffic got caught in it.

With UDP, the backup traffic still saturated the link. Some sensor packets still dropped. But the surviving packets arrived immediately. No cascading retransmission delays.

I’m still not sure why the network team scheduled backups during production hours. Politics, probably.

The Gotcha: Timeouts and Buffer Sizes

Here’s my real mistake: I set my read timeout too short. Wasted a week tuning it.

sock.settimeout(0.5)  # Don't do this blindly

Half a second seemed reasonable. But under congestion, legitimate packets took 800ms to arrive. My timeout fired. Connection closed. Data lost.

I bumped it to 2 seconds. Helped sometimes. Made the head-of-line blocking worse other times. Longer timeouts meant the application waited longer when packets actually were lost.

With UDP, timeouts work differently. You’re not waiting for the protocol to retry. You’re just waiting for data. If nothing arrives, your application decides what to do. Send a new request? Use stale data? Your choice.

Buffer sizes matter too. TCP receive buffers hide latency problems until they overflow. Then you get tail latency spikes. UDP has smaller buffers because there’s no reordering queue. Lost packets don’t consume buffer space.

Diagnosis Cascades Through Layers

Here’s how the debugging actually went. Timeouts are symptoms. I saw timeouts in application logs. Traced them to slow responses. Speaking of symptoms, high CPU often masks network issues — if your app is busy retrying, CPU looks busy, but you’re not doing useful work.

Slow responses came from retransmissions. Retransmissions came from packet loss. Packet loss came from link saturation. Link saturation came from backup jobs.

Each layer revealed the next. This is how network debugging works. You start at the application layer (HTTP 500s, timeouts) and work down through TCP (retransmissions, connection resets) to IP (routing, fragmentation) to the physical layer (link saturation, bit errors).

Why connection pooling matters: every TCP connection has setup cost. If you’re making hundreds of requests per second, connection setup becomes the bottleneck. Pools amortize that cost. But they also hide problems — a bad connection stays in the pool, serving corrupt data until health checks remove it.

UDP doesn’t have connection pools. No connections to pool. Each packet is independent. Lower complexity, but you lose connection-level metrics and circuit breaking.

The Middle Ground: QUIC

I mentioned the sensor network earlier. We eventually migrated to QUIC.

QUIC runs on UDP but adds reliability features. Selective acknowledgments — only retransmit lost packets, not everything after them. Connection migration — your phone switches from WiFi to cellular, connection survives. Reduced handshake latency — combines TCP’s three-way handshake and TLS setup into one round-trip.

HTTP/3 uses QUIC. Google built it to fix TCP’s head-of-line blocking for web traffic. A slow-loading image doesn’t block JavaScript anymore.

QUIC isn’t perfect. It’s complex. Debugging is harder — encrypted from the start, so tcpdump shows less. NAT traversal can be tricky. CPU overhead is higher than plain UDP.

But for applications that need reliability and low latency, it’s worth considering.

[Image Prompt: A three-tier timeline showing network evolution: TCP (1981), UDP (1980), and QUIC (2012) with arrows indicating “reliability” vs “speed” tradeoffs. Show how QUIC attempts to combine both.]

Caption: QUIC isn’t replacing TCP everywhere, but it’s solving real problems for specific use cases. When you need both reliability and speed, the protocol layer matters more than you think.

What This Means For Your Next Project

Don’t cargo-cult protocol choices. “Everyone uses TCP” isn’t engineering reasoning.

Ask: what happens when packets drop? If you need every byte in order, TCP is right. If recent data beats complete data, consider UDP.

Test under realistic conditions. Packet loss isn’t theoretical. Saturate your network in staging. Drop packets with tc on Linux:

tc qdisc add dev eth0 root netem loss 1%

One percent packet loss. Watch how your application behaves. TCP might be fine. Or you might see latency spike to seconds.

Measure what matters. Throughput? Latency? P99? P999? Different protocols optimize for different metrics. UDP gives better P99 latency. TCP gives better worst-case reliability.

My sensor network runs on QUIC now. P99 latency is 45ms. Packet loss doesn’t cascade anymore. We still lose packets — that’s networks — but the system degrades gracefully.

Tomorrow: I debugged a UDP packet loss nightmare. Turns out application-level acknowledgments are harder than they look. More on that soon.

Sources consulted : RFC 793 (TCP), RFC 768 (UDP), RFC 9000 (QUIC), Linux kernel TCP implementation docs, Cloudflare’s blog on QUIC deployment

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️