The networking problem behind every "random" backend outage.

#networking #backend #python #devops

You get paged at 2am. The service is down. You check the app — no deploys, no config changes, nothing. You restart the container and it comes back. You go to sleep. It happens again Thursday.

It was never the app.

I spent three years doing satellite internet support before I moved into backend engineering. That job taught me one thing: most "application" problems are network problems wearing a disguise. I see the same patterns now in backend systems that I saw then in rural broadband infrastructure.

Here are the ones that get teams every time.

The timeout that isn't a timeout

Your service calls a third-party API. It times out. You log it, you retry, life goes on. But the retries pile up. Each retry holds a connection open. Your connection pool fills. New requests start queuing. The queue backs up. Now your service looks down — but the third-party API recovered ten seconds ago.

The fix is not a shorter timeout. The fix is a circuit breaker. Don't retry into a wall. Detect the wall and stop knocking.

DNS TTL lying to you in production

You rotate a database host. You update the DNS record. You wait for TTL to expire — 300 seconds, fine. But your app has been running for six hours and the old IP is baked into the JVM DNS cache, or your connection pool, or a library that ignores TTL entirely.

The new host is up. The app is still talking to the old one. The old one is gone. Outage.

Always set TTL aggressively low before a planned DNS change. And test your app's actual DNS resolution behaviour, not just the record.

The packet that never comes back

TCP connections are stateful. A NAT device, a load balancer, a firewall — they all keep track of active connections. Leave a connection idle long enough and that state entry gets evicted. The next packet on that connection goes nowhere. Your app is still waiting for a response that will never arrive.

This is the silent killer of database connection pools. The DB is fine. The network path is fine. But the connection your pool thinks is open has been silently dropped by a load balancer that forgot it existed.

Keepalives exist for this reason. Use them. Set tcp_keepalive_time lower than your NAT timeout. Most default settings are wrong for production.

MTU mismatch on the path nobody checks

A packet leaves your server at 1500 bytes. Somewhere between you and the destination, a link has an MTU of 1400. The packet needs to be fragmented. If the DF (don't fragment) bit is set and ICMP is blocked — which it often is — the packet is silently dropped. The connection hangs. Nothing in your application logs explains why.

I saw this constantly in satellite networks where overhead compression changed effective MTU. I still see it in cloud environments where overlay networks, VPNs, and tunnel encapsulation all shave bytes off the path.

Run tracepath instead of traceroute. Check PMTUD. If you're running on Kubernetes with Flannel or Calico, know what your overlay MTU actually is.

The retry storm

Your upstream is slow. Your service retries. Every service instance retries at the same time because they all hit the same timeout window. Your upstream — which was recovering — now gets hit with 10x normal traffic. It goes down again.

Add jitter to your retries. Exponential backoff without jitter is a coordinated attack on your own infrastructure.

Why this matters more now

More surface area means more network paths. Microservices, managed databases, external APIs, LLM providers — each hop is a place for the network to betray you. The app is often the last thing to blame.

When something breaks randomly and the restart fixes it, start at the network. Check connection pool state, check DNS, check keepalive settings. The answer is usually there.

The app was fine the whole time.

Top comments (1)

Lauren Miller • Jul 9

Jumping in on this older post—I’d love to hear how folks have handled these hidden network issues in your projects. Have any of you found solutions or tools that have been particularly helpful? Also, curious if anyone has any thoughts on how shifting to cloud-based systems has impacted the frequency of these mysterious outages in your experience. Let's get the discussion going again!