DEV Community

Cover image for Your Retry Logic Works. Your Timeout Doesn't.
Liora
Liora

Posted on

Your Retry Logic Works. Your Timeout Doesn't.

I watched a junior developer spend three days debugging why our retry logic worked perfectly on his machine but failed on staging. He'd written a beautiful exponential backoff algorithm with jitter and circuit breaker patterns. It was a work of art. It also never worked.

The problem wasn't his code. It was that he'd been testing with a local server that responded in 50 milliseconds. Our staging environment averaged 800 milliseconds per request. His retry logic was giving up faster than the server could respond.

This is what happens when we write retry mechanisms in a vacuum. We focus on the algorithms - exponential backoff, linear backoff, polynomial backoff, whatever the latest blog post told us to use. We forget that users don't care about our elegant mathematics. They care that their upload doesn't vanish into the void.

Here's what actually matters when building retry logic:

Start with the timeout, not the algorithm. Most failures happen because developers pick random timeout values. Test your actual API under real conditions. If 95% of requests complete in 2 seconds, set your initial timeout to 3 seconds, not 500 milliseconds because "it feels right."

Count failures differently. Don't count "attempts." Count "time since last success." A function that fails three times in one minute needs different handling than one that fails three times across three hours. The second one might just be a server having a bad moment.

Handle the three failure types separately. Network timeouts need aggressive retries. 5xx errors need exponential backoff. 4xx errors need to stop immediately because you're probably sending bad data. Mixing these together is like using the same medicine for headaches and broken arms.

Build visibility in from day one. Every retry should log: what failed, why it failed, how long it waited, and whether it succeeded. Without this, you're debugging in the dark. With it, patterns emerge. Like how our payment API always fails at 2:17 AM because that's when the backup server reboots. Not coincidentally, this is also when our payment success rate mysteriously drops.

Test with real chaos. Don't just unplug your ethernet cable. Throttle your connection to 56k speeds. Introduce 10% packet loss. Run your code on a cheap Android phone from 2016. Your users are already doing this. You should too.

The junior developer fixed his retry logic by adding a 1-second initial timeout and stopping after 3 attempts instead of 5. Success rate jumped from 73% to 98%. Sometimes the best code is the code you don't write - or in this case, the retry attempts you don't attempt.

He learned that debugging retry logic is 20% understanding algorithms and 80% understanding how networks actually behave in the wild. The real world doesn't care about your beautiful code. It cares whether the upload worked before the user rage-quit the app.

The best part? Two months later, he caught me making the same mistake with a different service.

Top comments (0)