Retry in Distributed Systems — How Production Systems Recover From Temporary Failures

#productivity #architecture #learning #writing

Not every failure is permanent.

This is something I didn't think about before. When something fails in my app, my first thought was something broke, fix it. But when I started learning how distributed systems actually work, I realized that some failures are not really failures. They're just temporary.

Network glitch. API timeout. A service that just restarted. Rate limiting kicking in. These are all failures but they last for a very short time window. If your system tries the same operation again after a few seconds, it will probably succeed.

So the question is does your system know how to try again? Or does it just give up the first time something goes wrong?
That's what retry is.

What Retry Actually Does

Without a retry system, if a temporary failure happens that's it. The entire operation fails. The user sees an error. The request is gone.
With retry, your system automatically attempts the operation again after a failure. The goal is simple recover from temporary failures without the user even knowing something went wrong.

This felt obvious to me once I understood it. But building it properly is where it gets interesting.

The Configuration: What Each Part Controls

When I looked into how retry systems are actually configured, there were more options than I expected. And each one exists for a specific reason.

maxAttempts — this defines the maximum number of times the operation can be attempted. You don't want infinite retries. At some point if it keeps failing, it's probably not a temporary problem.

exponential backoff — instead of retrying immediately every time, the delay between retries doubles after each failure. First retry after 1 second, second after 2 seconds, third after 4 seconds. This gives the failing service time to recover instead of bombarding it with requests.

baseDelay — this is the starting delay used in the exponential backoff. The first wait time before retrying.

maxDelay — this caps the maximum delay. Without this, the exponential backoff keeps doubling forever and the delay becomes too long to be useful.

shouldRetry — this determines whether another retry should actually happen. Not every error deserves a retry. A 404 is not a temporary failure. A network timeout is. This config lets you define that logic.

onRetry — a callback that runs before each retry attempt. This is mainly used for logging, metrics, and monitoring. So you have a record of how many retries happened and why.

But There's a Catch — The Thundering Herd Problem

This is the part that I found really interesting.
Imagine 200 workers all connected to the same service. The service goes down for a moment. All 200 workers detect the failure and because of exponential backoff they all wait the same amount of time and then retry at the exact same moment.

What happens? 200 requests hit the service at the same time the moment it comes back up. The service crashes again. Then all 200 retry again at the same time. It becomes a loop.

This is called the thundering herd problem. And exponential backoff alone doesn't prevent it because all workers are using the same delay calculation.

Jitter : The Solution for Thundering Herd

Jitter means adding randomness to the retry timing. Instead of every worker waiting exactly 2 seconds, one waits 1.7 seconds, another waits 2.3 seconds, another waits 1.4 seconds. The requests spread out across a time window instead of hitting all at once.

This one small addition of randomness in the delay completely solves the thundering herd problem. The service gets requests gradually, recovers properly, and the retry system actually works the way it's supposed to.

Why This Matters in Production

What I realized going through this retry is not just "try again". It's a carefully designed system. Without exponential backoff, you overload the failing service. Without jitter, you get thundering herd. Without
maxAttempts, you retry forever. Without shouldRetry, you retry on errors that will never recover.

Every config option exists because someone ran into a real problem in production.

That's the thing about distributed systems the failures are real, the edge cases are real, and every piece of this infrastructure exists because someone hit a wall and had to find a way through it.

If I got something wrong or anything can be improved — please drop it in the comments. I'm still learning and I want to get this right.