EP 6 - Don't Kill Flaky APIs: The Art of Resilient Retries

#backend #sre #webdev

The Problem: The "Thundering Herd" Effect

You call a payment gateway like Stripe or Razorpay, and it returns a 503 Service Unavailable. If you just give up, you lose revenue. But if you retry every 100ms, you are essentially launching a DDoS attack on an already struggling service.

If thousands of your server instances do this simultaneously, it's called the "Thundering Herd" problem.

The Solution: Exponential Backoff + Jitter

Instead of retrying at fixed intervals, we use a smarter mathematical approach.

1. Exponential Backoff

We increase the wait time exponentially between attempts:

Attempt 1: Wait 1s
Attempt 2: Wait 2s
Attempt 3: Wait 4s
Attempt 4: Wait 8s

2. Adding Jitter (Randomness)

If all your servers crash at once and wait exactly 2 seconds, they will all hit the API again at the exact same millisecond. We add a small amount of "Jitter" to the delay to desynchronize the retries.

Wait Time = (Base * 2^attempt) + Random_Jitter

Best Practice:

Idempotency: Only retry "Idempotent" operations (like GET or PUT). Be very careful retrying a POST request (like createOrder), or you might charge a customer twice!
Circuit Breakers: If an API fails 10 times in a row, stop trying entirely for 30 seconds to let it recover.