The Problem: The "Thundering Herd" Effect
You call a payment gateway like Stripe or Razorpay, and it returns a 503 Service Unavailable. If you just give up, you lose revenue. But if you retry every 100ms, you are essentially launching a DDoS attack on an already struggling service.
If thousands of your server instances do this simultaneously, it's called the "Thundering Herd" problem.
The Solution: Exponential Backoff + Jitter
Instead of retrying at fixed intervals, we use a smarter mathematical approach.
1. Exponential Backoff
We increase the wait time exponentially between attempts:
- Attempt 1: Wait 1s
- Attempt 2: Wait 2s
- Attempt 3: Wait 4s
- Attempt 4: Wait 8s
2. Adding Jitter (Randomness)
If all your servers crash at once and wait exactly 2 seconds, they will all hit the API again at the exact same millisecond. We add a small amount of "Jitter" to the delay to desynchronize the retries.
- Wait Time = (Base * 2^attempt) + Random_Jitter
Best Practice:
-
Idempotency: Only retry "Idempotent" operations (like GET or PUT). Be very careful retrying a POST request (like
createOrder), or you might charge a customer twice! - Circuit Breakers: If an API fails 10 times in a row, stop trying entirely for 30 seconds to let it recover.
Top comments (0)