Retry isn't bad. But used incorrectly, you could unknowingly become a "DDoS hacker"... of your own system.
Retry — the mechanism of repeating a request upon failure — is a crucial part of distributed system design. When one API call to another service fails due to network errors, timeouts, or temporary issues, retries are often configured to increase the chance of success.
From a supporting mechanism, retry can easily turn into the culprit of a domino failure effect if left uncontrolled.
1. When Retry Is a Double-Edged Sword
Imagine a simple scenario:
- Service A calls Service B.
- Service B is under heavy load and returns a 503 (Service Unavailable).
- Service A retries 3 times, with a 100ms delay between each attempt.
Now suppose 1000 requests hit Service A at the same time:
- Each request makes 4 calls to Service B (1 original + 3 retries).
- Total: 1000 × 4 = 4000 requests to Service B.
- While Service B is already overloaded, these retries choke it completely, leading to cascading failure.
Uncontrolled retries = shooting yourself in the foot.
2. Dangerous Retry Patterns
Retry without delay
→ Causes request storms when errors occur.
Simultaneous retries from multiple instances
→ Multiple services retrying at once → sudden traffic spikes → downstream crashes.
Infinite retries
→ Can cause memory leaks, jammed queues, and unstoppable request storms.
3.5 When to Retry and When Not To
Not every error should be retried.
Retry if:
- Temporary issues: timeouts, connection resets
- System errors: HTTP 5xx like 500, 502, 503, 504
- Downstream service is restarting
Do NOT retry if:
- Client errors: 400, 401, 403, 404
- Business logic errors: user not found, insufficient funds, validation failed
- 422 – Unprocessable Entity
✅ Only retry if the error is recoverable.
3.6 How to Retry the Right Way
Limit retry attempts
Never retry infinitely. Use a max of 2–3 tries depending on the context.Use delay and jitter
Add delays between retries (exponential or linear), with jitter to avoid synchronized spikes.Only retry idempotent actions
E.g., GET and PUT are safer than POST — avoid duplicate orders or repeated payments.Use a circuit breaker
Temporarily cut off retries when the downstream service keeps failing.Deferred Retry – Smart retries using jobs
Instead of retrying immediately, queue the task or store it in a DB, and process later via background jobs. Helps avoid additional load during a system failure.Log everything
Record the error reason, retry count, and retry time for easier debugging and alerting.
3.7 How Do You Know When It's Safe to Retry?
Use circuit breakers
Stop retrying temporarily when services fail repeatedly. Switch back to half-open state gradually.Monitor health checks and metrics
Check/health
endpoints or tools like Prometheus and Grafana to see if services have recovered.Respect the
Retry-After
header
Some APIs return this to indicate the recommended wait time before retrying.Rate-limit retries
Avoid flooding the service again after it starts recovering.
4. Tools for Effective Retry Implementation
Java / Spring Ecosystem:
Spring Retry
Supports@Retryable
, configurable delays, backoff, and fallback with@Recover
.Resilience4j
Combines retry, circuit breaker, rate limiter, and bulkhead into one library. Works well with Spring Boot and Micrometer.Kafka Retry Topic
Separate retry topics with delay, avoids blocking the main consumer. Combine with dead-letter topics for reliability.Quartz / Spring Task
Schedule deferred retries using background jobs.
Other Languages / Platforms:
-
Python:
-
tenacity
: powerful retry decorator -
celery
: built-in retry policy for async tasks
-
-
Node.js:
-
retry
,bull
,agenda
: retry support with timing and retry limits
-
-
Go:
-
go-retryablehttp
,backoff
: lightweight and effective
-
Cloud-native:
-
AWS:
- SQS + Lambda + DLQ
- Step Functions with retry/catch blocks
-
GCP:
- Cloud Tasks, Pub/Sub retry + DLQ
- Workflows with built-in retry logic
-
Azure:
- Service Bus with configurable retry policy
- Azure Durable Functions with built-in retry
5. Real Case: Saving the System During Peak Load with Strategic Retry
Context:
At year-end, the system was under heavy traffic due to a promotional campaign. A payment processing service got overloaded, frequently timing out. Meanwhile, a batch job was firing thousands of requests per minute, with 5 retries per request, no delay, no jitter.
Result:
Massive retry storm completely choked the payment service → triggered cascading failures in related systems → 15 minutes of downtime during peak hours.
Solution:
- Reduced retries to 2
- Added exponential backoff and jitter
- Applied circuit breaker on the job
- Moved retries to a queue and processed via background jobs
Outcome:
System stabilized in under 10 minutes. Retries no longer overwhelmed the backend.
Lesson:
Retry isn’t about “hammering through” — it’s about helping the system recover gracefully.
6. Conclusion
Retry is a powerful tool when used correctly. But if applied without control, it can bring down your system faster than the original error.
Keep in mind:
- Retry only for temporary, recoverable errors
- Always limit retries, add delay + jitter, and use circuit breakers
- Effective retry isn’t about "how many times you call back", but "knowing when to stop and wait"
Retry is medicine — used wisely, it heals. Used wrong, it poisons your system.
Top comments (0)