Be careful with retries — don't DDoS your own system

Retry isn't bad. But used incorrectly, you could unknowingly become a "DDoS hacker"... of your own system.

Retry — the mechanism of repeating a request upon failure — is a crucial part of distributed system design. When one API call to another service fails due to network errors, timeouts, or temporary issues, retries are often configured to increase the chance of success.

From a supporting mechanism, retry can easily turn into the culprit of a domino failure effect if left uncontrolled.

1. When Retry Is a Double-Edged Sword

Imagine a simple scenario:

Service A calls Service B.
Service B is under heavy load and returns a 503 (Service Unavailable).
Service A retries 3 times, with a 100ms delay between each attempt.

Now suppose 1000 requests hit Service A at the same time:

Each request makes 4 calls to Service B (1 original + 3 retries).
Total: 1000 × 4 = 4000 requests to Service B.
While Service B is already overloaded, these retries choke it completely, leading to cascading failure.

Uncontrolled retries = shooting yourself in the foot.

2. Dangerous Retry Patterns

Retry without delay
→ Causes request storms when errors occur.

Simultaneous retries from multiple instances
→ Multiple services retrying at once → sudden traffic spikes → downstream crashes.

Infinite retries
→ Can cause memory leaks, jammed queues, and unstoppable request storms.

3.5 When to Retry and When Not To

Not every error should be retried.

Retry if:

Temporary issues: timeouts, connection resets
System errors: HTTP 5xx like 500, 502, 503, 504
Downstream service is restarting

Do NOT retry if:

Client errors: 400, 401, 403, 404
Business logic errors: user not found, insufficient funds, validation failed
422 – Unprocessable Entity

✅ Only retry if the error is recoverable.

3.6 How to Retry the Right Way

Limit retry attempts
Never retry infinitely. Use a max of 2–3 tries depending on the context.
Use delay and jitter
Add delays between retries (exponential or linear), with jitter to avoid synchronized spikes.
Only retry idempotent actions
E.g., GET and PUT are safer than POST — avoid duplicate orders or repeated payments.
Use a circuit breaker
Temporarily cut off retries when the downstream service keeps failing.
Deferred Retry – Smart retries using jobs
Instead of retrying immediately, queue the task or store it in a DB, and process later via background jobs. Helps avoid additional load during a system failure.
Log everything
Record the error reason, retry count, and retry time for easier debugging and alerting.

3.7 How Do You Know When It's Safe to Retry?

Use circuit breakers
Stop retrying temporarily when services fail repeatedly. Switch back to half-open state gradually.
Monitor health checks and metrics
Check /health endpoints or tools like Prometheus and Grafana to see if services have recovered.
Respect the Retry-After header
Some APIs return this to indicate the recommended wait time before retrying.
Rate-limit retries
Avoid flooding the service again after it starts recovering.

4. Tools for Effective Retry Implementation

Java / Spring Ecosystem:

Spring Retry
Supports @Retryable, configurable delays, backoff, and fallback with @Recover.
Resilience4j
Combines retry, circuit breaker, rate limiter, and bulkhead into one library. Works well with Spring Boot and Micrometer.
Kafka Retry Topic
Separate retry topics with delay, avoids blocking the main consumer. Combine with dead-letter topics for reliability.
Quartz / Spring Task
Schedule deferred retries using background jobs.

Other Languages / Platforms:

Python:
- tenacity: powerful retry decorator
- celery: built-in retry policy for async tasks
Node.js:
- retry, bull, agenda: retry support with timing and retry limits
Go:
- go-retryablehttp, backoff: lightweight and effective

Cloud-native:

AWS:
- SQS + Lambda + DLQ
- Step Functions with retry/catch blocks
GCP:
- Cloud Tasks, Pub/Sub retry + DLQ
- Workflows with built-in retry logic
Azure:
- Service Bus with configurable retry policy
- Azure Durable Functions with built-in retry

5. Real Case: Saving the System During Peak Load with Strategic Retry

Context:
At year-end, the system was under heavy traffic due to a promotional campaign. A payment processing service got overloaded, frequently timing out. Meanwhile, a batch job was firing thousands of requests per minute, with 5 retries per request, no delay, no jitter.

Result:
Massive retry storm completely choked the payment service → triggered cascading failures in related systems → 15 minutes of downtime during peak hours.

Solution:

Reduced retries to 2
Added exponential backoff and jitter
Applied circuit breaker on the job
Moved retries to a queue and processed via background jobs

Outcome:
System stabilized in under 10 minutes. Retries no longer overwhelmed the backend.

Lesson:

Retry isn’t about “hammering through” — it’s about helping the system recover gracefully.

6. Conclusion

Retry is a powerful tool when used correctly. But if applied without control, it can bring down your system faster than the original error.

Keep in mind:

Retry only for temporary, recoverable errors
Always limit retries, add delay + jitter, and use circuit breakers
Effective retry isn’t about "how many times you call back", but "knowing when to stop and wait"