DEV Community

lowkey dev
lowkey dev

Posted on

Be careful with retries — don't DDoS your own system

Retry isn't bad. But used incorrectly, you could unknowingly become a "DDoS hacker"... of your own system.

Retry — the mechanism of repeating a request upon failure — is a crucial part of distributed system design. When one API call to another service fails due to network errors, timeouts, or temporary issues, retries are often configured to increase the chance of success.

From a supporting mechanism, retry can easily turn into the culprit of a domino failure effect if left uncontrolled.


1. When Retry Is a Double-Edged Sword

Imagine a simple scenario:

  • Service A calls Service B.
  • Service B is under heavy load and returns a 503 (Service Unavailable).
  • Service A retries 3 times, with a 100ms delay between each attempt.

Now suppose 1000 requests hit Service A at the same time:

  • Each request makes 4 calls to Service B (1 original + 3 retries).
  • Total: 1000 × 4 = 4000 requests to Service B.
  • While Service B is already overloaded, these retries choke it completely, leading to cascading failure.

Uncontrolled retries = shooting yourself in the foot.

image.png


2. Dangerous Retry Patterns

Retry without delay
→ Causes request storms when errors occur.

Simultaneous retries from multiple instances
→ Multiple services retrying at once → sudden traffic spikes → downstream crashes.

Infinite retries
→ Can cause memory leaks, jammed queues, and unstoppable request storms.


3.5 When to Retry and When Not To

Not every error should be retried.

Retry if:

  • Temporary issues: timeouts, connection resets
  • System errors: HTTP 5xx like 500, 502, 503, 504
  • Downstream service is restarting

Do NOT retry if:

  • Client errors: 400, 401, 403, 404
  • Business logic errors: user not found, insufficient funds, validation failed
  • 422 – Unprocessable Entity

Only retry if the error is recoverable.


3.6 How to Retry the Right Way

  • Limit retry attempts
    Never retry infinitely. Use a max of 2–3 tries depending on the context.

  • Use delay and jitter
    Add delays between retries (exponential or linear), with jitter to avoid synchronized spikes.

  • Only retry idempotent actions
    E.g., GET and PUT are safer than POST — avoid duplicate orders or repeated payments.

  • Use a circuit breaker
    Temporarily cut off retries when the downstream service keeps failing.

  • Deferred Retry – Smart retries using jobs
    Instead of retrying immediately, queue the task or store it in a DB, and process later via background jobs. Helps avoid additional load during a system failure.

  • Log everything
    Record the error reason, retry count, and retry time for easier debugging and alerting.


3.7 How Do You Know When It's Safe to Retry?

  • Use circuit breakers
    Stop retrying temporarily when services fail repeatedly. Switch back to half-open state gradually.

  • Monitor health checks and metrics
    Check /health endpoints or tools like Prometheus and Grafana to see if services have recovered.

  • Respect the Retry-After header
    Some APIs return this to indicate the recommended wait time before retrying.

  • Rate-limit retries
    Avoid flooding the service again after it starts recovering.


4. Tools for Effective Retry Implementation

Java / Spring Ecosystem:

  • Spring Retry
    Supports @Retryable, configurable delays, backoff, and fallback with @Recover.

  • Resilience4j
    Combines retry, circuit breaker, rate limiter, and bulkhead into one library. Works well with Spring Boot and Micrometer.

  • Kafka Retry Topic
    Separate retry topics with delay, avoids blocking the main consumer. Combine with dead-letter topics for reliability.

  • Quartz / Spring Task
    Schedule deferred retries using background jobs.

Other Languages / Platforms:

  • Python:

    • tenacity: powerful retry decorator
    • celery: built-in retry policy for async tasks
  • Node.js:

    • retry, bull, agenda: retry support with timing and retry limits
  • Go:

    • go-retryablehttp, backoff: lightweight and effective

Cloud-native:

  • AWS:

    • SQS + Lambda + DLQ
    • Step Functions with retry/catch blocks
  • GCP:

    • Cloud Tasks, Pub/Sub retry + DLQ
    • Workflows with built-in retry logic
  • Azure:

    • Service Bus with configurable retry policy
    • Azure Durable Functions with built-in retry

5. Real Case: Saving the System During Peak Load with Strategic Retry

Context:
At year-end, the system was under heavy traffic due to a promotional campaign. A payment processing service got overloaded, frequently timing out. Meanwhile, a batch job was firing thousands of requests per minute, with 5 retries per request, no delay, no jitter.

Result:
Massive retry storm completely choked the payment service → triggered cascading failures in related systems → 15 minutes of downtime during peak hours.

Solution:

  • Reduced retries to 2
  • Added exponential backoff and jitter
  • Applied circuit breaker on the job
  • Moved retries to a queue and processed via background jobs

Outcome:
System stabilized in under 10 minutes. Retries no longer overwhelmed the backend.

Lesson:

Retry isn’t about “hammering through” — it’s about helping the system recover gracefully.


6. Conclusion

Retry is a powerful tool when used correctly. But if applied without control, it can bring down your system faster than the original error.

Keep in mind:

  • Retry only for temporary, recoverable errors
  • Always limit retries, add delay + jitter, and use circuit breakers
  • Effective retry isn’t about "how many times you call back", but "knowing when to stop and wait"

Retry is medicine — used wisely, it heals. Used wrong, it poisons your system.

Top comments (0)