Retry and Backoff: Building Resilient Systems

#backend #dolearncode #programming #performance

In network based applications, transient failures like network delays or temporarily unresponsive servers are common. Without a retry mechanism, these disruptions can cause immediate failures, affecting user experience and service reliability. Implementing retries is key to maintaining stability during such interruptions.

Case Study

Consider an application that connects to an external API to fetch weather data. If the API experiences a temporary disruption, our application will fail immediately without attempting to retry. This could result in failing to retrieve important data, even though the disruption might be temporary.

In the following example, the FetchWithoutRetry function attempts to make an HTTP request to the external API. If the network call fails, the application logs the error and returns nil, without retrying the request

func FetchWithoutRetry(apiURL string) (*http.Response, error) {
    response, err := http.Get(apiURL)
    if err != nil {
        return nil, fmt.Errorf("Request failed : %w", err)
    }
    return response, nil
}

This approach misses the opportunity to recover from a transient error that might be resolved with a simple retry.

Implementing a Retry Pattern

A retry mechanism attempts the same request again if it initially fails due to a transient issue. By implementing a retry pattern, you improve the resilience of your downstream services. Instead of immediately returning an error, your system attempts to retry the request a predefined number of times.

Here’s an example implementation in Go:

const (
    maxRetries = 3
    retryInterval = 2 * time.Second
)

func FetchWithRetry(apiURL string) (*http.Response, error) {
    for attempt := 1; attempt <= maxRetries; attempt++{
        if response, err := http.Get(apiURL); err == nil {
            return response, nil
        }

       fmt.Printf("Request failed, attempt %d/%d: %w\n", attempt, maxRetries, err)
        time.Sleep(retryInterval)
    }

    return nil, fmt.Errorf("Request failed after %d attempts", maxRetries)
}

In this example, the system attempts the request up to maxRetries times, waiting retryInterval between each attempt.

Exponential Backoff

While simple retries improve resilience, using fixed intervals can worsen transient failures, especially if a server is under heavy load.

Exponential backoff optimizes retries by increasing the delay between
each attempt. This reduces the load on the failing system and improves
the likelihood of a successful retry by giving the server more time to
recover with each successive attempt. The progressively longer intervals
also help prevent overwhelming a recovering server with repeated bursts
of traffic.

Exponential backoff gradually increases the wait time after each failure. For example, the wait time might start at one second, then increase to two seconds, four seconds, and so on. Here’s how you might implement exponential backoff in Go:

const (
    maxRetries = 3
    initialInterval = 2 * time.Second
    backoffFactor   = 2.0
)

func FetchWithExponentialBackoff(apiURL string) (*http.Response, error) {
    waitTime := initialInterval

   for attempt := 1; attempt <= maxRetries; attempt++{
        if response, err := http.Get(apiURL); err == nil {
            return response, nil
        } 

        fmt.Printf("Request failed, attempt %d/%d: %w\n", attempt, maxRetries, err)
        time.Sleep(waitTime)
        waitTime *= time.Duration(backoffFactor)        
    }

    return nil, fmt.Errorf("Request failed after %d attempts", maxRetries)
}

Adding Jitter to Backoff

While exponential backoff reduces server load by spacing out individual
client retries, it doesn't prevent the "thundering herd" problem when
multiple clients fail simultaneously. If 1,000 clients all experience
a failure at the same moment (for example, when a server crashes),
exponential backoff means they'll all retry together at 2 seconds,
then again at 4 seconds, then 8 seconds, and so on. Each wave of
synchronized requests can overwhelm the recovering server.

Jitter solves this by adding random variation to the retry intervals.
Instead of all 1,000 clients retrying at exactly 2 seconds, they might
retry anywhere between 2.0 and 2.2 seconds. This spreads the load over
a 200ms window instead of hitting the server with all 1,000 requests
in the same millisecond.

Here's how to implement jitter with exponential backoff:

const (
    maxRetries = 3
    initialInterval = 2 * time.Second
    backoffFactor   = 2.0
)

func FetchWithExponentialBackoffAndJitter(apiURL string) (*http.Response, error) {
    waitTime := initialInterval

    for attempt := 1; attempt <= maxRetries; attempt++{
        if response, err := http.Get(apiURL); err == nil {
            return response, nil
        } 

        fmt.Printf("Request failed, attempt %d/%d: %w\n", attempt, maxRetries, err)

         // Add jitter: up to 10% of the wait time
        jitter := time.Duration(rand.Int63n(int64(waitTime / 10)))
        actualWaitTime := waitTime + jitter

        time.Sleep(actualWaitTime)
        waitTime *= time.Duration(backoffFactor)        
    }

    return nil, fmt.Errorf("Request failed after %d attempts", maxRetries)
}

In this implementation, we add up to 10% random variation to the wait
time. This small adjustment can significantly reduce peak load during
recovery periods. The combination of exponential backoff (which gives
the server progressively more recovery time) and jitter (which prevents
synchronized retries) creates a robust retry strategy that helps both
client and server handle failures gracefully.

When to Retry Your Request

Although retry mechanisms enhance resilience, not every request to a downstream service should be retried. Consider the following factors:

Transient Errors : Focus on retrying errors that are likely to be temporary, such as network timeouts, server overload, or transient database issues
Type of Error : Not all errors should be retried. For example, retrying a 404 Not Found error is futile because the resource doesn’t exist.
Idempotency : Only retry idempotent operations like GET or DELETE requests. Retrying operations that modify state, such as POST, could cause unintended side effects.
Set Retry Limits : Set reasonable limits to prevent infinite loops and excessive resource usage.

Common Scenarios Where Retrying Might Not Help :

404 Not Found : The requested resource does not exist, so retrying won’t help.
401 Unauthorized : This error indicates invalid credentials. Retrying without correcting the credentials will fail.
403 Forbidden : The client does not have permission to access the resource. Retrying will not change the authorization status.
Permanent Errors : Errors caused by permanent issues, like configuration problems, should not be retried.

Conclusions

Implementing retry mechanisms with exponential backoff and jitter
significantly improves application resilience against transient failures
such as temporary network disruptions, server overloads, and brief
service outages. These techniques help applications recover automatically
from temporary issues without requiring manual intervention.

However, retry strategies come with important trade-offs to consider:

Benefits:

Automatic recovery from transient failures without user impact
Better tolerance for downstream service instabilities
Improved overall system reliability during peak loads

Trade-offs:

Increased end-to-end latency when retries are triggered
Additional resource consumption (network bandwidth, server capacity)
Risk of masking underlying problems that require investigation

Remember that retries are just one component of a comprehensive
resilience strategy. For production systems, consider combining retries
with complementary patterns such as:

Circuit breakers to prevent cascading failures when downstream services are consistently failing
Timeouts to avoid indefinitely waiting for responses
Monitoring and alerting to track retry rates and identify patterns that indicate deeper issues

Always evaluate the error type before retrying, not all failures benefit
from retries. Focus on idempotent operations and transient errors, and
avoid retrying authentication failures, permanent errors (4xx status
codes), or operations that modify state without proper idempotency
guarantees. By thoughtfully implementing these patterns, you build
systems that gracefully handle failures while remaining observable
and maintainable.

References

https://www.pullrequest.com/blog/retrying-and-exponential-backoff-smart-strategies-for-robust-software/

https://codecurated.com/blog/designing-a-retry-mechanism-for-reliable-systems/