Transient Errors: Retry Wisely

#distributedsystem #retrymechanims #transienterrors #software

Each remote service that we call eventually going to fail. No matter how reliable they are, it is inevitable.

“Everything fails all the time” — Werner Vogels

These failures can come from a variety of factors; network issues, hardware problems, temporarily unavailable services, exceeded response times, etc.

Some of these failures might have been resolved automatically and in a short period, if the remote service is invoked again, it immediately responds successfully. We call these kinds of errors as transient errors.

When we encounter a transient error, there are a few things we can do. The simplest option would be to log an error and give up. Since transient errors most likely to be resolved when you retry, you may guess that this is not the wisest option. So, the correct strategy would be the retry the failed operation.

Retry the failed operation without having a clear strategy most likely will create extra load on remote service and therefore it probably will make the situation worst.

Some questions that should be answered before applying retry strategy are:

How would client determine if error transient or not?
How often client should retry?
How long client should keep retrying?
When client should give up?

So, when to retry?

You can identify that it is a transient error

If the remote service returns TransientErrorException, it is great. However, this is not always likely. In those situations, we need to be smart while interpreting the errors.

Client Errors: These are the errors caused by client itself. Examples are badly formed requests, causing conflict, or doing too many requests. In those cases, remote service returns 4xx error. The only way to handle client errors is to fix either client or request itself by human intervention. _ There is no point to retry these requests. _

Server Errors: These are the 5xx errors that indicate something went wrong on the server side. In those cases, it is usually safe to retry since every 5xx error is not as transient error.

Network Errors: These are the errors due to network issues. Examples are package loss, router/switches etc hardware issue, etc. Safe to retry if you can identify those.

The service is idempotent

Idempotence means, when making multiple identical requests has the same effect as making a single request ¹

According to the definition of the spec of HTTP², GET, HEAD, PUT, and DELETE are idempotent operations. So, you are fine to retry those requests unless advised by a remote service owner. On the other hand, POST & PATCH are not idempotent and if idempotency not applied, it is not safe to retry since it might cause side effects such as charging a customer multiple times.

Retry strategies

Several strategies that can be applied as retry mechanism. Choosing the right strategy depends on the use case.

We can retry immediately right after the operation failed. This is the simplest retry strategy that we can implement. It is a good idea to give up or fallback a better strategy after the first failed retry operation since continuously retrying will create too much load on remote service.

We can retry at fixed intervals right after the operation failed. This strategy gives more time to remote service for recovery.

Both of these strategies are useful for applications that user interact with it since these strategies retry the failed operation and if operation not successful, most likely to give up. So that, the user does not have to wait for a long time.

If your service/application does not directly interact with a user and/or you have a luxury to wait more (e.g. background operations), exponential backoff is a strategy that you should try. This strategy is based on increasing the wait time exponentially between subsequent retries. This is a pretty useful technique since it gives remote service more time to recover and create less load than previous both strategy in a given period.

So, what exponential backoff strategy looks like?

Here is a simplified pseudo code for exponential backoff strategy algorithm in a simple way:

retries = 0;
retry = true;
do {
    wait for Math.MIN((2^retries * 1000) milliseconds, MAX_WAIT_INTERVAL)
    status = fn(); // retry function
    set retry true if operation status failed
    retries++;
} while(retry && retries < MAX_RETRY_COUNT)

Distribute load by adding some jitter

Most likely that there will be multiple instance of client and therefore, if all the requests from those clients fail at exactly the same time, we don’t want these retries to be overlapped. Adding jitter will distribute load more evenly. With jiter, our algorithm would be;

retries = 0;
retry = true;
do {
    waitTime = Math.MIN((2^retries * 100) milliseconds, MAX_WAIT_INTERVAL)
    waitTime += random(0...3000)
    status = fn(); // retry function
    set retry true if operation status failed
    retries++;
} while(retry && retries < MAX_RETRY_COUNT)

Summary

Retry is a powerful technique that allows the client to offer higher availability than its dependencies if applied correctly. Retrying the failed operation without having a clear retry strategy most likely will create extra load on service and will make the situation worst.

So, use wisely!