DEV Community

Cover image for Exponential backoff for AWS Lambda
Sohaib Tariq
Sohaib Tariq

Posted on • Edited on • Originally published at sohaibtariq.com

Exponential backoff for AWS Lambda

I recently set up a Lambda function that reads data from an SQS Queue
and makes an API call to one of our microservices.
Naturally, this calls for an error handling mechanism, considering that the microservice
could be down or unresponsive.

AWS Lambda provides its own retry mechanism where a message is picked up from the queue by the Lambda
consumer and becomes invisible to other consumers for a specific duration called the visibility timeout.
If the consumer completes execution successfully, it automatically deletes the message from the queue.
In case of unsuccesful execution (such as a Runtime Exception), the approximate receive count
of the message is incremented and it becomes available to other consumers after the visibility timeout passes.
The number of times a message can be re-read from the queue
before it is finally sent to a Dead Letter Queue(DLQ) is configured in the Redrive policy of the
SQS Queue and is tracked via the approximate receive count.

This retry mechanism was not exactly what I had in mind for our use case. I was thinking along the
lines of a backoff strategy that keeps retrying the API call with exponentially increasing wait time;
finally sending the message to a DLQ after a set number of retries. This would give us ample time to
fix any issues with our miscroservice and prevent it from being bombarded with failing API calls.

This is what I ended up with:

First, a very basic Java function to calculate the exponential wait time, given the number of
retries recvCount:

        int randomInt = rand.nextInt(60);
        Long result = new Double(Math.pow(2, recvCount)).longValue() + 30 +randomInt;  //adding jitter to new random visibility timeout
Enter fullscreen mode Exit fullscreen mode

Notice the addition of randomInt. That is 'jitter'. A bit of randomness. I read about it in some
documentation by Google Cloud and included
it as a good practice.

Next up, set the visibility timeout of the message to the value that we just calculated above. The maximum value allowed by AWS is 43200 seconds
or 12 hours.

  sqs.changeMessageVisibility(queueUrl, msg.getReceiptHandle(), newVisibilityTimeout.intValue());
Enter fullscreen mode Exit fullscreen mode

Finally, we check the response to our API call. If it is a 400 or 500 series response, we throw a Runtime Exception and change the visibility timeout of the
message. This is
the easiest way I could come up with to signal unsuccessful execution of the Lambda function. Plus, we can only throw unchecked exceptions
in our handler method.

...
// api call
...  
 if (response.getStatusLine().getStatusCode() >= 400){
                            new ExponentialBackoff().setVisibilityTimeout(msg);
                            throw new RuntimeException("Request to server failed");
                        }
Enter fullscreen mode Exit fullscreen mode

ExponentialBackoff is my utility class where the code that calculates and sets the visibility timeout lives. It also has some other
utility functions that are not essential for this demonstration.

There you have it; A bare bones exponential backoff implementation for AWS Lambda.

Top comments (1)

Collapse
 
arsalanwork profile image
ArsalanWork • Edited

Exponential backoff will be good approach in the following two scenarios:

  1. when there are different downstream api endpoints triggered from the lambda for different types of records it receives.
  2. when the downstream service is expected to behave differently for different records you process from the queue.

Exponential backoff won't be a good pattern in a solution where high volumes of records are processed and pushed to a single api endpoint. Because exponential backoff works in the context of a single record. Imagine you have thousands of records in the queue, and the downstream api is down, each record will hit the api in the first processing attempt and will then have visibility timeout increased exponentially. So all the records in the queue will keep hitting the api. In these kind of scenarios, Circuit breaker pattern works far better exponential back off.