DEV Community

Cover image for AWS Lambda retry using Backoff and Jitter
Prabusah
Prabusah

Posted on

AWS Lambda retry using Backoff and Jitter

TLDR;

Use Backoff and Jitter to retry TooManyRequestException, ThrottlingException or RateExceeded while calling APIs with limits.

AWS Service API limit:

There are 200+ AWS Services and each service has quotas (limits). For eg: Amazon Connect is a fully managed cloud contact center service. Amazon Connect service API{:target="blank"} operations such as _CreateContactFlow and DeleteContactFlow etc. has a Rate limit of 2 requests per second. This link shows operations of Amazon Connect Participant Service API throttling quotas. Similarly AWS account has default quotas (or limits) for each AWS Service APIs.

Also APIs of an organization hosted either on-premises or cloud will have its own throttling limits.

Throttle scenario in AWS Lambda:

A typical AWS Lambda interacts with other AWS services like DynamoDB, S3, SQS, SNS etc. by calling its APIs using aws-sdk.

Consider a AWS Lambda named "FunLambda" calls a particular AWS service API operation (say NoSuchAWSOperation) and assume its rate limit is 1 request per second.

const handler = async (event, context) => { 
    let [failure, success] = awsService.NoSuchAWSOperation(); 
    if(failure due to throttle) {
        retryLogic()
    }
}

let retryLogic = function() {    
    awsService.NoSuchAWSOperation(); //immediately retrying NoSuchAWSOperation by calling again. 
    // consider there is a logic to retry above operation 2 more times on retry failure.
}
Enter fullscreen mode Exit fullscreen mode

Assuming 2 instances of "FunLambda" running concurrently and calling the NoSuchAWSOperation exactly at same time

Time format used HH:MM:SS:milliseconds.
LI1 - Lambda Instance 1; LI2 - Lambda Instance 2.
01:01:00:000 - LI1 & LI2 calling NoSuchAWSOperation and received ThrottlingException.
01:01:00:000 - LI1 & LI2 retry logic tries (1st time) immediately and fails.
01:01:00:000 - LI1 & LI2 retry logic tries (2nd time) immediately and fails.
01:01:00:000 - LI1 & LI2 retry logic tries (3rd time) immediately and fails.

Explanation

Both instances of "FunLambda" function trying to hit the NoSuchAWSOperation whose rate limit is 1 request per second-based on above scenario, all 3 retries are done during 01AM and 01 minute - and the failure is expected.

Solution (retry with static delay):

Add static wait time to retryLogic prior to calling AWS Service API.

let retryLogic = function() {
    // Wait for 1000ms.
    awsService.NoSuchAWSOperation();
    // consider there is a logic to retry above operation 2 more times on retry failure.
}
Enter fullscreen mode Exit fullscreen mode

Time format used HH:MM:SS:milliseconds.
LI1 - Lambda Instance 1; LI2 - Lambda Instance 2.
01:01:00:000 - LI1 & LI2 calling NoSuchAWSOperation and received TooManyRequestException.
01:01:01:000 - LI1 & LI2 retry logic waits for 1000ms and calls API again (1st time) but fails.
01:01:02:000 - LI1 & LI2 retry logic waits for 1000ms and calls API again (2nd time) but fails.
01:01:03:000 - LI1 & LI2 retry logic waits for 1000ms and calls API again (3rd time) but fails.

Explanation

Both instances of "FunLambda" function trying to hit NoSuchAWSOperation by waiting exactly at the same time in all 3 retries during 01AM, 01min and 01, 02 and 03 seconds and the failure is expected as the rate limit exceeded per second.

Instead of static wait time, lets add exponential backoff.

Solution (retry with exponential delay):

Add exponential wait time by retry attempts to retrylogic while calling AWS Service API.

let retryLogic = function() {
    // Wait for 1000ms exponentially like 1s, 2s, 4s, 16s etc. 
    awsService.NoSuchAWSOperation(); 
    // consider there is a logic to retry above operation 2 more times on retry failure.
}
Enter fullscreen mode Exit fullscreen mode

Time format used HH:MM:SS:milliseconds.
LI1 - Lambda Instance 1; LI2 - Lambda Instance 2.
01:01:00:000 - LI1 & LI2 calling NoSuchAWSOperation and received TooManyRequestException.
01:01:01:000 - LI1 & LI2 retry logic waits for 1000ms and calls API again (1st retry) but fails.
01:01:03:000 - LI1 & LI2 retry logic waits for 2000ms and calls API again (2nd retry) but fails.
01:01:07:000 - LI1 & LI2 retry logic waits for 4000ms and calls API again (3rd retry) but fails.

Explanation:

Both instances of "FunLambda" function trying to hit NoSuchAWSOperation by waiting exactly at the same time in all 3 retries during 01AM, 01min and 01, 03 and 07 seconds and the failure is expected as the rate limit exceeded per second.

Lets add Backoff and Jitter for calculating wait time.

Solution (retry with Jitter delay):

Use Backoff and Jitter formula to calculate wait time. Go through this AWS Blog for more info on Backoff and Jitter.

let random_between = function(min, max) {
    return Math.floor(Math.random()* (max - min + 1) + min)
}    

let waitTimeInMS = function() {
    let sleep = 350;
    let cap = 1000; //max wait a sec 
    let base = 500; //wait atleast half a sec
    sleep = Math.min(cap, random between(base, sleep * 3));
    return sleep;
}
Enter fullscreen mode Exit fullscreen mode

Time format used HH:MM:SS:milliseconds.
LI1 - Lambda Instance 1; LI2 - Lambda Instance 2.
01:01:00:000 - LI1 & LI2 calling NoSuchAWSOperation and received TooManyRequestException.
01:01:00:763 - LI1 retry logic waits for 763ms and calls API again but fails (000 + 763).
01:01:00:907 - LI2 retry logic waits for 907ms and calls API again but fails (000 + 907).
01:01:01:587 - LI1 retry logic waits for 824ms and calls API again this time its success (763 +824).
01:01:01:850 - LI2 retry logic waits for 943ms and calls API again but fails (907 +943).
01:01:02:459 - LI2 retry logic waits for 609ms and calls API again this time its success (1850 +609).

Explanation:

LI1 & 2 waited for 763ms and 907ms since the wait time is within a second - both failed due to quota exceeded (until 01HH:01MM:01SS this issue expected).
LI1 retried for the 2nd time at 01HH:01MM:01SS:587ms and it is successful.
LI2 retried for the 2nd time at 01HH:01MM:01SS:850ms and failed (until 01HH:01MM:02SS this is expected).
LI2 retried for the 3rd time at 01HH:01MM:02SS:459ms and it is successful.

Conclusion:

As we can see the retry logic with Backoff and Jitter is more optimal solution when compared to immediate, static wait and exponential wait retry logics.

Image by Gerd Altmann from Pixabay

Top comments (0)