DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on • Edited on

AWS Lambda SnapStart -Part 10 Troubleshooting errors and timeouts of init and restore phase

Introduction

In the previous parts of our article series about AWS SnapStart, we measured Lambda cold start times with various Java runtime versions and frameworks. Now let's talk about how to troubleshoot errors and timeouts of SnapStart-enabled Lambda function Init and Restore Phase.

Let's imagine we have enabled SnapStart for our function, and during the Deployment phase, we received the following or similar CloudFormation error message.

The following resource(s) failed to create: [GetProductByIdWithPrimingFunctionVersion5b3d011e02, GetProductByIdFunctionVersion5b3d011e
Enter fullscreen mode Exit fullscreen mode

This will also leave the CloudFormation Stack in the "Update_Rollback_In_Progress_State" like this.

Previously, Lambda reported errors and timeout into the CloudWatch Logs during the Invoke phase only, so it was difficult to figure out the reason for the error.

Troubleshooting Errors and Timeouts of SnapStart-enabled Lambda function Init and Restore Phase

Since November 8, 2023, [AWS Lambda makes it easier to troubleshoot errors and timeouts of the Init and Restore phase (https://aws.amazon.com/about-aws/whats-new/2023/11/aws-lambda-troubleshoot-errors-timeouts-init-restore-phases/).

Let's explore the short example, what this means, and intentionally produce the error during the invocation of the CraC Lambda hook in the beforeCheckpoint method.

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context) throws Exception {
   Optional<Product> optionalProduct= productDao.getProduct("0");
   Product product = optionalProduct.get());
}
Enter fullscreen mode Exit fullscreen mode

We do some priming here, and as the product with id equal to 0 doesn't exist, we'll run into the

java.util.NoSuchElementException: No value present
        at java.base/java.util.Optional.get()
Enter fullscreen mode Exit fullscreen mode

error. As the invocation of beforeCheckpoint for that SnapStart-enabled Lambda function occurs before taking the snapshot, according to the recent improvement in the troubleshooting, the error should be published into the CloudWatch Logs. And it is:

Now it's at least clear what happens during the "Init phase", so the error in this phase can be identified more easily.

For the sake of completeness, let's reproduce the same error during the Restore phase

@Override
public void afterRestore(org.crac.Context<? extends Resource> context) throws Exception {   
   Optional<Product> optionalProduct= productDao.getProduct("0");
   Product product = optionalProduct.get());
}
Enter fullscreen mode Exit fullscreen mode

It doesn't make much sense to do such priming in the Restore phase. This code is only for the sake of provoking an error. Also in this case, the error appears in the CloudWatch Log and is easy to understand and fix.

Of course, there are many more complex errors, like some internal failures during snapshot taking and the restoring phase, which can only be fixed by the corresponding AWS team. But providing them with the exact error message in the created support case will definitely help resolve the issue much quicker.

Conclusion

In this article, we successfully demonstrated the recent improvement that Lambda now automatically captures and sends logs about each phase of the Lambda execution environment lifecycle to CloudWatch Logs. This includes the Init phase, in which Lambda initializes the Lambda runtime and static code outside the function handler, the Restore phase, in which Lambda restores the execution environment from a snapshot for Lambda SnapStart-enabled functions, and the Invoke phase, in which Lambda executes the code in your function handler. This improvement enables us to troubleshoot errors and timeouts of the Init and Restore Lambda function Phase.

Top comments (0)