DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on • Edited on

Reducing cold starts on AWS Lambda with Java runtime - Future ideas about SnapStart, GraalVM and Co

Introduction

In the previous 8 parts of our series about AWS Lambda SnapStart, we measured the cold starts of Lambda functions with Java 11 and 17 runtime first without and with enabling SnapStart, and also applying various optimization techniques like priming when using SnapStart. You can refer to the cold start times measured with GraalVM Native Image. Current measurements reveal that the fastest cold start times can be achieved with GraalVM Native Image, followed by SnapStart with priming (in case you can apply such optimization for your use case, for example, when you’re using DynamoDB as your database of choice), followed by SnapStart without any optimizations. Of course, the slowest cold start times you will experience are without using GraalVM Native Image and SnapStart. See the summarized measurements in my previous articles of this series or in one of my presentations, like this one.

In this article, I’d like to share some thoughts on how AWS can further improve its offering around reducing cold starts on AWS Lambda with Java Runtime, and also improve the developer experience.

Thoughts on how AWS can further improve its offering around reducing cold starts on AWS Lambda with Java Runtime

Let’s start with the potential improvements on the AWS SnapStart-enabled Lambda functions :

  • Now, as we have the correct snapshot restore numbers reported since the end of September 2023, we see that there is a big potential to reduce the time required for such a snapshot restore. I’m quite sure that AWS is already working on it.

  • Another potential improvement is in the area of how SnapStart is implemented. The snapshot is taken in the deployment phase, which currently takes a bit more than 2 minutes. As I first measured it, it took 2 minutes and 40 seconds, so there is already an improvement achieved there. I’d personally like to have the configurable option to take the Firecracker microVM snapshot on the first Lambda function invocation (instead of during the deployment phase). And as long as this operation takes place, the regular cold start should occur as if SnapStart had been disabled. With that, we trade off a quicker deployment time (which improves developer experience) and having slower cold starts for about 2 minutes after the first invocation after the Lambda function deployment. In such a scenario, the Firecracker microVM snapshot is fully created and ready to be restored, and the SnapStart becomes automatically re-enabled.

  • The same I’d like to have available for Lambda functions that haven’t been invoked for 14 days, and in such a case microVM snapshot will be deleted, leading to its re-creation during the next invocation. This leads to huge cold start times and even timeouts, which currently makes SnapStart not really usable for Lambda functions with such an invocation pattern, so that they can remain not invoked for 14 days.

  • In parts 5 and 6 of this series, we introduced an optimization technique called “priming” based on optional Lambda hooks with CRaC API and discussed how it can be applied to already known scenarios. Priming is a purely educational thing: you need to understand how things work behind the scenes to know whether it's worth applying and how or not. In the end it comes down to nearly everybody writing the same boilerplate code, for example to prime the invocation to DynamoDB. I personally expect that not only AWS (they can for sure provide some help with priming at least for AWS services) but the Java community will provide some open source frameworks capable of applying priming “out of the box” by analyzing which AWS services, libraries and frameworks we are using in Lambda and then generating the corresponding priming code in Lambda hooks based on CRaC API behind the scenes (for example during the compilation phase) for us.

  • Another improvement has to be achieved around the reliability of the microVM snapshot creation itself. I know several people reported that errors occasionally occur there during the deployment phase without any given reason, which leads to the CloudFormation state rollback. AWS has already recently provided some improvements for troubleshooting such cases, which we'll explore in a future article.

  • Another interesting potential optimization with SnapStart, not directly related to the cold start times, is related to achieving the peak performance of the Lambda function. It currently belongs to the best practice to reduce the cold start times to use tiered compilation for Lambda functions using Java managed runtime. As Lambda functions are often small and single-purpose, it can be potentially possible during the Firecracker microVM snapshot creation phase to call them several thousand times (usually 10.000) using C2 compilation to achieve their peak performance, also for the warm function execution. Here is, of course, a huge trade-off between the gain (performance) and associated costs for executing the Lambda function so many times, and also developer experience, as the Firecracker microVM snapshot creation will then take much longer. Once again, snapshot creation during the first Lambda execution discussed above can improve at least the developer experience.

Now let’s look into the GraalVM Native Image and potential AWS offering around it:

  • I personally still think that providing the Lambda managed GraalVM (Native Image) runtime will give developers more options and benefit everybody. Delivering GraalVM Native Image through the Lambda Custom Runtime or the Lambda Container Image shifts many operational responsibilities like scaling CI/CD pipelines to the developers, but gives the developers the freedom to use the runtime version they'd ike including the new ones. Maybe the new release of the Amazon Linux 2023 runtime for AWS Lambda can become the base of it.

  • I’d also suggest in case such a offering would become available in Lambda in the future, AWS should give developers an option similar to the one discussed above for SnapStart to create the Native Image (which currently takes between 20 seconds and several minutes) either in the Lambda deployment phase or during the first Lambda invocation (defaulting to pure GraalVM with Just-in-Time compilation or even AWS Corretto Java runtime) as long as the native image hasn’t been fully created.

  • As Ahead-of-Time compilation has its own set of challenges, and developers fear runtime errors, especially as we don’t control all dependencies and how they are GraalVM -ready. Maybe for such a case developer can provide a set of tests that will be executed by AWS on their behalf on the Lambda GraalVM Native Image execution environment sandbox, and if they pass, the environment becomes active. Otherwise, once again defaulting to pure GraalVM with Just-in-Time compilation or even AWS Corretto Java runtime, and notifying developers about the errors during the test execution. Of course, this approach is tricky to implement and may not be required at all as GraalVM publishes more and more GraalVM Native Image ready dependencies.

Conclusion

In this article, we looked at the possible improvements around AWS's existing offering to reduce the cold start times of Lambda functions using Java managed runtime as SnapStart, and also looked at how managed GraalVM (Native Image) Lambda runtime could benefit the developers. Also Java runtime itself is subject to improvement for the Serverless use case. I encourage you to look at the Project Leyden, whose primary goal is to improve the startup time, time to peak performance, and footprint of Java programs. I also encourage you to watch the video about the current state of the Project Leyden in October 2023, presented by the Java architects at Oracle.

Please also check out my website for more technical content and upcoming public speaking activities.

Top comments (1)

Collapse
 
dev264 profile image
dev264

Thanks for sharing some tips for working around cold starts!