Introduction
In the course of this blog series, we measured cold starts with very different scenarios, mainly with SnapStart enabled. Now let's explore one more SnapStart detail, which is called "tiered caching" of the microVM snapshot. This procedure was briefly mentioned in the article where SnapStart was announced. Here is the relevant sentence: With SnapStart, when a customer publishes a function version, the Lambda service initializes the function’s code. It takes an encrypted snapshot of the initialized execution environment, and persists the snapshot in a tiered cache for low-latency access. So, what might this tiered cache be?
Tiered cache of the snapshot
In our experiment we'll re-use the application introduced in part 9. Let's take GetProductByIdWithPureJava21Lambda function with SnapStart-enabled (but without priming) and 1024 MB memory and measure the cold start for exactly 1 invocation. Let's assume it's the very first execution after the newer version of the Lambda function has been published. My result was 2270.28 ms. The cold start is still big enough. Without SnapStart enabled, the result might be in the range of 3100 and 3600 ms.
But what will happen with the subsequent cold start times with SnapStart enabled? Let's see how percentiles change for an increasing number of cold starts.
| data and time | number of cold starts | p50 | p75 | p90 | p99 | p99.9 | max |
|---|---|---|---|---|---|---|---|
| 8.3. 18:15 | 1 | 2270.28 | 2270.28 | 2270.28 | 2270.28 | 2270.28 | 2270.28 |
| 8.3. 18:26 | 4 | 2078.54 | 2196.68 | 2270.28 | 2270.28 | 2270.28 | 2270.28 |
| 8.3. 18:38 | 9 | 2131.58 | 2210.69 | 2340.34 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 18:51 | 14 | 1880.21 | 2131.58 | 2270.28 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 19:05 | 20 | 1792.05 | 2015.11 | 2196.68 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 19:20 | 34 | 1706.08 | 1856.04 | 2131.58 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 19:32 | 49 | 1662.7 | 1792.05 | 2168.88 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 19:44 | 66 | 1642.87 | 1709.27 | 2078.54 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 19:44 | 76 | 1640.13 | 1703.17 | 2064.59 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 20:10 | 85 | 1640.13 | 1700.9 | 2015.11 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 20:20 | 98 | 1642.74 | 1703.17 | 1880.21 | 2340.34 | 2340.34 | 2340.34 |
| 8.3. 20:30 | 109 | 1639.75 | 1691.35 | 1865.41 | 2269.10 | 2338.17 | 2340.34 |
| 8.3. 20:41 | 120 | 1633.21 | 1679.56 | 1854.25 | 2269.10 | 2338.17 | 2340.34 |
| 8.3. 20:52 | 129 | 1629.95 | 1676.21 | 1854.25 | 2269.10 | 2338.17 | 2340.34 |
So, what we observe is that the cold start times are reduced the more cold starts we experience. After 50 cold starts, the effect becomes less and less visible for p50 and after 100 cold starts for p90. This is the effect of the tiered cache for the mivcorVM snapshot in action. The effect of the tiered cache depends on the percentile and is significant (up to 600ms).
If you are interested in the deep details about how Lambda SnapStart (which is currently only available for Java runtime) is implemented, and particularly microVM (the whole execution environment) snapshot and its tiered caching works under the hood, I recommend the talk AWS Lambda Under the Hood by Mike Danilov. There is also a detailed summary of his talk here and additional resources here.
Of course, the question was what would happen if we didn't invoke my Lambda function for a while and then execute it later? Will the cold start increase? Let's check.
Let's then stop Lambda function execution and then invoke it 30 min later at 21.22 - cold start was 1674.06, then stop again and invoke it at 21:52 - cold start was 1702.17, then the same at 23:00, cold start was 1735.06. So, it got slightly bigger, but we don't observe the worst values from the first executions. Then I stopped Lambda execution for 8 hours and executed it the next morning, then 15.000 times running into 16 cold starts with p50 being 1669.07 and from p90 on 2019.88. So, the tiered caching effect was still there after so many hours, and the p90 and higher numbers didn't look that big as during the first invocations.
To complete the test of snapshot tiered caching, I also did the same experiments on GetProductByIdWithPureJava21LambdaAndPriming, which uses SnapStart and DynamoDB request invocation priming on top. Let's summarize them in the table below.
| data and time | number of cold starts | p50 | p75 | p90 | p99 | p99.9 | max |
|---|---|---|---|---|---|---|---|
| 8.3. 18:15 | 1 | 1189.55 | 1189.55 | 1189.55 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 18:26 | 4 | 1046.09 | 1166.55 | 1189.55 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 18:38 | 9 | 801.74 | 1046.09 | 1189.55 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 18:51 | 14 | 763.37 | 808.96 | 1166.55 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 19:05 | 23 | 730.28 | 801.74 | 1046.09 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 19:20 | 32 | 720.01 | 796.2 | 941.29 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 19:32 | 47 | 700 | 758.39 | 903.36 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 19:44 | 58 | 692.52 | 749.01 | 831.72 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 19:44 | 68 | 684 | 748.61 | 831.72 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 20:10 | 80 | 679.44 | 731.52 | 801.74 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 20:20 | 91 | 688.25 | 748.61 | 799.25 | 1189.55 | 1189.55 | 1189.55 |
| 8.3. 20:30 | 100 | 689.34 | 748.22 | 799.24 | 1166.16 | 1188.52 | 1189.55 |
| 8.3. 20:41 | 110 | 679.76 | 744.49 | 799.24 | 1166.16 | 1188.52 | 1189.55 |
| 8.3. 20:52 | 122 | 679.08 | 744.49 | 799.24 | 1166.16 | 1188.52 | 1188.52 |
So, what we observe is the same as without priming. The cold start time reduces with more cold starts we experience. After 50 cold starts, the effect becomes less and less visible, and after 80 cold starts, the effect becomes negligible for p50 and after 90 cold starts for p90. The effect of the tiered cache depends on the percentile and is significant (up to 500ms).
Of course, I had the same question was what will happen if I don't invoke my Lambda function for a while and then execute it later? Let's check.
Let's stop the Lambda function execution and then invoke it 30 min later at 21.22 - cold start was 746.63, then stop again and invoke it at 21:52 - cold start was 617.7, then the same at 23:00, cold start was 673.5. Then I stopped Lambda execution for 8 hours and executed it the next morning, then 15.000 times, running into 17 cold start times with p50 being 723.99 and p90 being 894.05. So the tiered caching effect was still there as well after so many hours, and the p90 and higher numbers didn't look that big as in the first invocations.
Conclusion
In this article, we saw microVM snapshot tiered cache in action for SnapStart-enabled Lambda function (with and without priming) with Java 21 runtime. The conclusion is quite obvious: don't stop by enabling SnapStart and measuring only one cold start time or a couple of them. Yes, the first cold starts take longer, but it gets better with the number of invocations for the same Lambda function version and seems to stay on a good level independent whether you invoked your Lambda for a while or not. I assume the same or very similar effect as observed with Java 17, as it's not about the Java version itself, but about the technical implementation of the microVM snapshot tiered cache done by AWS.
Top comments (0)