How to model Lambda cold-start behaviour under spike traffic before you deploy

#aws #serverless #lambda #cloudarchitecture

There is a class of AWS incident I have started calling the "everything looked fine in testing" failure.

The pattern is consistent. You design a serverless API. Lambda function with sensible defaults, wired through API Gateway, pointing at DynamoDB. You test it in dev throughout the week. Latency is acceptable. Costs track to plan. Then a campaign drops, or a new enterprise customer brings their three thousand users on day one, and your traffic goes from 300 RPS to 3,000 RPS in under a minute.

Lambda, which has never had to spin up more than a dozen concurrent environments at once, is now being asked to handle a hundred. Cold starts accumulate. p99 latency goes from 80ms to 2,400ms. API Gateway timeout windows close on in-flight requests. Customers see errors. The Slack channel fires. You spend a Saturday explaining to your CTO why the architecture that "passed all our tests" just fell over under a load it should have anticipated.

I have been in this situation. Not once.

The second time is when I stopped treating load testing as a post-deployment activity.

The cold start problem, precisely stated

Lambda's execution model does not maintain persistent servers. When an invocation arrives and no warm execution environment exists, Lambda must provision one: select a host, initialise the execution environment, load the runtime, execute your module-level initialisation code.

That sequence is the cold start. And its duration varies along several dimensions that are non-obvious:

Runtime matters. Node.js 20 with V8: typically under 100ms for lightweight functions. Python: comparable. Java with JVM class-loading: 300ms to well over a second.

Memory allocation matters. Lambda allocates CPU proportionally to memory. A function at 1,024 MB gets significantly more CPU than one at 128 MB. Counterintuitively, increasing memory can reduce cold start latency and total cost simultaneously - the faster initialisation more than compensates for the higher per-GB-second rate.

The spike dynamic is what kills you. Cold starts at steady state are manageable. The problem is spike behaviour. Under rapid traffic increase, Lambda must provision new environments in parallel. You can hit dozens or hundreds of concurrent cold starts at the exact moment your users' experience is most consequential. A steady-state load test does not expose this.

The pre-deployment simulation model

For the simulation, I built the architecture on a pinpole canvas:

Route 53 → CloudFront → API Gateway → Lambda (Node.js 20, 512MB) → DynamoDB

Lambda configured without provisioned concurrency - the common default for a new service with uncertain traffic. Reserved concurrency set explicitly, not at default unlimited.

The Lambda config panel exposes the parameters that directly affect cold-start modelling:

Parameter	Value	Cold Start Relevance
Runtime	Node.js 20.x	High - directly factors into latency model
Memory	512 MB	High - CPU allocation, init speed
Reserved concurrency	Set explicitly	Critical - defines throttle ceiling
Provisioned concurrency	Off	The variable under test

I ran a Spike traffic pattern: 300 RPS baseline → 3,000 RPS over 60 seconds. The concurrency graph showed cold-start accumulation in real time. At peak: 90 concurrent cold starts, 2,400ms p99 latency.

What the simulation surfaced that a load test could not

1. The burst scaling limit. Lambda's initial burst quota is 500–3,000 concurrent executions depending on region, then 500 new environments per minute thereafter. This is not visible in the Lambda console until you hit it. The simulation reflects these constraints - the concurrency graph under spike traffic is not a smooth ramp. It shows the actual burst behaviour, including the plateau and the recovery slope.

2. Timeout alignment. The simulation flagged a configuration where API Gateway's integration timeout and Lambda's execution timeout were both set to 29 seconds. Under concurrency pressure, invocations that queue before executing can consume their window before execution even begins. Surface this in a canvas session: costs nothing. Discover it in a 2 AM incident: costs considerably more.

3. The provisioned concurrency trade-off, quantified. I accepted the recommendation to enable provisioned concurrency, reran the simulation, and compared in the execution history view. p99 at peak: 80ms. The cost of provisioned concurrency was visible in the live estimate alongside the latency improvement. The trade-off was explicit and quantified before any IaC was written.

The reproducibility argument

The result I value most is not the simulation output itself - it is that the output is reproducible and shareable. When I shared this analysis with my team, I shared the simulation run: the exact canvas configuration, the traffic pattern, the concurrency graph, the before-and-after comparison. Not an assertion about expected behaviour. A versioned record of what the model showed.

That is a materially different quality of architectural evidence.

Full post with complete simulation methodology, burst scaling model details, provisioned concurrency cost analysis, and the pre-deployment Lambda checklist →