One of the most compelling promises of serverless is "infinite scale" — your function handles 1 request or 100,000 requests, and you never touch a config file.
But that promise comes with nuance. Lambda's scaling model has hard limits, specific behaviors under load, and failure modes that will surprise you in production if you haven't studied them.
In this article, we'll break down exactly how Lambda scales: the concurrency model, how scale-out works internally, what happens when you hit limits, and how to design for scale in production.
The Fundamental Unit: Concurrency
Lambda doesn't scale by CPU or memory — it scales by concurrency.
Concurrency = the number of function instances handling requests simultaneously.
Time →
Request A: [===========] (200ms)
Request B: [===========] (200ms)
Request C: [===========] (200ms)
Concurrency at peak = 3 simultaneous executions
Each concurrent execution runs in its own isolated execution environment (a Firecracker microVM, as we covered in Part 1). There is no shared state between concurrent executions.
The scaling formula is simple:
$$\text{Concurrency} = \text{Requests per second} \times \text{Average duration (seconds)}$$
Example: 500 requests/second, each taking 200ms (0.2s):
$$\text{Concurrency} = 500 \times 0.2 = 100 \text{ concurrent executions}$$
This means you need 100 execution environments running simultaneously. If your function takes longer (say, 2 seconds per request), the same traffic requires 1,000 concurrent environments.
Key insight: Slow functions are expensive at scale — not just in duration cost, but in the concurrency they consume. Optimizing function duration directly reduces the concurrency you need.
Three Types of Concurrency
AWS Lambda has three concurrency concepts you need to understand:
1. Account-Level Concurrency Limit
Every AWS account has a regional concurrency limit — by default, 1,000 concurrent executions per region. This is a hard ceiling shared across ALL Lambda functions in that region.
Account concurrency pool: 1,000
├── function-A: using 400
├── function-B: using 350
├── function-C: using 200
└── Available: 50
If all 1,000 slots are consumed, any new invocation is throttled — Lambda returns a 429 TooManyRequestsException.
You can request a limit increase via AWS Support (up to tens of thousands for production workloads).
2. Reserved Concurrency
You can reserve a fixed number of concurrency slots for a specific function. This does two things:
- Guarantees the function always has those slots available (other functions can't consume them)
- Caps the function at that maximum (it can never exceed the reserved amount)
# Set reserved concurrency via boto3
import boto3
lambda_client = boto3.client('lambda')
lambda_client.put_function_concurrency(
FunctionName='brand-logo-processor',
ReservedConcurrentExecutions=200 # max 200 concurrent executions
)
Account pool: 1,000
├── brand-logo-processor: RESERVED 200 (guaranteed + capped)
├── brand-api: RESERVED 300 (guaranteed + capped)
└── Unreserved pool: 500 (shared by all other functions)
When to use reserved concurrency:
- Protect a downstream database from being overwhelmed (cap Lambda concurrency to match DB connection pool size)
- Guarantee capacity for a critical function during high traffic
- Prevent a runaway function from consuming the entire account pool
3. Provisioned Concurrency
As covered in Part 1, Provisioned Concurrency pre-initializes execution environments so they're ready to handle requests with zero cold start. It's a subset of reserved concurrency.
Reserved: 200
└── Provisioned: 50 (always warm, zero cold start)
└── On-demand: 150 (scale up as needed, may cold start)
How Lambda Scales Out: The Internal Mechanics
When traffic increases, Lambda needs to spin up new execution environments. This is where the internal scaling mechanics matter.
The Burst Scaling Limit
Lambda doesn't scale from 0 to 10,000 instantly. There's a burst concurrency limit — the maximum number of new execution environments Lambda can add per minute:
| Region | Initial Burst Limit | Scale Rate After Burst |
|---|---|---|
| us-east-1, us-west-2, eu-west-1 | 3,000 | +500/minute |
| All other regions | 500–1,000 | +500/minute |
What this means in practice:
Minute 0: Traffic spike hits. Lambda starts at current concurrency.
Minute 1: +500 new environments provisioned
Minute 2: +500 more
Minute 3: +500 more
...until account limit is reached or traffic stabilizes
If your traffic spikes from 0 to 5,000 concurrent requests instantly (e.g., a viral event), Lambda cannot serve all of them immediately. The first ~3,000 are handled (in us-east-1), and the rest are throttled until the next minute's burst capacity is available.
Mitigation: Use Provisioned Concurrency for latency-sensitive functions that may experience sudden spikes.
Scale-to-Zero
Unlike traditional servers or Kubernetes deployments, Lambda scales all the way down to zero when there's no traffic. No idle execution environments, no cost.
Traffic pattern:
08:00 ████████████████ (high traffic — many environments active)
12:00 ████ (moderate traffic)
03:00 (no traffic — zero environments, zero cost)
This is fundamentally different from Kubernetes HPA (Horizontal Pod Autoscaler), which cannot scale to zero. HPA monitors CPU/memory metrics — if there are no pods running, there are no metrics to monitor, so HPA can't trigger scale-up from zero.
Lambda solves this with a different model: the trigger mechanism itself (API Gateway, SQS, EventBridge) acts as the "activator" — it wakes Lambda up when traffic arrives, even from zero.
The Metrics → Decision → Scale Loop
Lambda's auto-scaling follows a three-phase loop that mirrors how all serious auto-scalers work:
Phase 1: Metrics Collection
Lambda continuously monitors:
- Concurrent executions: how many environments are actively processing
- Throttle rate: percentage of invocations being throttled
- Queue depth (for SQS/Kinesis triggers): how many unprocessed messages
These metrics are available in CloudWatch and can be used to build custom scaling alarms.
# Monitor Lambda concurrency with CloudWatch
import boto3
cloudwatch = boto3.client('cloudwatch')
# Get concurrent executions for the last 5 minutes
response = cloudwatch.get_metric_statistics(
Namespace='AWS/Lambda',
MetricName='ConcurrentExecutions',
Dimensions=[{'Name': 'FunctionName', 'Value': 'brand-api'}],
StartTime='2024-01-01T00:00:00Z',
EndTime='2024-01-01T00:05:00Z',
Period=60,
Statistics=['Maximum']
)
for datapoint in response['Datapoints']:
print(f"Max concurrency at {datapoint['Timestamp']}: {datapoint['Maximum']}")
Phase 2: Scaling Decision
Lambda's internal scheduler makes scaling decisions based on incoming request rate and available environments. The decision logic follows two modes (similar to Knative's Stable/Panic modes):
Normal mode (gradual traffic increase):
- Scale out proportionally to match incoming request rate
- Target: keep concurrency utilization below ~70% of reserved limit
Burst mode (sudden traffic spike):
- Scale as fast as the burst limit allows (+500 environments/minute after initial burst)
- Prioritize getting new environments online over perfect efficiency
Phase 3: Execution
Once the scaling decision is made, Lambda provisions new execution environments (triggering cold starts for the new instances) and routes traffic to them. Existing warm environments continue handling requests uninterrupted.
Throttling: What Happens When You Hit the Limit
When Lambda can't scale further (account limit reached, reserved concurrency exhausted, or burst limit hit), it throttles — rejects new invocations with a 429 ThrottlingException.
Throttling behavior differs by invocation type:
| Invocation Type | Throttle Behavior |
|---|---|
| Synchronous (API Gateway) | Returns 429 immediately to caller |
| Asynchronous (S3, EventBridge) | Retries for up to 6 hours with exponential backoff |
| SQS | Message stays in queue, retried based on visibility timeout |
| Kinesis/DynamoDB Streams | Shard processing pauses, retried until success or expiry |
# Handle throttling in synchronous callers
import boto3
import time
from botocore.exceptions import ClientError
lambda_client = boto3.client('lambda')
def invoke_with_retry(function_name: str, payload: dict, max_retries: int = 3):
"""Invoke Lambda with exponential backoff on throttling"""
for attempt in range(max_retries):
try:
response = lambda_client.invoke(
FunctionName=function_name,
InvocationType='RequestResponse',
Payload=json.dumps(payload)
)
return response
except ClientError as e:
if e.response['Error']['Code'] == 'TooManyRequestsException':
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f'Throttled. Retrying in {wait_time:.1f}s (attempt {attempt + 1}/{max_retries})')
time.sleep(wait_time)
else:
raise
raise Exception(f'Max retries exceeded for {function_name}')
Production Scaling Patterns
Pattern 1: Protect Downstream Services with Reserved Concurrency
The most common production issue: Lambda scales to 500 concurrent executions, each opening a database connection, overwhelming your RDS instance (which supports ~100 connections).
# Bad: Lambda scales freely, destroys your database
def handler(event, context):
conn = psycopg2.connect(DATABASE_URL) # new connection every invocation!
# ...
# Good: Use connection pooling + reserved concurrency cap
import os
from aws_lambda_powertools import Logger
logger = Logger()
# RDS Proxy handles connection pooling — Lambda connects to proxy
DB_PROXY_ENDPOINT = os.environ['DB_PROXY_ENDPOINT']
def handler(event, context):
# RDS Proxy multiplexes Lambda's connections to RDS
conn = get_db_connection(DB_PROXY_ENDPOINT)
# ...
# serverless.yml — cap Lambda concurrency to match RDS Proxy pool
functions:
brandDataProcessor:
handler: handler.handler
reservedConcurrency: 100 # RDS Proxy supports 100 connections
Pattern 2: Use SQS as a Concurrency Buffer
For high-volume async workloads, put SQS between your event source and Lambda. SQS absorbs traffic spikes; Lambda processes at a controlled rate.
High-volume events → SQS Queue → Lambda (controlled concurrency)
[buffer] [max 50 concurrent]
# serverless.yml
functions:
processBrandAsset:
handler: handler.handler
reservedConcurrency: 50 # never more than 50 concurrent
events:
- sqs:
arn: !GetAtt BrandAssetQueue.Arn
batchSize: 5
maximumBatchingWindow: 10
Pattern 3: Scheduled Scaling with Provisioned Concurrency
For predictable traffic patterns (business hours spike), pre-scale with Provisioned Concurrency on a schedule.
# scale_provisioned.py — run this as a scheduled Lambda or local script
import boto3
lambda_client = boto3.client('lambda')
aas_client = boto3.client('application-autoscaling')
# Register the function alias as a scalable target
aas_client.register_scalable_target(
ServiceNamespace='lambda',
ResourceId='function:brand-api:prod',
ScalableDimension='lambda:function:ProvisionedConcurrency',
MinCapacity=2,
MaxCapacity=100
)
# Scale up at 8 AM UTC (business hours start)
aas_client.put_scheduled_action(
ServiceNamespace='lambda',
ResourceId='function:brand-api:prod',
ScheduledActionName='scale-up-morning',
Schedule='cron(0 8 * * ? *)',
ScalableDimension='lambda:function:ProvisionedConcurrency',
ScalableTargetAction={'MinCapacity': 50, 'MaxCapacity': 50}
)
# Scale down at 8 PM UTC
aas_client.put_scheduled_action(
ServiceNamespace='lambda',
ResourceId='function:brand-api:prod',
ScheduledActionName='scale-down-evening',
Schedule='cron(0 20 * * ? *)',
ScalableDimension='lambda:function:ProvisionedConcurrency',
ScalableTargetAction={'MinCapacity': 5, 'MaxCapacity': 5}
)
Pattern 4: Monitor and Alert on Throttling
Never let throttling go unnoticed in production.
# CloudFormation / serverless.yml — throttle alarm
resources:
Resources:
LambdaThrottleAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: brand-api-throttles
AlarmDescription: Lambda throttling detected
MetricName: Throttles
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: brand-api
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 10 # alert if >10 throttles per minute
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic
Scaling Limits Reference
| Limit | Default | Adjustable |
|---|---|---|
| Account concurrency (per region) | 1,000 | ✅ Yes (via Support) |
| Burst concurrency (us-east-1) | 3,000 initial | ❌ No |
| Scale rate after burst | +500/minute | ❌ No |
| Reserved concurrency (per function) | Up to account limit | ✅ Yes |
| Provisioned concurrency (per function) | Up to reserved limit | ✅ Yes |
| Max execution duration | 15 minutes | ❌ No |
Summary
Lambda's auto-scaling model is powerful but not magic. Here's what to internalize:
| Concept | Key Point |
|---|---|
| Concurrency = RPS × Duration | Slow functions consume more concurrency at the same traffic level |
| Burst limit | Lambda can't go from 0 to 10,000 instantly — plan for gradual scale-out |
| Scale-to-zero | Unlike Kubernetes HPA, Lambda scales all the way to zero |
| Reserved concurrency | Both a guarantee AND a cap — use it to protect downstream systems |
| Throttling | Sync = immediate 429; Async = retried for up to 6 hours |
| SQS as buffer | Absorbs traffic spikes, lets Lambda process at a controlled rate |
The teams that run Lambda smoothly in production aren't the ones who trust "infinite scale" — they're the ones who set reserved concurrency, configure DLQs, monitor throttle rates, and design their downstream systems to handle Lambda's scaling behavior.
Next in this series: **Part 4 — Traffic Routing in Serverless: Canary Deployments, Weighted Aliases & Blue/Green with Lambda**
Top comments (0)