James Lee

Posted on May 26

Auto Scaling in AWS Lambda: Concurrency, Throttling & Scale-to-Zero

#architecture #aws #performance #serverless

One of the most compelling promises of serverless is "infinite scale" — your function handles 1 request or 100,000 requests, and you never touch a config file.

But that promise comes with nuance. Lambda's scaling model has hard limits, specific behaviors under load, and failure modes that will surprise you in production if you haven't studied them.

In this article, we'll break down exactly how Lambda scales: the concurrency model, how scale-out works internally, what happens when you hit limits, and how to design for scale in production.

The Fundamental Unit: Concurrency

Lambda doesn't scale by CPU or memory — it scales by concurrency.

Concurrency = the number of function instances handling requests simultaneously.

Time →
Request A: [===========] (200ms)
Request B:    [===========] (200ms)
Request C:       [===========] (200ms)

Concurrency at peak = 3 simultaneous executions

Each concurrent execution runs in its own isolated execution environment (a Firecracker microVM, as we covered in Part 1). There is no shared state between concurrent executions.

The scaling formula is simple:

$$\text{Concurrency} = \text{Requests per second} \times \text{Average duration (seconds)}$$

Example: 500 requests/second, each taking 200ms (0.2s):

$$\text{Concurrency} = 500 \times 0.2 = 100 \text{ concurrent executions}$$

This means you need 100 execution environments running simultaneously. If your function takes longer (say, 2 seconds per request), the same traffic requires 1,000 concurrent environments.

Key insight: Slow functions are expensive at scale — not just in duration cost, but in the concurrency they consume. Optimizing function duration directly reduces the concurrency you need.

Three Types of Concurrency

AWS Lambda has three concurrency concepts you need to understand:

1. Account-Level Concurrency Limit

Every AWS account has a regional concurrency limit — by default, 1,000 concurrent executions per region. This is a hard ceiling shared across ALL Lambda functions in that region.

Account concurrency pool: 1,000
├── function-A: using 400
├── function-B: using 350
├── function-C: using 200
└── Available: 50

If all 1,000 slots are consumed, any new invocation is throttled — Lambda returns a 429 TooManyRequestsException.

You can request a limit increase via AWS Support (up to tens of thousands for production workloads).

2. Reserved Concurrency

You can reserve a fixed number of concurrency slots for a specific function. This does two things:

Guarantees the function always has those slots available (other functions can't consume them)
Caps the function at that maximum (it can never exceed the reserved amount)

# Set reserved concurrency via boto3
import boto3

lambda_client = boto3.client('lambda')

lambda_client.put_function_concurrency(
    FunctionName='brand-logo-processor',
    ReservedConcurrentExecutions=200  # max 200 concurrent executions
)

Account pool: 1,000
├── brand-logo-processor: RESERVED 200 (guaranteed + capped)
├── brand-api:            RESERVED 300 (guaranteed + capped)
└── Unreserved pool:      500 (shared by all other functions)

When to use reserved concurrency:

Protect a downstream database from being overwhelmed (cap Lambda concurrency to match DB connection pool size)
Guarantee capacity for a critical function during high traffic
Prevent a runaway function from consuming the entire account pool

3. Provisioned Concurrency

As covered in Part 1, Provisioned Concurrency pre-initializes execution environments so they're ready to handle requests with zero cold start. It's a subset of reserved concurrency.

Reserved: 200
└── Provisioned: 50  (always warm, zero cold start)
    └── On-demand: 150 (scale up as needed, may cold start)

How Lambda Scales Out: The Internal Mechanics

When traffic increases, Lambda needs to spin up new execution environments. This is where the internal scaling mechanics matter.

The Burst Scaling Limit

Lambda doesn't scale from 0 to 10,000 instantly. There's a burst concurrency limit — the maximum number of new execution environments Lambda can add per minute:

Region	Initial Burst Limit	Scale Rate After Burst
us-east-1, us-west-2, eu-west-1	3,000	+500/minute
All other regions	500–1,000	+500/minute

What this means in practice:

Minute 0: Traffic spike hits. Lambda starts at current concurrency.
Minute 1: +500 new environments provisioned
Minute 2: +500 more
Minute 3: +500 more
...until account limit is reached or traffic stabilizes

If your traffic spikes from 0 to 5,000 concurrent requests instantly (e.g., a viral event), Lambda cannot serve all of them immediately. The first ~3,000 are handled (in us-east-1), and the rest are throttled until the next minute's burst capacity is available.

Mitigation: Use Provisioned Concurrency for latency-sensitive functions that may experience sudden spikes.

Scale-to-Zero

Unlike traditional servers or Kubernetes deployments, Lambda scales all the way down to zero when there's no traffic. No idle execution environments, no cost.

Traffic pattern:
08:00 ████████████████  (high traffic — many environments active)
12:00 ████              (moderate traffic)
03:00                   (no traffic — zero environments, zero cost)

This is fundamentally different from Kubernetes HPA (Horizontal Pod Autoscaler), which cannot scale to zero. HPA monitors CPU/memory metrics — if there are no pods running, there are no metrics to monitor, so HPA can't trigger scale-up from zero.

Lambda solves this with a different model: the trigger mechanism itself (API Gateway, SQS, EventBridge) acts as the "activator" — it wakes Lambda up when traffic arrives, even from zero.

The Metrics → Decision → Scale Loop

Lambda's auto-scaling follows a three-phase loop that mirrors how all serious auto-scalers work:

Phase 1: Metrics Collection

Lambda continuously monitors:

Concurrent executions: how many environments are actively processing
Throttle rate: percentage of invocations being throttled
Queue depth (for SQS/Kinesis triggers): how many unprocessed messages

These metrics are available in CloudWatch and can be used to build custom scaling alarms.

# Monitor Lambda concurrency with CloudWatch
import boto3

cloudwatch = boto3.client('cloudwatch')

# Get concurrent executions for the last 5 minutes
response = cloudwatch.get_metric_statistics(
    Namespace='AWS/Lambda',
    MetricName='ConcurrentExecutions',
    Dimensions=[{'Name': 'FunctionName', 'Value': 'brand-api'}],
    StartTime='2024-01-01T00:00:00Z',
    EndTime='2024-01-01T00:05:00Z',
    Period=60,
    Statistics=['Maximum']
)

for datapoint in response['Datapoints']:
    print(f"Max concurrency at {datapoint['Timestamp']}: {datapoint['Maximum']}")

Phase 2: Scaling Decision

Lambda's internal scheduler makes scaling decisions based on incoming request rate and available environments. The decision logic follows two modes (similar to Knative's Stable/Panic modes):

Normal mode (gradual traffic increase):

Scale out proportionally to match incoming request rate
Target: keep concurrency utilization below ~70% of reserved limit

Burst mode (sudden traffic spike):

Scale as fast as the burst limit allows (+500 environments/minute after initial burst)
Prioritize getting new environments online over perfect efficiency

Phase 3: Execution

Once the scaling decision is made, Lambda provisions new execution environments (triggering cold starts for the new instances) and routes traffic to them. Existing warm environments continue handling requests uninterrupted.

Throttling: What Happens When You Hit the Limit

When Lambda can't scale further (account limit reached, reserved concurrency exhausted, or burst limit hit), it throttles — rejects new invocations with a 429 ThrottlingException.

Throttling behavior differs by invocation type:

Invocation Type	Throttle Behavior
Synchronous (API Gateway)	Returns `429` immediately to caller
Asynchronous (S3, EventBridge)	Retries for up to 6 hours with exponential backoff
SQS	Message stays in queue, retried based on visibility timeout
Kinesis/DynamoDB Streams	Shard processing pauses, retried until success or expiry

# Handle throttling in synchronous callers
import boto3
import time
from botocore.exceptions import ClientError

lambda_client = boto3.client('lambda')

def invoke_with_retry(function_name: str, payload: dict, max_retries: int = 3):
    """Invoke Lambda with exponential backoff on throttling"""
    for attempt in range(max_retries):
        try:
            response = lambda_client.invoke(
                FunctionName=function_name,
                InvocationType='RequestResponse',
                Payload=json.dumps(payload)
            )
            return response
        except ClientError as e:
            if e.response['Error']['Code'] == 'TooManyRequestsException':
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f'Throttled. Retrying in {wait_time:.1f}s (attempt {attempt + 1}/{max_retries})')
                time.sleep(wait_time)
            else:
                raise
    raise Exception(f'Max retries exceeded for {function_name}')

Production Scaling Patterns

Pattern 1: Protect Downstream Services with Reserved Concurrency

The most common production issue: Lambda scales to 500 concurrent executions, each opening a database connection, overwhelming your RDS instance (which supports ~100 connections).

# Bad: Lambda scales freely, destroys your database
def handler(event, context):
    conn = psycopg2.connect(DATABASE_URL)  # new connection every invocation!
    # ...

# Good: Use connection pooling + reserved concurrency cap
import os
from aws_lambda_powertools import Logger

logger = Logger()

# RDS Proxy handles connection pooling — Lambda connects to proxy
DB_PROXY_ENDPOINT = os.environ['DB_PROXY_ENDPOINT']

def handler(event, context):
    # RDS Proxy multiplexes Lambda's connections to RDS
    conn = get_db_connection(DB_PROXY_ENDPOINT)
    # ...

# serverless.yml — cap Lambda concurrency to match RDS Proxy pool
functions:
  brandDataProcessor:
    handler: handler.handler
    reservedConcurrency: 100  # RDS Proxy supports 100 connections

Pattern 2: Use SQS as a Concurrency Buffer

For high-volume async workloads, put SQS between your event source and Lambda. SQS absorbs traffic spikes; Lambda processes at a controlled rate.

High-volume events → SQS Queue → Lambda (controlled concurrency)
                     [buffer]      [max 50 concurrent]

# serverless.yml
functions:
  processBrandAsset:
    handler: handler.handler
    reservedConcurrency: 50       # never more than 50 concurrent
    events:
      - sqs:
          arn: !GetAtt BrandAssetQueue.Arn
          batchSize: 5
          maximumBatchingWindow: 10

Pattern 3: Scheduled Scaling with Provisioned Concurrency

For predictable traffic patterns (business hours spike), pre-scale with Provisioned Concurrency on a schedule.

# scale_provisioned.py — run this as a scheduled Lambda or local script
import boto3

lambda_client = boto3.client('lambda')
aas_client = boto3.client('application-autoscaling')

# Register the function alias as a scalable target
aas_client.register_scalable_target(
    ServiceNamespace='lambda',
    ResourceId='function:brand-api:prod',
    ScalableDimension='lambda:function:ProvisionedConcurrency',
    MinCapacity=2,
    MaxCapacity=100
)

# Scale up at 8 AM UTC (business hours start)
aas_client.put_scheduled_action(
    ServiceNamespace='lambda',
    ResourceId='function:brand-api:prod',
    ScheduledActionName='scale-up-morning',
    Schedule='cron(0 8 * * ? *)',
    ScalableDimension='lambda:function:ProvisionedConcurrency',
    ScalableTargetAction={'MinCapacity': 50, 'MaxCapacity': 50}
)

# Scale down at 8 PM UTC
aas_client.put_scheduled_action(
    ServiceNamespace='lambda',
    ResourceId='function:brand-api:prod',
    ScheduledActionName='scale-down-evening',
    Schedule='cron(0 20 * * ? *)',
    ScalableDimension='lambda:function:ProvisionedConcurrency',
    ScalableTargetAction={'MinCapacity': 5, 'MaxCapacity': 5}
)

Pattern 4: Monitor and Alert on Throttling

Never let throttling go unnoticed in production.

# CloudFormation / serverless.yml — throttle alarm
resources:
  Resources:
    LambdaThrottleAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: brand-api-throttles
        AlarmDescription: Lambda throttling detected
        MetricName: Throttles
        Namespace: AWS/Lambda
        Dimensions:
          - Name: FunctionName
            Value: brand-api
        Statistic: Sum
        Period: 60
        EvaluationPeriods: 1
        Threshold: 10          # alert if >10 throttles per minute
        ComparisonOperator: GreaterThanThreshold
        AlarmActions:
          - !Ref AlertSNSTopic

Scaling Limits Reference

Limit	Default	Adjustable
Account concurrency (per region)	1,000	✅ Yes (via Support)
Burst concurrency (us-east-1)	3,000 initial	❌ No
Scale rate after burst	+500/minute	❌ No
Reserved concurrency (per function)	Up to account limit	✅ Yes
Provisioned concurrency (per function)	Up to reserved limit	✅ Yes
Max execution duration	15 minutes	❌ No

Summary

Lambda's auto-scaling model is powerful but not magic. Here's what to internalize:

Concept	Key Point
Concurrency = RPS × Duration	Slow functions consume more concurrency at the same traffic level
Burst limit	Lambda can't go from 0 to 10,000 instantly — plan for gradual scale-out
Scale-to-zero	Unlike Kubernetes HPA, Lambda scales all the way to zero
Reserved concurrency	Both a guarantee AND a cap — use it to protect downstream systems
Throttling	Sync = immediate 429; Async = retried for up to 6 hours
SQS as buffer	Absorbs traffic spikes, lets Lambda process at a controlled rate

The teams that run Lambda smoothly in production aren't the ones who trust "infinite scale" — they're the ones who set reserved concurrency, configure DLQs, monitor throttle rates, and design their downstream systems to handle Lambda's scaling behavior.

Next in this series: **Part 4 — Traffic Routing in Serverless: Canary Deployments, Weighted Aliases & Blue/Green with Lambda**

DEV Community