Collins Ushi for AWS Community Builders

Posted on Apr 20

Tighter and more concrete Scaling AWS Serverless in Production: Event Sources, Throttling, and Zero-Downtime Deploys

#ai #aws #awscommunity #awscommunitybuilder

"Serverless scales automatically" is one of those claims that is technically true and practically misleading. The platform will scale your code but the rate, the ceiling, and most importantly the failure modes of that scaling are determined by decisions you make at three specific layers of the system. Get any of them wrong and your perfectly elastic, pay-per-use architecture will either stall out during a traffic spike, silently corrupt data under load, or quietly DDoS itself into a tarpit.

This post is about the parts of serverless scaling that aren't on the marketing page. It's organised around the three boundaries where scale is actually decided: the event source that feeds your functions, the throughput quotas AWS enforces against you, and the deployment pipeline that ships changes without breaking production.

The three boundaries where scale is decided

Every serverless system has the same topology. Work enters at an edge (API Gateway, an event bus, a queue, a stream). It's handed to a compute layer (Lambda, Fargate) that does the work. The compute layer writes to downstream systems (DynamoDB, S3, another queue, a third-party API). A control plane, the Lambda service itself, plus IAM and your deployment tooling governs how much of this can happen concurrently.

Almost every scaling problem I've seen in production falls into one of three buckets:

Event source misconfiguration. The Lambda is fine, but the queue or stream feeding it is throttling throughput, triggering duplicate deliveries, or creating head-of-line blocking.
Quota collision. Your function can scale, but something upstream or downstream can't - API Gateway burst, downstream database connections, account-wide Lambda concurrency, a third-party rate limit.
Deployment fragility. The system scales correctly, but a bad deploy takes it down globally in thirty seconds because there's no canary and no automatic rollback.

The rest of this post works through each of those three layers.

Layer 1: SQS-backed Lambda

Lambda integrates with SQS through a pull model that most people never inspect. When you connect a queue to a function, AWS stands up a small fleet of pollers on your behalf; typically starting at five - which continuously ask the queue "any work?" and invoke your function when the answer is yes.

That polling fleet is the scaling unit. As the queue backlog grows, AWS adds pollers, which spawn more concurrent function invocations, which drain the queue faster. The ramp continues until one of three things happens: the queue empties, your function's reserved concurrency ceiling is hit, or the account-wide concurrency limit is reached.

Batch size and batch window
The single biggest throughput lever on an SQS-backed Lambda is how much work each invocation does. By default, a poller might hand your function a single message which is wasteful, because the invocation overhead dominates. Raising the batch size (up to 10,000 for standard queues, 10 for FIFO) lets a single invocation drain many messages at once.

The tradeoff is latency. If you tell Lambda to wait for 10 messages before invoking, and traffic is light, the first message in the batch sits idle until nine others arrive.

The MaximumBatchingWindowInSeconds setting the batch window puts a ceiling on that wait. It says "gather up to N messages, but if this many seconds pass, send whatever you have." Setting it to a few seconds typically captures most of the batching benefit while keeping tail latency bounded.

The visibility timeout rule
When a poller hands your function a batch, those messages become invisible to other pollers for a configurable period, the visibility timeout. If the function succeeds, it deletes the messages. If the function crashes or times out, the messages become visible again and get retried.

The failure mode to understand is subtle. Suppose your function has a 10 second timeout and the queue's visibility timeout is also 10 seconds. If a single invocation hits a slow downstream and runs for 9 seconds, then deletes the messages at 9.5 seconds, you're fine. But if anything causes the invocation to slip past 10 seconds, the message reappears in the queue while the original function is still running. A second function picks it up. Now two invocations are processing the same message duplicated work, possible data corruption, and if the downstream isn't idempotent, a real mess.

The rule of thumb is to set the visibility timeout to at least six times the function timeout. Overkill? Yes. But this is one of those parameters where being paranoid costs you nothing and the failure mode is insidious enough that you want margin.

Layer 2: Kinesis-backed Lambda

Kinesis looks like SQS on the surface but behaves nothing like it. SQS is a buffer: messages are independent, order doesn't matter, and consumers scale horizontally. Kinesis is an ordered stream: records are partitioned into shards, order within a shard matters, and concurrency is fundamentally bounded by shard count.

The rule is one Lambda execution environment per shard. A stream with four shards gets four concurrent Lambda invocations regardless of backlog size. You could have a billion records waiting and still have only four workers draining them. This is the piece that bites teams migrating from SQS: "just add more Lambda" doesn't work.

There are two ways to scale past the shard ceiling. The infrastructure answer is to reshard - splitting four shards into eight doubles your concurrency. The software answer is the Parallelization Factor, which lets a single shard be processed by up to 10 concurrent Lambda invocations simultaneously, as long as records with the same partition key are still delivered to the same invocation. Order is preserved within a partition key, not across the whole shard. For most analytics and event-processing workloads, that's a meaningful distinction that buys you a 10x concurrency boost without resharding.

Iterator age: the lag signal
In SQS you watch queue depth to know you're falling behind. In Kinesis you watch iterator age - the age of the most recent record your function has processed. A flat iterator age means you're keeping up. A climbing iterator age means records are entering the stream faster than you can drain them, and data is aging toward the retention cliff. If iterator age crosses retention (24 hours by default, up to 365 days with extended retention), records fall off the back of the stream and are gone.

Iterator age is the single most important metric to alarm on for any Kinesis-backed Lambda. Queue depth tells you about volume; iterator age tells you about time remaining before data loss.

Enhanced Fan-Out
The default Kinesis read bandwidth is 2 MB/s per shard, shared across all consumers. Attach a Lambda and a Firehose to the same stream and each effectively gets 1 MB/s. Add a third consumer and now everyone gets 666 KB/s. This is the noisy-neighbour problem applied to data streams.

Enhanced Fan-Out solves it by giving each registered consumer its own dedicated 2 MB/s pipe. For production pipelines with multiple downstream consumers, this is not optional, it's the difference between a stream that scales with consumers and one that gets slower with every addition.

DynamoDB Streams vs Kinesis Data Streams

When you need to capture changes from DynamoDB, you have two architecturally similar but operationally very different options. Both use shards as the parallelism unit, but the management model diverges sharply.

The choice is almost entirely about how much scaling responsibility you want to own. DynamoDB Streams is the right default for triggers, CDC to a single downstream, and most small-to-medium workloads - you pay nothing for operational simplicity. Kinesis Data Streams is the right choice when you have many consumers, need long replay windows (reprocessing the last week of events for a new feature is a common pattern), or need dedicated bandwidth per consumer for SLA reasons.

Poison pills and the negative scaling trap

There's a counterintuitive behaviour of Lambda's event source integrations that every serverless team eventually discovers the hard way. If your function starts returning errors at a high rate; crashing, timing out, throwing exceptions - the Lambda service doesn't scale up to retry faster. It scales down. It reduces polling rate, reduces concurrency, and backs off.

From Lambda's perspective this is sensible: a wave of errors probably means a downstream database is struggling, and pouring more traffic at it will turn a degradation into an outage. The service is protecting your infrastructure from your own code. But for an operations team watching the dashboard, this self-imposed slowdown shows up as a rapidly climbing iterator age or queue depth exactly when you can least afford it.

The way out is to stop throwing hard errors when individual records fail. Instead of letting one bad record crash the entire batch, use the ReportBatchItemFailures response pattern. This tells Lambda "the invocation succeeded overall, but here are the specific record IDs that failed don't delete those, but keep the rest." The healthy records move forward, the failed ones go to a DLQ or on-failure destination, and Lambda sees a succeeding function and maintains full polling velocity.

Here's a clean implementation of the pattern for an SQS or DynamoDB Streams-backed function:

import json
from typing import Any

def handler(event: dict, context: Any) -> dict:
    """Process a batch, reporting per-record failures to preserve scaling velocity."""
    batch_item_failures: list[dict[str, str]] = []

    for record in event.get("Records", []):
        record_id = _extract_record_id(record)
        try:
            payload = _extract_payload(record)
            process_item(payload)
        except Exception as exc:
            # Log with structured context so the failure is diagnosable
            print(json.dumps({
                "level": "error",
                "record_id": record_id,
                "error": str(exc),
                "error_type": type(exc).__name__,
            }))
            batch_item_failures.append({"itemIdentifier": record_id})

    # Returning this shape keeps the invocation status "Success" from Lambda's
    # perspective, while telling the poller exactly which records to retry.
    return {"batchItemFailures": batch_item_failures}


def _extract_record_id(record: dict) -> str:
    """SQS uses messageId; DynamoDB/Kinesis use sequenceNumber."""
    return (
        record.get("messageId")
        or record.get("dynamodb", {}).get("SequenceNumber")
        or record.get("kinesis", {}).get("sequenceNumber", "")
    )


def _extract_payload(record: dict) -> dict:
    if "body" in record:
        return json.loads(record["body"])
    if "dynamodb" in record:
        return record["dynamodb"].get("NewImage", {})
    if "kinesis" in record:
        return record["kinesis"].get("data", {})
    return {}


def process_item(data: dict) -> None:
    if not data:
        raise ValueError("Empty payload")
    # Actual business logic here idempotent, please

You also need to enable ReportBatchItemFailures on the event source mapping itself (in SAM, CDK, or the console) the function-side response is inert without it.

Layer 2 intermission: the token bucket

Underneath almost every AWS throttling decision is the same algorithm: the token bucket. API Gateway uses it for per-route throttling. Lambda uses it for burst concurrency. DynamoDB uses it for provisioned-throughput tables. Every AWS SDK client uses one internally for retry management. Understanding it is the difference between tuning limits with intent and adjusting them until the alarms stop firing.

The mental model has three pieces. The bucket has a maximum capacity, the burst limit and starts full. Each successful request consumes one token. If the bucket is empty when a request arrives, the request is throttled (HTTP 429). Tokens refill at a steady rate; the rate limit, expressed as requests per second. A bucket with a 1,000 request burst and a 100 RPS refill rate can absorb a 1,000 request spike instantly, but then needs 10 seconds of zero traffic to fully recover its burst capacity.

Three things about this are operationally painful.

CloudWatch lies about it. Standard metrics aggregate over 1 or 5 minute windows. If 6,000 requests arrive at 100 RPS evenly across a minute, the graph looks identical to 6,000 requests arriving in the first five seconds. The first scenario is healthy; the second emptied your bucket, throttled hundreds of requests, and then sat idle. The only metric that tells you the truth is the throttle count itself in API Gateway, 4XXError or ThrottledCount; in Lambda, Throttles. Alarm on throttles, not on request counts.

Enforcement is distributed. There isn't one bucket sitting in one server. API Gateway enforces its quotas across a fleet of nodes, and tokens don't refill in perfect synchrony across all of them. At the edges you'll see "jitter" a request throttled on node A that would have succeeded on node B a millisecond later. This is why single-burst load tests often pass and then production fails you tested an idealised bucket, not the real distributed one.

Mismatched buckets upstream and downstream create phantom capacity. If API Gateway has a 5,000 RPS burst but the Lambda it fronts has a reserved concurrency of 500, the API Gateway quota is fiction. The real ceiling is 500. Every quota in your chain has to be reconciled against the weakest link, or you'll think you have headroom you don't actually have.

Practical tuning
Four habits make token-bucket behaviour predictable in practice:
First, implement exponential backoff with jitter on every client. A fixed backoff from 100 simultaneous throttled clients causes all 100 to retry at exactly the same millisecond, re-emptying the bucket instantly. Randomised backoff spreads the retries out so the bucket has time to refill between waves.

Second, calculate time-to-refill explicitly. refill_seconds = burst_limit / rate_limit. If your burst is 1,000 and your rate is 100, you need 10 seconds of quiet to recover full burst capacity. If your traffic is continuous, you may never recover it which means your effective capacity is the rate limit, not the burst.

Third, load-test for sustained burst, not just peak. A burst of 500 with a rate of 100 RPS can absorb 200 RPS for about five seconds before the bucket drains; after that you'll see ~50% throttling. If your expected peak is sustained, you need to size the rate limit, not the burst.

Fourth, use Lambda Provisioned Concurrency as a "floor of warm tokens" for latency-sensitive paths but understand the cost. Provisioned concurrency is subtracted from your account's unreserved pool. Provisioning 500 units for one function permanently removes those 500 units from every other function in the account, even when your provisioned function is idle. Over-provisioning quietly starves the rest of your workloads.

The pre-production scaling review

Before putting any ingestion-heavy serverless pipeline in front of real traffic, there are four questions worth writing down the answers to. I've seen each of them caught in review and missed in launch, with predictable outcomes.

What are the hard limits in every hop of this chain?
Not the defaults, the actual limits on this account, in this region, this month. Lambda concurrency, API Gateway RPS, DynamoDB provisioned throughput, SQS message size, Kinesis shard count. Put them in a table. The ceiling of the whole system is the lowest number on the page.

*Is the timeout hierarchy consistent? *
The function timeout must be shorter than the visibility timeout, which must be shorter than the retry window, which must be shorter than any upstream timeout. Any inversion creates ghost retries invocations that succeed but get replayed because the upstream decided they'd failed.

What's the error strategy, and is it written down?
Is this system at-least-once or exactly-once? When a record fails, does it halt the pipeline (preserving order, stopping throughput) or go to a dead-letter queue (preserving throughput, losing order)? There's no universally right answer, but there is a right answer for your business and it should be decided before traffic arrives, not during an incident.

Are there native integrations you're replacing with custom glue? If you're moving data from Kinesis to S3 with a Lambda function, you're probably reimplementing Amazon Data Firehose, badly. If you're parsing DynamoDB Stream records and writing them to OpenSearch with a Lambda, the Zero-ETL integration likely exists. Custom glue is the highest-maintenance part of any pipeline; push it into a managed service wherever possible.

Layer 3: Shipping changes without breaking production

A perfectly scaling system is one bad deploy away from an outage. The final part of the playbook is the deployment pipeline specifically, how SAM and CodeDeploy work together to make Lambda deploys boring.

The core primitives are Lambda versions (immutable snapshots of function code) and aliases (mutable pointers to versions, like live or canary). A SAM template with AutoPublishAlias: live tells the deploy pipeline: every time my code changes, publish a new immutable version and shift the live alias to point to it gradually, with monitoring, with a kill switch.

The mechanism behind that gradual shift is DeploymentPreference. Three strategies are available:

AllAtOnce; the default Lambda behaviour. Instant cutover. Fast and risky; appropriate only for non-production or for tooling functions where a failed invocation is inconvenient but not expensive.
Linear; shift traffic in fixed increments (e.g., Linear10PercentEvery10Minutes). Simple, predictable, and gives alarms time to notice problems before they're global.
Canary; shift a small slice (say 10%) immediately, hold for a configurable bake time, then shift the rest. Lower latency to full rollout than linear, still lets you catch regressions on the canary slice.

The deployment runs four phases:

Publish. SAM publishes the new version as an immutable snapshot.

Pre-traffic validation. CodeDeploy invokes a PreTraffic Lambda hook you provide a synthetic transaction or smoke test against the new version before any real traffic sees it. If the hook fails, the deploy halts immediately and live stays on the old version.

Weighted traffic shift. CodeDeploy updates the alias to use weighted routing; sending a configurable percentage to the new version, the rest to the old. During the shift window, it watches the CloudWatch alarms you've listed (typically error rate, p99 latency, downstream throttling). If any alarm fires, traffic snaps back to 100% old version automatically.

Post-traffic validation. Once the shift completes, CodeDeploy runs a PostTraffic hook for final verification, then marks the deploy done.

This is what "safe deploys" actually means in practice: not a manual runbook, but a machine that's watching metrics and can undo itself faster than you can type.

SAM or CDK?

Both deploy via CloudFormation under the hood. SAM is declarative (YAML) with shorthand resources like AWS::Serverless::Function that expand into a dozen primitive resources the right choice when your infrastructure is mostly serverless and mostly stable. CDK is imperative (TypeScript, Python) and gives you loops, conditionals, abstractions, and IDE autocomplete the right choice when your infrastructure has real logic, many environments, or needs reusable constructs across teams. For a single-team serverless app, SAM will get you there faster. For a platform that many teams build on, CDK's abstraction power pays off.

Four habits that keep the pipeline boring

Test at every stage, not at the end. Linters and unit tests in the build stage; integration tests against a deployed staging environment; synthetic transactions in pre-traffic hooks; post-deploy smoke tests. Each stage catches a different class of bug and none of them are substitutes for each other.

One AWS account per environment. Dev, staging, and production should be separate accounts, not separate regions or separate resource prefixes in one account. The boundary is for blast radius (a compromised dev IAM role can't reach production), cost attribution (one bill per environment), and accident prevention (you can't accidentally terraform destroy prod if prod is in an account you're not authenticated against).

One template, parameterised per environment. If your staging and production templates diverge, you stop testing production in staging. Use CloudFormation parameters for environment-specific values (table names, instance sizes, domain names) and keep the resource shape identical across environments.

Secrets in Secrets Manager or Parameter Store, referenced dynamically. Never bake credentials into environment variables at deploy time, you'll end up redeploying the app to rotate a secret. Reference secrets by ARN in the template, grant the function IAM permission to read them, and fetch them at runtime (with caching). Rotation becomes a secrets-manager operation, not a code deploy.

DEV Community