Darryl Ruggles for AWS Community Builders

Posted on May 16 • Originally published at darryl-ruggles.cloud

Live Canary Deployments with AWS SAM, the New WebSocket API Resource, and Lambda Durable Functions

#aws #durablefunctions #lambda #websocket

It's great to see new serverless offerings all the time. Two recent serverless features pair unusually well: Lambda Durable Functions and SAM's new first-class WebSocket API resource. On May 5, 2026, SAM gained AWS::Serverless::WebSocketApi, a resource type that turns the verbose dance of ApiGatewayV2::Api + Stage + Route + Integration + Lambda::Permission into a few lines of YAML.

Lambda Durable Functions also reached an important maturity point. They went GA in December 2025, have been supported in SAM CLI since v1.150.0, and now have an AWS best-practices post. They let a single Lambda handler run a multi-step workflow that can pause for hours (or days) waiting on a human callback without burning compute and costing you money. I covered the durable mechanics in detail in AWS Lambda Durable Functions: Build a Loan Approval Workflow; this post focuses on what changes when you pair them with a real-time UI.

A quick disambiguation up front: "canary deployments with SAM" has traditionally meant SAM's DeploymentPreference feature - CodeDeploy traffic-shifting on a Lambda alias, with built-in Canary10Percent5Minutes and Linear10PercentEvery1Minute types. This post isn't about that. It's about a custom canary pipeline built on Durable Functions, where the operator watches live baseline-vs-canary metrics and intervenes in real time. The two patterns are complementary - you could (and arguably should) use DeploymentPreference on the orchestrator itself once you ship this to production.

These features are made for each other. WebSocket APIs without a long-running backend tend to collapse into chat-app demos. Durable Functions without a real-time UI mean operators stare at CloudWatch waiting for a workflow to finish. Put them together and you get a workflow the user can both observe and steer in real time, with checkpoint-and-replay handling failures behind the scenes.

I built the most universally relatable version of that pairing I could think of: a canary deployment pipeline. Submit a build artifact, watch it move through smoke tests, staging deploy, integration tests, and a canary rollout, then watch live baseline-vs-canary metrics during a configurable observation window. Promote to 100%, roll back, or extend the window - and the bidirectional WebSocket means the "roll back NOW" button is always one click away while the canary is still serving traffic. Every developer reading this has lived this exact moment. The orchestrator runs Python 3.14 on arm64 with Powertools for AWS Lambda, IAM is scoped with SAM policy templates where they fit and targeted inline policies where the templates are too broad, and the whole stack deploys with a single sam deploy.

The full source - SAM template, Python handlers, React+Vite frontend, Makefile, and architecture diagram - is on GitHub: live-canary-deploys-with-sam-the-new-websocket-api-and-durable-functions.

Architecture

The flow is:

Browser submits the build artifact to POST /deploy on the HTTP API. The Start Deploy Lambda persists deployment metadata to DynamoDB and asynchronously kicks off the durable orchestrator, returning the deploymentId synchronously.
Browser opens a WebSocket to wss://...execute-api..., fires the $connect route, and sends a subscribe frame naming the deploymentId.
The orchestrator runs durable steps: smoke -> staging deploy -> integration tests -> canary deploy. Each step is checkpointed; failures replay from the last checkpoint, not from scratch. Each stage streams its log lines over WebSocket so the pipeline tracker in the UI animates as the workflow progresses.
Observation window: the orchestrator pauses on a durable wait_for_callback and asynchronously invokes a separate MetricsEmitter Lambda. The emitter streams synthetic baseline-vs-canary metrics (error rate, p50, p99, RPS) every 2 seconds via the Progress Publisher. The browser renders a side-by-side metrics dashboard with sparklines, threshold deltas, and a countdown timer.
The operator decides - Promote / Roll back / Extend - by clicking a button that sends an intervene WebSocket frame. The Intervene Lambda completes the durable callback via lambda:SendDurableExecutionCallbackSuccess and the orchestrator resumes from exactly where it stopped. If the operator does nothing, the metrics emitter completes the callback at end-of-window with an auto decision based on the configured thresholds.
Promote or rollback: the orchestrator runs the corresponding step (each emitting its own log lines to the UI), writes the final state to DynamoDB, and emits a completed event so the UI shows the final result.

A single DynamoDB table (PK = DEPLOY#<id>, SK = META | CONN#<connectionId> plus a connectionId GSI) holds both the deployment state and the WebSocket subscriptions. To keep it simple, CloudFront and S3 hosting are deliberately absent: Vite serves the frontend on localhost:5173, the browser calls the AWS-hosted endpoints directly, and there's one fewer layer of infrastructure to learn.

The pipeline stages are intentionally simulated. Each step function (step_smoke_tests, step_deploy_staging, etc.) sleeps briefly and emits realistic-looking log lines. In a real deployment you'd replace each body with calls to CodeDeploy / ECS service updates / Lambda alias shifting / your CI provider's API. The shape stays identical; only the leaves change.

Here is a shot of the sample app i built.

SAM in 30 seconds

If you've never used SAM before, the elevator pitch is: it's a CloudFormation macro, not a separate tool. Add Transform: AWS::Serverless-2016-10-31 to a CloudFormation template and you can write resources like AWS::Serverless::Function that expand at deploy time into full CloudFormation - a Lambda function plus its IAM execution role, log group, event source mappings, version, alias, and so on. Anything you can write in vanilla CloudFormation still works. I am a big fan of using Terraform for IaC but SAM is my second favorite - especially when mostly dealing with AWS serverless resources.

The SAM CLI layers on developer tooling that vanilla CloudFormation lacks: local invoke, the sam sync --watch inner loop, log tailing, sample event generation, and pipeline scaffolding.

Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Runtime: python3.14
    Architectures: [arm64]
    Tracing: Active
    LoggingConfig:
      LogFormat: JSON
      ApplicationLogLevel: INFO
      SystemLogLevel: WARN
    Environment:
      Variables:
        POWERTOOLS_SERVICE_NAME: deploy-pipeline
        TABLE_NAME: !Ref DeploymentTable

The Globals: block is one of SAM's nicest small features: every AWS::Serverless::Function in the template inherits these defaults. No copy-paste of Runtime: python3.14 across nine handlers. New runtimes (Python 3.14 was added in November 2025), structured JSON logging, table names - changing the global block changes them all.

For deeper SAM background and the broader IaC landscape, the AWS SAM developer guide is a good starting point. The Powertools setup used throughout this project - logger, tracer, metrics decorators in the right order, structured JSON logging, EMF metrics - is the same pattern I covered in Powertools for AWS Lambda Best Practices. The rest of this post focuses on what's new.

The new WebSocket API resource

Before May 2026, defining a WebSocket API in CloudFormation looked like this for the simplest possible case:

# ~70 lines, redacted for brevity
MyApi: { Type: AWS::ApiGatewayV2::Api, Properties: {...} }
ConnectRoute: { Type: AWS::ApiGatewayV2::Route, Properties: {...} }
ConnectIntegration: { Type: AWS::ApiGatewayV2::Integration, Properties: {...} }
ConnectPermission: { Type: AWS::Lambda::Permission, Properties: {...} }
DisconnectRoute: { ... }
DisconnectIntegration: { ... }
DisconnectPermission: { ... }
DefaultRoute: { ... }
DefaultIntegration: { ... }
DefaultPermission: { ... }
Stage: { Type: AWS::ApiGatewayV2::Stage, Properties: {...} }
Deployment: { Type: AWS::ApiGatewayV2::Deployment, Properties: {...} }

Every basic Lambda-backed route needed three resources: Route, Integration, and Permission. Integration URIs needed manually constructed arn:aws:apigateway:.../invocations strings. Forgetting the Lambda::Permission resource on any route was an easy mistake to make: the route would exist, the integration would exist, the connection would even succeed, and route invocations would fail at runtime in a way that looks like an integration problem rather than a missing permission.

The new AWS::Serverless::WebSocketApi collapses all of that into:

DeployWebSocketApi:
  Type: AWS::Serverless::WebSocketApi
  Properties:
    StageName: dev
    RouteSelectionExpression: "$request.body.action"
    Routes:
      "$connect":
        FunctionArn: !GetAtt ConnectFunction.Arn
      "$disconnect":
        FunctionArn: !GetAtt DisconnectFunction.Arn
      subscribe:
        FunctionArn: !GetAtt SubscribeFunction.Arn
      intervene:
        FunctionArn: !GetAtt InterveneFunction.Arn

That's the entire WebSocket API. Behind the scenes SAM still generates the same set of CloudFormation resources - one Api, one Stage, and per-route Route + Integration + Lambda::Permission - so anything you could do in raw CloudFormation you can still do here. You're trading verbosity for a smaller blast radius for typos.

A few things worth noting from the resource reference:

RouteSelectionExpression is required. The conventional value is $request.body.action, which means: parse incoming messages as JSON and dispatch on the action field. The frontend sends { "action": "subscribe", "deploymentId": "..." } and API Gateway routes it to the subscribe integration.
The route property is FunctionArn, not Function. Easy to get wrong if you're paraphrasing the resource by analogy with other SAM event sources. Mistakes here fail at deploy time, not runtime.
There is no documented !GetAtt MyApi.ApiEndpoint. !Ref MyWebSocketApi returns the API ID, and !GetAtt MyWebSocketApi.Stage returns the generated stage's logical reference. To get the wss endpoint you construct it: wss://${MyApi}.execute-api.${AWS::Region}.amazonaws.com/${StageName}.

Authorization is `$connect`-only, and `$connect` auth alone isn't enough

The SAM Auth block applies only to $connect. Valid AuthType values are NONE, AWS_IAM, and CUSTOM for a Lambda authorizer.

I left this as NONE in the demo for simplicity, but for this specific design that's a real leak vector worth naming. Any connected client can send {"action":"subscribe","deploymentId":"..."} and start receiving the live metrics and log stream for an arbitrary deployment ID. Worse, intervene is a control-plane action that promotes or rolls back a deploy.

$connect authentication proves who opened the socket. It doesn't prove that the caller is allowed to subscribe to a given deployment or send intervene for it.

Production deployments need:

A Lambda authorizer on $connect that validates the caller. Browsers can't set arbitrary headers on the WebSocket upgrade, so the conventional carriers are a query-string token or the Sec-WebSocket-Protocol subprotocol header. Both can leak into request logs, devtools, reverse-proxy logs, and support captures.
A short-lived, audience-scoped WebSocket ticket minted by your backend after the user authenticates through your normal flow, instead of a long-lived bearer token in any of those carriers. The $connect authorizer should bind the validated claims to the connectionId so route handlers can look them up later.
Origin validation in the $connect authorizer. WebSocket APIs don't enforce CORS the same way fetch does, so the Origin header is your own perimeter check. Reject anything not from your expected frontend origin list.
Per-message authorization in every route handler. Store the authenticated principal alongside each connection record, and in subscribe and intervene, check the principal against the deployment owner, team, or environment before honoring the action. Reject unauthorized subscribe, promote, rollback, or extend messages with a 403-equivalent and log it.
AccessLogSettings on the stage, with a format that doesn't log raw tokens.
DefaultRouteSettings.ThrottlingBurstLimit and DefaultRouteSettings.ThrottlingRateLimit so a misbehaving client can't burn your connection-message budget.

Why not just SSE?

The first question I asked myself was "do I actually need WebSockets?" Server-Sent Events would handle one-way progress streaming with much less infrastructure - a Lambda Function URL with RESPONSE_STREAM invoke mode, no connection store, no fan-out publisher. (Function URLs are public endpoints unless protected with IAM auth or your own application-level authorization layer; that's a separate piece of work either way.)

But this workflow needs a return channel, and a low-latency one. The operator's "roll back NOW" decision has to land while the canary is still serving traffic - if a customer-impacting regression appeared at second 35 of a 60-second observation window, you don't want to wait until the window ends and hope the auto-decision picks rollback. You want a button that triggers an immediate callback completion. SSE is one-way. You'd end up with SSE for metrics streaming and HTTP for the intervention, two different protocols, two different idle-connection lifecycles, two sets of error-handling. WebSockets give you one bidirectional channel for both.

The same logic applies to any workflow with bidirectional, low-latency interaction over long time horizons: collaborative editors, multi-agent systems where the human can re-prompt, interactive ML training where the operator can adjust hyperparameters mid-run, IoT control planes. Each of those becomes a tractable SAM application with WebSocketApi + Durable Functions; without one or the other, you're either gluing together more services or polling.

Sending messages back from Lambda

The other half of the WebSocket story is how a Lambda function pushes a frame to a connection. API Gateway exposes a small management API - POST /@connections/{id} - and boto3 has a dedicated client for it:

import boto3

# domain and stage come from event['requestContext'] inside any route handler,
# or are constructed from the WEBSOCKET_API_ID env var elsewhere.
client = boto3.client(
    "apigatewaymanagementapi",
    endpoint_url=f"https://{api_id}.execute-api.{region}.amazonaws.com/{stage}",
)
client.post_to_connection(ConnectionId=connection_id, Data=b'{"hello":"world"}')

The IAM action is execute-api:ManageConnections and the resource ARN format is arn:aws:execute-api:{region}:{account}:{api-id}/{stage}/POST/@connections/*. In the SAM template I scope this tightly:

- Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Action: execute-api:ManageConnections
      Resource: !Sub 'arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${DeployWebSocketApi}/dev/POST/@connections/*'

The Lambda needs both the apigatewaymanagementapi boto3 client (which is just an HTTP client pointed at a constructed endpoint URL) and an IAM policy granting execute-api:ManageConnections against the specific WebSocket API + stage. Wildcard either piece and you'll regret it during pen-test time.

The Progress Publisher Lambda also has to handle the case where a browser closed the tab between the moment the orchestrator decided to publish and the moment the publisher actually POSTed. API Gateway returns HTTP 410 Gone; boto3 surfaces this as GoneException. The publisher catches it and prunes the dead connection from DynamoDB:

for connection_id in subscribers:
    try:
        client.post_to_connection(ConnectionId=connection_id, Data=frame)
        sent += 1
    except client.exceptions.GoneException:
        dead.append(connection_id)
if dead:
    remove_subscriptions(dead)

Lambda Durable Functions

Standard Lambda has one execution environment per request and a 15-minute hard timeout. Step Functions handles long-running orchestration but at the cost of writing your business logic in Amazon States Language and paying per state transition. Durable Functions split the difference: you write the workflow as plain Python code in a Lambda handler, and the Lambda runtime handles the checkpoint-and-replay underneath. If you want the full mental model of how checkpoints, replays, and callbacks work, my loan approval workflow post walks through it from scratch with a different example.

Each call to context.step() is a checkpoint. If the underlying Lambda crashes, the runtime kills the environment for capacity, or your replay hits the 15-minute wall-clock, the runtime simply re-invokes the handler with the same execution ID. Recorded step results are returned without re-executing the body. From the developer's perspective, the function "just resumes" - even if "resume" means the wait was 23 hours and 59 minutes.

The replay model also means the orchestration body should stay deterministic. Generate UUIDs, read the clock, call external APIs, query DynamoDB, write to a file, or emit side effects inside @durable_step functions rather than inline in the orchestration path. Otherwise a replay can compute a different value (a new UUID, a different current timestamp, a now-different DynamoDB row) and take a different branch than the original execution, which defeats the point of checkpoint-and-replay. The orchestration body should read like a recipe of step calls plus branching on their recorded results, nothing else.

In SAM, you opt in by adding a DurableConfig block to an AWS::Serverless::Function:

DeployOrchestratorFunction:
  Type: AWS::Serverless::Function
  Properties:
    Handler: handlers.deploy_orchestrator.handler
    Timeout: 900             # per-replay wall clock; the durable timeout is what matters
    AutoPublishAlias: live
    DurableConfig:
      ExecutionTimeout: 86400        # 24 hours, integer seconds (max 31622400 = 366 days)
      RetentionPeriodInDays: 30

ExecutionTimeout is the outer hard limit on how long a single durable execution can live, in plain integer seconds (not ISO 8601 - early blog posts disagreed on this). RetentionPeriodInDays controls how long the execution history is kept after completion (1-90, default 14).

Supported runtimes today are python3.13, python3.14, nodejs22.x, nodejs24.x, java17, java21, and java25, plus container images.

IAM for durable execution

Two execution-role actions are required (these go on the orchestrator's role, not the caller's):

Statement:
  - Effect: Allow
    Action:
      - lambda:CheckpointDurableExecution
      - lambda:GetDurableExecutionState
    Resource: !Sub '${DeployOrchestratorFunction.Arn}:*'

AWS publishes a managed policy (AWSLambdaBasicDurableExecutionRolePolicy) that bundles these with CloudWatch Logs basics, but you can construct it inline with two lines. The :* suffix matches all published versions, since durable execution requires a published version (not $LATEST).

For services that need to complete a callback - the Intervene Lambda and the Metrics Emitter in this project - the actions are different:

- Effect: Allow
  Action:
    - lambda:SendDurableExecutionCallbackSuccess
    - lambda:SendDurableExecutionCallbackFailure
    - lambda:SendDurableExecutionCallbackHeartbeat
  Resource: !Sub '${DeployOrchestratorFunction.Arn}:*'

The heartbeat action is the pattern that makes the metrics emitter behave nicely when the operator interrupts; more on that below.

The Python SDK

The durable SDK is a separate PyPI package, not part of the runtime or Powertools. It's installed via requirements.txt:

aws-durable-execution-sdk-python>=0.1.0

The handler signature changes slightly: context is no longer a LambdaContext but a DurableContext, and the function is decorated with @durable_execution. Steps are decorated with @durable_step. Heads-up: a few of the config classes (Duration, WaitForCallbackConfig, etc.) aren't re-exported from the top-level package - import them from aws_durable_execution_sdk_python.config:

from aws_durable_execution_sdk_python import (
    DurableContext, durable_execution, durable_step,
)
from aws_durable_execution_sdk_python.config import Duration, WaitForCallbackConfig
# WaitForCallbackContext is a runtime-protocol type used to annotate the
# submitter callable; it lives in .types, not .config.
from aws_durable_execution_sdk_python.types import WaitForCallbackContext

@durable_step
def step_smoke_tests(_ctx, deployment_id, artifact):
    # Each "stage" is a durable step. The body emits log lines through the
    # Progress Publisher and sleeps to simulate real work; in a real deploy
    # this would call CodeDeploy / ECS / your CI provider.
    publish_log(deployment_id, "smoke", f"pulling artifact {artifact[:12]}")
    publish_log(deployment_id, "smoke", "/health 200 ok")
    return {"passed": True}

@durable_execution
def handler(event, context: DurableContext):
    deployment_id = event["deploymentId"]
    context.step(step_smoke_tests(deployment_id, event["artifact"]))
    context.step(step_deploy_staging(...))
    context.step(step_integration_tests(...))
    context.step(step_deploy_canary(...))
    decision = context.wait_for_callback(register_observation, ...)
    if decision == "promote":
        context.step(step_promote(...))
    else:
        context.step(step_rollback(...))
    context.step(step_finalize(...))

Each @durable_step-decorated function takes a StepContext as its first argument; context.step(step_smoke_tests(deployment_id, artifact)) calls the wrapper to produce a curried Callable[[StepContext], T], and the durable runtime then calls that with a real StepContext. Once you see the pattern, all the parallel/map/wait helpers follow it consistently.

The killer feature: callbacks

This is the single most useful primitive Durable Functions adds to the Lambda model. We have callback with Step Functions but it's great to have these now with Lambda!context.wait_for_callback() registers an external completion point and pauses the workflow without paying for compute while waiting:

def register_observation(callback_id: str, _ctx) -> None:
    # Persist the callback ID so the Intervene Lambda can find it.
    store_callback_id(deployment_id, callback_id)
    # Kick off the metrics emitter as an async fire-and-forget invoke.
    # It will stream metrics_tick events to subscribers and complete this
    # callback at end-of-window with an auto decision (or stop early if
    # the operator beats it to the punch).
    invoke_metrics_emitter(deployment_id, callback_id, observation_seconds)
    publish(deployment_id, {"type": "stage", "stage": "observation", "status": "running"})

decision_payload = context.wait_for_callback(
    register_observation,
    name="canary-observation",
    config=WaitForCallbackConfig(
        timeout=Duration.from_seconds(observation_seconds + 120)
    ),
)

Duration is a frozen dataclass with seconds: int and a family of factory classmethods (from_seconds, from_minutes, from_hours, from_days). The keyword form Duration(seconds=N) works too because it's the underlying field, but the factory methods are the canonical pattern shown in the SDK examples and read better for non-trivial durations: Duration.from_hours(24) is unambiguous; Duration(seconds=86400) makes a reader do mental math.

The submitter's signature is (callback_id: str, ctx: WaitForCallbackContext) -> None. The runtime calls it once with a fresh callback_id, the function persists that ID and notifies the outside world (here: kicks off the metrics emitter and tells the UI), and wait_for_callback blocks until something completes the callback via the Lambda API.

In this project there are two parties racing to complete the same callback:

The operator clicks Promote / Roll back / Extend in the UI. The Intervene Lambda receives the WebSocket frame, looks up the callback ID, and calls lambda:SendDurableExecutionCallbackSuccess with the chosen decision.
The metrics emitter runs the observation window to completion (e.g. 60 seconds), evaluates the canary metrics against the configured thresholds, and calls the same API with an auto decision ("promote" if metrics stayed within thresholds, "rollback" if they didn't).

First call wins. The orchestrator's wait_for_callback returns whatever payload was passed, and execution continues from there.

Heartbeating to detect superseded callbacks

When the operator clicks Extend observation, the orchestrator gets decision = "extend", registers a new callback, and invokes a new metrics emitter for the extension window. But the original metrics emitter is still running - Lambda invocations can't be cancelled mid-flight. Without intervention, both emitters publish ticks for the same deployment ID for the next ~60 seconds, and the dashboard shows two timers and conflicting metrics.

The fix is a heartbeat at the top of every tick:

while time.time() < deadline:
    try:
        _lambda_client.send_durable_execution_callback_heartbeat(
            CallbackId=callback_id
        )
    except _lambda_client.exceptions.ClientError as exc:
        # CallbackTimeoutException covers both "no longer pending" cases:
        # the callback was completed by another party (operator clicked
        # extend/rollback/promote, or a sibling emitter), or its heartbeat
        # timeout expired. Either way, we should stop emitting.
        logger.info("callback no longer pending; stopping emitter early")
        return
    publish_tick(...)
    time.sleep(2)

The exception name is a bit misleading - CallbackTimeoutException is documented as the timeout-expiry signal, but in practice it's also what surfaces when the callback was already completed by another caller (I caught this on a SendDurableExecutionCallbackSuccess race in testing - same exception, message "The callback is either timed out or already completed"). Catching it covers both supersede paths. The result: the moment the operator clicks Extend, the next heartbeat from the old emitter raises and the old emitter exits cleanly. Only the new extension emitter continues. This is the durable equivalent of "send a cancellation token down" and it costs essentially nothing.

An image of the pipeline in action is below.

Step Functions vs Durable Functions

Worth a quick comparison since teams will reach for both. Step Functions is mature, has a visual editor, supports JSONata for transforms (added at re:Invent 2024; AWS now recommends it for transformation-heavy workflows, though JSONPath remains supported and widely used), and is the right answer for state machines that span Lambda + ECS + Bedrock + SQS + SNS with no application code. I built a version of that pattern in Serverless Data Processor with Step Functions, Lambda, and Fargate (Rust) - that workflow has multiple compute backends and a clean state-machine shape, which Step Functions handles well.

Durable Functions is the right answer when:

The workflow logic is code you'd want to write anyway (loops, conditionals, branching on metrics, optional extension windows that recurse into more callbacks)
You want callbacks measured in minutes-to-days without paying State Transitions costs - the canary observation window in this project is exactly that pattern
The team already lives in Python/Node.js/Java and doesn't want to maintain ASL

A canary deploy with operator intervention is a borderline case. Step Functions can absolutely express it - the wait-for-callback / Task-with-task-token pattern has been around for years. But in practice the deploy logic ends up being half ASL and half code, with the branching rules ("if errors > threshold then rollback else if extend then loop") split awkwardly between the two. Writing it as one Python function with if statements and wait_for_callback reads like the workflow you'd describe to a colleague.

What SAM CLI brings to the table

The CLI is the half of SAM that turns a CloudFormation template into a tight inner loop. Worth calling out the commands I actually use day-to-day on this project:

sam validate --lint: Schema plus cfn-lint pass over the template; runs in seconds.
sam build: Bundles each function's CodeUri with its requirements.txt into .aws-sam/build/.
sam deploy --guided: First-time interactive deploy that writes samconfig.toml.
sam deploy: Subsequent deploys; uses the parameters in samconfig.toml.
sam sync --watch: Dev inner loop; code-only changes deploy in seconds via service APIs, bypassing CloudFormation.
sam logs --tail --name MyFunc: Live tail of one function's logs.
sam local invoke MyFunc --event events/foo.json: Run a function locally in Docker or Finch.
sam local invoke MyFunc --durable-execution-name local-1: Run a function locally as a durable execution.
sam local execution history <name>: Inspect the durable runtime's view of a workflow's steps.
sam local callback succeed <id>: Manually complete a paused callback during local testing.
sam local generate-event s3 put: Emit sample event payloads; also supports services such as SQS, SNS, EventBridge, Kinesis, and Cognito.
sam pipeline init: Generate CI/CD pipeline configs for CodePipeline, GitHub Actions, GitLab, Jenkins, and Bitbucket.

For reference, samconfig.toml for this project is six lines of parameter overrides plus the standard scaffolding. The first sam deploy --guided writes this for you; subsequent sam deploy runs use it without prompting:

version = 0.1
[default.global.parameters]
stack_name = "canary-deploy"
region = "us-east-1"

[default.deploy.parameters]
profile = "blog_admin"
capabilities = "CAPABILITY_IAM"
confirm_changeset = false
fail_on_empty_changeset = false
resolve_s3 = true
parameter_overrides = "Environment=\"dev\" AllowedOrigin=\"http://localhost:5173\" ObservationSeconds=\"60\" DurableExecutionTimeoutSeconds=\"86400\""

The big one in real-world use is sam sync --watch. A standard sam deploy runs a CloudFormation changeset, which takes 30-90 seconds even when the only change is one line of Python. sam sync differentiates code vs. infrastructure changes: code changes are deployed directly via lambda:UpdateFunctionCode, taking 2-3 seconds. Infrastructure changes (a new resource, a changed property) still go through CloudFormation. The trade-off is that sam sync introduces drift between CloudFormation's view of the stack and reality - never use it on a production stack, but for local development it's transformative.

A few smaller things landed in SAM CLI v1.156.0 (March 2026) that this project uses:

.env file format support for --env-vars. You can now write KEY=value lines instead of the old JSON envelope.
Route-specific CORS on AWS::Serverless::HttpApi - useful for the demo's localhost dev origin without opening up production routes.
BuildKit support for container image builds - not used here (zip is simpler) but a meaningful win for teams packaging Lambdas as containers.
Rust cargo-lambda graduated from experimental to stable - separate story, covered in Daniel Abib's multi-threaded Rust on Lambda post.

Policy templates over Connectors

A note on IAM. SAM offers two abstractions for granting permissions: policy templates (named, scoped policies like DynamoDBCrudPolicy) and Connectors (declarative Read/Write semantics between resources). Jeremy Daly's critique of Connectors is still the most coherent position on this: the Read/Write semantics are confusing (e.g., Write on DynamoDB enables deletions, Read on SQS only receives but you need Write to delete after processing), and they don't compose well across nested or multi-stack architectures. For this project, policy templates are easier to reason about because each function's permissions are visible right where the function is defined. Note that template names like DynamoDBCrudPolicy include deletes too - "least privilege" with the CRUD template still grants more than a strict read-only function needs, so I fall back to inline policies for the few cases where the templates are too broad (the execute-api:ManageConnections grant scoped to a specific API and stage, the durable-callback grants, etc.).

This project uses policy templates where they fit, plus targeted inline policies for the permissions the templates don't cover:

Policies:
  - DynamoDBWritePolicy:
      TableName: !Ref DeploymentTable
  - LambdaInvokePolicy:
      FunctionName: !Sub '${AWS::StackName}-ProgressPublisher'
  - Version: '2012-10-17'      # inline for the cases templates don't cover
    Statement:
      - Effect: Allow
        Action: execute-api:ManageConnections
        Resource: !Sub 'arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${DeployWebSocketApi}/dev/POST/@connections/*'

Every Lambda has its own role. The Start Deploy function can write to DynamoDB and invoke the orchestrator - that's it. The Intervene function can read DynamoDB and complete durable callbacks - that's it. The Progress Publisher can read DynamoDB, prune subscriptions, and POST to WebSocket connections - that's it. The blast radius of any one function being compromised stays small.

Things to watch for

A real list of things I tripped over while building this. Most aren't obvious until you hit them.

Use a recent SAM CLI. AWS::Serverless::WebSocketApi support landed in 1.159.1 (released April 28, 2026, ahead of the public AWS announcement on May 5). I verified this template against 1.159.1. As of mid-May 2026, 1.160.0 is the current release - use that or newer unless you have a reason to pin. Older versions (the Durable Functions launch 1.150.x, the post-BuildKit 1.156.0) don't know the new resource type and sam deploy fails at changeset creation with the unhelpful Transform AWS::Serverless-2016-10-31 failed with: Internal transform failure. Upgrade with pip3 install --user --upgrade aws-sam-cli (or grab the packaged installer from the latest AWS SAM CLI GitHub release) and confirm with sam --version. Expected: SAM CLI, version 1.160.0 or newer.
StageName on WebSocketApi must be a literal string in SAM 1.159.1. Any intrinsic - !Ref Environment, !Sub "${Environment}" - trips a TypeError deep inside the SAM transform's per-route Lambda permission generator: Error transforming template: can only concatenate str (not "dict_node") to str. The traceback points at samtranslator/model/api/websocket_api_generator.py line 299 (_construct_permission); the constructor builds the permission's SourceArn by string concatenation with StageName and chokes on the dict node. The workaround in this template is hard-coding StageName: dev and matching it in the Lambda env var (WEBSOCKET_STAGE: dev) and the Outputs. Per-environment parameterization for the rest of the stack still works through the Environment parameter; only the WebSocket stage name has to be a constant. Reported in SAM 1.159.1; if you're on a later release, try the intrinsic form first and drop back to the literal only if you hit the same traceback.
AutoPublishAlias only republishes on code changes, not env-var-only changes. This bites hard with durable functions because each version freezes its environment variables. If you change a config value but don't touch any Python file, sam deploy updates $LATEST but doesn't publish a new version - the alias keeps pointing at the old version with the stale env var. Verify with aws lambda get-alias --function-name X --name live --query FunctionVersion and force-publish with aws lambda publish-version + aws lambda update-alias if the version is older than your env-var change. A practical workaround if you control the function code: change a Description field or bump a comment in the Python file at the same time as the env var change; SAM treats that as a code change and republishes.
wait_for_callback returns the Result field as a raw string, not a parsed object. Both the Intervene Lambda and the Metrics Emitter pass Result=json.dumps({...}) to SendDurableExecutionCallbackSuccess. The orchestrator's wait_for_callback returns that string verbatim. Forgetting this gives you TypeError: string indices must be integers, not 'str' the moment you try result["decision"]. The quick fix is a one-liner: if isinstance(decision_payload, str): decision_payload = json.loads(decision_payload). The more idiomatic fix is to configure a JSON SerDes on WaitForCallbackConfig (and on the matching CallbackConfig of the sender) so the SDK handles round-tripping for you; CallbackConfig and WaitForCallbackConfig both expose a serdes field. The demo uses the manual parse to keep the dependency surface minimal and the failure mode visible, but for production code the SerDes path is cleaner.
Watch for circular dependencies. Anything in Globals.Function.Environment that uses !Ref DeployWebSocketApi creates a dependency from every function (including the four route-handler functions the API references back) to the API itself. SAM rejects this. The same trap exists for any IAM policy whose Resource is !Sub '...${DeployWebSocketApi}/...' if that policy is on a function in the API's Routes map. The fix in this template is to keep WEBSOCKET_API_ID only on ProgressPublisherFunction, which isn't in the routes map. The publisher and metrics emitter also get explicit FunctionName values based on ${AWS::StackName}, so the orchestrator can reference them as strings (PUBLISHER_NAME: !Sub '${AWS::StackName}-ProgressPublisher') instead of with !Ref ProgressPublisherFunction. The ${AWS::StackName} pseudo-parameter has no resource dependency, so the loop never forms.
Depending on your cfn-lint version, cfn-lint may still complain about the new resource types. Until its schema catches up in your environment, expect E3006 on AWS::Serverless::WebSocketApi and E3002 on DurableConfig. Suppress them at the resource level with the Metadata.cfn-lint.config.ignore_checks pattern shown in template.yaml. SAM CLI 1.159.1 accepts both correctly; newer releases should as well.
WaitConfig doesn't exist in the durable SDK. The naming I'd seen referenced in early write-ups didn't match the published package. The real types are WaitForCallbackConfig (extends CallbackConfig) and Duration, both in aws_durable_execution_sdk_python.config (not re-exported from the top-level). Submitter callable is (callback_id: str, ctx) -> None, not (callback_id) -> None. Importing the wrong name throws Runtime.ImportModuleError at Lambda cold start and the orchestrator never runs. That can be silent from the user's perspective because lambda_client.invoke(InvocationType="Event") returns 202 once Lambda accepts the event for asynchronous invocation, even though the handler may fail later during cold start. Always tail the orchestrator's CloudWatch log group when wiring up a durable function for the first time.
The Powertools v3 Lambda layer for python314-arm64 isn't published yet. As of 2026-05, AWS publishes AWSLambdaPowertoolsPythonV3-python313-arm64 (latest version 30, Powertools 3.26.0) but not the python3.14 variant. This is the kind of thing that changes quickly; check again before publishing or before copying this template into a long-lived production repo. Confirm yourself with aws lambda list-layer-versions --layer-name AWSLambdaPowertoolsPythonV3-python314-arm64 --region us-east-1. Expected result today: "LayerVersions": []. If you point your template at a layer ARN that doesn't exist, sam deploy rolls back at function-creation time with lambda:GetLayerVersion AccessDenied (the layer's resource-based policy doesn't grant access because the layer doesn't exist - a misleading error). The fix here is to install Powertools via src/requirements.txt so each function bundles its own copy. Adds ~5 MB to each function package; cheap insurance until the layer ships.
Async fan-out tasks need to know when their callback has been superseded. The metrics emitter spawned at the start of the observation window keeps running until its window expires - even if the operator clicks Extend, completes the original callback, and a fresh emitter has already been launched for the extension window. Without intervention you get two emitters streaming overlapping metrics_tick events to the same UI. The cleanest fix is the heartbeat pattern shown earlier: lambda:SendDurableExecutionCallbackHeartbeat at the top of every tick raises CallbackTimeoutException the moment someone else completes the callback, and the emitter exits early.
Async invokes can fail silently before durable replay can save you. lambda_client.invoke(InvocationType="Event", ...) returns 202 once Lambda accepts the event for asynchronous invocation. That doesn't mean the target handler actually ran successfully. If the async invoke later fails because of throttling, handler errors, bad payload shape, or runtime/import problems, the orchestrator may never make visible progress and the user can see a frozen UI. The fix is an async-invoke destination (OnFailure -> SQS or EventBridge) or a Lambda DLQ on the orchestrator. In SAM, configure that with EventInvokeConfig on the function: set MaximumRetryAttempts, then add a DestinationConfig.OnFailure target such as an SQS queue ARN. Skipped in the demo for brevity; not skippable in production.
API Gateway WebSocket doesn't enforce a fixed concurrent-connection quota. The practical ceiling is shaped by the new-connections-per-second quota (500 per account per Region by default), the 2-hour maximum connection duration, and the 10-minute idle timeout. AWS's own example: 500 new connections per second sustained over the 2-hour window can support up to 3.6M concurrent connections. For most canary dashboards this isn't the first limit you hit, but reconnect storms (browser refreshes during a multi-environment deploy) and large internal audiences can still justify a quota-increase request on new-connection rate. Long-lived dashboards also need application-level ping/pong or periodic traffic to avoid the 10-minute idle disconnect.
CloudWatch alarms aren't optional for a workflow whose auto-decision is a safety net. The orchestrator's "auto-promote at end-of-window if metrics look fine" branch is only as good as your confidence that the orchestrator and metrics emitter are themselves healthy. At minimum, alarm on: Lambda Errors and Throttles on each function, lambda:CheckpointDurableExecution failures (visible as Lambda errors but worth a dedicated metric filter), API Gateway WebSocket 5xx rate, and DynamoDB throttling on the table. None of these alarms are in the template - omitted for demo simplicity, mandatory before you trust the auto-decision in anger.
DynamoDB encryption is AWS-owned by default; deploy metadata may want a CMK. The table uses default AWS-managed encryption. For a deploy-pipeline table holding build artifact IDs and operator decisions, customer-managed KMS keys give you per-environment isolation, key rotation control, and an audit trail of who/what decrypted. Trade-off: every read and write needs kms:Decrypt / kms:GenerateDataKey on the CMK, which adds a small per-request cost and a configuration surface (key policies, grants).
DynamoDB on-demand is the right default for hobby use; provisioned will be cheaper at scale. The template uses PAY_PER_REQUEST because the demo's traffic is bursty and small. A real CI pipeline running this dozens of times an hour will hit on-demand's per-write cost, roughly $1.25 per million write request units in us-east-1 as of this writing, which can be several times higher than provisioned capacity once traffic is steady. Switch to provisioned with auto-scaling once your request rate stabilizes.

Production hardening checklist

Because intervene can promote, roll back, or extend a deployment, this WebSocket is a control-plane interface, not just a UI convenience. Treat it like you'd treat a deployment API.

Demo vs production is more than a one-line caveat. Before you wire this up to anything that matters, walk this list:

[ ] Require $connect authorization (Lambda authorizer with a short-lived WebSocket ticket, not a long-lived bearer token).
[ ] Validate Origin in the $connect authorizer.
[ ] Persist authenticated principal/claims alongside each connection record.
[ ] Authorize every subscribe and intervene message against the deployment owner/team/environment, not just the connection identity.
[ ] Configure AccessLogSettings on the WebSocket stage with a format that omits raw tokens.
[ ] Configure DefaultRouteSettings.ThrottlingBurstLimit / ThrottlingRateLimit per route, and know your account-level throttling limits.
[ ] Add DynamoDB TTL on CONN#... rows to clean up stale subscriptions if $disconnect ever misfires.
[ ] Encrypt the deploy state table with a customer-managed KMS key if it holds anything you wouldn't paste in chat.
[ ] Add an async-invoke destination (SQS / EventBridge) or DLQ on the orchestrator so lambda_client.invoke(InvocationType="Event") failures aren't silent.
[ ] Add CloudWatch alarms on Lambda errors/throttles, durable-execution failures, API Gateway 5xx rate, and DynamoDB throttling.
[ ] Plan for new-connection-rate quota increase requests if you scale internal audiences.
[ ] Keep the orchestration body deterministic; push side effects into @durable_step functions.
[ ] Use AutoPublishAlias plus a DeploymentPreference on the orchestrator itself if you want canary semantics on your canary orchestrator (yes, really).

Most of these are one-liners or one-resource additions in the template. None are skippable for production.

Cost and cleanup

The stack is cheap but not free at idle. With zero traffic:

DynamoDB on-demand with PITR enabled: pennies per day for an empty-ish table
Published Lambda versions: free at rest, billed per invoke
API Gateway HTTP API and WebSocket API: free at rest, billed per request and per connection-minute respectively
CloudWatch Logs: pennies per day for the application/system log groups

Per-deployment costs:

Lambda durable executions: standard Lambda compute pricing for the active steps. The orchestrator only runs during checkpoints; the wait_for_callback pause is free.
Metrics emitter Lambda: one invocation per observation window, running for about 60 seconds and publishing roughly 30 tick messages, plus the callback lifecycle calls. Most of that time is time.sleep between ticks, so it's billed Lambda duration, but still small at 512 MB for a demo-scale observation window.
DynamoDB: a handful of writes and reads per workflow, fractions of a cent.
WebSocket API: $0.25 per million connection-minutes plus $1 per million messages. The demo uses ~30 messages per workflow and a connection of a few minutes; the connection-minute share is essentially noise.

For this demo, expect well under a cent to around a cent per completed workflow on the AWS side, dominated by Lambda execution time during the observation window. The exact number depends on Lambda memory size, duration, and region. Negligible at hobby scale, worth understanding if you wire this up to your real CI pipeline and start firing it dozens of times a day.

When you're done:

make destroy

Verify in the console that the stack is gone and no orphaned API Gateway APIs remain. The DynamoDB table is part of the stack and will be deleted with it - if you've put data you care about there, export it first.

Wrapping up

The new AWS::Serverless::WebSocketApi resource type is exactly the kind of incremental-but-meaningful improvement SAM has been shipping since 2018. It doesn't enable anything CloudFormation couldn't do before; it just removes 60+ lines of boilerplate per WebSocket API and eliminates a category of "forgot the Lambda::Permission" silent failures.

Where it gets genuinely interesting is when you pair it with Lambda Durable Functions. Real-time UIs and long-running workflows have always wanted to talk to each other; the standard pattern was either polling (clunky), Step Functions + WebSocket fan-out (works, but lots of glue), or maintaining a custom orchestrator on Fargate (overkill for most workloads). Now you can write the workflow as plain Python in a single Lambda handler, deploy it with one SAM template, and the human-in-the-loop story is two API calls and a callback.

The key decisions in this project:

WebSocket over SSE when you need a low-latency return channel, not just one-way streaming. Watching a canary's metrics live is a one-way story, but a "roll back NOW" button is bidirectional and time-sensitive.
Durable Functions over Step Functions when the workflow is naturally code, not a state machine. A canary deploy with operator intervention has branching that reads more naturally as Python if statements than as ASL.
Policy templates plus targeted inline policies instead of Connectors, so the effective permissions stay visible beside each function.
Vite locally over CloudFront/S3 when the frontend doesn't need to be reachable from outside your laptop.
Heartbeat callbacks from any background Lambda that might be racing another finisher, so the loser exits cleanly instead of double-publishing.
Decoupled metrics emitter so the orchestrator's durable replay surface stays minimal: one wait_for_callback, the emitter does the rest.

The full source - SAM template, Python handlers, React+Vite frontend, Makefile, samconfig, architecture diagram - is on GitHub: live-canary-deploys-with-sam-the-new-websocket-api-and-durable-functions. Clone it, swap in your AWS profile, run make deploy-guided && make frontend-env && make frontend-dev, and you should have the demo running locally in a few minutes once your AWS credentials and local toolchain are set up. Toggle the Inject canary error spike checkbox in the form to see the rollback path with the metrics dashboard turning red.

Resources

AWS::Serverless::WebSocketApi reference
Generated CloudFormation resources for WebSocketApi
Lambda Durable Functions launch post
Building fault-tolerant applications with Lambda Durable Functions
Best practices for Durable Functions (fraud detection example)
Test and debug durable functions with SAM
SAM CLI v1.156.0 release notes - .env support, route-specific CORS, BuildKit, Rust GA
API Gateway WebSocket connection management API
SAM policy template list
Jeremy Daly: Getting abstractions wrong with SAM Serverless Connectors
Powertools for AWS Lambda (Python) - logger, tracer, metrics, idempotency
Serverless ICYMI Q1 2026 - quarterly recap, AI-assisted serverless tooling

My related posts

AWS Lambda Durable Functions: Build a Loan Approval Workflow - my deep dive on the durable mechanics (checkpoint, replay, callbacks) with a different worked example
Powertools for AWS Lambda Best Practices - the logger/tracer/metrics pattern used throughout this project
Serverless Data Processor with Lambda, Step Functions, and Fargate (Rust) - companion piece on Step Functions orchestration, when ASL is the right tool over Durable
Lambda Managed Instances with Terraform - the rest of the Lambda compute continuum (sustained throughput, 32 GB memory, EC2 pricing)
Elastic Container Service - my default for containers on AWS - when you graduate beyond Lambda

Connect with me on X, Bluesky, LinkedIn, GitHub, Medium, Dev.to, or the AWS Community. Check out more of my projects at darryl-ruggles.cloud and join the Believe In Serverless community.

DEV Community

Live Canary Deployments with AWS SAM, the New WebSocket API Resource, and Lambda Durable Functions

Architecture

SAM in 30 seconds

The new WebSocket API resource

Authorization is `$connect`-only, and `$connect` auth alone isn't enough

Why not just SSE?

Sending messages back from Lambda

Lambda Durable Functions

IAM for durable execution

The Python SDK

The killer feature: callbacks

Heartbeating to detect superseded callbacks

Step Functions vs Durable Functions

What SAM CLI brings to the table

Policy templates over Connectors

Things to watch for

Production hardening checklist

Cost and cleanup

Wrapping up

Resources

My related posts

Top comments (0)

Architecture

SAM in 30 seconds

The new WebSocket API resource

Authorization is $connect-only, and $connect auth alone isn't enough

Why not just SSE?

Sending messages back from Lambda

Lambda Durable Functions

IAM for durable execution

The Python SDK

The killer feature: callbacks

Heartbeating to detect superseded callbacks

Step Functions vs Durable Functions

What SAM CLI brings to the table

Policy templates over Connectors

Things to watch for

Production hardening checklist

Cost and cleanup

Wrapping up

Resources

My related posts

Authorization is `$connect`-only, and `$connect` auth alone isn't enough