DEV Community: Gunnar Grosch

Picking a Phone Verification Method: SMS, Flash Call, Phone Call, and Data Verification

Gunnar Grosch — Sun, 14 Jun 2026 21:34:07 +0000

When your app needs to confirm that a user actually owns the phone number they gave you, the pattern looks the same from the outside: send something to the device, user usually confirms it. Under that surface, there are four distinct approaches using phone and carrier networks, each with different security characteristics, user experiences, and requirements. The right one depends on your context.

If you want the implementation side, the Sinch Verification API is a good starting point. I've covered the code in detail in Phone-Based User Verification in TypeScript and Python.

The four methods

Method	Delivery	User action	Requires mobile data
SMS OTP	Text message	Read and type a numeric code (auto-fill possible on Android)	No
Flash Call	Missed call (caller ID is the code)	None (Android SDK) / Enter caller ID (iOS, web)	No
Phone Call Verification	Inbound phone call	Listen and type a numeric code	No
Data Verification	Carrier network check	None	Yes

The differences matter more than they appear in that table.

SMS OTP

The default choice for most apps. It works on any phone, any network, over Wi-Fi or cellular. Your users already know what to do with a six-digit code. Delivery is global, the integration is straightforward, and it pairs with any backend.

On Android, the SMS Retriever API makes auto-fill possible: the mobile SDK can read the incoming message and fill the code without user input, if your app implements it. Most apps don't, so users typically still read and type the code manually.

The trade-off is that a code the user can read is a code that can be relayed, whether by accident or by a phishing page. For most consumer flows that's an acceptable trade. For account recovery or financial transactions, you may want to weigh methods where no code changes hands at all.

Flash Call

A call is placed to the user's number and immediately disconnected. The incoming caller ID is the verification code. On Android, the mobile SDK can intercept the caller ID automatically, completing verification without any user input. On iOS and web, the user reads the incoming number from their recent calls and enters it manually.

On Android, both SMS and Flash Call can complete without user input. The difference is that Flash Call never delivers a visible code at all. There's nothing to relay to a phishing page, misread, or enter incorrectly. On iOS the UX is more similar to SMS since the user still has to retrieve and enter a value, just from the call log instead of a text. Flash Call is still PSTN-based, so it shares the same SIM-level exposure as SMS. The specific benefit is phishing-resistance, not a broader security upgrade.

Flash Call is typically priced at a fixed rate per attempt, while SMS pricing varies by destination country and operator. In markets where SMS termination is expensive, Flash Call can be meaningfully cheaper at scale. Voice calls typically cost more than SMS, and Data Verification is priced separately.

Phone Call Verification

A text-to-speech call reads the code aloud. By default the user listens and types it in. Most providers also support prompting the user to press a digit on the keypad to confirm, which helps bypass voicemail: if the user doesn't answer, the call goes to voicemail and the code is spoken to the wrong listener. The keypress prompt ensures a human answered before the code is read out.

Choose it for coverage: it reaches numbers that SMS can't. Landlines, VoIP numbers, users in regions where SMS delivery is consistently unreliable. For mobile-first apps, it adds friction without much benefit over SMS. For apps where users might be verifying a non-mobile number or where SMS delivery is unreliable in your target market, it closes a gap the other methods leave open.

Data Verification

Your backend calls the verification API with the user's phone number. The carrier authenticates the device through its network infrastructure, with no visible interaction. The result comes back in the API response. If it succeeds, verification is done before the user does anything.

There's no code for the user to handle at any point. The constraint is cellular data: it must be enabled and available on the device. Modern mobile OSes can route the authentication request over the cellular interface even when Wi-Fi is connected, but if cellular data is off or unavailable, the carrier can't authenticate the device and the request fails. A second constraint worth knowing: on dual-SIM devices, the carrier authenticates based on the SIM handling data traffic. If that SIM is different from the number being verified, the check fails. You need a fallback either way.

One more practical note: on Sinch, Data Verification isn't self-serve. Enabling it requires carrier routing agreements, so you'll need to contact your account manager before it's available. Plan for that lead time if you're building it into your architecture.

Data Verification is a good first attempt for mobile apps where you control the client environment, but it can't stand alone.

How to pick

The decision comes down to a few questions:

Is this a mobile app or a web app? For web, SMS is the baseline. Flash Call and Phone Call Verification are useful as fallbacks but neither delivers the native SDK experience that makes them frictionless on mobile. For mobile apps, a practical default: Flash Call on Android (the SDK intercepts automatically, zero user input), SMS on iOS (Flash Call is less streamlined there). Layer Data Verification on top of that if your carrier coverage allows it.

How much friction can your flow afford? Signup confirmation can tolerate a code-entry step. A checkout flow or account recovery step benefits from fewer interruptions. Data Verification and Flash Call (with the Android SDK) add zero friction for the user. If reducing drop-off at the verification step matters to your conversion, start there.

Who are your users and where are they? SMS delivery is not uniform globally. In some regions, high termination costs make SMS pricing volatile at scale, and Flash Call's fixed per-attempt pricing is worth considering. In others, carrier support for CLI delivery or Data Verification varies. If you're building for a specific geography, check delivery rates and method availability for that market before committing.

Do your users have landlines or VoIP numbers? If yes, Phone Call Verification is the only method that reaches them. Build it as a fallback option even if it's not your primary path.

Fallback chains

No single method works for every user in every context. Build a fallback chain rather than picking one and moving on.

The general sequence: attempt Data Verification first for mobile users (best UX, no code handling), fall back to Flash Call or SMS if it fails or the user is on Wi-Fi, and offer Phone Call Verification as a last resort for users who don't receive the first attempt. On the web, start with SMS and offer Phone Call Verification as the fallback path, surfaced as a "didn't receive the code?" option in your UI.

If you're using a mobile SDK, it will typically provide a callback you can hook into to trigger a fallback automatically when the primary method fails. If you're integrating via the REST API directly, you implement the fallback yourself: catch the failure response from start, then issue a new request with a different method value.

The two-step API is the same regardless of method: start to trigger delivery, report to validate what the user confirms. Switching between methods is a one-field change in the request body, which keeps the fallback logic straightforward to implement.

Wrapping up

SMS OTP is broadly compatible and the right default for most apps. Flash Call eliminates the code-entry step and offers fixed-rate pricing. Phone Call Verification covers landlines and numbers that SMS can't reach. Data Verification is the lowest friction option for mobile but requires cellular data, a fallback, and provisioning through your account manager.

For most apps, the answer isn't picking one but combining them: Data Verification or Flash Call as the primary path for mobile, SMS as the reliable fallback, Phone Call Verification as the last resort. That covers most users and handles the edge cases where your first choice fails.

If you want the backend implementation for any of these methods, including TypeScript and Python examples for the Sinch Verification API, that's covered in Phone-Based User Verification in TypeScript and Python.

Additional Resources

What verification method are you using, and what drove the choice? Let me know in the comments.

Phone-Based User Verification in TypeScript and Python

Gunnar Grosch — Wed, 10 Jun 2026 13:07:26 +0000

A user signs up for your app. You need to confirm the phone number they entered actually belongs to them. A returning user wants to log in without a password. An account change needs a second factor before it goes through.

The underlying pattern is always the same: send a code to the device, wait for the user to confirm it.

The Sinch Verification API handles this with a two-step flow. Your backend calls start to trigger code delivery, and report to validate what the user submits. It supports four methods: SMS, FlashCall (a missed call where the incoming caller ID is the code), voice (text-to-speech), and Data Verification (silent authentication through the carrier's network, no user input at all). The payload shape is the same for all four, so switching methods is a one-field change.

This post covers the backend piece. Whether your frontend is a web app, iOS, or Android, the pattern is the same: your client calls your API, your API calls Sinch. A mobile app calling a Lambda function, a web form submitting to an Express server, a Next.js API route. The code here works for all of them.

This post uses the official Sinch Node.js SDK. Python and no-dependency Node.js examples are in the raw HTTP section at the end.

Prerequisites

You'll need a Sinch account (sign up here, a trial account is enough to test, see pricing for details) and Node.js 18+ or Python 3.9+.

In the Sinch Dashboard:

Create a Verification app. Go to Verification > Apps and click New app. Enter a name and optional description.
Copy your App Key and App Secret. They appear on the right side of the creation dialog under "Your summary". Copy both before closing.
Set the authentication level. In the Settings panel on the left, find Minimal Authentication Level and set it to Application. This ensures only signed requests (like the ones in this post) are accepted. The default is Public, which allows unsigned requests and is intended for mobile clients calling Sinch directly.

Note: on a trial account you can only verify numbers on your verified numbers list. To verify any number, upgrade your account.

How verification works

The flow is two steps:

Start: your backend calls the Verification API with the phone number and method. Sinch delivers the code to the device.
Report: the user enters the code. Your backend calls the API with the phone number and code. Sinch returns SUCCESSFUL or FAILED.

For SMS and voice, the code is a numeric OTP. For FlashCall, the "code" is the full incoming caller ID from the missed call. For Data Verification there is no report step: Sinch authenticates the device silently through the carrier's network and returns the result in the start response.

The SDK approach (Node.js)

The Node.js SDK handles authentication internally. You initialize it once with your credentials. The Verification API has method-specific calls for start and report rather than a single generic endpoint, so each method gets its own SDK call.

Note: the Python SDK doesn't include Verification yet. Python developers can use the raw HTTP section below.

Install the SDK:

npm install @sinch/sdk-core

import { SinchClient } from "@sinch/sdk-core";

const sinchClient = new SinchClient({
  applicationKey: process.env.SINCH_APP_KEY!,
  applicationSecret: process.env.SINCH_APP_SECRET!,
});

const E164_REGEX = /^\+[1-9]\d{1,14}$/;

export async function startVerification(
  phoneNumber: string,
  method: "sms" | "flashcall" | "callout" | "data" = "sms"
) {
  if (!E164_REGEX.test(phoneNumber)) {
    throw new Error(`Invalid phone number format. Use E.164 (e.g. +15551234567), got: ${phoneNumber}`);
  }

  if (method === "sms") {
    return sinchClient.verification.verifications.startSms({
      startVerificationWithSmsRequestBody: {
        identity: { type: "number", endpoint: phoneNumber },
      },
    });
  }
  if (method === "flashcall") {
    return sinchClient.verification.verifications.startFlashCall({
      startVerificationWithFlashCallRequestBody: {
        identity: { type: "number", endpoint: phoneNumber },
      },
    });
  }
  if (method === "callout") {
    return sinchClient.verification.verifications.startPhoneCall({
      startVerificationWithPhoneCallRequestBody: {
        identity: { type: "number", endpoint: phoneNumber },
      },
    });
  }
  return sinchClient.verification.verifications.startData({
    startDataVerificationRequestBody: {
      identity: { type: "number", endpoint: phoneNumber },
    },
  });
}

export async function reportVerification(
  phoneNumber: string,
  code: string,
  method: "sms" | "flashcall" | "callout" = "sms"
) {
  if (method === "sms") {
    return sinchClient.verification.verifications.reportSmsByIdentity({
      endpoint: phoneNumber,
      reportSmsVerificationByIdentityRequestBody: { sms: { code } },
    });
  }
  if (method === "flashcall") {
    return sinchClient.verification.verifications.reportFlashCallByIdentity({
      endpoint: phoneNumber,
      reportFlashCallVerificationByIdentityRequestBody: { flashCall: { cli: code } },
    });
  }
  return sinchClient.verification.verifications.reportPhoneCallByIdentity({
    endpoint: phoneNumber,
    reportPhoneCallVerificationByIdentityRequestBody: { phoneCall: { code } },
  });
}

Call startVerification when the user submits their phone number, and reportVerification when they submit the code:

// Step 1: user submits phone number
const started = await startVerification("+15559876543", "sms");
console.log(started.id);

// Step 2: user submits the code from their device
const result = await reportVerification("+15559876543", "123456", "sms");
if (result.status === "SUCCESSFUL") {
  // issue session token, complete login, approve account change, etc.
}

A successful start response:

{
  "id": "01ABC123DEF456GHI789JKL012",
  "method": "sms",
  "sms": {
    "template": "Your verification code is {{CODE}}. Verified by Sinch",
    "interceptionTimeout": 198
  }
}

The response includes additional fields (_links, etc.) that you don't need for the basic flow.

A successful report response:

{
  "id": "01ABC123DEF456GHI789JKL012",
  "method": "sms",
  "status": "SUCCESSFUL",
  "source": "manual"
}

If the code is wrong or expired, the SDK throws. Treat that as a prompt for the user to request a new code, not a retry of the same one.

Raw HTTP

If you're using Python, or prefer no runtime dependency in Node.js, the same flow works with plain HTTP. Each request must be signed with HMAC-SHA256 using your App Secret. The signature covers the HTTP method, a hash of the request body, the content type, a timestamp, and the request path. This lets the server verify the request hasn't been tampered with and isn't a replay.

Python

import hashlib
import hmac as hmac_mod
import base64
import json
import os
import re
import urllib.request
import urllib.error
import urllib.parse
from datetime import datetime, timezone

APP_KEY = os.environ["SINCH_APP_KEY"]
APP_SECRET = os.environ["SINCH_APP_SECRET"]

CONTENT_TYPE = "application/json; charset=UTF-8"
E164_REGEX = re.compile(r"^\+[1-9]\d{1,14}$")


def sign(method, path, body):
    timestamp = datetime.now(timezone.utc).isoformat()
    content_md5 = base64.b64encode(hashlib.md5(body.encode()).digest()).decode() if body else ""
    string_to_sign = f"{method}\n{content_md5}\n{CONTENT_TYPE}\nx-timestamp:{timestamp}\n{path}"
    signature = base64.b64encode(
        hmac_mod.new(base64.b64decode(APP_SECRET), string_to_sign.encode(), hashlib.sha256).digest()
    ).decode()
    return f"Application {APP_KEY}:{signature}", timestamp


def sinch_request(url, http_method, payload):
    path = urllib.parse.urlparse(url).path
    body = json.dumps(payload)
    auth, timestamp = sign(http_method, path, body)
    req = urllib.request.Request(
        url, data=body.encode(),
        headers={"Content-Type": CONTENT_TYPE, "Authorization": auth, "x-timestamp": timestamp},
        method=http_method,
    )
    try:
        with urllib.request.urlopen(req) as resp:
            return json.loads(resp.read())
    except urllib.error.HTTPError as e:
        raise RuntimeError(f"Sinch API error {e.code}: {e.read().decode()}")


def start_verification(phone_number, method="sms"):
    if not E164_REGEX.match(phone_number):
        raise ValueError(f"Invalid phone number format. Use E.164 (e.g. +15551234567), got: {phone_number}")
    return sinch_request(
        "https://verification.api.sinch.com/verification/v1/verifications", "POST",
        {"identity": {"type": "number", "endpoint": phone_number}, "method": method},
    )


def report_verification(phone_number, code, method="sms"):
    body = {"method": method}
    if method == "sms": body["sms"] = {"code": code}
    elif method == "flashcall": body["flashcall"] = {"cli": code}
    elif method == "callout": body["callout"] = {"code": code}
    return sinch_request(
        f"https://verification.api.sinch.com/verification/v1/verifications/number/{phone_number}",
        "PUT", body,
    )

Node.js (no SDK)

import { createHmac, createHash } from "crypto";

const APP_KEY = process.env.SINCH_APP_KEY!;
const APP_SECRET = process.env.SINCH_APP_SECRET!;

const CONTENT_TYPE = "application/json; charset=UTF-8";

function sign(method: string, path: string, body: string) {
  const timestamp = new Date().toISOString();
  const contentMd5 = body
    ? createHash("md5").update(body, "utf8").digest("base64")
    : "";
  const stringToSign = `${method}\n${contentMd5}\n${CONTENT_TYPE}\nx-timestamp:${timestamp}\n${path}`;
  const signature = createHmac("sha256", Buffer.from(APP_SECRET, "base64"))
    .update(stringToSign)
    .digest("base64");
  return { auth: `Application ${APP_KEY}:${signature}`, timestamp };
}

async function sinchRequest(method: string, url: string, body?: object) {
  const bodyStr = body ? JSON.stringify(body) : "";
  const path = new URL(url).pathname;
  const { auth, timestamp } = sign(method, path, bodyStr);
  const response = await fetch(url, {
    method,
    headers: { "Content-Type": CONTENT_TYPE, Authorization: auth, "x-timestamp": timestamp },
    ...(bodyStr && { body: bodyStr }),
  });
  if (!response.ok) throw new Error(`Sinch API error ${response.status}: ${await response.text()}`);
  return response.json();
}

export async function startVerification(phoneNumber: string, method = "sms") {
  return sinchRequest("POST", "https://verification.api.sinch.com/verification/v1/verifications", {
    identity: { type: "number", endpoint: phoneNumber },
    method,
  });
}

export async function reportVerification(phoneNumber: string, code: string, method = "sms") {
  const body: Record<string, unknown> = { method };
  if (method === "sms") body.sms = { code };
  else if (method === "flashcall") body.flashcall = { cli: code };
  else if (method === "callout") body.callout = { code };
  return sinchRequest(
    "PUT",
    `https://verification.api.sinch.com/verification/v1/verifications/number/${phoneNumber}`,
    body,
  );
}

Both functions raise on a non-2xx response. A wrong or expired code returns HTTP 400. Catch that and prompt the user to request a new code.

Sinch also publishes SDKs for Java and .NET if those are your languages.

Things worth knowing

Verifications expire

A started verification is only valid for a short window: a few minutes for SMS. If the user takes too long, the report call will fail. Build your UI to handle this: show a "resend code" option and treat expiry errors as a prompt to start fresh, not a permanent failure.

Phone numbers must be E.164

The phone number must be in E.164 format: +15551234567, not 5551234567 or (555) 123-4567. Both the SDK and raw HTTP examples validate this before calling the API. The Sinch Verification API won't normalize it for you.

FlashCall code format

For FlashCall, the code you pass to reportVerification is the full incoming caller ID: the complete phone number that called the device. Your mobile client captures this from the call log and passes it to your backend.

Data Verification requires cellular data

Data Verification only works when the device is on cellular data. If the user is on Wi-Fi, the carrier can't authenticate the device. Configure a fallback chain: attempt data first, then fall back to FlashCall or SMS. The API supports this natively so you don't have to retry manually.

Wrapping up

Two functions, a clear two-step flow, and support for four verification methods. The SDK version needs no auth code. Initialize once and call. The raw HTTP version adds no dependencies at all. Both drop into any backend regardless of framework or platform.

If you're building a mobile app, Sinch also publishes native SDKs for iOS and Android. They handle the client side of verification: automatically capturing the incoming caller ID for FlashCall so the user never has to read it, and managing the carrier network handshake for Data Verification. The backend code in this post pairs directly with those SDKs.

Additional Resources

What are you building that needs user verification? Let me know in the comments.

Sending SMS from AWS Lambda with the Sinch SDK

Gunnar Grosch — Sun, 07 Jun 2026 19:14:34 +0000

Your payment just failed. The user needs to know before they try to check out again. Your deployment pipeline just finished. The engineer who kicked it off is waiting. A patient's appointment is in 24 hours and they haven't confirmed.

These are the moments where SMS works better than email. It's immediate, it doesn't get filtered, and it doesn't require the recipient to be looking at an app. Lambda is already where your event-driven logic lives. Connecting it to SMS is a few dozen lines of code.

This post uses the official Sinch SDKs. The examples cover Node.js and Python. Sinch also publishes SDKs for Java and .NET that follow the same pattern. If you'd rather manage the HTTP calls and OAuth token exchange yourself, there's a companion post that covers the raw HTTP approach using the same repo.

Prerequisites

The Sinch dashboard setup takes about 10 minutes. After that, sending is a single SDK call. You'll need a Sinch account (sign up here, a trial account is enough to test, see pricing for details), the AWS SAM CLI, and Node.js 22+ or Python 3.13+.

When you first sign up, Sinch shows an onboarding wizard. If you completed it and selected SMS, you likely already have an app configured with a sender number. In that case, skip to step 4 below. If you dismissed the wizard or are not sure, follow all steps.

Before writing any code, configure the following in the Sinch Build Dashboard:

Get access to Conversation API. Click Conversation API in the left menu, accept the terms, and click GET ACCESS.
Create a Conversation API app. Go to Conversation API > Apps and click Create app. Record the app ID.
Enable the SMS channel on your app. Open the app, find SMS in the channel list, click Set up channel, and connect your service plan.
Find your sender number. Go to SMS > SMS Channel > Numbers. The assigned number is your SMS_SENDER value.
Note your project ID. Click the project name in the top bar and go to Project Settings.
Create an access key. Go to Settings > Access Keys. Record the access key ID and secret. The secret is only shown once.

Note: on a trial account, you can only send to verified numbers and the message content is fixed. To send to any number with custom content, upgrade your account.

One thing that trips people up: the Conversation API is regional. Your app must be created in the same region as your SMS service plan. If they're in different regions, messages will fail with no obvious error. The most common regions are us and eu. A Brazil region (br) also exists.

Storing credentials in SSM Parameter Store

The Lambda function reads your Sinch access key and secret from SSM Parameter Store at cold start. Storing them there as SecureString keeps them out of CloudFormation state and encrypted at rest using KMS.

Create the two parameters before deploying. Run these in the same AWS region where you'll deploy the Lambda:

aws ssm put-parameter \
  --name /sinch/access-key \
  --value "YOUR_ACCESS_KEY" \
  --type SecureString

aws ssm put-parameter \
  --name /sinch/access-key-secret \
  --value "YOUR_ACCESS_KEY_SECRET" \
  --type SecureString

The Lambda function

The SDK handles OAuth 2.0 token exchange internally, so there's no token caching code in the handler. What you do cache is the SinchClient instance itself. Initializing it fetches credentials from SSM, so you want that to happen once at cold start, not on every invocation.

TypeScript

import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";
import { SinchClient } from "@sinch/sdk-core";
import { SQSEvent, SQSBatchResponse, EventBridgeEvent, SNSEvent } from "aws-lambda";

const ssm = new SSMClient({});
const SINCH_REGION = process.env.SINCH_REGION || "us";
const SINCH_PROJECT_ID = process.env.SINCH_PROJECT_ID!;
const SINCH_APP_ID = process.env.SINCH_APP_ID!;
const SINCH_SMS_SENDER = process.env.SINCH_SMS_SENDER!;
const SINCH_ACCESS_KEY_PARAM = process.env.SINCH_ACCESS_KEY_PARAM!;
const SINCH_ACCESS_KEY_SECRET_PARAM = process.env.SINCH_ACCESS_KEY_SECRET_PARAM!;

let sinchClient: SinchClient | null = null;

async function getClient(): Promise<SinchClient> {
  if (sinchClient) return sinchClient;
  const [keyRes, secretRes] = await Promise.all([
    ssm.send(new GetParameterCommand({ Name: SINCH_ACCESS_KEY_PARAM, WithDecryption: true })),
    ssm.send(new GetParameterCommand({ Name: SINCH_ACCESS_KEY_SECRET_PARAM, WithDecryption: true })),
  ]);
  sinchClient = new SinchClient({
    projectId: SINCH_PROJECT_ID,
    keyId: keyRes.Parameter!.Value!,
    keySecret: secretRes.Parameter!.Value!,
    conversationRegion: SINCH_REGION as "us" | "eu",
  });
  return sinchClient;
}

const E164_REGEX = /^\+[1-9]\d{1,14}$/;

async function sendSms(to: string, message: string) {
  if (!E164_REGEX.test(to)) {
    throw new Error(`Invalid phone number format. Use E.164 (e.g. +15551234567), got: ${to}`);
  }
  const client = await getClient();
  return client.conversation.messages.send({
    sendMessageRequestBody: {
      app_id: SINCH_APP_ID,
      recipient: { identified_by: { channel_identities: [{ channel: "SMS", identity: to }] } },
      message: { text_message: { text: message } },
      channel_properties: { SMS_SENDER: SINCH_SMS_SENDER },
    },
  });
}

// --- Direct invocation (default) ---
export const handler = async (event: { to: string; message: string }) => {
  if (!event.to || !event.message) return { statusCode: 400, body: "Missing to or message" };
  try {
    return { statusCode: 200, body: await sendSms(event.to, event.message) };
  } catch (err) {
    console.error(err);
    return { statusCode: 500, body: String(err) };
  }
};

// --- SQS (uncomment to use) ---
// Message body format: { "to": "+15551234567", "message": "Hello!" }
// Returns batchItemFailures so only failed records are retried, preventing duplicate sends.
//
// export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
//   const failures: { itemIdentifier: string }[] = [];
//   for (const record of event.Records) {
//     try {
//       const payload = JSON.parse(record.body);
//       await sendSms(payload.to, payload.message);
//     } catch (err) {
//       console.error(`Failed record ${record.messageId}:`, err);
//       failures.push({ itemIdentifier: record.messageId });
//     }
//   }
//   return { batchItemFailures: failures };
// };

// --- EventBridge (uncomment to use) ---
// Event detail format: { "to": "+15551234567", "message": "Hello!" }
//
// export const handler = async (
//   event: EventBridgeEvent<"SendSms", { to: string; message: string }>
// ) => {
//   await sendSms(event.detail.to, event.detail.message);
// };

// --- SNS (uncomment to use) ---
// SNS message body format: { "to": "+15551234567", "message": "Hello!" }
//
// export const handler = async (event: SNSEvent) => {
//   for (const record of event.Records) {
//     const payload = JSON.parse(record.Sns.Message);
//     await sendSms(payload.to, payload.message);
//   }
// };

Python

import json
import os
import re
import boto3
from sinch import SinchClient
from sinch.core.exceptions import SinchException

ssm = boto3.client("ssm")

SINCH_REGION = os.environ.get("SINCH_REGION", "us")
SINCH_PROJECT_ID = os.environ["SINCH_PROJECT_ID"]
SINCH_APP_ID = os.environ["SINCH_APP_ID"]
SINCH_SMS_SENDER = os.environ["SINCH_SMS_SENDER"]
SINCH_ACCESS_KEY_PARAM = os.environ["SINCH_ACCESS_KEY_PARAM"]
SINCH_ACCESS_KEY_SECRET_PARAM = os.environ["SINCH_ACCESS_KEY_SECRET_PARAM"]

E164_REGEX = re.compile(r"^\+[1-9]\d{1,14}$")

_sinch_client = None


def get_client():
    global _sinch_client
    if _sinch_client:
        return _sinch_client
    key = ssm.get_parameter(Name=SINCH_ACCESS_KEY_PARAM, WithDecryption=True)["Parameter"]["Value"]
    secret = ssm.get_parameter(Name=SINCH_ACCESS_KEY_SECRET_PARAM, WithDecryption=True)["Parameter"]["Value"]
    _sinch_client = SinchClient(
        project_id=SINCH_PROJECT_ID,
        key_id=key,
        key_secret=secret,
        conversation_region=SINCH_REGION,
    )
    return _sinch_client


def send_sms(to, message):
    if not E164_REGEX.match(to):
        raise ValueError(f"Invalid phone number format. Use E.164 (e.g. +15551234567), got: {to}")
    client = get_client()
    return client.conversation.messages.send_text_message(
        app_id=SINCH_APP_ID,
        text=message,
        recipient_identities=[{"channel": "SMS", "identity": to}],
        channel_properties={"SMS_SENDER": SINCH_SMS_SENDER},
    )


# --- Direct invocation (default) ---
def handler(event, context):
    to = event.get("to")
    message = event.get("message")
    if not to or not message:
        return {"statusCode": 400, "body": "Missing to or message"}
    try:
        result = send_sms(to, message)
        return {"statusCode": 200, "body": {"message_id": result.message_id, "accepted_time": str(result.accepted_time)}}
    except Exception as e:
        if isinstance(e, SinchException):
            print(f"Sinch API error {e.response_status_code}: {e}")
            return {"statusCode": e.response_status_code or 500, "body": str(e)}
        print(f"Error: {e}")
        return {"statusCode": 500, "body": str(e)}


# --- SQS (uncomment to use) ---
# Message body format: {"to": "+15551234567", "message": "Hello!"}
# Returns batchItemFailures so only failed records are retried, preventing duplicate sends.
#
# def handler(event, context):
#     failures = []
#     for record in event["Records"]:
#         try:
#             payload = json.loads(record["body"])
#             send_sms(payload["to"], payload["message"])
#         except Exception as e:
#             print(f"Failed record {record['messageId']}: {e}")
#             failures.append({"itemIdentifier": record["messageId"]})
#     return {"batchItemFailures": failures}


# --- EventBridge (uncomment to use) ---
# Event detail format: {"to": "+15551234567", "message": "Hello!"}
#
# def handler(event, context):
#     detail = event["detail"]
#     send_sms(detail["to"], detail["message"])


# --- SNS (uncomment to use) ---
# SNS message body format: {"to": "+15551234567", "message": "Hello!"}
#
# def handler(event, context):
#     for record in event["Records"]:
#         payload = json.loads(record["Sns"]["Message"])
#         send_sms(payload["to"], payload["message"])

Compared to the raw HTTP version, there's no getAccessToken() function and no token cache. The SDK handles the OAuth 2.0 exchange internally and refreshes tokens as needed. The tradeoff: you add @sinch/sdk-core (Node.js) or sinch (Python) as a runtime dependency. The Java and .NET SDKs work the same way if those are your languages.

The SAM template

The repo uses a single template.yaml at the root that deploys all four functions together. Here's the relevant section for the Node.js SDK function:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 10
    MemorySize: 256
    Environment:
      Variables:
        SINCH_REGION: !Ref SinchRegion
        SINCH_PROJECT_ID: !Ref SinchProjectId
        SINCH_APP_ID: !Ref SinchAppId
        SINCH_SMS_SENDER: !Ref SinchSmsSender
        SINCH_ACCESS_KEY_PARAM: !Ref SinchAccessKeyParam
        SINCH_ACCESS_KEY_SECRET_PARAM: !Ref SinchAccessKeySecretParam

Parameters:
  SinchRegion:
    Type: String
    Default: us
    AllowedValues: [us, eu]
  # ... SinchProjectId, SinchAppId, SinchSmsSender,
  #     SinchAccessKeyParam, SinchAccessKeySecretParam

Resources:
  NodeSdkFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: nodejs22.x
      Handler: src/handler.handler
      CodeUri: node-sdk/
      Policies:
        - Statement:
            - Effect: Allow
              Action: ssm:GetParameter
              Resource:
                - !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${SinchAccessKeyParam}"
                - !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${SinchAccessKeySecretParam}"
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: false
        Target: es2022
        EntryPoints:
          - node-sdk/src/handler.ts

Deploy from the repo root:

sam build
sam deploy --guided

SAM handles all dependencies automatically during sam build: Node.js packages via npm install, Python packages via requirements.txt.

Invoking the function

With no event source attached, you invoke the function directly with a JSON payload. The simplest way is the AWS CLI:

# Get the function name from the stack outputs after sam deploy --guided,
# then invoke it:
aws lambda invoke \
  --function-name <NodeSdkFunctionName from outputs> \
  --payload '{"to": "+15559876543", "message": "Hello from Lambda!"}' \
  --cli-binary-format raw-in-base64-out \
  response.json

cat response.json

A successful response looks like:

{
  "statusCode": 200,
  "body": {
    "accepted_time": "2026-05-27T18:00:00.000Z",
    "message_id": "01ABC123DEF456GHI789JKL012"
  }
}

message_id is what you'd use to correlate delivery receipts if you set up a webhook.

If something goes wrong:

{
  "statusCode": 400,
  "body": "{\"error\":{\"code\":400,\"message\":\"recipient identity is not a valid MSISDN\",\"status\":\"INVALID_ARGUMENT\"}}"
}

Common causes: malformed phone number, wrong app_id, region mismatch between your app and service plan, or missing SMS_SENDER without a default originator configured.

When you're ready to wire it up to something, change only the handler function. sendSms() / send_sms() doesn't change. The SAM template has commented event source blocks for SQS, EventBridge, SNS, and API Gateway ready to uncomment:

SQS for queued, rate-controlled sending
EventBridge for event-driven notifications (order placed, alarm fired, job completed)
SNS for fan-out scenarios
API Gateway if you need an HTTP endpoint

Testing locally

echo '{"to": "+15559876543", "message": "Test"}' > event.json
sam local invoke NodeSdkFunction --event event.json
# or
sam local invoke PythonSdkFunction --event event.json

Note: sam local invoke won't be able to reach SSM or the Sinch API unless your local environment has AWS credentials with the right permissions. For local testing without live credentials, stub the SSM calls or use environment variables directly.

Things worth knowing

Phone numbers must be E.164

The to value must be in E.164 format: +15551234567, not 5551234567 or (555) 123-4567. The handler validates this before calling the SDK. The Sinch API won't normalize it for you.

Long messages get split

SMS has a 160-character limit for standard GSM encoding. Longer messages get split into multiple parts. You can cap this with SMS_MAX_NUMBER_OF_MESSAGE_PARTS in channel_properties:

"channel_properties": {
  "SMS_SENDER": "+15551231234",
  "SMS_MAX_NUMBER_OF_MESSAGE_PARTS": "2"
}

If the message would require more parts than the cap, the API rejects it. It won't truncate or silently drop it. You'll get an error back, so you can handle it in your application.

Error handling and delivery status

When sendSms() throws, the handler logs the error and returns a 500. For SQS triggers, implement partial batch failure reporting to avoid retrying already-sent messages: catch errors per record and return { batchItemFailures: [{ itemIdentifier: record.messageId }] }.

For true end-to-end idempotency, you need deduplication before the Sinch call since the Conversation API doesn't support idempotency keys. Use an SQS FIFO queue with a MessageDeduplicationId derived from your business event, or the Powertools for AWS Lambda idempotency utility which handles a DynamoDB check-before-send with a decorator.

For delivery status, the Conversation API sends a DELIVERED or FAILED callback to a webhook URL you configure on your app. Register one in the Sinch Build Dashboard or via the Webhooks API.

Wrapping up

The SDK version is shorter than the raw HTTP version: no token management, no fetch calls, just client.conversation.messages.send(). The cost is a runtime dependency. Whether that tradeoff is worth it depends on your project.

Sinch also publishes SDKs for Java and .NET. The same pattern applies.

The natural next step is delivery receipts: configuring a webhook so your app knows whether each message actually reached the recipient's device. After that, if you need to send to many recipients, wiring an SQS queue in front of the function gives you rate control and automatic retries without changing the handler.

The source code for this post is available on GitHub: gunnargrosch/sinch-sms-lambda

What are you building that needs SMS notifications? Let me know in the comments.

Additional Resources

Two-Way SMS Conversations on AWS

Gunnar Grosch — Mon, 01 Jun 2026 19:39:17 +0000

A customer gets your appointment reminder and texts back "Confirmed" or "Can we reschedule?" Someone texts your support number asking about their order. A user responds to your verification flow with a question. Something needs to receive each of these, understand the context, and respond. That's the gap between broadcasting SMS and actually having a conversation.

Two scenarios matter here. The customer texts you first: a question, a support request, a reply to a notification from a different channel. Or your application sends the opening message: an appointment reminder, a verification code, an order update. Either way, the customer can reply. Both flows use the same webhook. What changes is whether your application fires the first shot.

This post builds that on AWS serverless using the Sinch Conversation API. If you've read the previous posts in this series, you'll recognize the SSM credential pattern. This post stands on its own if you're coming in fresh.

The architecture

DLQs not shown. Both queues have a dead-letter queue that captures messages that fail after three attempts.

SenderFunction: reads from the outbound queue, sends an SMS via the Conversation API, calls storeConversation with the messageId. All outbound messages flow through here, whether they come from your application, a human agent, or the processor.

WebhookFunction: validates the HMAC signature, extracts the sender, text, and conversationId, puts them on the inbound queue, returns 200. Fast, never fails because of slow downstream logic.

ProcessorFunction: reads from the inbound queue, calls handleInboundMessage to decide what to reply, puts the reply on the outbound queue. Has 60 seconds to work. No Sinch credentials needed here. If it fails, SQS retries automatically.

This means your processor never touches the Sinch API directly. A human agent, a support tool, or any other system can send messages through the same outbound queue.

The webhook function

Validates the signature, rejects stale requests, and enqueues. Nothing else.

The Function URL uses AuthType: NONE because Sinch needs to reach it without an AWS auth header. The HMAC signature check covers two things: the payload hasn't been tampered with, and the request was sent recently. The timestamp header is signed alongside the body, so an attacker can't replay a captured request after the 5-minute window closes. Without these checks, anyone who knows the URL can send arbitrary payloads.

import { createHmac, timingSafeEqual } from "crypto";
import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";

const ssm = new SSMClient({});
const sqs = new SQSClient({});
const SINCH_WEBHOOK_SECRET_PARAM = process.env.SINCH_WEBHOOK_SECRET_PARAM!;
const QUEUE_URL = process.env.QUEUE_URL!;
const TIMESTAMP_TOLERANCE_SECONDS = 300; // 5 minutes

let webhookSecret: string | null = null;

async function getWebhookSecret(): Promise<string> {
  if (webhookSecret) return webhookSecret;
  const res = await ssm.send(
    new GetParameterCommand({ Name: SINCH_WEBHOOK_SECRET_PARAM, WithDecryption: true })
  );
  webhookSecret = res.Parameter!.Value!;
  return webhookSecret;
}

function validateTimestamp(timestamp: string): boolean {
  const ts = parseInt(timestamp, 10);
  if (isNaN(ts)) return false;
  return Math.abs(Math.floor(Date.now() / 1000) - ts) <= TIMESTAMP_TOLERANCE_SECONDS;
}

function validateSignature(
  body: string, signature: string, nonce: string, timestamp: string, secret: string
): boolean {
  const signedData = body + "." + nonce + "." + timestamp;
  const expected = createHmac("sha256", secret).update(signedData).digest("base64");
  return signature.length === expected.length &&
    timingSafeEqual(Buffer.from(signature), Buffer.from(expected));
}

export const handler = async (event: { headers: Record<string, string>; body: string }) => {
  const body = event.body;
  const signature = event.headers["x-sinch-webhook-signature"];
  const nonce = event.headers["x-sinch-webhook-signature-nonce"];
  const timestamp = event.headers["x-sinch-webhook-signature-timestamp"];

  if (!signature || !nonce || !timestamp) {
    return { statusCode: 401, body: "Missing signature headers" };
  }

  if (!validateTimestamp(timestamp)) {
    console.error("Webhook timestamp outside acceptable window");
    return { statusCode: 401, body: "Request expired" };
  }

  const secret = await getWebhookSecret();

  if (!validateSignature(body, signature, nonce, timestamp, secret)) {
    console.error("Invalid webhook signature");
    return { statusCode: 401, body: "Invalid signature" };
  }

  const payload = JSON.parse(body);

  if (payload.message && payload.message.direction === "TO_APP") {
    const rawFrom = payload.message.channel_identity?.identity;
    const from = rawFrom && !rawFrom.startsWith('+') ? `+${rawFrom}` : rawFrom;
    const text = payload.message.contact_message?.text_message?.text;

    if (from && text) {
      await sqs.send(new SendMessageCommand({
        QueueUrl: QUEUE_URL,
        MessageBody: JSON.stringify({
          from,
          text,
          messageId: payload.message.id,
          conversationId: payload.message.conversation_id,
        }),
      }));
    }
  }

  return { statusCode: 200, body: "OK" };
};

The processor function

Reads from the inbound queue, runs your logic, and enqueues the reply. This is where your application lives.

import { SQSEvent } from "aws-lambda";
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";

const sqs = new SQSClient({});
const OUTBOUND_QUEUE_URL = process.env.OUTBOUND_QUEUE_URL!;

async function getConversation(from: string): Promise<Record<string, string> | null> {
  // Add your retrieval logic here, e.g. read from DynamoDB by 'from' (phone number)
  // to find what your application last sent to this customer.
  console.log("getConversation", { from });
  return null;
}

// Add your application logic here. The function signature and SQS wiring are done.
async function handleInboundMessage(from: string, text: string, conversationId: string): Promise<string> {
  const context = await getConversation(from);
  // context contains what your application originally sent, if you stored it in storeConversation.
  // For customer-initiated messages, context will be null. Treat them as fresh inquiries.
  return `Thanks for your message! You said: "${text}"`;
}

export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    const { from, text, conversationId } = JSON.parse(record.body);
    try {
      const reply = await handleInboundMessage(from, text, conversationId);
      await sqs.send(new SendMessageCommand({
        QueueUrl: OUTBOUND_QUEUE_URL,
        MessageBody: JSON.stringify({ to: from, message: reply }),
      }));
    } catch (err) {
      console.error(`Failed to process message from ${from}:`, err);
      throw err; // Let SQS retry
    }
  }
};

handleInboundMessage is where your logic goes. It calls getConversation first: a stub that looks up by the sender's phone number. If your application sent the first message and stored it in storeConversation, that's how you retrieve what was originally sent. For customer-initiated messages, getConversation returns null and you handle it as a fresh inquiry.

The sender function

SenderFunction is triggered by the outbound queue. Any message put on that queue gets sent to the recipient via Sinch. The processor puts replies there. Your application puts company-initiated messages there. A human agent's tool puts messages there. One function handles all of it.

Sinch credentials and auth logic live in exactly one place. Anything that needs to send a message puts { to, message } on the queue. No knowledge of the Sinch API required.

export const handler = async (event: SQSEvent): Promise<void> => {
  for (const record of event.Records) {
    const { to, message } = JSON.parse(record.body);
    // ... E164 validation, auth, Sinch API call ...
    const data = await response.json() as { message_id: string; accepted_time: string };
    await storeConversation(data.message_id, to, message);
  }
};

The messageId comes back from Sinch after every send. storeConversation is a stub: add your DynamoDB write there, keyed by the recipient's phone number, so you can look up what was sent when they reply.

To kick off a conversation from your application, put a message on the outbound queue:

aws sqs send-message \
  --queue-url YOUR_OUTBOUND_QUEUE_URL \
  --message-body '{"to": "+15559876543", "message": "Your appointment is tomorrow at 2pm. Reply to confirm."}'

YOUR_OUTBOUND_QUEUE_URL is in the deploy outputs as OutboundQueueUrl.

What an inbound message looks like

The webhook receives this from Sinch:

{
  "message": {
    "id": "01EXAMPLE8235TD19N21XQTH12B",
    "direction": "TO_APP",
    "contact_message": {
      "text_message": {
        "text": "What's my order status?"
      }
    },
    "channel_identity": {
      "channel": "SMS",
      "identity": "+15559876543"
    },
    "conversation_id": "01EXAMPLECONV172WMDB8008EFT"
  }
}

The webhook extracts from, text, and conversationId and puts them on the queue. The processor picks them up.

Deploying

If you followed the sending SMS post, your Sinch app is already configured. Skip to the SSM parameter setup below.

If you're starting here, configure the following in the Sinch Build Dashboard first:

Get access to the Conversation API. Click Conversation API in the left menu, accept the terms, and click Get Access.
Create a Conversation API app. Go to Conversation API > Apps and click Create app. Record the app ID.
Switch the app to Conversation mode. New apps default to Dispatch mode. Open your app, go to Settings, and change the processing mode to Conversation. This enables the conversation_id field in webhook payloads, which Sinch uses to track threads on their side. The conversationId is available in your processor if you want to use it as a thread key. This example keys on phone number instead, since that's available in both directions: Sinch returns it on inbound webhooks, and you already know the recipient when sending outbound. If you expect one active thread per number, phone number is the simpler choice.
Enable the SMS channel on your app. Open the app, find SMS in the channel list, click Set up channel, and connect your service plan.
Find your sender number. Go to SMS > Numbers. The assigned number is your SMS sender.
Note your project ID. Click the project name in the top bar and go to Project Settings.
Create an access key. Go to Settings > Access Keys. Record the key ID and secret. The secret is only shown once.

You'll need three things in SSM Parameter Store before deploying. If you followed the sending SMS post, your access key and secret are already there. The webhook secret is new to this post.

# Skip these two if you already have them from the sending SMS post
aws ssm put-parameter --name /sinch/access-key --value "YOUR_ACCESS_KEY" --type SecureString
aws ssm put-parameter --name /sinch/access-key-secret --value "YOUR_ACCESS_KEY_SECRET" --type SecureString

# New: a dedicated secret for this webhook's HMAC signature validation.
# Use a different value from any secret you set for the delivery receipts webhook.
aws ssm put-parameter \
  --name /sinch/conversation-webhook-secret \
  --value "your-webhook-secret-here" \
  --type SecureString

Then build and deploy from your preferred language directory:

cd typescript   # or: cd python
sam build
sam deploy --guided

sam deploy --guided will prompt for SinchRegion, SinchProjectId, SinchAppId, and SinchSmsSender. After the deploy completes, the outputs include WebhookUrl and OutboundQueueUrl.

Registering the webhook

Register the Function URL with Sinch so it starts receiving inbound message callbacks. You can do this in the Sinch Build Dashboard (Conversation API > Apps > your app > Webhooks) or via the API.

Get an OAuth 2.0 token:

curl https://auth.sinch.com/oauth2/token \
  -d grant_type=client_credentials \
  -u YOUR_ACCESS_KEY:YOUR_ACCESS_KEY_SECRET

Then register:

curl -X POST \
  "https://us.conversation.api.sinch.com/v1/projects/YOUR_PROJECT_ID/webhooks" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "app_id": "YOUR_APP_ID",
    "target": "YOUR_FUNCTION_URL",
    "target_type": "HTTP",
    "secret": "your-webhook-secret-here",
    "triggers": ["MESSAGE_INBOUND"]
  }'

The secret must match what you stored in SSM at /sinch/conversation-webhook-secret. Replace us with eu if your app is in the EU region.

Testing

Tail the logs first:

sam logs --stack-name YOUR_STACK_NAME --tail

Flow 1: Your application sends first

Put a message on the outbound queue using the URL from the deploy outputs:

aws sqs send-message \
  --queue-url YOUR_OUTBOUND_QUEUE_URL \
  --message-body '{"to": "+15559876543", "message": "Your appointment is tomorrow at 2pm. Reply to confirm."}'

The customer gets the SMS. When they reply, you'll see the webhook log the inbound enqueue, the processor log the outbound enqueue, and the sender log the Sinch call. Your phone gets the response.

Flow 2: Customer sends first

Text your Sinch number directly. The webhook logs the incoming message, the processor enqueues the reply, the sender logs the Sinch call, your phone gets the response.

Things worth knowing

Why three functions and two queues?

The webhook must return 200 quickly or Sinch retries. If your AI agent takes 10 seconds, the webhook would time out. The inbound queue decouples the two: the webhook always returns quickly and the processor gets its own 60-second window with automatic retries on failure.

The outbound queue does the same for sending. The processor doesn't need Sinch credentials or auth logic. It just enqueues { to, message }. The sender handles the Sinch API call in one place, which means your application backend and any human agent tools can send messages the same way without reimplementing credentials.

Retries are automatic

If the processor throws an error, SQS makes the message visible again after the visibility timeout. The visibility timeout is set to 360 seconds, six times the processor's 60-second Lambda function timeout. That gap is intentional: it prevents SQS from redelivering a message while the processor is still running it.

After three failures, the message moves to a dead-letter queue. Both the inbound and outbound queues have one. Monitor them for processing failures: messages there didn't make it through after three attempts.

Sinch stores conversation history too

In Conversation mode, Sinch keeps a copy of all messages for the retention period configured on your app (7 days by default, up to 180 days). You can query it via the Conversation API. That's useful for debugging and audit trails, but don't rely on it as your application's source of truth. Implement your own storage if you need conversation context beyond the retention window.

Duplicates are possible

SQS is at-least-once delivery. Your processor might see the same message twice. Use the messageId field as an idempotency key if your logic has side effects.

Non-text messages are ignored

The webhook extracts contact_message?.text_message?.text using optional chaining. If Sinch delivers a non-text message type (media, location, template), the text field is undefined and the message is silently dropped before it reaches the queue. That's the right default for a text conversation app. If your use case requires handling other types, add explicit branching in the webhook before the enqueue step.

This is the foundation for conversational AI

The processor function is where you plug in your agent. For multi-turn conversations that need to wait for replies and branch based on responses, Lambda Durable Functions let you write that as sequential code that suspends and resumes.

Wrapping up

You now have the full loop: user texts you, the webhook validates and enqueues, the processor generates a reply and sends it back. Clean separation, automatic retries, room for slow processing.

If you also need to know whether your outbound messages were delivered, see the delivery receipts post.

The source code for this post is available on GitHub: sinch-sms-conversation

The repo has a typescript/ and python/ directory, each fully self-contained with its own template.yaml. Pick your language and deploy from that directory.

What are you building with two-way SMS? An AI support agent? Appointment confirmations? Let me know in the comments.

Additional Resources

SMS Delivery Receipts on AWS Lambda

Gunnar Grosch — Fri, 29 May 2026 13:12:50 +0000

You sent an SMS. Did it arrive?

A password reset link that never reached the user. An appointment reminder that bounced because the number was disconnected. A fraud alert that failed silently while the transaction went through.

The previous post covered sending SMS from Lambda. But messages:send returning 200 only means Sinch accepted the message. It doesn't mean the recipient's phone received it. Delivery receipts close that gap. They tell you whether the message was delivered, and if not, why it failed. You can retry, fall back to email, alert an operator, or update a record. On the success side, you can mark a notification as confirmed, start a countdown timer for a response, or log proof of delivery for compliance.

The Sinch Conversation API delivers status updates as webhook callbacks. You register a URL, Sinch POSTs to it every time a message changes state, and your function logs or acts on the result. This post builds that webhook endpoint.

How delivery receipts work

After you send a message, it moves through a series of states:

QUEUED_ON_CHANNEL: Sinch accepted it and dispatched it to the SMS carrier
DELIVERED: the carrier confirmed it reached the recipient's device
FAILED: delivery failed (with a reason code explaining why)
READ: the recipient opened/read it (rare for SMS, more common on WhatsApp/RCS)

Each state change triggers a MESSAGE_DELIVERY callback to your webhook. A single sent message typically generates 2-3 callbacks as it moves through these states.

HMAC signature validation

When you create a webhook with a secret, Sinch signs every callback. Your function must verify the signature before processing anything. If it doesn't match, return 401 and ignore the payload.

The signature is: HMAC-SHA256(secret, body + "." + nonce + "." + timestamp), base64-encoded.

import { createHmac, timingSafeEqual } from "crypto";

function validateSignature(
  body: string, signature: string, nonce: string, timestamp: string, secret: string
): boolean {
  const signedData = body + "." + nonce + "." + timestamp;
  const expected = createHmac("sha256", secret).update(signedData).digest("base64");
  return signature.length === expected.length &&
    timingSafeEqual(Buffer.from(signature), Buffer.from(expected));
}

The Lambda handler

import { createHmac, timingSafeEqual } from "crypto";
import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";

const ssm = new SSMClient({});
const SINCH_WEBHOOK_SECRET_PARAM = process.env.SINCH_WEBHOOK_SECRET_PARAM!;

let webhookSecret: string | null = null;

async function getWebhookSecret(): Promise<string> {
  if (webhookSecret) return webhookSecret;
  const res = await ssm.send(
    new GetParameterCommand({ Name: SINCH_WEBHOOK_SECRET_PARAM, WithDecryption: true })
  );
  webhookSecret = res.Parameter!.Value!;
  return webhookSecret;
}

export const handler = async (event: { headers: Record<string, string>; body: string }) => {
  const body = event.body;
  const signature = event.headers["x-sinch-webhook-signature"];
  const nonce = event.headers["x-sinch-webhook-signature-nonce"];
  const timestamp = event.headers["x-sinch-webhook-signature-timestamp"];

  if (!signature || !nonce || !timestamp) {
    return { statusCode: 401, body: "Missing signature headers" };
  }

  const secret = await getWebhookSecret();

  if (!validateSignature(body, signature, nonce, timestamp, secret)) {
    console.error("Invalid webhook signature");
    return { statusCode: 401, body: "Invalid signature" };
  }

  const payload = JSON.parse(body);

  if (payload.message_delivery_report) {
    const report = payload.message_delivery_report;
    handleDeliveryReceipt(report.message_id, report.status, report.reason);
  }

  return { statusCode: 200, body: "OK" };
};

function validateSignature(
  body: string, signature: string, nonce: string, timestamp: string, secret: string
): boolean {
  const signedData = body + "." + nonce + "." + timestamp;
  const expected = createHmac("sha256", secret).update(signedData).digest("base64");
  return signature.length === expected.length &&
    timingSafeEqual(Buffer.from(signature), Buffer.from(expected));
}

// Replace this with your application logic
function handleDeliveryReceipt(messageId: string, status: string, reason?: any) {
  console.log(`Message ${messageId}: ${status}`);

  if (status === "FAILED") {
    // Alert, retry on a different channel, update a database record, etc.
    console.error(`Delivery failed:`, reason);
  }
}

handleDeliveryReceipt is where your logic goes. The demo logs the status. In production you might update a database record, trigger an alert on failure, or retry via a different channel.

What a delivery receipt looks like

Here's an actual callback payload for a successfully delivered message:

{
  "app_id": "01EXAMPLE1NFQ9H0N6HBWPTB10A",
  "accepted_time": "2026-05-28T09:34:19.812Z",
  "event_time": "2026-05-28T09:34:21.267Z",
  "project_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "message_delivery_report": {
    "message_id": "01EXAMPLE54NBHBQZDHHHPYQ3FS",
    "conversation_id": "01EXAMPLEBTR4C1HJZS05PVTXZ8",
    "status": "DELIVERED",
    "channel_identity": {
      "channel": "SMS",
      "identity": "+15559876543",
      "app_id": ""
    },
    "contact_id": "01EXAMPLEN3NANZ05VM0FS80EHD",
    "metadata": "",
    "processing_mode": "CONVERSATION"
  }
}

And here's a failed delivery:

{
  "message_delivery_report": {
    "message_id": "01EQBF0BT63J7S1FEKJZ0Z08VD",
    "status": "FAILED",
    "reason": {
      "code": "RECIPIENT_NOT_REACHABLE",
      "description": "The channel reported: Unable to deliver message to recipient",
      "sub_code": "UNSPECIFIED_SUB_CODE",
      "channel_code": "150"
    }
  }
}

Common failure codes: RECIPIENT_NOT_REACHABLE, RECIPIENT_INVALID_CHANNEL_IDENTITY, OUTSIDE_ALLOWED_SENDING_WINDOW, CHANNEL_FAILURE.

Correlating receipts to your business logic

The delivery receipt includes the message_id from when you sent the message. If you set message_metadata in your original messages:send request, it's also included in the receipt. This is how you tie a delivery status back to a specific order, alert, or notification in your system.

The triggers array controls which callback types Sinch sends to this URL. MESSAGE_DELIVERY means "only delivery receipts." Other triggers you might use:

MESSAGE_INBOUND: a user sent you a message (covered in the companion post about two-way conversations)
EVENT_INBOUND: typing indicators and other events from users
CONVERSATION_START / CONVERSATION_STOP: conversation lifecycle events
CONTACT_CREATE: a new contact was created

You can subscribe to multiple triggers on the same webhook, or create separate webhooks for different triggers. Up to 5 webhooks per app.

Setting up the webhook endpoint

A Lambda Function URL is the simplest way to expose a Lambda as an HTTPS endpoint for Sinch to call. No API Gateway needed.

DeliveryReceiptFunction:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: nodejs22.x
    Handler: src/delivery-receipt.handler
    CodeUri: webhook/
    FunctionUrlConfig:
      AuthType: NONE
    Environment:
      Variables:
        SINCH_WEBHOOK_SECRET_PARAM: !Ref SinchWebhookSecretParam

The complete template (with Parameters and Outputs) is in the repo. The snippet above shows the key parts.

AuthType: NONE makes the URL publicly accessible. That's why HMAC validation is required.

Deploying

Store the webhook secret in SSM (same region where you'll deploy):

aws ssm put-parameter \
  --name /sinch/webhook-secret \
  --value "your-webhook-secret-here" \
  --type SecureString

Then build and deploy:

sam build
sam deploy --guided

After the deploy completes, sam deploy prints the stack outputs including your Function URL. Copy it for the next step.

Registering the webhook

Register the Function URL with Sinch so it starts receiving delivery receipt callbacks. You can do this via the dashboard (Conversation API > Apps > your app > Webhooks) or via the API.

First, get an OAuth 2.0 token:

curl https://auth.sinch.com/oauth2/token \
  -d grant_type=client_credentials \
  -u YOUR_ACCESS_KEY:YOUR_ACCESS_KEY_SECRET

Then register the webhook using the access_token from the response:

curl -X POST \
  "https://us.conversation.api.sinch.com/v1/projects/YOUR_PROJECT_ID/webhooks" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "app_id": "YOUR_APP_ID",
    "target": "YOUR_FUNCTION_URL",
    "target_type": "HTTP",
    "secret": "your-webhook-secret-here",
    "triggers": ["MESSAGE_DELIVERY"]
  }'

You can also do this in the Sinch Build Dashboard under Conversation API > Apps > your app > Webhooks.

The secret value here must match what you stored in SSM at /sinch/webhook-secret.

Watching the logs

After registering the webhook, send an SMS through your Sinch Conversation app (using the Lambda function from the previous post, the Sinch dashboard, or the API directly) and tail the logs:

sam logs --stack-name sinch-sms-delivery-receipts --tail

You should see delivery receipt callbacks arriving within seconds of sending a message.

Things worth knowing

Multiple callbacks per message

Each sent message generates 2-3 callbacks. First QUEUED_ON_CHANNEL, then DELIVERED or FAILED. Don't treat the first callback as the final status. This means your function is invoked 2-3 times per message you send. If you're sending 1000 messages, expect 2000-3000 invocations for delivery receipts alone. At Lambda pricing this is negligible, but worth knowing when you look at your metrics.

Return 200 immediately

Sinch expects a quick response. For simple logging or a single database write, processing inline is fine. If your logic involves external API calls or complex processing, accept the callback immediately, write the payload to SQS, and return 200. A separate function processes the queue.

Callbacks can arrive out of order

A DELIVERED callback might arrive before QUEUED_ON_CHANNEL due to network timing. If your logic depends on ordering, track state per message ID rather than assuming sequential delivery.

Duplicates are possible

Network issues can cause the same callback to arrive more than once. Use message_delivery_report.message_id + status as a deduplication key.

Retries

Sinch retries on 5xx, 429, and network failures with exponential backoff. 4xx (except 429) is treated as a permanent failure and not retried. If your function is down, you'll miss callbacks permanently once retries are exhausted.

Wrapping up

You now know whether your messages are reaching recipients. The handler validates the webhook signature, parses the delivery report, and routes to a function you control.

The natural next step is receiving inbound messages: what happens when the user replies? That's covered in the companion post about two-way SMS conversations.

The source code for this post is available on GitHub: gunnargrosch/sinch-sms-delivery-receipts

The repo also includes a Python implementation of the same handler.

Have you built retry logic on top of delivery receipts, or are you using them differently? I'd like to hear about it in the comments.

Additional Resources

Sending SMS from AWS Lambda

Gunnar Grosch — Thu, 28 May 2026 16:53:37 +0000

A user places an order. Your backend processes it, updates the database, and fires an event. Somewhere in that chain, you want to send them a text message: "Your order is confirmed. It ships tomorrow."

Or maybe it's 2 AM and a CloudWatch alarm just fired. You need an on-call engineer to know about it right now, not when they next check their email.

Or a long-running job just finished processing a large file. The user who submitted it hours ago should know it's ready.

In all of these cases, the pattern is the same: something happens, and you need to send an SMS as part of the response. Lambda is the natural fit for event-driven logic, whether the trigger comes from within AWS or from an external system. You just need a way to get a text message out.

Sinch's Conversation API handles SMS, WhatsApp, RCS, and more through a single endpoint. Your Lambda function won't change if you add channels later.

The post walks through building a Lambda function in TypeScript and Python that sends an SMS using the Sinch Conversation API. Credentials are stored in SSM Parameter Store. The handler is structured so it can be wired to whatever triggers you need: direct invocation, SQS, EventBridge, SNS, or API Gateway.

Prerequisites

The Sinch dashboard setup takes about 10 minutes. After that, sending is a single API call. You'll need a Sinch account (sign up here, a trial account is enough to test, see pricing for details), the AWS SAM CLI, and Node.js 22+ or Python 3.13+.

Before writing any code, configure the following in the Sinch Build Dashboard:

Get access to Conversation API. Click Conversation API in the left menu, accept the terms, and click GET ACCESS.
Create a Conversation API app. Go to Conversation API > Apps and click Create app. Record the app ID.
Enable the SMS channel on your app. Open the app, find SMS in the channel list, click Set up channel, and connect your service plan.
Find your sender number. Go to SMS > SMS Channel > Numbers. The assigned number is your SMS_SENDER value.
Note your project ID. Click the project name in the top bar and go to Project Settings.
Create an access key. Go to Settings > Access Keys. Record the access key ID and secret. The secret is only shown once.

Note: on a trial account, you can only send to verified numbers and the message content is fixed. To send to any number with custom content, upgrade your account.

Storing credentials in SSM Parameter Store

Create the two parameters before deploying. Run these in the same AWS region where you'll deploy the Lambda:

aws ssm put-parameter \
  --name /sinch/access-key \
  --value "YOUR_ACCESS_KEY" \
  --type SecureString

aws ssm put-parameter \
  --name /sinch/access-key-secret \
  --value "YOUR_ACCESS_KEY_SECRET" \
  --type SecureString

The Lambda function

The structure is the same in both languages: module-level caches for credentials and the OAuth token, a send_sms / sendSms function that owns the validation and API call, and a thin handler that adapts the event.

TypeScript

import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";
import { SQSEvent, SQSBatchResponse, EventBridgeEvent, SNSEvent } from "aws-lambda";

const ssm = new SSMClient({});
const SINCH_REGION = process.env.SINCH_REGION || "us";
const SINCH_PROJECT_ID = process.env.SINCH_PROJECT_ID!;
const SINCH_APP_ID = process.env.SINCH_APP_ID!;
const SINCH_SMS_SENDER = process.env.SINCH_SMS_SENDER!;
const SINCH_ACCESS_KEY_PARAM = process.env.SINCH_ACCESS_KEY_PARAM!;
const SINCH_ACCESS_KEY_SECRET_PARAM = process.env.SINCH_ACCESS_KEY_SECRET_PARAM!;

// Cached at module level. Reused across warm invocations regardless of trigger type
let credentials: { accessKey: string; accessKeySecret: string } | null = null;
let cachedToken: { value: string; expiresAt: number } | null = null;

async function getCredentials() {
  if (credentials) return credentials;
  const [keyRes, secretRes] = await Promise.all([
    ssm.send(new GetParameterCommand({ Name: SINCH_ACCESS_KEY_PARAM, WithDecryption: true })),
    ssm.send(new GetParameterCommand({ Name: SINCH_ACCESS_KEY_SECRET_PARAM, WithDecryption: true })),
  ]);
  credentials = {
    accessKey: keyRes.Parameter!.Value!,
    accessKeySecret: secretRes.Parameter!.Value!,
  };
  return credentials;
}

async function getAccessToken(): Promise<string> {
  const now = Date.now();
  if (cachedToken && now < cachedToken.expiresAt) return cachedToken.value; // Refreshes automatically when expired

  const { accessKey, accessKeySecret } = await getCredentials();
  const encoded = Buffer.from(`${accessKey}:${accessKeySecret}`).toString("base64");

  // Exchange credentials for a short-lived OAuth 2.0 Bearer token (valid ~1 hour).
  // Basic auth is rate-limited and for testing only. Don't use it directly on the Sinch API.
  const response = await fetch("https://auth.sinch.com/oauth2/token", {
    method: "POST",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
      Authorization: `Basic ${encoded}`,
    },
    body: "grant_type=client_credentials",
  });

  if (!response.ok) throw new Error(`Token request failed: ${await response.text()}`);

  const data = await response.json() as { access_token: string; expires_in: number };
  cachedToken = { value: data.access_token, expiresAt: now + (data.expires_in - 60) * 1000 };
  return cachedToken.value;
}

const E164_REGEX = /^\+[1-9]\d{1,14}$/;

async function sendSms(to: string, message: string) {
  if (!E164_REGEX.test(to)) {
    throw new Error(`Invalid phone number format. Use E.164 (e.g. +15551234567), got: ${to}`);
  }

  const token = await getAccessToken();

  const response = await fetch(
    `https://${SINCH_REGION}.conversation.api.sinch.com/v1/projects/${SINCH_PROJECT_ID}/messages:send`,
    {
      method: "POST",
      headers: { "Content-Type": "application/json", Authorization: `Bearer ${token}` },
      body: JSON.stringify({
        app_id: SINCH_APP_ID,
        recipient: { identified_by: { channel_identities: [{ channel: "SMS", identity: to }] } },
        message: { text_message: { text: message } },
        channel_properties: { SMS_SENDER: SINCH_SMS_SENDER },
      }),
    }
  );

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`Sinch API error ${response.status}: ${error}`);
  }

  return response.json();
}

// --- Direct invocation (default) ---
export const handler = async (event: { to: string; message: string }) => {
  if (!event.to || !event.message) {
    return { statusCode: 400, body: "Missing to or message" };
  }
  try {
    return { statusCode: 200, body: await sendSms(event.to, event.message) };
  } catch (err) {
    console.error(err);
    return { statusCode: 500, body: String(err) };
  }
};

// --- SQS (uncomment to use) ---
// Message body format: { "to": "+15551234567", "message": "Hello!" }
// Returns batchItemFailures so only failed records are retried, preventing duplicate sends.
//
// export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
//   const failures: { itemIdentifier: string }[] = [];
//   for (const record of event.Records) {
//     try {
//       const payload = JSON.parse(record.body);
//       await sendSms(payload.to, payload.message);
//     } catch (err) {
//       console.error(`Failed record ${record.messageId}:`, err);
//       failures.push({ itemIdentifier: record.messageId });
//     }
//   }
//   return { batchItemFailures: failures };
// };

// --- EventBridge (uncomment to use) ---
// Event detail format: { "to": "+15551234567", "message": "Hello!" }
//
// export const handler = async (
//   event: EventBridgeEvent<"SendSms", { to: string; message: string }>
// ) => {
//   await sendSms(event.detail.to, event.detail.message);
// };

// --- SNS (uncomment to use) ---
// SNS message body format: { "to": "+15551234567", "message": "Hello!" }
//
// export const handler = async (event: SNSEvent) => {
//   for (const record of event.Records) {
//     const payload = JSON.parse(record.Sns.Message);
//     await sendSms(payload.to, payload.message);
//   }
// };

Python

import json
import os
import re
import time
import urllib.request
import urllib.error
from base64 import b64encode
import boto3

ssm = boto3.client("ssm")

SINCH_REGION = os.environ.get("SINCH_REGION", "us")
SINCH_PROJECT_ID = os.environ["SINCH_PROJECT_ID"]
SINCH_APP_ID = os.environ["SINCH_APP_ID"]
SINCH_SMS_SENDER = os.environ["SINCH_SMS_SENDER"]
SINCH_ACCESS_KEY_PARAM = os.environ["SINCH_ACCESS_KEY_PARAM"]
SINCH_ACCESS_KEY_SECRET_PARAM = os.environ["SINCH_ACCESS_KEY_SECRET_PARAM"]

E164_REGEX = re.compile(r"^\+[1-9]\d{1,14}$")

# Module-level cache. Shared across warm invocations regardless of trigger type
_credentials = None
_token_cache = {"value": None, "expires_at": 0}


def get_credentials():
    global _credentials
    if _credentials:
        return _credentials
    key = ssm.get_parameter(Name=SINCH_ACCESS_KEY_PARAM, WithDecryption=True)["Parameter"]["Value"]
    secret = ssm.get_parameter(Name=SINCH_ACCESS_KEY_SECRET_PARAM, WithDecryption=True)["Parameter"]["Value"]
    _credentials = (key, secret)
    return _credentials


def get_access_token():
    now = time.time()
    if _token_cache["value"] and now < _token_cache["expires_at"]:
        return _token_cache["value"]

    access_key, access_key_secret = get_credentials()
    encoded = b64encode(f"{access_key}:{access_key_secret}".encode()).decode()

    req = urllib.request.Request(
        "https://auth.sinch.com/oauth2/token",
        data=b"grant_type=client_credentials",
        headers={
            "Content-Type": "application/x-www-form-urlencoded",
            "Authorization": f"Basic {encoded}",
        },
        method="POST",
    )
    try:
        with urllib.request.urlopen(req) as resp:
            data = json.loads(resp.read())
    except urllib.error.HTTPError as e:
        raise RuntimeError(f"Token request failed {e.code}: {e.read().decode()}")

    _token_cache["value"] = data["access_token"]
    _token_cache["expires_at"] = now + data["expires_in"] - 60
    return _token_cache["value"]


def send_sms(to, message):
    if not E164_REGEX.match(to):
        raise ValueError(f"Invalid phone number format. Use E.164 (e.g. +15551234567), got: {to}")

    token = get_access_token()
    payload = json.dumps({
        "app_id": SINCH_APP_ID,
        "recipient": {"identified_by": {"channel_identities": [{"channel": "SMS", "identity": to}]}},
        "message": {"text_message": {"text": message}},
        "channel_properties": {"SMS_SENDER": SINCH_SMS_SENDER},
    }).encode()

    url = f"https://{SINCH_REGION}.conversation.api.sinch.com/v1/projects/{SINCH_PROJECT_ID}/messages:send"
    req = urllib.request.Request(
        url, data=payload,
        headers={"Content-Type": "application/json", "Authorization": f"Bearer {token}"},
        method="POST",
    )
    try:
        with urllib.request.urlopen(req) as response:
            return json.loads(response.read())
    except urllib.error.HTTPError as e:
        error = e.read().decode()
        print(f"Sinch API error: {error}")
        raise RuntimeError(f"Sinch API error {e.code}: {error}")


# --- Direct invocation (default) ---
def handler(event, context):
    to = event.get("to")
    message = event.get("message")
    if not to or not message:
        return {"statusCode": 400, "body": "Missing to or message"}
    try:
        return {"statusCode": 200, "body": send_sms(to, message)}
    except Exception as e:
        print(f"Error: {e}")
        return {"statusCode": 500, "body": str(e)}


# --- SQS (uncomment to use) ---
# Message body format: {"to": "+15551234567", "message": "Hello!"}
# Returns batchItemFailures so only failed records are retried, preventing duplicate sends.
#
# def handler(event, context):
#     failures = []
#     for record in event["Records"]:
#         try:
#             payload = json.loads(record["body"])
#             send_sms(payload["to"], payload["message"])
#         except Exception as e:
#             print(f"Failed record {record['messageId']}: {e}")
#             failures.append({"itemIdentifier": record["messageId"]})
#     return {"batchItemFailures": failures}


# --- EventBridge (uncomment to use) ---
# Event detail format: {"to": "+15551234567", "message": "Hello!"}
#
# def handler(event, context):
#     detail = event["detail"]
#     send_sms(detail["to"], detail["message"])


# --- SNS (uncomment to use) ---
# SNS message body format: {"to": "+15551234567", "message": "Hello!"}
#
# def handler(event, context):
#     for record in event["Records"]:
#         payload = json.loads(record["Sns"]["Message"])
#         send_sms(payload["to"], payload["message"])

The SAM template

The repo uses a single template.yaml at the root that deploys all four functions together (Node.js and Python, raw HTTP and SDK). Here's the relevant section for the Node.js raw HTTP function:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 10
    MemorySize: 256
    Environment:
      Variables:
        SINCH_REGION: !Ref SinchRegion
        SINCH_PROJECT_ID: !Ref SinchProjectId
        SINCH_APP_ID: !Ref SinchAppId
        SINCH_SMS_SENDER: !Ref SinchSmsSender
        SINCH_ACCESS_KEY_PARAM: !Ref SinchAccessKeyParam
        SINCH_ACCESS_KEY_SECRET_PARAM: !Ref SinchAccessKeySecretParam

Parameters:
  SinchRegion:
    Type: String
    Default: us
    AllowedValues: [us, eu]
  # ... SinchProjectId, SinchAppId, SinchSmsSender,
  #     SinchAccessKeyParam, SinchAccessKeySecretParam

Resources:
  NodeHttpFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: nodejs22.x
      Handler: src/handler.handler
      CodeUri: node-http/
      Policies:
        - Statement:
            - Effect: Allow
              Action: ssm:GetParameter
              Resource:
                - !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${SinchAccessKeyParam}"
                - !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${SinchAccessKeySecretParam}"
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: false
        Target: es2022
        EntryPoints:
          - node-http/src/handler.ts

Deploy from the repo root:

sam build
sam deploy --guided

SAM handles all dependencies automatically during sam build: Node.js packages via npm install, Python packages via requirements.txt.

Invoking the function

With no event source attached, you invoke the function directly with a JSON payload. The simplest way is the AWS CLI:

# Get the function name from the stack outputs after sam deploy --guided,
# then invoke it:
aws lambda invoke \
  --function-name <NodeHttpFunctionName from outputs> \
  --payload '{"to": "+15559876543", "message": "Hello from Lambda!"}' \
  --cli-binary-format raw-in-base64-out \
  response.json

cat response.json

A successful response looks like:

{
  "statusCode": 200,
  "body": {
    "accepted_time": "2026-05-27T18:00:00.000Z",
    "message_id": "01ABC123DEF456GHI789JKL012"
  }
}

message_id is what you'd use to correlate delivery receipts if you set up a webhook.

If something goes wrong, the Sinch API returns a structured error:

{
  "statusCode": 400,
  "body": "{\"error\":{\"code\":400,\"message\":\"recipient identity is not a valid MSISDN\",\"status\":\"INVALID_ARGUMENT\"}}"
}

Common causes: malformed phone number, wrong app_id, region mismatch between your app and service plan, or missing SMS_SENDER without a default originator configured.

When you're ready to wire it up to something, change only the handler signature. sendSms() doesn't change. The SAM template has commented event source blocks for SQS, EventBridge, SNS, and API Gateway ready to uncomment:

SQS for queued, rate-controlled sending
EventBridge for event-driven notifications from other AWS services (order placed, alarm fired, job completed)
SNS for fan-out scenarios where multiple subscribers need to react
API Gateway if you need an HTTP endpoint for third-party webhooks or client-side triggers

Testing locally

Create an event file:

{
  "to": "+15559876543",
  "message": "Test from local"
}

Then invoke locally with SAM:

sam local invoke NodeHttpFunction --event event.json
# or
sam local invoke PythonHttpFunction --event event.json

Things worth knowing

Phone numbers must be E.164

The to value must be in E.164 format: +15551234567, not 5551234567 or (555) 123-4567. The handler validates this before calling the API. The Sinch API won't normalize it for you.

Long messages get split

SMS has a 160-character limit for standard GSM encoding. Longer messages get split into multiple parts. You can cap this with SMS_MAX_NUMBER_OF_MESSAGE_PARTS in channel_properties:

"channel_properties": {
  "SMS_SENDER": "+15551231234",
  "SMS_MAX_NUMBER_OF_MESSAGE_PARTS": "2"
}

If the message would require more parts than the cap, the API rejects it. It won't truncate or silently drop it. You'll get an error back, so you can handle it in your application.

Error handling and delivery status

When sendSms() throws, the handler logs the error and returns a 500. If the Lambda is invoked asynchronously (EventBridge, SNS), Lambda will retry up to two more times by default before sending the event to a dead-letter queue if one is configured. If you're sending high volumes, set up a DLQ on the function so failed sends don't disappear silently.

For SQS triggers, be aware of partial batch failures. If one message in a batch fails, the entire batch is retried by default, which means already-sent messages get sent again. To avoid duplicate SMS, implement partial batch failure reporting: catch errors per record and return { batchItemFailures: [{ itemIdentifier: record.messageId }] } so only the failed records are retried. The SQS handler variant in the code has a comment pointing to this.

For delivery status on the recipient's device, the Conversation API sends a callback to a webhook URL you configure on your app. The status will be either DELIVERED or FAILED. To receive these, register a webhook in the Sinch Build Dashboard under your app's webhook settings, or via the Webhooks API. That's a natural next step once this is working.

The SDK and other languages

This post uses raw HTTP. The tradeoff compared to the official Sinch SDKs (Node.js, Python, Java, .NET): raw HTTP has zero runtime dependencies and the auth flow is visible in the code. The SDKs remove the token caching code entirely (they handle OAuth 2.0 internally) but add a dependency. Both are valid. The repo includes node-sdk and python-sdk examples alongside the raw HTTP ones.

Wrapping up

The function is deployed, the credentials are in SSM, and you have a handler that works with whatever trigger makes sense for your use case.

The source code for this post is available on GitHub: gunnargrosch/sinch-sms-lambda

What are you building that needs SMS notifications? Let me know in the comments.

Additional Resources

Visualizing AWS Lambda Durable Function Workflows with durable-viz

Gunnar Grosch — Fri, 27 Mar 2026 15:20:58 +0000

If you're new to durable functions, start with the durable functions post for the core concepts. The short version: your handler re-runs from the beginning on every resume, but completed steps return cached results instantly instead of re-executing. The SDK handles this checkpointing and replay transparently.

Durable functions encourage you to write sequential code. But the execution flow isn't always sequential. You have parallel branches that fan out and converge. Conditionals that route to callbacks or skip to the end. Invocations that call other Lambda functions. The more primitives you use, the harder it gets to see the full picture just by reading the handler.

I hit this when building the purchasing coordinator. Five specialist agents dispatched in parallel, a conditional approval callback, plan and synthesis steps on either side. The code reads top to bottom, but the workflow branches and converges in ways that aren't obvious from the source. Two primitives in particular: context.invoke() calls another Lambda function with automatic checkpointing (unlike the AWS SDK's lambda.invoke(), the result is cached so the target function isn't called again on replay), and waitForCallback suspends the workflow until an external signal arrives. This is how the purchasing coordinator pauses for human approval.

So I built durable-viz: a static analysis tool that turns durable function handlers into flowcharts. No deployment, no execution, no AWS credentials. Point it at a source file and it extracts the workflow structure from the code.

It supports TypeScript, Python, and Java. You can run it as a CLI (Mermaid output, browser, or JSON), or as a VS Code extension with a live diagram panel next to your code.

What It Does

Run durable-viz against any file containing a durable function handler:

npx durable-viz handler.ts --open

It parses the file, extracts the durable primitives (step, parallel, invoke, waitForCallback, conditionals), builds a directed graph, and renders it as a Mermaid flowchart. The --open flag generates an interactive HTML page with zoom, pan, and PNG export.

Here's what the purchasing coordinator from the multi-agent post looks like:

npx durable-viz src/handlers/coordinator.ts

The four-phase flow is immediately visible: plan, five specialists fanning out from a parallel node, synthesize, and the conditional approval callback gated by a diamond. The "no" branch skips straight to End. Each node shape encodes the primitive type (the --open browser view and VS Code extension add color coding). These primitives (step, parallel, invoke, waitForCallback) are the SDK methods that automatically checkpoint their results. On replay, completed primitives return cached results without re-executing. That's what makes them "durable."

The tool didn't execute the code or read a deployment. It parsed the TypeScript AST, found withDurableExecution(), walked the handler body, and extracted every durable primitive with its name and structure. The five specialist branches came from resolving the SPECIALISTS registry object at module scope.

CLI

The default output is Mermaid flowchart syntax printed to stdout. Paste it into GitHub Markdown, Notion, Confluence, or any Mermaid-compatible renderer.

npx durable-viz handler.ts

You can change the graph direction from top-down to left-right:

npx durable-viz handler.ts --direction LR

The --open flag generates a self-contained HTML page and opens it in your browser. Dark theme, scroll-to-zoom, click-drag panning, and fit-to-view. You can save the diagram as a high-resolution PNG for documentation, pull requests, or presentations.

npx durable-viz handler.ts --open

For custom tooling, --json outputs the raw workflow graph (nodes, edges, source line numbers):

npx durable-viz handler.ts --json

VS Code Extension

The extension renders the diagram in a side panel next to your code. Install from the VS Code Marketplace, then open a durable function handler and run Durable Viz: Open Lambda Durable Function Workflow from the command palette.

Feature	Description
Click-to-navigate	Click any node to jump to that line in the source file
Auto-refresh	Diagram updates on file save
Save PNG	Export the diagram as a high-resolution transparent PNG
Source view	View the raw Mermaid syntax or JSON graph

The extension supports zoom, pan, direction toggle, and fit-to-view. See the Marketplace listing for the full feature list.

Multi-Language Support

The tool supports TypeScript/JavaScript, Python, and Java. Each language has its own parser, but the graph model, edge builder, and renderers are shared.

TypeScript / JavaScript

Uses ts-morph for full AST parsing. This is the most capable parser with two features the others don't have:

Function-reference following. If your handler calls a helper function that accepts DurableContext, the parser resolves the call and inlines the helper's durable primitives at the call site. Only works for functions defined in the same file.
Registry key resolution. For context.parallel() calls that use .map() over a module-scope registry object, the parser enumerates the registry keys to show all possible parallel branches. This is how the purchasing coordinator's five specialists appear in the diagram even though the code dispatches them dynamically.

Python

npx durable-viz examples/order_processor.py --direction LR

Finds @durable_execution decorated handlers and extracts context.<method>() calls. Uses indentation to determine block boundaries.

Java (preview)

npx durable-viz Handler.java --open

Finds classes extending DurableHandler and extracts ctx.<method>() calls from the handleRequest method. Some primitives (parallel, waitForCallback, waitForCondition) are still in development in the Java durable execution SDK.

Both the Python and Java parsers use regex rather than full AST parsing. This keeps the tool as a single Node.js package without requiring Python or Java parser dependencies. The trade-off: standard single-line call patterns work well, but method calls split across many lines or unusual argument formatting may not be detected. For most idiomatic durable function code, it works without issues.

Supported Primitives

Primitive	TypeScript	Python	Java (preview)
Step	`context.step()`	`context.step()`	`ctx.step()`
Invoke	`context.invoke()`	`context.invoke()`	`ctx.invoke()`
Parallel	`context.parallel()`	`context.parallel()`	in development
Map	`context.map()`	`context.map()`	in development
Wait	`context.wait()`	`context.wait()`	`ctx.wait()`
Wait for Callback	`context.waitForCallback()`	`context.wait_for_callback()`	in development
Create Callback	`context.createCallback()`	`context.create_callback()`	`ctx.createCallback()`
Wait for Condition	`context.waitForCondition()`	`context.wait_for_condition()`	in development
Child Context	`context.runInChildContext()`	`context.run_in_child_context()`	`ctx.runInChildContext()`

TypeScript also detects context.promise.all(), context.promise.any(), context.promise.race(), and context.promise.allSettled().

Visual encoding

Each primitive type has a distinct shape and color:

Node	Shape	Color
Start / End	Stadium	Blue
Step	Rectangle	Green
Invoke	Trapezoid	Amber
Parallel / Map	Hexagon	Purple
Wait / Callback	Circle	Red
Condition	Diamond	Indigo
Child Context	Subroutine	Teal

How It Works

The tool performs static analysis on your source file. It never imports, executes, or deploys your code.

The architecture is a three-stage pipeline:

[source file] → Parser → WorkflowGraph → Renderer → [output]
                  │                          │
          TypeScript / Python / Java    Mermaid / JSON

The parser interface

Adding a new language means implementing two methods:

export interface Parser {
  extensions: string[]
  parseFile(filePath: string, options?: ParseOptions): WorkflowGraph
}

extensions declares which file types the parser handles. parseFile takes a file path and returns a WorkflowGraph with nodes, edges, and source line numbers. The dispatcher selects the right parser by file extension.

The graph model

The parser produces a WorkflowGraph: an ordered list of nodes with branches (for parallel blocks) and metadata (for conditionals). Here's a simplified view of what each node looks like:

interface WorkflowNode {
  id: string
  kind: 'start' | 'end' | 'step' | 'invoke' | 'parallel' | 'map'
      | 'wait' | 'waitForCallback' | 'condition' | /* ... */  // maps to the primitives table above
  label: string
  branches?: WorkflowBranch[]   // for parallel/map nodes
  thenCount?: number            // for conditions: nodes in the then-branch
  thenReturns?: boolean         // for conditions: does then-branch return?
  sourceLine?: number           // 1-based line number for click-to-navigate
}

The edge builder constructs edges from the node list, handling sequential flow, parallel fan-out/fan-in, and conditional routing.

For conditionals, the parser tracks whether the if block ends with a return. If it does, the "yes" branch connects to End instead of falling through. The "no" branch skips the conditional block and continues to the next node.

Registry key resolution in practice

Here's a pattern from the purchasing coordinator. Don't worry about the specifics of the durable function code. The key thing is that the parallel branches are built dynamically at runtime from a registry object:

const SPECIALISTS: Record<string, { functionName: string; display: string }> = {
  'price-research': { functionName: process.env.PRICE_RESEARCH_FUNCTION!, display: 'Price Research' },
  'financing': { functionName: process.env.FINANCING_FUNCTION!, display: 'Financing' },
  // ... 3 more
}

const results = await context.parallel('specialists',
  plan.specialists.map((spec) => ({
    name: spec.name,
    func: async (ctx) => {
      const result = await ctx.invoke(spec.name, specialist.functionName, { prompt: spec.prompt })
      return { name: spec.name, response: result.response }
    },
  }))
)

The .map() call means the parallel branches are determined at runtime. The parser can't execute the code, but it can look at the SPECIALISTS object and enumerate its keys. It finds five keys, creates five invoke branches, and labels them with the key names. This is how the diagram shows all five specialists even though the code builds the branch list dynamically.

If the parser can't resolve the registry (the object is imported from another file, or the pattern doesn't match), it falls back to showing a single representative branch.

If you point the tool at a file that isn't a durable function handler, or a file with syntax errors, it exits with a clear error message. The VS Code extension shows the error in the webview panel instead of a diagram.

Design Decisions

Static analysis over runtime tracing

The main design choice: parse the code, don't execute it. Runtime tracing would give you the actual execution path for a specific input, but it requires deployment, credentials, and a real invocation. Static analysis gives you all possible paths from the source alone. You see every parallel branch, every conditional route, every callback. The trade-off is that dynamic branches (like the specialist .map()) require heuristics to resolve.

For documentation and code review, seeing all possible paths is usually more useful than seeing one specific execution. For debugging a specific run, the durable execution history API (get-durable-execution) is the right tool.

Language-agnostic graph model

The parsers are language-specific. The graph model, edge builder, and renderers are not. Adding a new language means writing a parser that produces WorkflowGraph nodes. Everything downstream is shared. The TypeScript parser uses ts-morph for full AST analysis. The Python and Java parsers use regex, which handles standard patterns well but can miss unusual formatting. The regex approach was a deliberate trade-off: full AST parsing for Python would require a Python parser dependency, and Java would need a Java parser. Regex keeps the tool as a single Node.js package.

Visual encoding for primitive types

Each primitive type gets a unique shape and color combination so you can identify the primitive at a glance without reading labels. Steps are green rectangles (the most common node). Invocations are amber trapezoids (they call out to external functions). Parallel blocks are purple hexagons (they branch). Callbacks are red circles (they suspend execution). Conditionals are indigo diamonds (standard flowchart convention). The color palette is optimized for dark backgrounds since most developer tools use dark themes.

Try It Out

You'll need:

Node.js 20+

Run against the examples

Clone the repo and run against the included examples:

git clone https://github.com/gunnargrosch/durable-viz.git
cd durable-viz

# TypeScript order workflow
npx durable-viz examples/order-workflow.ts --open

# Python order processor
npx durable-viz examples/order_processor.py --open

# Java order processor
npx durable-viz examples/OrderProcessor.java --open

Run against the purchasing coordinator

If you have the multi-agent purchasing demo cloned:

cd durable-multi-agent-purchasing
npx durable-viz src/handlers/coordinator.ts --open

Run against your own handler

For your own durable function handlers, npx downloads and runs the tool directly. No cloning needed:

npx durable-viz path/to/your-handler.ts --open

Install the VS Code extension

Search "Durable Viz" in the Extensions panel, or run:

ext install gunnargrosch.durable-viz

Open a durable function handler, open the command palette, and run Durable Viz: Open Lambda Durable Function Workflow.

Additional Resources

durable-viz on GitHub
durable-viz on npm
VS Code Marketplace
Multi-Agent Systems on AWS Lambda with Durable Functions: The purchasing coordinator used as the primary example
AWS Lambda Durable Functions: Building Long-Running Workflows in Code: Durable execution primitives and the support triage demo
AWS Lambda Durable Functions documentation

Run npx durable-viz against your handler and share the diagram. I'd love to see what your workflows look like!

Multi-Agent Systems on AWS Lambda with Durable Functions

Gunnar Grosch — Wed, 25 Mar 2026 13:27:42 +0000

In my previous post on multi-agent systems, I built a purchasing coordinator where a coordinator agent routes requests to specialist agents based on RISEN prompt contracts. RISEN structures system prompts into five components: Role, Instructions, Steps, Expectation, and Narrowing. In a multi-agent system, the Steps section encodes the routing logic and the Narrowing section prevents agents from doing each other's work. A laptop triggers Price Research and Delivery. A used car triggers all five specialists. The routing logic lives in the prompts, not in code. It works, but it runs in a single process. No fault isolation, no independent scaling, no durability. If the process crashes halfway through specialist consultations, you start over.

The durable functions post solved the durability problem for a support ticket triage workflow: checkpoint each step, suspend for human review, resume where you left off. But that was a single-agent workflow.

This post combines the two. The same purchasing coordinator, deployed to AWS Lambda, with each specialist as its own Lambda function. The coordinator is a durable function that checkpoints every specialist call. If it's interrupted after consulting three of five specialists, it resumes from the fourth. When a high-value purchase needs human approval, the function suspends, compute charges stop, and it picks up exactly where it left off when the approver responds.

The two SDKs have distinct roles. The Strands Agents SDK handles the AI reasoning: the planning agent decides which specialists to call, and the synthesis agent produces the recommendation. The durable execution SDK handles the infrastructure: checkpointing, parallel dispatch, suspension, and replay. The coordinator uses both. The specialists only use Strands.

The complete source code is on GitHub: github.com/gunnargrosch/durable-multi-agent-purchasing.

What Changes from the In-Process Demo

The multi-agent post used tool() callbacks for specialist dispatch: each specialist was defined as a Strands tool, and the coordinator agent called them as functions within the same process. That's the simplest possible architecture, and it's fine for development. Here's what changes when you deploy:

	In-Process (previous post)	Lambda + Durable (this post)
Specialist invocation	`tool()` callback, in-process	`context.invoke()`, separate Lambda function
Execution model	Sequential (one specialist at a time)	Parallel via `context.parallel()`
Fault tolerance	None. Process crash = restart	Checkpointed. Resume from last completed step
Scaling	Single process	Each specialist scales independently
Human-in-the-loop	Not supported	`waitForCallback()` with zero-cost suspension
IAM isolation	Shared process permissions	Per-specialist least-privilege policies
Routing visibility	Real-time hook as each tool is called	Checkpointed plan step with routing summary
Infrastructure	`npm start`	SAM template, 6 Lambda functions

The RISEN prompts carry over with only minor adjustments. The coordinator's prompt splits into two phases (plan and synthesis) instead of a single invocation, because the durable function needs to checkpoint the plan before dispatching specialists. The specialist prompts are unchanged.

How It Works

If you haven't read the durable functions post, here's the key mental model: durable functions use checkpoint and replay. Your handler re-executes from the top on every resume, but completed steps return their cached results instantly without re-executing. New work picks up from where it left off. The SDK manages this transparently. You write sequential code and the infrastructure handles the rest.

[CoordinatorFunction] — Lambda Durable Function (Sonnet, 1536MB)
  ├→ context.step('plan')            → Planning agent selects specialists
  ├→ context.parallel('specialists') → Runs selected specialists concurrently:
  │     ├─ context.invoke(PriceResearchFunction)  (Haiku)   ← always
  │     ├─ context.invoke(FinancingFunction)       (Haiku)   ← if value > $5K
  │     ├─ context.invoke(DeliveryFunction)        (Haiku)   ← if physical product
  │     ├─ context.invoke(RiskAssessmentFunction)  (Sonnet)  ← if value > $10K / used
  │     └─ context.invoke(ContractReviewFunction)  (Haiku)   ← if subscription/lease
  ├→ context.step('synthesize')      → Synthesis agent combines findings
  └→ context.waitForCallback()       → Human approval (when requireApproval=true)

Three things to notice:

context.invoke() is a new primitive not covered in the durable functions post. It's the SDK's built-in method for calling other Lambda functions. It checkpoints the result automatically and suspends the coordinator while waiting, so you don't pay for compute during specialist execution. Despite the SDK's API reference describing it as invoking "another durable function," context.invoke() works with any Lambda function. The specialists here are standard functions without DurableConfig.
context.step() wraps the planning and synthesis phases with retry strategies, just like the Bedrock calls in the support triage demo. Each step checkpoints its result. On replay, it returns the cached result without re-executing.
context.parallel() wraps the specialist invocations so they run concurrently, each independently checkpointed. If specialist 3 of 5 fails, the other four results are preserved.

The SAM Template

Here are the key parts of the coordinator's definition in the SAM template. The template defines 6 functions total: the coordinator plus 5 specialists that follow the same pattern.

CoordinatorFunction:
  Type: AWS::Serverless::Function
  Properties:
    Handler: coordinator.handler
    MemorySize: 1536
    Timeout: 300
    Environment:
      Variables:
        COORDINATOR_MODEL_ID: !Sub global.${CoordinatorModelId}
        PRICE_RESEARCH_FUNCTION: !Ref PriceResearchFunction
        # ... remaining specialist function references
    Policies:
      - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
      - Statement:
          - Effect: Allow
            Action:
              - bedrock:InvokeModel
              - bedrock:InvokeModelWithResponseStream
            Resource: # Bedrock model + inference profile ARNs
          - Effect: Allow
            Action: lambda:InvokeFunction
            Resource:
              - !GetAtt PriceResearchFunction.Arn
              - !GetAtt FinancingFunction.Arn
              - !GetAtt DeliveryFunction.Arn
              - !GetAtt RiskAssessmentFunction.Arn
              - !GetAtt ContractReviewFunction.Arn
    DurableConfig:
      ExecutionTimeout: 86400
      RetentionPeriodInDays: 7
    AutoPublishAlias: live

A few things to note:

No API Gateway, no function URLs, no public endpoints. The multi-agent post previewed two deployment options: Lambda with HTTP endpoints, or AgentCore containers. This implementation goes a simpler route. context.invoke() calls specialists directly via the Lambda API. The coordinator's IAM policy grants lambda:InvokeFunction on each specialist ARN. Specialists are unreachable from outside the coordinator's execution role.
Shared handler, different prompts. All five specialists use the same specialist.ts handler. The PROMPT_NAME environment variable selects which RISEN prompt to load. This keeps the template repetitive but predictable: each specialist block differs only in its name, PROMPT_NAME, and model ID.
Per-specialist model selection. Risk Assessment gets AdvancedSpecialistModelId (Sonnet) for stronger reasoning. The other four get SpecialistModelId (Haiku). Same pattern as the in-process demo, now enforced at the infrastructure level.
DurableConfig with 24-hour execution timeout. The used car scenario with human approval can sit overnight. Each individual invocation is bounded by Timeout: 300 (the coordinator's per-replay limit). ExecutionTimeout: 86400 is the outer wall-clock limit across all replays and suspensions.
InvokeModelWithResponseStream is included because the Strands SDK's BedrockModel may use streaming internally for token generation. Without it, the coordinator would get AccessDeniedException on agent invocations.
ESM build. The coordinator uses ESM (Format: esm) with esbuild. The full template includes a Banner that injects a createRequire shim because some dependencies expect CommonJS require. This is a known pattern for ESM Lambda functions with mixed dependencies. See the repo's template.yaml for the complete Metadata block.

The Coordinator Handler

The coordinator is the only durable function. Here's the handler skeleton showing the four durable phases. The full source includes types, the specialist registry, retry strategies, and routing summary logic.

export const handler = withDurableExecution(async (
  event: CoordinatorEvent,
  context: DurableContext
): Promise<Record<string, unknown>> => {

  // Phase 1: Plan — agent decides which specialists to consult
  const plan = await context.step<AnalysisPlan>('plan', async () => {
    let capturedPlan: AnalysisPlan | null = null
    const planTool = tool({
      name: 'create_analysis_plan',
      inputSchema: z.object({
        specialists: z.array(z.object({
          name: z.enum(['price-research', 'financing', 'delivery',
                        'risk-assessment', 'contract-review']),
          prompt: z.string(),
        })),
      }),
      callback: async (input) => { capturedPlan = input; return 'Plan created.' },
    })
    const agent = new Agent({ model, systemPrompt: planPrompt, tools: [planTool], printer: false })
    await agent.invoke(event.request)
    if (!capturedPlan) throw new Error('Planning agent did not call create_analysis_plan')
    return capturedPlan
  }, { retryStrategy: bedrockRetry })

  // Phase 2: Consult specialists in parallel via context.invoke()
  const results = await context.parallel<{ name: string; response: string }>('specialists',
    plan.specialists.map((spec) => ({
      name: spec.name,
      func: async (ctx) => {
        try {
          const specialist = SPECIALISTS[spec.name]
          if (!specialist) {
            throw new Error(`Unknown specialist: ${spec.name}`)
          }
          const result = await ctx.invoke<{ prompt: string }, { response: string }>(
            spec.name, specialist.functionName, { prompt: spec.prompt }
          )
          return { name: spec.name, response: result.response }
        } catch (err) {
          const display = SPECIALISTS[spec.name]?.display ?? spec.name
          return { name: spec.name, response: `[${display} unavailable: ${err instanceof Error ? err.message : 'unknown error'}]` }
        }
      },
    }))
  )

  // Phase 3: Synthesize specialist findings into a recommendation
  const response = await context.step('synthesize', async () => {
    // Build findings from checkpointed parallel results, invoke synthesis agent
    // (see full source for details)
  }, { retryStrategy: bedrockRetry })

  // Phase 4: Human approval (optional)
  if (event.requireApproval) {
    const approval = await context.waitForCallback<ApprovalPayload>(
      'approval',
      async (callbackId) => { context.logger.info('approval_callback_created', { callbackId }) },
      { timeout: { hours: 24 }, serdes: defaultSerdes }
    )
    if (!approval.approved) return { status: 'rejected', recommendation: response }
  }

  return { status: event.requireApproval ? 'approved' : 'completed', recommendation: response }
})

Let's walk through what's happening in each phase.

Phase 1: Plan

The planning phase is the biggest change from the in-process demo. In the previous post, the coordinator was a single agent that both decided which specialists to call and synthesized their findings. Here, planning and synthesis are separate agents wrapped in separate context.step() calls.

Why split them? Checkpointing. In the in-process demo, if the coordinator crashes after calling three specialists, you lose the routing decision and all three results. With durable functions, the plan is checkpointed as a single unit. If the function replays after the plan step, it returns the cached plan instantly without calling Bedrock again.

The planning agent uses tool() with a Zod schema to produce structured output. The create_analysis_plan tool captures the plan into a closure variable, and the step returns it as its checkpointed result. If the agent doesn't call the tool, the step throws an error. The retry strategy will attempt it three times, but this is a non-transient failure: if the model didn't call the tool on the first attempt, retries won't help. After retries are exhausted, the execution fails.

Phase 2: Parallel specialists

This is where context.invoke() replaces tool() callbacks. Each specialist invocation is a branch inside context.parallel():

const result = await ctx.invoke<{ prompt: string }, { response: string }>(
  spec.name,             // step name for checkpoint history
  specialist.functionName, // Lambda function name from environment
  { prompt: spec.prompt }  // payload
)

context.invoke() calls the specialist Lambda function directly, checkpoints the result, and suspends the coordinator while waiting. You don't pay for coordinator compute while specialists are executing. On replay, completed invocations return their cached results without re-invoking the specialist.

Each branch has a try/catch for graceful degradation. If the Delivery specialist times out, the coordinator still gets results from Price Research, Financing, Risk Assessment, and Contract Review. The synthesis agent notes the gap and advises the buyer to investigate delivery independently. This is a meaningful improvement over the in-process demo, where a failed specialist tool call could derail the entire coordinator.

Phase 3: Synthesize

The synthesis agent receives all specialist findings and produces a structured recommendation. Like the plan step, the entire synthesis is one checkpointed unit. The synthesis prompt's Narrowing section prevents it from overriding specialist findings or filling in gaps from failed specialists.

Phase 4: Human approval

When requireApproval is true (the used car scenario sets this), the function suspends at waitForCallback. Compute charges stop. The approver reviews the recommendation and sends a callback via the Lambda API or the demo's interactive prompt. The function resumes and returns the final status.

Note the serdes: defaultSerdes option on the callback. As covered in the durable functions post's gotchas, waitForCallback defaults to passthrough serialization (not JSON.parse). Without defaultSerdes, approval.approved would be undefined at runtime even though TypeScript thinks it's a boolean.

The Specialist Handler

All five specialists share one handler. The PROMPT_NAME environment variable selects the behavior:

const promptName = process.env.PROMPT_NAME!
const systemPrompt = loadPrompt(promptName)
const model = createModel(SPECIALIST_MODEL)

export const handler = async (
  event: SpecialistEvent,
  context: Context
): Promise<{ response: string }> => {
  const logger = makeLogger({ handler: promptName, requestId: context.awsRequestId })
  const tools = loadSpecialistTools(promptName, logger)
  const agent = new Agent({ model, systemPrompt, tools, printer: false })
  const result = await agent.invoke(event.prompt)
  return { response: result.toString() }
}

Specialists are standard Lambda functions, not durable. They don't need checkpointing because each one completes in a single invocation (a Bedrock call and response processing). The coordinator's context.invoke() handles the durability: if a specialist invocation times out, the coordinator can retry from the checkpoint without re-running specialists that already succeeded.

loadSpecialistTools returns specialist-specific tools (see src/lib/specialist-tools.ts). Most specialists are pure reasoning (no tools). The Price Research specialist has a save_price_snapshot tool that logs structured price data. In production, that tool could write to DynamoDB or call a pricing API. The coordinator never sees these tools. They're scoped to each specialist's domain.

The Prompt Split

The in-process demo had one coordinator prompt with routing in the Steps section. The durable version splits this into two prompts:

coordinator-plan handles routing. Here's the Steps and Expectation sections:

# Steps
1. Read the purchase request and identify what is being purchased, its likely category
   (vehicle, electronics, real estate, software, etc.), the approximate value range,
   and any special circumstances mentioned or implied (used/secondhand, financing needed,
   physical delivery, contract or subscription, high value).
2. Select specialists using these routing rules:
   - Always include price-research to compare options and assess market value.
   - Include financing if the estimated value exceeds $5,000, or if financing or a loan
     is mentioned or implied.
   - Include delivery if the item is a tangible physical product that must be physically
     received — electronics, appliances, vehicles, furniture, machinery. Vehicles always
     need delivery planning (transport, pickup, or test drive logistics). Do not include
     for purely digital purchases (software, SaaS subscriptions, downloadable content).
   - Include contract-review if the purchase involves a subscription, lease, warranty
     agreement, service contract, purchase agreement, or any multi-year financial
     commitment. Vehicle and real estate purchases always involve contracts.
   - Include risk-assessment if the estimated value exceeds $10,000, the item is used or
     secondhand, or the category carries known risk (vehicles, real estate, machinery).
     Do not include for new electronics, appliances, or standard retail under $10,000.
3. For each selected specialist, write a focused prompt describing what to analyze about
   this specific purchase. Include relevant details from the request (item, price,
   condition, location, urgency). The specialist's own system prompt defines its role —
   do not repeat the role description in your prompt.
4. Call create_analysis_plan exactly once with the selected specialists and their prompts.

# Expectation
A single call to create_analysis_plan containing:
- An array of specialists, each with a name (from the allowed set) and a prompt string.
- Only specialists whose routing criteria are met.
- Prompts that are specific to this purchase, not generic templates.

These are the same routing rules from the original coordinator prompt in the multi-agent post. The difference: instead of calling specialist tools directly (Steps 2-6 in the original said "invoke the research_prices tool," "invoke the evaluate_financing tool"), the plan agent calls create_analysis_plan once with all selected specialists. The actual dispatch happens via context.invoke() in Phase 2.

coordinator-synthesis handles the final recommendation. It receives specialist findings and produces the buyer-facing output. Its Narrowing prevents it from contradicting specialists or filling in missing analysis.

This split means the coordinator makes two agent invocations (plan + synthesize) instead of one. Each agent invocation may involve multiple Bedrock round-trips internally as the Strands SDK handles reasoning. The trade-off is worth it: the plan is checkpointed before any specialist is called, and the synthesis is checkpointed after all specialists complete. On replay, both return cached results instantly.

Testing with the Local Runner

The durable functions post showed LocalDurableTestRunner for the support triage workflow. The multi-agent demo adds a new pattern: registerFunction for mocking context.invoke() targets.

// Register mock handlers for each specialist Lambda function
runner = new LocalDurableTestRunner({ handlerFunction: handler });
runner
  .registerFunction("PriceResearchFunction", specialistHandler)
  .registerFunction("FinancingFunction", specialistHandler)
  .registerFunction("DeliveryFunction", specialistHandler)
  .registerFunction("RiskAssessmentFunction", specialistHandler)
  .registerFunction("ContractReviewFunction", specialistHandler);

// Run the full workflow and verify the result
const result = runner.run({ payload: laptopPayload });
const output = await result;
expect(output.getStatus()).toBe("SUCCEEDED");
expect(output.getResult().routing.called).toContain("Price Research");

registerFunction maps function names to local handlers. When the coordinator calls context.invoke("PriceResearchFunction", payload), the test runner routes it to the registered mock instead of calling Lambda. This lets you test the full checkpoint/replay lifecycle without deploying or calling Bedrock. The full test suite also tests callback suspension and resumption using runner.getOperation() and sendCallbackSuccess(), the same pattern from the durable functions post.

The test suite covers seven scenarios: standard flow, all-5-specialists flow, approval, rejection, specialist failure with graceful degradation, callback failure, and planning agent failure.

Try the Demo

The repo includes an interactive demo with three purchase scenarios:

Scenario	Value	Specialists	Approval
Laptop	$1,500	Price Research + Delivery	No
Used car	$18,000	All 5 specialists	Yes (`waitForCallback`)
SaaS subscription	$200/mo	Price Research + Contract Review	Yes (`waitForCallback`)

You'll need:

Node.js 24+ and npm
For cloud mode: AWS SAM CLI 1.153.1+ and Bedrock access to Claude Sonnet 4.6 and Claude Haiku 4.5

Local mode (no AWS credentials needed)

git clone https://github.com/gunnargrosch/durable-multi-agent-purchasing.git
cd durable-multi-agent-purchasing
npm install
npm run demo:local -- --ticket=used-car

Local mode uses mocked Bedrock responses. The used car scenario exercises all four durable primitives: step (plan and synthesis), invoke (specialist calls), parallel (concurrent dispatch), and waitForCallback (human approval where you play the purchase approver).

Cloud mode (real Bedrock responses)

sam build
sam deploy --guided
npm run demo:cloud -- --ticket=used-car --region=us-east-1

Cloud mode invokes the deployed coordinator with real Bedrock calls. You'll see actual AI-generated specialist analyses and a synthesized recommendation. The demo polls execution history and prompts you when the approval callback is created.

The demo uses direct aws lambda invoke with --invocation-type Event for simplicity. In production, the coordinator would typically sit behind an upstream service: an API Gateway endpoint receiving purchase requests, an EventBridge rule triggered by order events, or an SQS queue processing a backlog. The coordinator itself doesn't care how it's invoked. It receives the event payload and the durable SDK handles the rest.

Inspecting execution history

After a cloud run, you can inspect the execution in the Lambda console under the Durable executions tab, or via the CLI:

aws lambda list-durable-executions-by-function \
  --function-name durable-multi-agent-purchasing-CoordinatorFunction

aws lambda get-durable-execution \
  --durable-execution-arn "<arn-from-list>"

The execution history shows each step's status and timing: when the plan completed, how long each specialist took, and whether the callback is pending or resolved.

Design Decisions

context.invoke() over HTTP

The multi-agent post previewed two deployment approaches: Lambda with API Gateway/Function URLs (HTTP), or AgentCore containers. This demo takes a third path: context.invoke() for direct Lambda-to-Lambda invocation.

The result is simpler than either preview option. No API Gateway resources, no function URLs, no SigV4 signing, no HTTP client configuration. The coordinator calls specialists via the Lambda API, and the durable SDK handles checkpointing and retry. Specialists are unreachable from outside the coordinator's execution role, which is better isolation than an HTTP endpoint with IAM auth.

The trade-off: specialists are only callable from within a durable function. If you later need specialists accessible from other services (a REST API, a Step Functions state machine, another team's coordinator), you'd need to add API Gateway or function URLs at that point. For this use case, where one coordinator owns all specialist dispatch, direct invocation is the right call.

Splitting plan from synthesis

The in-process coordinator was a single agent invocation: read the request, call specialist tools, synthesize findings. With durable functions, that becomes two separate agents wrapped in separate context.step() calls.

This introduces an extra Bedrock call, but the checkpointing benefits are significant. The plan is preserved before any specialist runs. If the function replays after three of five specialists complete, the plan doesn't need to be regenerated. The synthesis is preserved after all specialists complete. If the function replays during the approval callback, the recommendation doesn't need to be regenerated.

The alternative would be wrapping the entire coordinator in a single step. That would mean one Bedrock conversation (cheaper) but no intermediate checkpoints. A failure during synthesis would replay from the beginning, including all specialist invocations. With the split, a synthesis failure only retries the synthesis.

Planning agent with tool capture

The planning agent uses tool() with a Zod schema to produce structured output. This is the same pattern from the in-process demo, but used differently. In the previous post, tools were the dispatch mechanism (calling tools = calling specialists). Here, the tool is purely for structured output capture. The plan step returns the captured plan as its checkpointed result, and context.invoke() handles the actual specialist dispatch.

Why not just have the planning agent return JSON directly? Tool calling with a schema gives you validation at the SDK level. If the agent returns a plan with an invalid specialist name, Zod catches it before the plan is checkpointed. Without the tool, you'd parse and validate the JSON yourself, and an invalid plan could be checkpointed and break on every subsequent replay.

Graceful degradation in parallel

Each specialist branch in context.parallel() has a try/catch that returns an error message instead of throwing:

catch (err) {
  return { name: spec.name, response: `[unavailable: ${err instanceof Error ? err.message : 'unknown error'}]` }
}

This means a failed specialist doesn't fail the entire parallel block. The synthesis agent receives partial results and notes the gap. Without the catch, a failed specialist would fail its branch. Whether that fails the entire parallel block depends on the completion config, but either way, synthesis would never run with partial results. With the catch, every branch succeeds (some with error messages), the parallel block always completes, and the synthesis agent can work with whatever it has.

Cost

The main cost is Bedrock token usage, not Lambda compute.

Scenario	Specialists	Agent invocations	Estimated cost
Laptop (2 specialists)	2	~4 (plan + 2 specialists + synthesize)	~$0.02-0.05
Used car (5 specialists)	5	~7 (plan + 5 specialists + synthesize)	~$0.05-0.15

The coordinator uses Sonnet (~$3/$15 per million input/output tokens). Most specialists use Haiku (~$1/$5 per million input/output tokens). Risk Assessment uses Sonnet. Lambda compute for the active execution periods (plan, specialist dispatch, synthesis, replay) totals ~$0.0001. During the waitForCallback suspension, compute charges are zero.

Things to Watch For

Checkpoint size. Each step result is serialized and stored as a checkpoint. The durable functions post covered the 256KB limit per checkpoint. With 5 specialists returning Bedrock responses, the parallel result could get large if the model is verbose. Monitor response sizes. If you hit the limit, truncate or summarize specialist responses before returning them from the branch, or store full responses in S3 and return a reference.

Debugging failures. If the plan step fails after exhausting retries, or a specialist consistently times out, the execution moves to FAILED status. Use get-durable-execution to see which step failed and the error message. The coordinator uses context.logger (replay-aware), so CloudWatch Logs show each phase's progress without duplicate lines from replays. Specialist failures are easier to debug since each specialist has its own log group (sam logs --name PriceResearchFunction --tail).

Replay safety. Durable functions re-execute your handler from the top on every resume. Completed context.step() and context.invoke() calls return cached results, but any code outside those primitives runs again on every replay. If you add a side effect (writing to DynamoDB, sending a notification, calling an external API), wrap it in a context.step() so it executes exactly once. The coordinator's comment on the save-recommendation step shows this pattern. Without the step wrapper, you'd send duplicate notifications every time the function replays.

Cold starts. The used car scenario invokes 5 specialist Lambda functions in parallel. If all 5 are cold, that's 5 concurrent cold starts: arm64 Node.js 24, the Strands SDK, and the Bedrock client. This can add several seconds to the first specialist round. The in-process demo had no cold start penalty for specialists since everything ran in one process. In practice, the cold start overhead is small relative to the Bedrock inference time that follows. For workflows that already include multi-second model calls and human approval, a few extra seconds on the first invocation is rarely the bottleneck.

Wrapping Up

This post showed how to take a multi-agent system from a single-process demo to a deployed, fault-tolerant system on Lambda with durable functions. The RISEN prompts carry over with minimal changes. The architectural shift is from in-process tool calls to checkpointed Lambda-to-Lambda invocations with independent scaling, failure isolation, and human-in-the-loop approval.

Additional Resources

Demo repository for this post
Building Multi-Agent Systems with RISEN Prompts and Strands Agents: The in-process multi-agent demo this builds on
AWS Lambda Durable Functions: Building Long-Running Workflows in Code: Durable execution primitives and the support triage demo
AWS Lambda Durable Functions documentation
Durable Execution SDK for JavaScript (GitHub)
Strands Agents SDK (TypeScript)

What multi-agent workflow would you deploy with durable functions? Let me know in the comments!

AWS Lambda Durable Functions: Building Long-Running Workflows in Code

Gunnar Grosch — Tue, 17 Mar 2026 22:06:16 +0000

If you've built anything non-trivial on AWS Lambda, you've hit the wall. The function runs for 15 minutes and it's stateless. Any multi-step workflow requires stitching together Step Functions, SQS queues, DynamoDB tables for state, and a whole lot of glue. It works, but it's a lot of infrastructure for what should be straightforward sequential logic.

AWS Lambda Durable Functions, launched at re:Invent 2025, change that. You write sequential code in a single Lambda function. The SDK handles checkpointing, failure recovery, and suspension. Your function can run for up to a year, and you only pay for active compute time. During waits (human approvals, timers, external callbacks), the function suspends and compute charges stop.

In this post, I'll walk through what problem durable functions solve, how the checkpoint/replay model works, and then dig into a complete AI-powered support ticket workflow in TypeScript that demonstrates every primitive in action.

What Problem This Solves

Here's a scenario most teams deal with: a support ticket arrives, someone needs to triage it, figure out who should handle it, wait for them to respond, and then close the loop with the customer. Before durable functions, you had a few options:

Step Functions: Define an ASL state machine with states for each step, configure IAM for each integration, manage the state machine as a separate resource. Great for cross-service orchestration, but heavyweight for application logic that naturally reads as sequential code.

SQS + multiple Lambda functions: Break the workflow into separate functions connected by queues. Now you're managing message formats, dead-letter queues, idempotency, and correlating state across function boundaries.

Polling loop with DynamoDB: One function writes state to DynamoDB, another polls for changes. Works, but you're paying for polling compute and managing your own state machine.

All three approaches take what should be straightforward sequential logic and spread it across multiple services, IAM policies, and configuration files.

With durable functions, that same workflow looks like this:

const handler = async (event: TicketEvent, context: DurableContext) => {
  const analysis = await context.step("analyze", async () => analyzeTicket(event));
  const response = await context.waitForCallback("agent-review",
    async (callbackId) => notifyAgent(callbackId, analysis)
  );
  if (analysis.needsEscalation) {
    await context.waitForCallback("specialist-review",
      async (callbackId) => notifySpecialist(callbackId, analysis)
    );
  }
  await context.parallel("close-ticket", [
    { name: "reply", func: async (ctx) => ctx.step("send-reply", async () => sendReply(response)) },
    { name: "survey", func: async (ctx) => ctx.step("send-survey", async () => sendSurvey(event)) },
  ]);
  return { status: "resolved", ticketId: event.ticketId };
};

One function. Sequential code. The SDK handles checkpointing each step, suspending during the human review waits, and resuming when the callbacks arrive.

How Checkpoint/Replay Works

This is the part that makes everything else make sense. Durable functions use a checkpoint and replay model. Here's how it works:

First invocation: Your handler runs from the beginning. Each context.step() executes your code and checkpoints the result.
Suspension: When context.wait() (fixed-duration pause) or context.waitForCallback() (external signal) is called, the function terminates. Compute charges stop.
Resumption: When the wait completes or a callback arrives, Lambda invokes your handler again from the beginning. But this time, completed steps return their cached results instantly without re-executing. Execution picks up from the first non-checkpointed operation.

First invocation:
  analyze     ->  [executes Bedrock call, checkpoints result]
  agent-review ->  [creates callback, function suspends]

Second invocation (agent responds):
  analyze     ->  [returns cached result, skips Bedrock call]
  agent-review ->  [returns callback result]
  close-ticket ->  [sends reply + survey in parallel]

There's one critical rule that falls out of this: code outside steps re-executes on every replay and must be deterministic. If you use Date.now(), Math.random(), or crypto.randomUUID() outside a step, you'll get different values on each replay. Wrap non-deterministic operations in steps.

// Wrong: different value on each replay
const id = crypto.randomUUID();

// Right: checkpointed, same value on every replay
const id = await context.step("gen-id", async () => crypto.randomUUID());

What You'll Build

A support ticket triage workflow where AI handles the first pass and humans make the final call. This is the pattern that makes durable functions click: the AI analysis takes seconds, but the human reviews take hours or days. Without durable functions, you'd need to persist state somewhere and wire up resumption logic. With them, you just write await context.waitForCallback() and the function suspends until the human responds.

Primitive	What It Does	Where You'll See It
`step()`	Execute and checkpoint an atomic operation	AI ticket analysis with Bedrock
`waitForCallback()`	Suspend until an external system responds	Agent review, specialist escalation
`parallel()`	Run multiple branches concurrently	Customer reply + satisfaction survey
Retry strategies	Automatic retry with exponential backoff	Bedrock API calls
`context.logger`	Replay-aware structured logging (suppresses duplicate output during replay)	Throughout

The complete source code is on GitHub: github.com/gunnargrosch/durable-support-triage. Clone the repo and run npm run demo to try the full workflow locally with mocked Bedrock responses, or deploy to AWS and run it with real Bedrock.

Getting Started

You'll need:

An AWS account with credentials configured
AWS SAM CLI 1.153.1 or later (minimum version with DurableConfig support)
Node.js 24 or later
Access to Amazon Bedrock with Claude Haiku 4.5 enabled in your region

Clone and install

git clone https://github.com/gunnargrosch/durable-support-triage.git
cd durable-support-triage
npm install

The SAM Template

Here's the template.yaml:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: AI-powered support ticket triage with durable functions

Parameters:
  BedrockModelId:
    Type: String
    Default: anthropic.claude-haiku-4-5-20251001-v1:0
    Description: Bedrock foundation model ID (uses global inference profile prefix automatically)

Globals:
  Function:
    Timeout: 120
    MemorySize: 256
    Runtime: nodejs24.x

Resources:
  SupportTriageFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: durable-support-triage
      Handler: index.handler
      CodeUri: src/
      DurableConfig:
        ExecutionTimeout: 604800
        RetentionPeriodInDays: 14
      Policies:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
        - Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
                - bedrock:InvokeModel
              Resource:
                - !Sub "arn:aws:bedrock:*::foundation-model/${BedrockModelId}"
                - !Sub "arn:aws:bedrock:${AWS::Region}:${AWS::AccountId}:inference-profile/global.${BedrockModelId}"
      AutoPublishAlias: live
      Environment:
        Variables:
          BEDROCK_MODEL_ID: !Sub "global.${BedrockModelId}"
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: false
        Target: es2022
        EntryPoints:
          - index.ts

Outputs:
  FunctionArn:
    Value: !GetAtt SupportTriageFunction.Arn
  AliasArn:
    Value: !Ref SupportTriageFunctionAliaslive

A few things to note about this template:

Globals.Timeout: 120 is the standard Lambda invocation timeout. It applies to each individual invocation (each replay round), not the overall workflow. Two minutes is plenty for each replay round. ExecutionTimeout in DurableConfig is the total wall-clock time for the entire durable execution.
DurableConfig is the only new property compared to a standard Lambda function. ExecutionTimeout is in seconds (604,800 = 7 days). The individual callback timeouts handle the per-step boundaries, but the execution timeout is your outer safety net. A ticket that needs specialist review might sit over a weekend, so 7 days gives headroom. RetentionPeriodInDays controls how long execution history is kept (1 to 90 days).
AutoPublishAlias: live automatically creates a Lambda version and alias on each deploy. This is important for two reasons: durable functions require a qualified ARN (with version or alias) for invocation, and Lambda pins each execution to the version that started it. If you deploy new code while an execution is suspended, replay still uses the original version. This prevents inconsistencies from code changes mid-workflow.
AWSLambdaBasicDurableExecutionRolePolicy is an AWS managed policy that grants the checkpoint and state permissions your function needs (lambda:CheckpointDurableExecutions, lambda:GetDurableExecutionState) plus the standard CloudWatch Logs permissions.
Bedrock IAM uses a cross-region inference profile (global. prefix on the model ID) so that Bedrock routes requests to whichever region has capacity. The policy needs two resource ARNs: the foundation model (wildcard region, no account ID) and the inference profile (your region and account). The BedrockModelId parameter defaults to Claude Haiku 4.5 but you can override it at deploy time with --parameter-overrides BedrockModelId=<model-id>. Check Bedrock model availability for what's enabled in your region.

The RISEN Prompt

The AI triage uses Amazon Bedrock with Claude Haiku 4.5 to analyze incoming tickets. I'm using the RISEN framework for the system prompt. RISEN structures prompts into five components: Role, Instructions, Steps, Expectation, and Narrowing. Each component serves a specific purpose, and together they produce consistent, structured output that your code can reliably parse.

Here's the system prompt for the triage agent:

const TRIAGE_SYSTEM_PROMPT = `
# Role
You are a senior technical support analyst with 10 years of experience
triaging customer support tickets for a SaaS platform. You specialize in
categorizing issues by severity, identifying root causes, and drafting
professional responses.

# Instructions
Analyze the incoming support ticket and produce a structured triage
assessment with category, priority, sentiment, a suggested response,
and an escalation recommendation.

# Steps
1. Read the ticket subject and body to identify the core issue.
2. Categorize the issue (billing, technical, account, feature-request, other).
3. Assess priority based on business impact and urgency (critical, high, medium, low).
4. Evaluate customer sentiment (frustrated, neutral, positive).
5. Draft a suggested response that acknowledges the issue and outlines next steps.
6. Determine whether the ticket needs specialist escalation.

# Expectation
Return a JSON object with this exact structure:
{
  "category": "billing" | "technical" | "account" | "feature-request" | "other",
  "priority": "critical" | "high" | "medium" | "low",
  "sentiment": "frustrated" | "neutral" | "positive",
  "suggestedResponse": "string",
  "needsEscalation": boolean,
  "escalationReason": "string or null",
  "summary": "One-sentence summary of the issue"
}

# Narrowing
- Return only raw JSON. Do not wrap it in markdown code fences, backticks,
  or any other formatting. No explanation, no preamble, no commentary.
- Do not fabricate account details or order numbers not present in the ticket.
- Do not promise refunds, credits, or policy exceptions in the suggested response.
- needsEscalation MUST be false unless one of these exact conditions is met:
  1. The ticket describes confirmed or suspected data loss.
  2. The ticket describes a security breach, unauthorized access, or credential compromise.
  3. The ticket involves a legal or compliance issue.
  4. The customer tier is "enterprise".
  For all other tickets (billing issues, bugs, feature requests, general questions),
  needsEscalation MUST be false regardless of priority or sentiment.
- Keep the suggested response under 200 words.
`;

The Narrowing section does the heavy lifting for reliability. The explicit numbered escalation conditions prevent the model from over-escalating standard tickets (without these, Haiku flagged a routine CSV bug for specialist review). The constraint against promising refunds keeps the AI from making commitments that only a human should make. The pipe syntax in the Expectation section ("billing" | "technical" | ...) is instructional for the model, not literal JSON. It tells the model which values are valid without requiring a separate schema document. Note that "return only raw JSON" doesn't guarantee it: some models still wrap output in markdown code fences despite the instruction. The handler strips them defensively before parsing.

The Handler

Here's the handler from src/index.ts, trimmed to show the durable execution primitives. The full source (input validation, response parsing, integration stubs) is in the repo. Types are defined in src/types.ts. The TRIAGE_SYSTEM_PROMPT shown in the RISEN section above is defined at module scope.

import {
  withDurableExecution, DurableContext,
  createRetryStrategy, JitterStrategy, defaultSerdes,
} from "@aws/durable-execution-sdk-js";
import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";
import type { TicketEvent, TriageResult, AgentReview, SpecialistReview, TicketResolution } from "./types";

const bedrock = new BedrockRuntimeClient();

async function closeTicket(
  context: DurableContext,
  name: string,
  email: string,
  ticketId: string,
  response: string
): Promise<void> {
  await context.parallel(name, [
    {
      name: "send-reply",
      func: async (ctx) =>
        ctx.step("reply", async () => {
          await sendCustomerReply(email, ticketId, response);
        }),
    },
    {
      name: "send-survey",
      func: async (ctx) =>
        ctx.step("survey", async () => {
          await sendSatisfactionSurvey(email, ticketId);
        }),
    },
  ]);
}

export const handler = withDurableExecution(
  async (event: TicketEvent, context: DurableContext): Promise<TicketResolution> => {
    validateEvent(event);

    context.logger.info("Ticket received", {
      ticketId: event.ticketId,
      customerTier: event.customerTier,
    });

    // Step 1: AI analyzes the ticket using Bedrock
    const analysis = await context.step(
      "analyze-ticket",
      async () => {
        const response = await bedrock.send(new InvokeModelCommand({
          modelId: process.env.BEDROCK_MODEL_ID,
          contentType: "application/json",
          accept: "application/json",
          body: JSON.stringify({
            anthropic_version: "bedrock-2023-05-31",
            max_tokens: 1024,
            system: TRIAGE_SYSTEM_PROMPT,
            messages: [
              {
                role: "user",
                content: `Ticket ID: ${event.ticketId}\nCustomer Tier: ${event.customerTier}\nSubject: ${event.subject}\n\n${event.body}`,
              },
            ],
          }),
        }));
        return parseBedrockResponse(response.body as Uint8Array);
      },
      {
        retryStrategy: createRetryStrategy({
          maxAttempts: 3,
          initialDelay: { seconds: 2 },
          maxDelay: { seconds: 30 },
          backoffRate: 2.0,
          jitter: JitterStrategy.FULL,
        }),
      }
    );

    // Step 2: Support agent reviews AI suggestion
    const agentReview = await context.waitForCallback<AgentReview>(
      "agent-review",
      async (callbackId) => {
        await notifyAgent({ callbackId, ticketId: event.ticketId, analysis, customerTier: event.customerTier });
      },
      { timeout: { hours: 8 }, serdes: defaultSerdes }
    );

    const finalResponse = agentReview.editedResponse || analysis.suggestedResponse;

    if (!agentReview.approved) {
      return { status: "rejected", ticketId: event.ticketId, category: analysis.category, priority: analysis.priority, finalResponse: "" };
    }

    // Step 3: If escalation needed, wait for specialist
    if (analysis.needsEscalation) {
      const specialistResponse = await context.waitForCallback<SpecialistReview>(
        "specialist-review",
        async (callbackId) => {
          await notifySpecialist({ callbackId, ticketId: event.ticketId, analysis, agentNotes: agentReview.agentNotes });
        },
        { timeout: { days: 3 }, serdes: defaultSerdes }
      );

      const resolvedResponse = specialistResponse.response || finalResponse;
      await closeTicket(context, "close-escalated-ticket", event.contactEmail, event.ticketId, resolvedResponse);
      return { status: "escalated", ticketId: event.ticketId, category: analysis.category, priority: analysis.priority, finalResponse: resolvedResponse };
    }

    // Step 4: Send reply and survey in parallel
    await closeTicket(context, "close-ticket", event.contactEmail, event.ticketId, finalResponse);
    return { status: "resolved", ticketId: event.ticketId, category: analysis.category, priority: analysis.priority, finalResponse };
  }
);

Let's walk through what's happening:

withDurableExecution wraps your async function and returns a standard Lambda handler. The runtime calls it like any other handler; the SDK intercepts the execution to manage checkpoints. The BedrockRuntimeClient is instantiated at module scope, which is standard Lambda practice for connection reuse across warm-start invocations. Each replay is a new Lambda invocation, but it may reuse a warm container just like any regular invocation.

validateEvent runs before the first context.step(). This is intentional: if the payload is malformed, the execution fails immediately instead of after the Bedrock step has already been checkpointed. Durable executions that fail after partial checkpointing are harder to reason about than ones that fail fast. Validation is deterministic and cheap, so re-running it on every replay is harmless.

context.step("analyze-ticket", ...) calls Amazon Bedrock with the RISEN prompt and checkpoints the result. The retry strategy handles transient Bedrock API errors (throttling, temporary unavailability) with exponential backoff. parseBedrockResponse handles the response parsing separately: it strips markdown code fences (some models wrap JSON output despite the prompt instruction), validates the response structure, and gives clear error messages on parse failures. If the function replays later, this step returns the cached analysis without calling Bedrock again. That matters for both cost and consistency: you don't want the AI to produce a different triage on replay.

context.waitForCallback("agent-review", ...) is where the function suspends. The SDK creates a callback ID and passes it to your submitter function (which sends it to Slack, email, or your ticketing UI). The submitter runs exactly once, on the invocation that creates the callback. On replay, the SDK skips the submitter entirely and returns the callback result directly. This is important: even though the submitter isn't wrapped in a context.step(), it won't re-execute on replay. The SDK then terminates the Lambda function. Compute charges stop. The agent might respond in minutes or hours. When they do, an external system calls SendDurableExecutionCallbackSuccess with their review. Lambda resumes the function from where it left off. The serdes: defaultSerdes option is required for typed callbacks. Without it, the SDK uses passthrough serialization (not JSON.parse) and agentReview.approved would be undefined at runtime even though TypeScript thinks it's a boolean. This isn't documented yet: step() defaults to JSON serdes, but waitForCallback defaults to passthrough. The SDK exports defaultSerdes for this purpose. If the callback times out (8 hours for the agent, 3 days for the specialist), the SDK throws a CallbackTimeoutError. In production, wrap the callback in a try/catch to handle the timeout (re-queue the ticket, notify a manager, or auto-escalate).

If the agent rejects the AI suggestion (approved: false), the workflow returns early with a rejected status without sending a customer reply. Your ticketing system handles the next step (re-queue, reassign, or manual response).

The escalation path adds a second waitForCallback. If the AI flagged the ticket for escalation (security concern, data loss, enterprise customer), the function suspends again waiting for a specialist. The specialist sends back a SpecialistReview with a response and notes. This callback has a 3-day timeout because specialist reviews can take time. Without durable functions, you'd need a separate state machine or database to track which tickets are waiting for specialists.

context.logger replaces console.log. During replay, completed steps don't re-execute, but code outside steps does. context.logger suppresses duplicate log output during replay so your CloudWatch Logs stay clean. With console.log, you'd see the same log lines repeated on every replay invocation.

closeTicket extracts the parallel close-out into a helper. Both the escalation and standard paths send a customer reply and satisfaction survey concurrently using context.parallel. Each branch gets its own child context with isolated state tracking. The helper takes a dynamic context name so the escalated and standard close-out steps are distinguishable in the execution history.

Testing Locally

The testing SDK (@aws/durable-execution-sdk-js-testing) lets you run durable functions locally without deploying. Here's the key pattern from src/index.test.ts, showing how to drive a callback-based workflow in a test:

import { LocalDurableTestRunner } from "@aws/durable-execution-sdk-js-testing";

// jest.mock replaces BedrockRuntimeClient so analyze-ticket returns controlled results
// (see full mock setup in the repo)
import { handler } from "./index";

describe("Support Triage", () => {
  let runner: LocalDurableTestRunner;

  beforeAll(async () => {
    await LocalDurableTestRunner.setupTestEnvironment({ skipTime: true });
  });

  afterAll(async () => {
    await LocalDurableTestRunner.teardownTestEnvironment();
  });

  beforeEach(() => {
    runner = new LocalDurableTestRunner({ handlerFunction: handler });
  });

  afterEach(async () => {
    await runner.reset();
  });

  it("should resolve a standard ticket after agent review", async () => {
    const result = runner.run({
      payload: {
        ticketId: "TKT-001",
        customerId: "CUST-123",
        customerTier: "pro",
        subject: "Cannot export CSV reports",
        body: "When I click the export button, nothing happens.",
        contactEmail: "customer@example.com",
      },
    });

    // The handler suspends at waitForCallback("agent-review").
    // getOperation blocks until the callback is created.
    const agentCallback = await runner.getOperation("agent-review");
    const agentDetails = await agentCallback.waitForData();
    await agentDetails.sendCallbackSuccess(JSON.stringify({
      approved: true,
      editedResponse: "We have identified the CSV export issue and a fix is rolling out today.",
      agentNotes: "Known bug, fix in deploy pipeline",
    }));

    const output = await result;
    expect(output.getStatus()).toBe("SUCCEEDED");
    expect(output.getResult()).toMatchObject({ status: "resolved", ticketId: "TKT-001" });
  });

  // Additional tests in the repo: escalation flow, agent rejection, callback failure
});

setupTestEnvironment and teardownTestEnvironment are static methods that start and stop the local checkpoint server. They run once per test file in beforeAll/afterAll. The runner instance is created per test in beforeEach and reset in afterEach. The skipTime: true option fast-forwards any context.wait() calls so tests run instantly.

The interesting part is the callback interaction: runner.getOperation("agent-review") blocks until the handler reaches waitForCallback("agent-review") and creates the callback. Then sendCallbackSuccess simulates the external system responding. This lets you test the full suspend/resume lifecycle without deploying. The repo includes tests for all three paths: standard resolution, escalation with specialist review, and agent rejection. There's also a test that uses sendCallbackFailure to verify error handling when an external system reports a failure.

Run the tests with npm test, or use the run-durable CLI to run the handler directly with a payload:

npx run-durable --skip-time --verbose --event '{"ticketId":"TKT-001","subject":"Test ticket"}' src/index.ts

Try the Demo

The repo includes an interactive demo that runs the entire workflow in your terminal, showing each stage: AI analysis, human review prompts, checkpoint history, and final resolution.

Run npm run demo to open the interactive menu, or skip it with a direct command:

Local mode (no AWS credentials needed)

npm run demo:local -- --ticket=standard

Local mode uses mocked Bedrock responses so you can see the full workflow without calling AWS. Here's what the first few steps look like:

  ──────────────────────────────────────────────
  Ticket TKT-001
  Customer:  CUST-123 (pro)
  Subject:   Cannot export CSV reports
  ──────────────────────────────────────────────

  ▶ step: analyze-ticket
    ✓ Checkpointed analyze-ticket (1204ms)

  ──────────────────────────────────────────────
  AI Triage Result
  Category:    technical
  Priority:    high
  Sentiment:   frustrated
  Escalation:  No
  ──────────────────────────────────────────────

  ⏸ waitForCallback: agent-review
    Function suspended. Compute charges stopped.

The demo pauses at each callback and prompts you to respond as the agent or specialist. It walks through three ticket scenarios:

Standard ticket (pro tier, CSV export bug): AI analyzes, agent approves with edits, reply sent.
Enterprise escalation (security concern): AI flags for escalation, agent approves, specialist reviews, reply sent.
Agent rejection (feature request): AI suggests a response, agent rejects, ticket returned for manual handling.

At each human review step, the demo pauses and lets you play the role of the support agent or specialist. You approve, reject, or edit the AI's suggestion. The demo shows the execution history after each step so you can see the checkpoint/replay model in action.

Cloud mode (real Bedrock responses)

npm run demo:cloud -- --ticket=standard --region=us-east-2

Cloud mode invokes the deployed Lambda function with real Bedrock calls. You'll see actual AI-generated triage analysis and can watch the durable execution checkpoints in the Lambda console. Add --profile=<name> if you're not using the default AWS profile.

Deploying and Invoking

Build and deploy:

sam build
sam deploy --guided

Once deployed, invoke the function asynchronously with a qualified ARN. The AutoPublishAlias in the template created a live alias:

aws lambda invoke \
  --function-name durable-support-triage:live \
  --invocation-type Event \
  --durable-execution-name "ticket-TKT-001" \
  --cli-binary-format raw-in-base64-out \
  --payload '{
    "ticketId": "TKT-001",
    "customerId": "CUST-123",
    "customerTier": "pro",
    "subject": "Cannot export CSV reports",
    "body": "When I click the export button, nothing happens.",
    "contactEmail": "customer@example.com"
  }' \
  response.json

A few things to note:

--invocation-type Event is required for long-running workflows. Synchronous invocation (RequestResponse) times out after 15 minutes.
--durable-execution-name provides built-in idempotency. If you invoke the function twice with the same execution name, the second invocation returns the existing execution instead of creating a duplicate. Using the ticket ID as the execution name is a natural fit. Note: execution names must be alphanumeric, hyphens, or underscores. If your ticket IDs contain dots, slashes, or other special characters, sanitize them first.
:live on the function name is the alias qualifier. Without it, you'll get InvalidParameterValueException: Durable execution requires qualified function identifier.

Monitoring execution progress

Check execution status in the Lambda console under the Durable executions tab. You'll see each step's status and timing, including when the function suspended waiting for the agent callback.

You can also check programmatically:

aws lambda get-durable-execution \
  --durable-execution-arn "arn:aws:lambda:us-east-2:123456789012:function:durable-support-triage:live/durable-execution/ticket-TKT-001/<run-id>"

Replace <run-id> with the run ID returned in the initial invoke response. You can also find it in the Lambda console under the Durable executions tab.

Completing the callbacks

When the support agent finishes their review, send the callback from your ticketing system:

aws lambda send-durable-execution-callback-success \
  --callback-id "your-callback-id-here" \
  --cli-binary-format raw-in-base64-out \
  --result '{
    "approved": true,
    "editedResponse": "Hi, thanks for reporting this. We have identified the issue and a fix is rolling out today.",
    "agentNotes": "Known bug in CSV export module"
  }'

The --result value is a JSON string. The CLI handles serialization, so you pass the JSON object directly (unlike the test SDK, where you explicitly call JSON.stringify()).

The function resumes, checks for escalation, and continues. If a specialist callback is pending, the same pattern applies: the specialist's system calls send-durable-execution-callback-success when their review is complete. There's also send-durable-execution-callback-failure for when an external system needs to report an error (e.g., agent rejects the ticket, or an integration fails).

You can also react to execution state changes via EventBridge. Lambda emits events to the default event bus when executions start, succeed, fail, or time out. Create an EventBridge rule with this event pattern:

{
  "source": ["aws.lambda"],
  "detail-type": ["Durable Execution Status Change"]
}

Gotchas and Hard-Won Lessons

Determinism is not optional

This is the rule that trips people up. Code outside steps re-executes on every replay. If it produces different results each time, your workflow breaks in subtle ways.

// This generates a different ID on each replay
const requestId = crypto.randomUUID();
await context.step("use-id", async () => saveToDb(requestId));

// This generates one ID, checkpoints it, and returns the same value on replay
const requestId = await context.step("gen-id", async () => crypto.randomUUID());
await context.step("use-id", async () => saveToDb(requestId));

The same applies to Date.now(), Math.random(), API calls, and database queries. If it can return different values, wrap it in a step.

The ESLint plugin (@aws/durable-execution-sdk-js-eslint-plugin) catches common violations. Set it up early.

Closure mutations are lost on replay

Variables you modify inside a step are not preserved across replays:

let total = 0;
await context.step("calculate", async () => {
  total = 42; // This mutation is lost on replay
});
console.log(total); // Always 0 on replay

// Instead, return values from steps
total = await context.step("calculate", async () => 42);
console.log(total); // Always 42

This is because the step function doesn't re-execute on replay. It returns the cached result. But the closure variable was modified by the function body, which never ran. Return values from steps instead of modifying outer variables.

The qualified ARN requirement is easy to miss

Durable functions require a qualified function identifier: a version number, alias, or $LATEST. Using an unqualified ARN silently fails or throws InvalidParameterValueException.

The AutoPublishAlias SAM property solves this. It creates a new version and updates the alias on every deploy. If you're using EventBridge Scheduler or other services to invoke your function, make sure they target the alias ARN, not the unqualified ARN.

More steps means slower replay

Every time your function resumes, the SDK replays from the beginning, returning cached results for completed steps. The more steps you have, the more replay overhead per resumption.

This is a trade-off. More granular steps give you better debuggability and more precise retry boundaries. Fewer steps give you faster replay. In practice, group related operations into a single step unless you need separate retry behavior or checkpoint boundaries.

context.wait() is the simplest superpower

The "How Checkpoint/Replay Works" section mentions context.wait() for fixed-duration pauses, but the handler only uses waitForCallback(). In practice, context.wait() is one of the most useful primitives. Need a 24-hour cooling-off period before sending a follow-up? One line:

await context.wait("cooling-off", { hours: 24 });

The function suspends, compute charges stop, and Lambda resumes it 24 hours later. No cron jobs, no EventBridge Scheduler, no polling.

You can't enable durable execution on existing functions

DurableConfig can only be set when creating a function. You can't toggle it on an existing function. If you need to migrate, you'll need to create a new function. Plan for this.

Changing DurableConfig in CloudFormation also requires resource replacement, not an in-place update.

Checkpoint payloads have a 256KB limit

Each step result is serialized and stored as a checkpoint. If a step returns an object larger than 256KB, you'll get a CheckpointUnrecoverableExecutionError.

The workaround: store large data in S3 or DynamoDB and return a reference from the step.

const dataRef = await context.step("store-large-data", async () => {
  const key = `tickets/${ticketId}/attachments.json`;
  await s3.putObject({ Bucket: bucket, Key: key, Body: JSON.stringify(largeData) });
  return { bucket, key };
});

Know your failure modes

If a step throws an unrecoverable error (after all retries are exhausted), the execution moves to a FAILED state. You can inspect the error in the Lambda console or via the get-durable-execution API. For cases where you want to fail immediately without retries, throw an UnrecoverableInvocationError:

import { UnrecoverableInvocationError } from "@aws/durable-execution-sdk-js";
throw new UnrecoverableInvocationError("Customer account not found");

Failed executions can't be resumed. You'd need to start a new execution. Design your workflows so that steps are idempotent in case you need to re-run from scratch.

Use Lambda versions for deploy safety

If you update your function code while an execution is suspended, replay uses the version that started the execution. This prevents inconsistencies from code changes mid-workflow. AutoPublishAlias handles this, but it's worth understanding why: if your new code changes a step's return shape or removes a step, replay on the old version still works because Lambda pins executions to their starting version.

Version pinning protects your function code, but it doesn't protect external schemas. If an execution suspends on Monday waiting for a callback, and on Wednesday your ticketing system starts sending a different JSON structure in the callback payload, the Monday execution will fail when it resumes. Keep callback payloads backwards compatible for as long as executions can be in flight.

Plan your observability early

For production workflows that can run for days, go beyond basic CloudWatch Logs. Set up CloudWatch alarms on stuck executions (no state change within expected timeframes), use the EventBridge integration to track execution lifecycle events, and consider CloudWatch Logs Insights queries for filtering by execution name across replays.

When to Use What

Durable functions and Step Functions are not competing. They solve different problems.

	Durable Functions	Step Functions
Workflow definition	Sequential code in your language	Amazon States Language (JSON/YAML)
Best for	Application logic, tightly coupled workflows	Cross-service orchestration
Service integrations	Via SDK in your code	220+ native integrations
Debugging	CloudWatch Logs, execution history	Visual console, step-by-step
Infrastructure	Single Lambda function	State machine + Lambda functions
Scaling	Lambda concurrency limits	Distributed Map for large-scale parallel processing
Mental model	Write code	Design state machines

On cost: durable functions use standard Lambda pricing for active compute time. During waits, compute charges stop. Step Functions charges per state transition, which adds up for high-volume workflows. See Lambda pricing and Step Functions pricing for current details.

Use durable functions when your workflow is application logic that reads naturally as sequential code. Support ticket triage, approval workflows, AI agent loops, saga patterns.

Use Step Functions when you're orchestrating across multiple AWS services, need visual debugging, or need the 220+ native integrations. ETL pipelines, media processing, infrastructure provisioning.

Use both together when you have a high-level orchestration (Step Functions) that delegates to individual workflows (durable functions) for complex application logic.

What's Next

This post covered the fundamentals: what durable functions are, how checkpoint/replay works, and how to build and test a complete AI-powered workflow with human-in-the-loop callbacks. In the next post, I'll use durable functions to build a multi-agent orchestration workflow where multiple AI agents collaborate on complex tasks with checkpointed reasoning.

Additional Resources

What workflows are you thinking about building with durable functions? Let me know in the comments!

Chaos Engineering for AWS Lambda: failure-lambda 1.0

Gunnar Grosch — Thu, 12 Mar 2026 16:44:15 +0000

I wrote the first version of failure-lambda back in 2019. The idea was simple: inject faults into AWS Lambda functions so you can test how your system behaves when things go wrong. Latency spikes, exceptions, blocked network calls. The kind of failures that happen in production whether you're ready for them or not.

That version worked. People used it. But the codebase was showing its age. JavaScript with no types. AWS SDK v2. A flat configuration format that only allowed one failure mode at a time. And it only worked with Node.js.

failure-lambda 1.0 is a ground-up rewrite. TypeScript, AWS SDK v3, a feature flag configuration model, two new failure modes (timeout and corruption), and a Lambda Layer that brings fault injection to any managed runtime with zero code changes.

Why Chaos Engineering for Lambda?

If you're building on Lambda, you're building on a managed service. You don't manage servers, but you still manage dependencies. Your function calls DynamoDB, S3, third-party APIs, other microservices. Any of those can be slow, unreliable, or unavailable.

The question isn't whether failures will happen. It's whether your system handles them gracefully when they do. Does your function retry correctly? Does your circuit breaker trip? Does your API return a useful error message instead of a 500?

Failure injection lets you answer those questions before your users do. Enable latency injection and watch your downstream timeouts. Block a dependency with the denylist and see if your fallback logic works. Return a 503 and check that your retry policy backs off properly.

The important part: you control when and how these failures happen. Start with one mode at a low percentage in a test environment. Increase gradually. Build confidence that your system does what you think it does.

What's New

The short version: everything. TypeScript with full type definitions. AWS SDK v3. Two new failure modes.

The old format was flat: one failure mode, one rate, one toggle. The new format treats each mode as an independent feature flag. You can enable latency injection at 50% and DNS denylist at 100% simultaneously:

{
  "latency": {"enabled": true, "percentage": 50, "min_latency": 200, "max_latency": 500},
  "denylist": {"enabled": true, "deny_list": ["s3.*.amazonaws.com"]}
}

Seven failure modes:

Mode	Effect
`latency`	Adds random delay between configured bounds
`timeout`	Sleeps until the function is about to time out
`exception`	Throws an exception with a configurable message
`statuscode`	Returns a specific HTTP status code
`diskspace`	Fills `/tmp` to consume available disk space
`denylist`	Blocks network calls to matching hostnames
`corruption`	Mangles the response body after the handler returns

Beyond the modes:

Event-based targeting. Match conditions restrict injection to specific requests. Only corrupt GET requests to the prod stage. Only add latency to requests hitting /api. Conditions support exact match, exists, startsWith, and regex operators.
```
{
  "latency": {
    "enabled": true,
    "min_latency": 200,
    "max_latency": 500,
    "match": [{"path": "requestContext.http.path", "operator": "startsWith", "value": "/api"}]
  }
}
```
AppConfig Feature Flags. Native support for the AWS.AppConfig.FeatureFlags profile type. AppConfig gives you deployment strategies and automatic rollback, useful when you don't want an accidental "enable all failures at 100%" to take down your environment.
Middy middleware. If you use Middy, import failure-lambda/middy instead of wrapping your handler.
CLI. npx failure-lambda gives you an interactive CLI for managing failure configuration. Check status, enable modes, disable everything. Supports both SSM and AppConfig backends and saves connection profiles so you don't retype region and parameter names every time.

The Lambda Layer

This is the biggest addition. The npm package requires you to import failure-lambda and wrap your handler. That's fine for Node.js, but it doesn't help if your Lambda functions are written in Python, Java, .NET, or Ruby.

The Lambda Layer solves this. Add the layer to your function, set two environment variables, and fault injection works without touching your code. No imports, no wrapper, no middleware.

AWS_LAMBDA_EXEC_WRAPPER=/opt/failure-lambda-wrapper
FAILURE_INJECTION_PARAM=failureLambdaConfig

Under the hood, it's a lightweight Rust proxy that sits between your handler and the Lambda Runtime API. Single static binary, no runtime dependencies, negligible cold start impact. On each invocation, the proxy reads your failure configuration from SSM or AppConfig and decides whether to inject faults before or after forwarding the request to your handler. Your code never knows it's there.

┌─────────────────────────────────────────────────┐
│ Lambda Execution Environment                    │
│                                                 │
│  Lambda Runtime API                             │
│       │                                         │
│       ▼                                         │
│  failure-lambda proxy (Rust)                    │
│       │                                         │
│       ├── Read config from SSM / AppConfig      │
│       ├── Inject fault? ──yes──▶ Return early   │
│       │                         (statuscode,    │
│       │                          exception,     │
│       │                          latency)       │
│       no                                        │
│       │                                         │
│       ▼                                         │
│  Your handler (any runtime)                     │
│       │                                         │
│       ├── Inject fault? ──yes──▶ Modify response│
│       │                         (corruption)    │
│       no                                        │
│       │                                         │
│       ▼                                         │
│  Response returned                              │
└─────────────────────────────────────────────────┘

The layer works with all managed Lambda runtimes that support AWS_LAMBDA_EXEC_WRAPPER: Node.js, Python, Java, .NET, and Ruby. Both x86_64 and arm64 architectures. Custom runtimes can use the layer if they support AWS_LAMBDA_EXEC_WRAPPER: the runtime bootstrap must check for the variable and invoke the specified executable before starting its own runtime loop. Download the zip from the GitHub release, publish it to your account, and you're ready to go.

Getting Started: npm Package

For Node.js, the npm package gives you the most control.

You'll need:

Node.js 20 or later
An AWS account with permissions to create SSM parameters and Lambda functions
ssm:GetParameter granted to the function's execution role

npm install failure-lambda

Wrap your handler:

import failureLambda from "failure-lambda";

export const handler = failureLambda(async (event, context) => {
  return { statusCode: 200, body: "OK" };
});

Create an SSM parameter with your failure configuration:

aws ssm put-parameter \
  --name failureLambdaConfig \
  --type String \
  --value '{"latency": {"enabled": false, "min_latency": 100, "max_latency": 400}}' \
  --region eu-west-1

Set the FAILURE_INJECTION_PARAM environment variable on your Lambda function to the parameter name, grant ssm:GetParameter, and deploy.

When you're ready to inject a fault:

npx failure-lambda enable latency --param failureLambdaConfig --region eu-west-1

Getting Started: Lambda Layer

For any runtime, the layer path requires no code changes.

You'll need:

An AWS account with permissions to publish Lambda layers and create Lambda functions
AWS CLI configured
ssm:GetParameter granted to the function's execution role

Download failure-lambda-layer-x86_64.zip or failure-lambda-layer-aarch64.zip from the latest release.

Publish the layer:

aws lambda publish-layer-version \
  --layer-name failure-lambda \
  --zip-file fileb://failure-lambda-layer-x86_64.zip \
  --compatible-architectures x86_64 \
  --region eu-west-1

Add the layer ARN to your function, set AWS_LAMBDA_EXEC_WRAPPER=/opt/failure-lambda-wrapper and FAILURE_INJECTION_PARAM=failureLambdaConfig, create the SSM parameter, and grant ssm:GetParameter.

That's it. Your Python handler, your Java handler, your .NET handler: they all get the same fault injection capabilities without a single line of code changed.

Try It Out: Injecting Faults Step by Step

Let's walk through a concrete example. We'll deploy a simple function with the layer, verify it works normally, inject latency, inject a status code error, and then turn everything off. The whole thing uses a standard Node.js handler with zero failure-lambda code in it.

Here's the handler. It simulates a quick database lookup and returns the response time:

export const handler = async (event) => {
  const start = Date.now();
  await new Promise((resolve) => setTimeout(resolve, 5));
  return {
    statusCode: 200,
    body: JSON.stringify({
      message: "Order processed",
      duration_ms: Date.now() - start,
    }),
  };
};

The SAM template adds the layer and sets the two required environment variables. Nothing else:

DemoFunction:
  Type: AWS::Serverless::Function
  Properties:
    Handler: index.handler
    Runtime: nodejs22.x
    CodeUri: src/
    Timeout: 10
    Layers:
      - !Ref FailureLambdaLayerArn
    Environment:
      Variables:
        AWS_LAMBDA_EXEC_WRAPPER: /opt/failure-lambda-wrapper
        FAILURE_INJECTION_PARAM: !Ref FailureConfig
    Policies:
      - SSMParameterReadPolicy:
          ParameterName: !Ref FailureConfig

The full template including Parameters definitions is in the examples directory. The snippet above shows only the function resource for clarity.

After deploying with sam build && sam deploy --guided, we hit the endpoint a few times to see steady state:

The examples below use curl -s -o - -w ' HTTP %{http_code} | %{time_total}s\n' to append status code and total request time to each response. Standard curl won't show this without the -w flag.

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.21s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.19s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":6}   HTTP 200 | 0.19s

5ms handler duration, ~190ms end-to-end on warm invocations. That's our baseline. Now let's see what happens when conditions aren't ideal.

Injecting latency

A third-party API starts responding slowly. A DynamoDB table is throttling. A downstream microservice is under load. These are real scenarios, and you want to know how your system behaves before they happen in production.

Update the SSM parameter to add 500-1000ms of random latency on every invocation:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{"latency": {"enabled": true, "percentage": 100, "min_latency": 500, "max_latency": 1000}}' \
  --region eu-west-1

Configuration is cached for 60 seconds. If you update the SSM parameter and immediately hit the endpoint, you'll see the old behavior. Wait for the cache to refresh before testing.

Wait about 60 seconds for the configuration cache to refresh, then hit the endpoint again:

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":11}   HTTP 200 | 0.99s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}    HTTP 200 | 1.08s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}    HTTP 200 | 1.14s

Responses jumped from ~190ms to over a second. The handler itself still runs in 5ms: the latency is injected by the proxy before the handler executes. This simulates what happens when a dependency responds slowly. Does your API Gateway timeout kick in at the right threshold? Do callers retry or give up? Does a slow function cause a queue to back up?

Injecting a status code error

Latency is one thing. Complete failure is another. A downstream service returning 5xx errors, an expired API key, a misconfigured endpoint: all of these surface as error responses. Replace the config with a 503 Service Unavailable:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{"statuscode": {"enabled": true, "percentage": 100, "status_code": 503}}' \
  --region eu-west-1

After the cache refreshes (up to 60 seconds):

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.31s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.21s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.18s

The handler never runs. The proxy short-circuits the invocation and returns a 503 directly. This represents a function that's failing for any reason: a permissions change, a missing environment variable, an unhandled exception. Does your frontend show a useful error message or a blank page? Does your Step Functions workflow retry or fail the entire execution? Does your monitoring alert you?

What the logs show

The proxy writes structured JSON logs to CloudWatch. You can see exactly what it's doing on each invocation:

{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[]"}
{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[\"latency\"]"}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":670,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":859,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":933,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[\"statuscode\"]"}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}

Every config fetch and every injection is logged with the mode, action, and parameters. You can query these in CloudWatch Logs Insights:

fields @timestamp, mode, action
| filter source = "failure-lambda"
| sort @timestamp desc

Turning it off

The CLI can disable everything in one command:

npx failure-lambda disable --all --param <your-param-name> --region eu-west-1

Or set the parameter back to an empty config manually:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{}' --region eu-west-1

After the cache refreshes, everything is back to normal:

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.33s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.23s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.19s

No redeployment. No code changes. The proxy saw the empty config, disabled injection, and passed everything through.

Cleaning up

To remove everything deployed in this walkthrough:

sam delete
aws ssm delete-parameter --name <your-param-name> --region eu-west-1
aws lambda delete-layer-version --layer-name failure-lambda --version-number <version> --region eu-west-1

What You Learn

The walkthrough above uses a single function. A real application with multiple functions, queues, and dependencies will surface more. But even one function reveals things about your system:

Timeout behavior (latency, timeout). Are your Lambda timeouts, API Gateway timeouts, and client timeouts configured consistently? Latency injection exposes mismatches fast. A function with a 10-second timeout behind an API Gateway with a 3-second timeout will fail in ways that look intermittent.
Retry and backoff (statuscode, exception). When a function returns a 503, do callers retry with exponential backoff or hammer the endpoint? Do SQS redrive policies work as configured? Injecting errors at a percentage less than 100% lets you see if partial failures are handled differently than complete outages.
Error propagation (statuscode, exception). Does a failure in one function produce a clear error message at the API boundary, or does it cascade into a generic 500? Injecting status codes and exceptions at different points in a call chain shows you exactly where error context gets lost.
Alerting and observability (any mode). Do your CloudWatch alarms fire? Do they fire quickly enough? Injecting faults and watching your dashboards is the most direct way to validate your monitoring. If you don't get paged during a controlled experiment, you won't get paged during an incident.
Fallback behavior (denylist). If you've built fallback logic for when a dependency is unavailable, does it actually work? The denylist mode blocks specific hostnames, so you can test what happens when S3 or DynamoDB is unreachable without affecting other dependencies.
Capacity and scaling (latency). What happens when latency increases and concurrent executions climb? Do you hit reserved concurrency limits? Does a slow function cause upstream queues to grow? These are the kinds of cascading effects that are hard to predict and easy to test.

The point of chaos engineering isn't to cause outages. It's to discover how your system responds to conditions that will eventually occur, in a controlled way, before your users encounter them.

Troubleshooting

Injection isn't happening

The most common cause is a missing ssm:GetParameter permission on the function's execution role. Check CloudWatch Logs for a permission denied error from the proxy. The second most common cause is the configuration cache: changes take up to 60 seconds to take effect. If you've just updated the SSM parameter, wait for the next cache refresh before concluding injection isn't working.

Architecture mismatch

If you publish the x86_64 layer and attach it to an arm64 function (or vice versa), the proxy binary won't execute. Download the correct zip for your function's architecture: failure-lambda-layer-x86_64.zip or failure-lambda-layer-aarch64.zip.

AWS_LAMBDA_EXEC_WRAPPER has no effect

The AWS_LAMBDA_EXEC_WRAPPER mechanism is built into managed Lambda runtimes. Custom runtimes need to explicitly support it by checking for the variable and invoking the wrapper before starting their own runtime loop. If you're using a custom runtime that doesn't implement this, the layer won't intercept anything.

Design Decisions

A few things I learned building this.

Feature flags over a single toggle

The original version had one failure mode active at a time. That's not how production fails. In the real world, you might have a slow dependency and a flaky DNS resolution at the same time. The feature flag model lets you compose failures. Each mode is independent with its own percentage, so you can build realistic failure scenarios.

Why Rust for the layer

The proxy sits in the critical path of every Lambda invocation. It needs to be fast with minimal memory overhead. Rust was the natural choice: predictable performance, no garbage collector pauses, and the single binary keeps the layer small.

Caching with a purpose

Every invocation used to call SSM to get the configuration. That's unnecessary latency and API costs. The library now caches SSM responses for 60 seconds by default. For AppConfig, the Lambda extension already handles caching, so the library disables its own cache entirely to avoid staleness.

Validation that fails closed

If your configuration JSON is malformed or has invalid values, the library logs a clear error and disables injection. It doesn't crash your function and it doesn't silently inject with bad parameters. Regex patterns in denylist rules are checked for nested quantifiers to prevent ReDoS.

What's Next

failure-lambda 1.0 brings TypeScript, a feature flag configuration model, seven failure modes, and a Lambda Layer that works across all managed runtimes without touching your code. This release covers the core use cases I've seen in practice. There are things I'd like to explore next: more granular targeting with Lambda function aliases, integration with AWS Fault Injection Service, and better observability into what's being injected across a fleet of functions. If you have ideas, open an issue.

Additional Resources

What failure scenarios have you tested in your serverless applications? I'm curious what surprises people find with timeout mismatches and retry behavior. Let me know in the comments.

Can It Run DOOM? Playing DOOM in Claude Code with doom-mcp

Gunnar Grosch — Wed, 11 Mar 2026 02:04:03 +0000

DOOM runs on everything. Calculators, ATMs, pregnancy tests, an oscilloscope you bought at a garage sale. It was only a matter of time before AI coding assistants joined the list. Now they have.

doom-mcp is an MCP server that embeds the real DOOM engine and exposes it to Claude Code, Kiro, Cursor, or any MCP client through three tools. The AI sees the game, fights demons, navigates levels, and reports what's happening. You can direct it or let it play on its own.

> Let's play DOOM!

  Starting E1M1 on medium difficulty...

  HP:100 Armor:0 | Pistol (key:2) | Ammo: 50b 0s 0r 0c | Kills:0 | (-416,256) 0deg | E1M1
  No enemies in sight. Exploring the corridor ahead.

> Go forward and find some enemies

  Sprinting down the hallway... I can see an opening ahead.

  HP:100 Armor:0 | Pistol (key:2) | Ammo: 50b 0s 0r 0c | Kills:0 | (0,256) 0deg | E1M1
  ENEMIES IN SIGHT (aim for angle~0 then fire): Imp (HP:60) to your left (turn_left ~9) nearby

  There's an Imp to the left! Turning to face it...

> Shoot it!

  Aimed and firing!

  HP:100 Armor:0 | Pistol (key:2) | Ammo: 47b 0s 0r 0c | Kills:0 | (0,256) 30deg | E1M1
  ENEMIES IN SIGHT (aim for angle~0 then fire): Imp (HP:15) AHEAD close

  Hit! The Imp is wounded (HP:15). One more burst should finish it.

  HP:95 Armor:0 | Pistol (key:2) | Ammo: 44b 0s 0r 0c | Kills:1 | (0,256) 30deg | E1M1
  ITEMS: Stimpack (+10 HP) slightly to your right (turn_right ~4) point-blank (~3 ticks fwd+run to reach)
  1 kill! Want to see a screenshot?

Getting Started

Register with Claude Code

claude mcp add doom --scope user -- npx -y doom-mcp

For Kiro, Cursor, Windsurf, or any other MCP client, add to .mcp.json:

{
  "mcpServers": {
    "doom": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "doom-mcp"]
    }
  }
}

Play

Open a new session and say: "Let's play DOOM"

The AI will ask which mode you want, start on E1M1, and begin.

Play Modes

Two ways to play:

User-directed: You give commands ("go forward", "open that door", "shoot the imp"). The AI executes one action at a time and describes what happens. Good for a text-adventure feel where you call the shots and the AI handles the execution.

Autonomous: The AI makes all decisions: movement, combat, exploration. You watch and intervene if you want. It's genuinely entertaining to watch it work through a level, spot an Imp, and decide whether to charge or take cover.

WAD Files

doom-mcp ships with Freedoom out of the box. Freedoom is a free and open-source replacement IWAD (DOOM's game data format) with its own levels and enemy designs. If you want the original id Software levels, enemies, and atmosphere, the shareware DOOM1.WAD is free to download legally. Set the path in your MCP config:

{
  "mcpServers": {
    "doom": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "doom-mcp"],
      "env": {
        "DOOM_WAD_PATH": "/path/to/DOOM1.WAD"
      }
    }
  }
}

If you own DOOM or DOOM 2, those WADs work the same way.

A Few Things I Learned Building This

FFI over subprocess

The obvious approach is to run DOOM as a child process and communicate via pipes. The problem is timing and synchronization: you're fighting the engine's internal clock, process startup overhead, and serialization on every frame.

Instead, doom-mcp embeds doomgeneric (a portable C implementation of the DOOM engine) directly via Rust FFI (Foreign Function Interface). Rust was the right choice here: it has excellent FFI support for C code, compiles to a single native binary, and gives memory safety without a garbage collector that could interrupt the game loop. No subprocess spawning, no pipes. Each tool call advances the engine by calling doomgeneric_Tick() directly and reading the frame buffer in-memory.

Virtual time

DOOM normally ties its game clock to wall time. That's fine for a real-time player, but it's wrong for an AI that might take 500ms to decide its next move. Without intervention, the engine would skip ticks during the AI's thinking time and produce non-deterministic behavior.

The solution is to decouple the engine's clock from wall time entirely. Each doomgeneric_Tick() call advances exactly one game tic (1/35th of a second) regardless of how much real time has passed. Gameplay is fully deterministic: the same sequence of actions always produces the same result.

Line-of-sight, not wallhack

Enemy detection could just iterate the object list and report everything on the map. That would be cheating in a way that makes the game too easy and less interesting.

Instead, the server performs a proper line-of-sight check for each enemy, the same check the DOOM engine uses internally (P_CheckSight()). The AI only sees enemies it could see if it were a human looking at the screen. When an enemy moves behind a wall or around a corner, it drops from the AI's view immediately. It still needs to explore to find things.

What the AI gets per action

Each doom_action call returns structured game state alongside a small PNG:

HP, armor, ammo by type, current weapon, kill count
Visible enemies with human-readable direction and distance (Imp (HP:60) to your left (turn_left ~9) nearby)
Nearby items within pickup range, with CLOSING/RECEDING indicators
Nearby doors and switches (NEARBY: Door AHEAD (use to activate))
A 200x125 thumbnail PNG using a 216-color palette for inference

The structured data gives the AI something it can reason about without having to interpret pixel-level vision. The image fills in the spatial context. Together they let the AI make reasonable decisions: "Imp to my left, nearby, HP 60 — turn left and fire."

Tools Reference

doom_start

Starts or restarts a game. Safe to call at any time.

Parameter	Type	Default	Description
`skill`	int 1-5	3	1=baby, 2=easy, 3=medium, 4=hard, 5=nightmare
`episode`	int 1-4	1	Episode number
`map`	int 1-9	1	Map number

doom_action

Advances the game by executing actions for a number of ticks.

Parameter	Type	Required	Description
`actions`	string	yes	Comma-separated: `forward`, `backward`, `turn_left`, `turn_right`, `strafe_left`, `strafe_right`, `fire`, `use`, `run`, `1`-`7`
`ticks`	int 1-105	no	Ticks to advance. Default 7. 7 ticks ≈ 0.2s, 35 ticks ≈ 1s

Weapon keys: 1=fists, 2=pistol, 3=shotgun, 4=chaingun, 5=rocket launcher, 6=plasma, 7=BFG.

doom_screenshot

Saves a full-resolution 320x200 screenshot to the system temp directory and opens it in the default image viewer. Does not advance the game. Note: the viewer launch will fail silently on headless systems or SSH sessions.

How Well Does It Actually Play?

Realistically: well enough to be fun. On E1M1 at medium difficulty, it gets 5-10 kills in a typical 50-action session. It can navigate corridors, spot enemies, aim, and fire. It struggles with enemies behind partial cover and complex door sequences.

It improves significantly when you direct it. "There's an Imp to your left" turns a wandering AI into a focused combatant. The user-directed mode is where most of the entertainment is. Two AI agents in deathmatch is the obvious next experiment, and the architecture could extend to other doomgeneric-compatible titles: Heretic, Hexen, DOOM II.

The token cost is real: each action call is roughly 1,500-2,500 total tokens (input and output combined: game state text plus the PNG). A 50-action session is 75,000-125,000 tokens, which works out to roughly $0.50-2.00 depending on your model. Worth it.

Additional Resources

doom-mcp on GitHub: Source, docs, and examples
doom-mcp on npm: Package page
doomgeneric by ozkl: The portable DOOM engine this is built on
Freedoom: The open-source IWAD that ships with the package
DOOM1.WAD shareware download: The original shareware episode, free and legal

The real question was never whether it could run DOOM. It's what you do with it now that it can. Let me know in the comments how far you get.

Circuit Breakers on AWS Lambda: Why In-Memory State Silently Fails

Gunnar Grosch — Mon, 09 Mar 2026 21:53:44 +0000

You added a circuit breaker to your Lambda function. It compiles, your tests pass, and it works correctly in local testing. But it's silently useless. The problem isn't the implementation. It's an assumption every in-memory circuit breaker makes that doesn't hold on Lambda.

What Circuit Breakers Do

The circuit breaker pattern comes from Michael Nygard's Release It! and is named after the electrical component. Think about the services your Lambda functions actually call: a payment processor, a third-party enrichment API, a database under load, another service in your own fleet. Anything external your function depends on is a downstream service, and any of them can start responding slowly or fail outright. Slow is often worse than down. A dependency that takes 10 seconds to time out costs you 10 seconds of held concurrency per call, not a fast failure you can handle gracefully.

That concurrency cost is the Lambda-specific reason to care. When a downstream call hangs, your function holds a concurrency unit. At 100 concurrent executions and a 10-second timeout, one flaky dependency can saturate your function in seconds, throttling every other request, including the ones with nothing to do with the sick service. The cascade happens fast: payment API slows down → order function saturates concurrency → order requests fail → the service calling orders backs up → users see errors across your entire checkout flow.

When a downstream service starts failing, a circuit breaker stops calling it entirely, returns a fallback response immediately, and probes for recovery. It also gives the downstream service breathing room: instead of a flood of timeouts hammering something that's already struggling, it gets near-silence while the circuit is open. The naming follows the electrical analogy: a closed circuit is complete and current flows; an open circuit is broken and nothing gets through.

Three states:

CLOSED: Normal operation. Calls go through. Failures are counted.
OPEN: Circuit tripped. Calls fail fast without reaching the downstream service. A timeout runs.
HALF-OPEN: One trial call allowed. If it succeeds, the circuit closes. If it fails, it reopens with a longer timeout.

The problem is how Lambda runs code.

Why In-Memory State Fails on Lambda

Lambda's concurrency model is built around isolated execution environments. From the AWS documentation: "For each concurrent request, Lambda provisions a separate instance of your execution environment." Two simultaneous invocations of the same function run in two separate environments with completely independent memory spaces. There is no shared memory between them.

Consider what this means for a circuit breaker with a failure threshold of 5. Your function is receiving 50 concurrent requests. A downstream service starts failing:

Execution environment 1 takes a request. The call fails. Its local failure count: 1/5.
Execution environment 2 takes a request. The call fails. Its local failure count: 1/5.
...
Execution environment 50 takes a request. The call fails. Its local failure count: 1/5.

50 failures have hit the downstream service. No circuit has opened. Each environment has counted 1 failure and needs 4 more before it does anything. Meanwhile, all 50 environments continue sending requests to a service that is already failing. In the worst case, where traffic distributes evenly across environments, you need up to 250 total failures before any single execution environment opens its circuit.

And that's assuming the same 50 execution environments handle all the traffic. Lambda scales by adding new execution environments as load increases. Each new environment starts with a failure count of zero. As long as traffic grows and new environments spin up, the fleet will always have environments that haven't seen enough failures to open. The circuit can never effectively protect you across the fleet.

This isn't a hypothetical. The most widely-used Node.js circuit breaker libraries (opossum, cockatiel) store state in process memory. They work correctly in a single-process server where all traffic goes through one circuit. They don't work for Lambda's distributed execution model. opossum does provide state export and import hooks (toJSON()) specifically documented for serverless environments, but these don't solve the cross-environment isolation problem: each environment still starts from whatever state you restore, not a live shared view of current circuit state.

Provisioned Concurrency reduces but doesn't eliminate this problem. PC keeps a fixed number of execution environments initialized and warm, so they accumulate local failure counts across more requests than standard on-demand environments. But they're still isolated from each other, and scaling events still add fresh environments that start at zero. In-memory state is less useless with PC, but it's still wrong at any meaningful concurrency level.

Lambda also periodically terminates execution environments for runtime maintenance and updates, even for continuously invoked functions. An environment accumulating failure counts can be replaced with a fresh one starting at zero at any time, adding another layer of unreliability to in-memory state.

Lambda Managed Instances (launched at re:Invent 2025) are an exception: they support multiple concurrent invocations per environment, so in-memory state accumulates across requests within the same environment. The argument above applies to standard Lambda functions, which remain the default.

Shared State Across Execution Environments

The fix is to store circuit state in a shared external store. When execution environment 1 records a failure, execution environment 2 sees it. When any environment opens the circuit, every environment stops calling the downstream service. Yes, this adds a network call to every invocation. The Performance and Cost sections have the numbers. For most workloads the overhead is small, and CachedProvider can reduce it further. ElastiCache (Valkey) is the fastest option (sub-millisecond reads) and is the right choice if your functions are already in a VPC. DynamoDB is the right default for most Lambda workloads: no VPC required, single-digit millisecond latency, and it supports atomic operations and conditional writes for concurrent safety. Adding a VPC solely for circuit breaker state adds deployment complexity and a modest cold start overhead, which isn't worth it unless you're already VPC-attached.

circuitbreaker-lambda is an open-source library I built that takes the DynamoDB path. It stores circuit state in DynamoDB and shares it across all execution environments running the same function.

Two paths to choose from:

	npm package	Lambda Layer
Runtimes	Node.js 20+	Any managed runtime
Integration	Import library	HTTP calls to local sidecar
Cold start overhead	~50ms	~350ms

Both paths share the same DynamoDB state schema, so a Node.js function using the npm package and a Python function using the Layer can share circuit state for the same downstream service.

Getting Started: npm Package

Install

npm install circuitbreaker-lambda

Requires Node.js 20+.

Create a DynamoDB table

aws dynamodb create-table \
  --table-name circuitbreaker-table \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Set the environment variable

Add CIRCUITBREAKER_TABLE as a Lambda environment variable. In a SAM template:

Environment:
  Variables:
    CIRCUITBREAKER_TABLE: circuitbreaker-table

Grant the function access to DynamoDB

The function needs GetItem and UpdateItem on the table. In a SAM template:

Policies:
  - Statement:
      - Effect: Allow
        Action:
          - dynamodb:GetItem
          - dynamodb:UpdateItem
        Resource: !GetAtt CircuitBreakerTable.Arn

Use it in your handler

import { CircuitBreaker } from 'circuitbreaker-lambda'

// callDownstreamService is any async function that calls your downstream service.
// It should throw on failure; the circuit breaker catches the throw and counts it.
// Initialized outside the handler so the same instance
// is reused across warm invocations of this execution environment
const breaker = new CircuitBreaker(callDownstreamService, {
  failureThreshold: 5,
  successThreshold: 2, // successes (across any environment) required to close from HALF-OPEN
  timeout: 10000, // ms to wait in OPEN state before allowing a trial call (HALF-OPEN)
  fallback: async () => ({ data: 'cached response' }),
})

export const handler = async () => {
  try {
    const result = await breaker.fire()
    return { statusCode: 200, body: JSON.stringify(result) }
  } catch (err) {
    // fire() throws if the circuit is OPEN and no fallback is configured,
    // or if the downstream call fails and propagates the error.
    return { statusCode: 503, body: JSON.stringify({ error: 'Service unavailable' }) }
  }
}

fire() calls callDownstreamService. If the call succeeds, it records a success in DynamoDB. If it fails, it records a failure. When the failure count hits the threshold, it opens the circuit and subsequent calls return the fallback immediately (or throw if no fallback is configured). Every execution environment handling that function reads the same DynamoDB state. successThreshold counts successes across any execution environment via shared DynamoDB state, following the same last-writer-wins behavior as failure counts. Real fallbacks return something useful under degradation: cached data, a default empty state, or a simplified response. The { data: 'cached response' } placeholder in the example is where that goes.

If you're using Middy middleware, there's an integration that wraps your handler directly:

import middy from '@middy/core'
import { circuitBreakerMiddleware } from 'circuitbreaker-lambda/middy'

export const handler = middy(myHandler)
  .use(circuitBreakerMiddleware({
    failureThreshold: 5,
    fallback: async () => ({ statusCode: 503, body: 'Service unavailable' }),
  }))

The middleware wraps the entire handler rather than a specific downstream function. When the circuit is OPEN, the middleware short-circuits the handler before it runs and returns the fallback response. Without a fallback configured, it throws so your error handler can respond. If your handler calls multiple downstream services, use the npm package directly with a distinct circuit ID for each service.

Circuit IDs and Shared State

The circuit ID is what links circuit state to a specific downstream service. By default it uses AWS_LAMBDA_FUNCTION_NAME. Two execution environments running the same function share one circuit because they have the same function name and read from the same DynamoDB item.

If one function calls multiple downstream services, give each a distinct circuit ID:

const paymentBreaker = new CircuitBreaker(callPaymentService, {
  circuitId: 'payment-service',
})
const inventoryBreaker = new CircuitBreaker(callInventoryService, {
  circuitId: 'inventory-service',
})

If multiple functions protect the same downstream service and you want them to share a circuit, give them the same ID. A circuit open in one function will be seen by all functions using that ID. For the Lambda Layer, use the same circuit ID string in the HTTP path across all functions.

Getting Started: Lambda Layer

If your functions use a runtime other than Node.js, or if you want a single circuit breaker deployment that works across runtimes, the Lambda Layer is the other path. It ships a Rust extension that runs as a local sidecar on port 4243. Your handler makes HTTP calls to it instead of importing a library.

Add the layer to your SAM template

Download the layer zip from the GitHub releases page. The Rust extension is architecture-specific: download the x86_64 build for standard Lambda functions or the arm64 build for Graviton. Reference it as a AWS::Serverless::LayerVersion resource and attach it to your function. The examples/layer/template.yaml in the repo shows the full setup with both architectures. The key function configuration:

LayerNodeFunction:
  Type: AWS::Serverless::Function
  Properties:
    Layers: [!Ref CircuitBreakerLayer]
    Environment:
      Variables:
        CIRCUITBREAKER_TABLE: !Ref CircuitBreakerTable

Node.js handler

const CIRCUIT_ID = process.env.AWS_LAMBDA_FUNCTION_NAME
const CB_URL = 'http://127.0.0.1:4243'

// Check circuit state before calling downstream
const check = await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}`)
const { allowed, state } = await check.json()

if (!allowed) {
  return { statusCode: 503, body: JSON.stringify({ error: 'Circuit OPEN', state }) }
}

// Call downstream and report result
try {
  const result = await callDownstream()
  await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}/success`, { method: 'POST' })
  return { statusCode: 200, body: JSON.stringify(result) }
} catch (err) {
  await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}/failure`, { method: 'POST' })
  throw err  // Lambda returns a non-200; event sources like SQS will retry
}

Python handler

import json
import os
import urllib.request

circuit_id = os.environ.get('AWS_LAMBDA_FUNCTION_NAME', 'default')
cb_url = 'http://127.0.0.1:4243'

def handler(event, context):
    try:
        with urllib.request.urlopen(f'{cb_url}/circuit/{circuit_id}', timeout=5) as resp:
            check = json.loads(resp.read())
    except Exception:
        # Sidecar unavailable — fail open and allow the downstream call
        check = {'allowed': True}

    if not check['allowed']:
        return {'statusCode': 503, 'body': json.dumps({'error': 'Circuit OPEN'})}

    try:
        result = call_downstream()
        urllib.request.urlopen(
            urllib.request.Request(f'{cb_url}/circuit/{circuit_id}/success', method='POST'),
            timeout=5
        )
        return {'statusCode': 200, 'body': json.dumps(result)}
    except Exception as e:
        urllib.request.urlopen(
            urllib.request.Request(f'{cb_url}/circuit/{circuit_id}/failure', method='POST'),
            timeout=5
        )
        # For event-driven triggers like SQS, raise here instead so Lambda retries.
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}

Both examples make two HTTP calls to the sidecar per invocation: one to check state before the downstream call, one to report the result. These are loopback calls to 127.0.0.1, not network calls, so the round-trip is sub-millisecond. The Rust sidecar also runs the CachedProvider logic internally, so it rarely reaches DynamoDB on warm invocations. The warm latency numbers in the Performance section reflect this.

The Layer approach requires more boilerplate per handler, but it works in any managed runtime and keeps the state management and DynamoDB logic out of your application code. The handler wires up local HTTP calls to the sidecar. That part does live in your code. But the actual circuit state tracking, DynamoDB reads and writes, failure counting, and backoff logic are all inside the Rust extension.

Design Decisions

Fail-open

The term comes from physical security, not circuit states: a fail-open lock releases when power fails, defaulting to permissive. Here it means the same thing. If the DynamoDB state provider is unavailable, requests pass through rather than failing. This is a deliberate trade-off. The alternative is failing closed: a transient DynamoDB error takes down your service even if the downstream service it's protecting is completely healthy. Your circuit breaker becomes a single point of failure.

Fail-open accepts that brief periods of unprotected calls are better than self-inflicted downtime. State provider errors are logged as structured JSON so you can monitor and alert on them, but they don't block requests. The counter-argument: if DynamoDB is unavailable during an active downstream incident, fail-open leaves traffic unprotected. For most workloads this is the right call: a simultaneous DynamoDB outage and downstream failure is an unlikely combination, and failing closed (blocking all traffic because the circuit breaker can't read state) makes things worse. If your downstream is fragile enough that this scenario is a real concern, a lower-level fallback or degraded mode is a better answer than fail-closed.

For the Lambda Layer path, there are two sidecar failure modes to distinguish. If the extension fails during the INIT phase, Lambda restarts the execution environment entirely. The handler never runs, and Lambda retries automatically. If the extension crashes after initialization during an invocation, fetch calls to http://127.0.0.1:4243 throw connection refused errors. For this second case, wrap the sidecar calls in a try/catch and fail open: allow the downstream call to proceed. The same principle applies as with DynamoDB unavailability.

Warm invocation caching

Every fire() call reads circuit state from DynamoDB. For a function handling high throughput, that's a DynamoDB read on every invocation. You can reduce this with the CachedProvider, which caches state in memory for warm execution environments:

const breaker = new CircuitBreaker(callDownstream, {
  cacheTtlMs: 200, // use cached state for 200ms on warm invocations
})

On a warm invocation, the cache is checked first. If the state is fresh, no DynamoDB call is made. The cache is write-through: when state is saved to DynamoDB, the cache is also updated. Keep the TTL short. A long cache window can delay the CLOSED to OPEN transition: an execution environment that cached a CLOSED state won't see a newly-opened circuit until the cache expires. 200ms is a reasonable starting point: it caps the detection lag while cutting DynamoDB reads significantly for high-throughput functions. Increase the TTL to reduce costs further at the cost of slower circuit detection. Decrease it for faster propagation at higher DynamoDB cost.

What the DynamoDB item looks like

When debugging a stuck circuit, this is what you're looking for in the table:

{
  "id": "my-function-name",
  "circuitState": "OPEN",
  "failureCount": 5,
  "successCount": 0,
  "nextAttempt": 1741234567890,
  "lastFailureTime": 1741234557890,
  "consecutiveOpens": 1
}

circuitState is CLOSED, OPEN, or HALF-OPEN. nextAttempt is a Unix timestamp in milliseconds. The circuit won't probe until after that time. consecutiveOpens tracks how many consecutive HALF-OPEN→OPEN transitions have occurred, which drives the exponential backoff on the timeout.

The library uses last-writer-wins writes rather than atomic increments. Under extreme concurrent failures (many execution environments failing at the exact same millisecond) some failure counts can be lost: if 10 environments all read failureCount: 4 and each write 5, the count advances by 1 instead of 10. In practice this means the circuit may take slightly longer to open than the threshold suggests under burst concurrency. It will still open. For the CLOSED→OPEN transition itself, multiple environments writing OPEN simultaneously all succeed, which is fine: you want the circuit open. Atomic counter increments via DynamoDB's ADD operation could prevent lost failure counts, but a state transition updates multiple fields simultaneously: state, failure count, and timestamp. Last-writer-wins on the full item keeps the write logic simple at the cost of occasional lost counts under extreme concurrency. If your function handles high burst concurrency, set failureThreshold lower than you would in a single-process application. Lost counts mean the effective threshold is higher than the configured value, so a lower setting brings the actual behavior closer to the intended one.

Exponential backoff on repeated failures

When a circuit transitions from HALF-OPEN back to OPEN (a recovery probe failed), the timeout before the next probe doubles. This prevents a repeatedly-failing service from being probed too aggressively. The backoff resets when the circuit closes successfully. The maxTimeout option caps how long the backoff can grow.

HALF-OPEN probe behavior

With shared DynamoDB state, when the circuit transitions to HALF-OPEN, every warm execution environment that reads the updated state may attempt a trial call. Unlike a single-process circuit breaker where exactly one probe goes out, a fleet of 50 environments can send up to 50 simultaneous probes to a recovering downstream service. CachedProvider staggers probes across the TTL window as environments pick up the state change at different times, but doesn't eliminate the burst. A single-leader approach (using a DynamoDB conditional write to claim the probe slot) would be more precise, and it's tracked as a future improvement in the repo. The current behavior favors simplicity: the probe burst is proportional to the number of warm environments, which is typically small for functions with reasonable traffic patterns, and distributed leader election adds significant complexity for a probe that's designed to be retried on failure anyway.

Custom state backends

The StateProvider interface is pluggable. If you need Redis, a relational database, or anything else, implement two methods:

class RedisProvider implements StateProvider {
  async getState(circuitId: string): Promise<CircuitBreakerState | undefined> { ... }
  async saveState(circuitId: string, state: CircuitBreakerState): Promise<void> { ... }
}

const breaker = new CircuitBreaker(fn, { stateProvider: new RedisProvider() })

DynamoDB is the right default for most Lambda workloads. Valkey or Redis makes sense if you're already VPC-attached and running ElastiCache for caching: reusing existing infrastructure avoids the extra DynamoDB dependency. For most teams, running a cache cluster solely for circuit state isn't worth the VPC overhead and operational cost.

Performance

Using the npm package and Lambda Layer, here are measured results from a test run with 50 warm invocations per configuration and a shared DynamoDB table in the same region. All functions were configured at 512MB memory. The "downstream" in all cases was an HTTP call through an API Gateway endpoint backed by DynamoDB, which could be toggled healthy or unhealthy. The HTTP round-trip through API Gateway accounts for the ~590ms baseline. Raw DynamoDB read latency is single-digit milliseconds. Cold start times scale inversely with memory allocation. Lambda allocates CPU proportionally to memory, so at 128MB (where CPU is highly constrained) you would expect larger overhead, particularly for the Layer which initializes a Rust extension sidecar alongside the function runtime.

Cold start (forced by updating a function environment variable):

Configuration	Cold start	vs. baseline
Baseline (no circuit breaker)	1300ms	—
npm package (Node.js)	1353ms	+4%
Lambda Layer (Node.js)	1679ms	+29%
Lambda Layer (Python)	1541ms	+18%

The Layer cold start penalty comes from initializing the Rust extension sidecar alongside the function runtime. It's a one-time cost per execution environment. Since August 2025, AWS bills for the Lambda INIT phase on managed runtimes with ZIP deployment packages, so the Layer's +29% cold start overhead (379ms) is now both a latency and a cost consideration for functions with frequent cold starts.

Warm invocations (50 calls each):

Configuration	Median	p99
Baseline (no circuit breaker)	590ms	620ms
npm package (Node.js)	592ms	621ms
Lambda Layer (Node.js)	589ms	797ms
Lambda Layer (Python)	585ms	639ms

The npm package p99 (621ms) is essentially identical to baseline (620ms). The Lambda Layer Node.js p99 (797ms) is higher because the Rust extension sidecar occasionally adds latency on the first few invocations after a warm start. The median is fine but the tail is longer. The Layer configurations showing slightly below baseline median are within measurement noise, not a genuine speedup. With CachedProvider, DynamoDB reads are eliminated for subsequent invocations within the TTL window, which brings tail latency down for high-throughput functions.

Cost

Each fire() call reads circuit state from DynamoDB (one GetItem) and writes on state changes. At on-demand pricing, a GetItem on a small item costs $0.125 per million read request units. At one million Lambda invocations per day (around 11 RPS), that's roughly $0.125/day for the reads. State writes only happen on failures and state transitions, so they're a rounding error for a healthy function. During an active failure scenario writes increase. At $1.25 per million write request units, a function failing on every invocation could see $1.25/day in write costs before the circuit opens and stops the calls. In practice the circuit opens quickly (after the first threshold of failures per environment), so write volume drops sharply once OPEN. With CachedProvider at 200ms TTL and warm execution environments, reads drop by an order of magnitude on high-throughput functions.

Testing

The package includes a MemoryProvider for unit testing. Pass it as the stateProvider option to skip DynamoDB entirely in tests:

import { CircuitBreaker, MemoryProvider } from 'circuitbreaker-lambda'

const breaker = new CircuitBreaker(callDownstream, {
  stateProvider: new MemoryProvider(),
  failureThreshold: 3,
})

MemoryProvider uses an in-memory Map and is not safe for production. It's for tests and local development only.

When NOT to Use a Circuit Breaker

Not every Lambda function needs one:

Very low concurrency. If your function runs at a single execution environment (low traffic, no bursting), in-memory circuit breakers work: there's only one environment, so state is effectively shared. The overhead of distributed state isn't worth it for something handling a few requests per minute.
Calls to AWS services. The AWS SDK handles retries, timeouts, and transient failures with exponential backoff. Wrapping a DynamoDB GetItem or an S3 PutObject in a circuit breaker adds complexity without much benefit. AWS manages the resilience layer for you.
When fail-fast isn't better than retry. Circuit breakers are for cascading failure protection. If your function's caller expects a synchronous result and there's no meaningful fallback response, letting the error propagate and retry may be simpler than managing circuit state.

Try It Yourself

The repo includes two deployable SAM examples with a toggleable downstream service so you can watch the full circuit lifecycle (healthy calls, failures accumulating, circuit opening, recovery) against real AWS infrastructure:

examples/sam/: npm package example. Single Node.js function at /. Toggle the downstream at /toggle, check circuit state at /status.
examples/layer/: Layer example. Node.js (/node) and Python (/python) functions side by side, sharing the same Layer and DynamoDB table.
examples/minimal-npm/ and examples/minimal-layer/: Stripped-down versions if you just want the bare minimum code without the toggle/status test infrastructure.

Additional Resources

circuitbreaker-lambda on GitHub: Source, docs, and examples
circuitbreaker-lambda on npm: Package page
Using the circuit-breaker pattern with AWS Lambda extensions and Amazon DynamoDB: AWS Compute Blog post covering the same Lambda extension + DynamoDB architecture
AWS Prescriptive Guidance: Circuit Breaker: AWS' recommended approach using DynamoDB
AWS Lambda execution environment documentation: The concurrency and isolation model this post is based on
Circuit Breaker pattern (Martin Fowler): The canonical pattern reference

The silent failure mode of in-memory circuit breakers on Lambda isn't obvious until you're debugging a production incident. If you're running a circuit breaker today, check whether it's sharing state across execution environments. If it's not, it's not protecting you. The fix is a DynamoDB table and three lines of configuration. The alternative is finding out during the next downstream outage. Let me know in the comments how you're handling downstream resilience on Lambda.