How to configure, build, and deploy long-running workflows using SAM templates
canonical_url: https://dev.to/aws/developing-aws-lambda-durable-functions-with-aws-sam-ga9
You've written a durable function that orchestrates a multi-step workflow. Now you may want to deploy it. AWS Serverless Application Model (AWS SAM) makes this straightforward with specific configurations in your template. Let's walk through how to set up durable functions effectively.
Prerequisites
You'll need AWS SAM CLI version 1.150.1 or greater. Check your version:
sam --version
If you need to upgrade, follow the AWS SAM installation guide.
The Basics: Enabling Durable Functions
To make an AWS Lambda function durable, add the DurableConfig property to your AWS SAM template:
Resources:
OrderProcessorFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/order-processor
Handler: index.handler
Runtime: nodejs22.x
Architectures:
- arm64
Timeout: 900 # Function timeout: 15 minutes
DurableConfig:
ExecutionTimeout: 3600 # Execution timeout: 1 hour
RetentionPeriodInDays: 7 # Keep execution history for 7 days
Understanding the two timeout settings is crucial:
Function Timeout (Timeout) controls how long each individual Lambda invocation can run. This is still capped at 15 minutes (900 seconds), just like regular Lambda functions. Each time your durable function checkpoints and resumes, it's a new invocation with its own 15-minute window.
Execution Timeout (ExecutionTimeout) controls how long the entire workflow can run across all invocations. This can be up to 1 year. Your workflow can pause, wait for callbacks, and resume many times, as long as the total elapsed time doesn't exceed this limit.
For long-running workflows where the execution timeout exceeds the function timeout, you must invoke the function asynchronously. Synchronous invocations will fail with a validation error if the execution timeout is longer than the function timeout.
Installing the SDK
Add the durable execution SDK to your function's dependency manifest. SAM will automatically install it during the build process.
TypeScript/JavaScript - Add to package.json:
{
"dependencies": {
"@aws/durable-execution-sdk-js": "^1.0.0"
}
}
Python - Add to requirements.txt:
aws_durable_execution_sdk_python
For client functions that send callbacks or query execution state, add the AWS SDK:
TypeScript/JavaScript:
{
"dependencies": {
"@aws-sdk/client-lambda": "^3.0.0"
}
}
Python:
boto3
Build Configuration
Configure SAM to build your TypeScript functions with esbuild:
Resources:
OrderProcessorFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/order-processor
Handler: index.handler
Runtime: nodejs22.x
Architectures:
- arm64
DurableConfig:
ExecutionTimeout: 3600
RetentionPeriodInDays: 7
Metadata:
BuildMethod: esbuild
BuildProperties:
EntryPoints:
- index.ts
SAM will automatically compile your TypeScript code and bundle dependencies into a deployment package.
IAM Permissions
AWS SAM automatically grants the necessary permissions for durable functions to checkpoint and manage their execution state. You don't need to explicitly configure these permissions.
However, you may want to configure permissions for client functions that interact with durable executions. For functions that send callbacks:
BaristaCallbackFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/barista-callback
Handler: index.handler
Policies:
- Statement:
- Effect: Allow
Action:
- lambda:SendDurableExecutionCallbackSuccess
- lambda:SendDurableExecutionCallbackFailure
- lambda:SendDurableExecutionCallbackHeartbeat
Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
For functions that monitor execution state or stop executions:
MonitoringFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/monitoring
Handler: index.handler
Policies:
- Statement:
- Effect: Allow
Action:
- lambda:GetDurableExecution
- lambda:GetDurableExecutionHistory
- lambda:ListDurableExecutionsByFunction
- lambda:SendDurableExecutionCallbackSuccess
- lambda:SendDurableExecutionCallbackHeartbeat
- lambda:SendDurableExecutionCallbackFailure
- lambda:StopDurableExecution
Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
These permissions allow client functions to interact with your durable workflows by sending callbacks, monitoring execution state, or stopping running executions.
Using Globals for Common Configuration
AWS SAM's Globals section reduces repetition when you have multiple functions with shared settings:
Globals:
Function:
Runtime: nodejs22.x
Architectures:
- arm64
Timeout: 30
MemorySize: 512
Tracing: Active
Resources:
OrderProcessorFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/order-processor
Handler: index.handler
DurableConfig:
ExecutionTimeout: 3600
RetentionPeriodInDays: 7
CallbackHandlerFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/callback-handler
Handler: index.handler
# No DurableConfig - this is a regular Lambda function
You can set common runtime, architecture, and memory settings at the global level. For DurableConfig, configure it at the function level to make it explicit which functions are durable. Functions without DurableConfig are regular Lambda functions.
A Complete Example
Here's a full AWS SAM template for a coffee ordering system with durable workflows:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Coffee Ordering System with Durable Functions
Globals:
Function:
Runtime: nodejs22.x
Architectures:
- arm64
Timeout: 30
MemorySize: 512
Tracing: Active
Resources:
# ========== AMAZON DYNAMODB ==========
OrdersTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: CoffeeOrders
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: orderId
AttributeType: S
KeySchema:
- AttributeName: orderId
KeyType: HASH
# ========== FUNCTIONS ==========
CoffeeOrderFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/coffee-order
Handler: index.handler
DurableConfig:
ExecutionTimeout: 3600
RetentionPeriodInDays: 7
Environment:
Variables:
ORDERS_TABLE: !Ref OrdersTable
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref OrdersTable
Events:
ApiEvent:
Type: Api
Properties:
Path: /orders
Method: post
Metadata:
BuildMethod: esbuild
BuildProperties:
EntryPoints:
- index.ts
BaristaCallbackFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/barista-callback
Handler: index.handler
Policies:
- Statement:
- Effect: Allow
Action:
- lambda:SendDurableExecutionCallbackSuccess
- lambda:SendDurableExecutionCallbackFailure
Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
Events:
ApiEvent:
Type: Api
Properties:
Path: /barista/accept/{orderId}
Method: post
Metadata:
BuildMethod: esbuild
BuildProperties:
EntryPoints:
- index.ts
Outputs:
ApiUrl:
Description: API Gateway endpoint URL
Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod'
CoffeeOrderFunctionArn:
Description: Coffee Order Function ARN
Value: !GetAtt CoffeeOrderFunction.Arn
Building and Deploying
Build your application with AWS SAM:
sam build
This builds your functions and layers. AWS SAM automatically runs the Makefiles for your layers and uses esbuild for your TypeScript functions.
Deploy to AWS:
sam deploy --guided
The --guided flag walks you through configuration options. After the first deployment, AWS SAM saves your settings in samconfig.toml, so subsequent deploys are just:
sam deploy
Local Testing
Test your durable function locally before deploying:
sam local invoke CoffeeOrderFunction --event events/order.json
The event file contains your test payload:
{
"orderId": "test-123",
"attendeeId": "user-456",
"orderDetails": {
"drinkType": "Latte",
"size": "Grande"
}
}
Accessing Cloud Resources Locally
When testing locally, you often need to access deployed AWS resources like Amazon DynamoDB tables or Amazon EventBridge buses. Use the --env-vars flag with a JSON file to provide environment variables and credentials:
sam local invoke CoffeeOrderFunction \
--event events/order.json \
--env-vars locals.json
Create a locals.json file with environment variables for each function:
{
"CoffeeOrderFunction": {
"AWS_REGION": "us-east-1",
"ORDERS_TABLE_NAME": "CoffeeOrders",
"EVENT_BUS_NAME": "CoffeeOrderingEventBus"
},
"BaristaCallbackFunction": {
"AWS_REGION": "us-east-1",
"ORDERS_TABLE_NAME": "CoffeeOrders"
}
}
This allows your local function to interact with deployed resources in AWS, assuming your AWS credentials have the necessary permissions. Your function code can access these variables through process.env:
const tableName = process.env.ORDERS_TABLE_NAME;
const region = process.env.AWS_REGION;
Tracking Execution State
For durable functions, you can track execution state:
# Get execution details
sam local execution get $EXECUTION_ARN
# View execution history
sam local execution history $EXECUTION_ARN
# Stop a running execution
sam local execution stop $EXECUTION_ARN
Test callbacks locally:
# Send success callback
sam local callback succeed $CALLBACK_ID --result '{"status": "accepted"}'
# Send failure callback
sam local callback fail $CALLBACK_ID --error '{"message": "Rejected"}'
# Send heartbeat
sam local callback heartbeat $CALLBACK_ID
Remote Testing
After deploying, test against your live functions:
# Invoke the function (automatically uses $LATEST qualifier)
sam remote invoke CoffeeOrderFunction --event events/order.json
# Invoke a specific qualifier/alias
sam remote invoke CoffeeOrderFunction --event events/order.json --parameter Qualifier=prod
# Get execution details
sam remote execution get $EXECUTION_ARN
# View execution history
sam remote execution history $EXECUTION_ARN
By default, sam remote invoke uses the $LATEST qualifier. You can override this with --parameter Qualifier=<your-qualifier> to test against a specific version or alias.
The execution history shows every step, checkpoint, and state transition. It's invaluable for debugging workflows in production.
Configuration Best Practices
ExecutionTimeout can be up to 1 year (31,536,000 seconds) and controls how long your entire workflow can run across all invocations. For a coffee order that waits up to 5 minutes for barista acceptance, consider setting it to 600 seconds (10 minutes) to allow some buffer. For document processing that might take hours or days, you can set it much higher - up to the 1-year maximum.
Function Timeout is still capped at 15 minutes (900 seconds) and controls how long each individual invocation can run. Set this based on how long your function needs between checkpoints. If your steps typically complete in seconds, starting with 30 seconds is reasonable. If you have longer-running operations, consider increasing it up to the 15-minute maximum.
Async vs Sync Invocation: If your execution timeout exceeds your function timeout, you must use asynchronous invocation. Synchronous invocations will fail validation when the execution timeout is longer than the function timeout. Configure your event sources (API Gateway, EventBridge, etc.) to invoke asynchronously for long-running workflows.
RetentionPeriodInDays determines how long execution history is kept. Consider using 7-30 days for development environments where you're actively debugging. For production environments where you might need to investigate issues weeks later, consider using 90+ days.
Memory affects both performance and cost. Starting with 512 MB and adjusting based on your function's needs is a common approach. Durable functions checkpoint frequently, so memory usage is typically lower than equivalent non-durable functions.
Monitoring and Observability
Enable AWS X-Ray tracing to see how your workflow executes:
Globals:
Function:
Tracing: Active
This traces each invocation and shows you the complete execution path across multiple invocations.
Add Amazon CloudWatch Logs Insights queries to analyze execution patterns:
fields @timestamp, @message
| filter @message like /CHECKPOINT/
| sort @timestamp desc
| limit 100
Create CloudWatch alarms for execution failures:
ExecutionFailureAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: CoffeeOrderExecutionFailures
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref CoffeeOrderFunction
Common Patterns
Event-driven workflows triggered by API Gateway, EventBridge, or Amazon Simple Queue Service (Amazon SQS) - useful for order processing, approval workflows, and asynchronous task execution:
CoffeeOrderFunction:
Type: AWS::Serverless::Function
Properties:
# ... other properties
Events:
ApiTrigger:
Type: Api
Properties:
Path: /orders
Method: post
EventBridgeTrigger:
Type: EventBridgeRule
Properties:
Pattern:
source:
- coffee.orders
detail-type:
- OrderPlaced
For long-running workflows, configure API Gateway to invoke asynchronously:
CoffeeOrderFunction:
Type: AWS::Serverless::Function
Properties:
# ... other properties
Timeout: 900 # 15 minutes
DurableConfig:
ExecutionTimeout: 86400 # 24 hours
Events:
ApiTrigger:
Type: Api
Properties:
Path: /orders
Method: post
RequestParameters:
- method.request.header.X-Amz-Invocation-Type:
Required: false
Caching: false
Scheduled workflows that run on a timer - useful for daily reports, periodic data processing, or scheduled maintenance tasks:
DailyReportFunction:
Type: AWS::Serverless::Function
Properties:
# ... other properties
Events:
DailySchedule:
Type: Schedule
Properties:
Schedule: cron(0 9 * * ? *) # 9 AM daily
Fan-out workflows that process items in parallel - useful for batch processing, data transformation, or concurrent API calls:
BatchProcessorFunction:
Type: AWS::Serverless::Function
Properties:
# ... other properties
ReservedConcurrentExecutions: 10 # Limit parallel executions
Troubleshooting
Validation error on invoke: If you get a validation error saying "Execution timeout must be within the function timeout," you're trying to synchronously invoke a function where the execution timeout exceeds the function timeout. Use asynchronous invocation for long-running workflows.
Individual invocation timeout: If your function times out during a single invocation, consider increasing the Timeout property. Start by doubling the current value, up to the maximum of 900 seconds. This is separate from the workflow's total ExecutionTimeout.
Workflow timeout: If your entire workflow times out, consider increasing the ExecutionTimeout in DurableConfig. This can be up to 1 year (31,536,000 seconds).
Permission denied for callbacks: Ensure client functions have the correct IAM permissions to interact with durable executions. Functions sending callbacks need SendDurableExecutionCallback* permissions with the target durable function as the resource.
Summary
Deploying durable functions with AWS SAM involves several key configurations: enable DurableConfig on your function to set execution timeout and retention period, install the durable execution SDK as a dependency, and configure esbuild for TypeScript compilation. AWS SAM facilitates the IAM permissions needed for durable execution operations. For client functions that interact with durable executions, consider explicitly configuring permissions for callback and monitoring operations.
Use Globals to reduce repetition across functions, set appropriate timeouts for both individual invocations and total workflow duration, and enable tracing for observability. Test locally with sam local invoke and sam local execution commands before deploying to AWS.
With these patterns in place, you can deploy long-running workflows that span hours or days, handle failures gracefully, and provide full visibility into execution state.
Top comments (0)