DEV Community

Cover image for Developing AWS Lambda Durable Functions with AWS SAM
Eric D Johnson for AWS

Posted on

Developing AWS Lambda Durable Functions with AWS SAM

How to configure, build, and deploy long-running workflows using SAM templates


canonical_url: https://dev.to/aws/developing-aws-lambda-durable-functions-with-aws-sam-ga9

You've written a durable function that orchestrates a multi-step workflow. Now you may want to deploy it. AWS Serverless Application Model (AWS SAM) makes this straightforward with specific configurations in your template. Let's walk through how to set up durable functions effectively.

Prerequisites

You'll need AWS SAM CLI version 1.150.1 or greater. Check your version:

sam --version
Enter fullscreen mode Exit fullscreen mode

If you need to upgrade, follow the AWS SAM installation guide.

The Basics: Enabling Durable Functions

To make an AWS Lambda function durable, add the DurableConfig property to your AWS SAM template:

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      Runtime: nodejs22.x
      Architectures:
        - arm64
      Timeout: 900                # Function timeout: 15 minutes
      DurableConfig:
        ExecutionTimeout: 3600    # Execution timeout: 1 hour
        RetentionPeriodInDays: 7  # Keep execution history for 7 days
Enter fullscreen mode Exit fullscreen mode

Understanding the two timeout settings is crucial:

Function Timeout (Timeout) controls how long each individual Lambda invocation can run. This is still capped at 15 minutes (900 seconds), just like regular Lambda functions. Each time your durable function checkpoints and resumes, it's a new invocation with its own 15-minute window.

Execution Timeout (ExecutionTimeout) controls how long the entire workflow can run across all invocations. This can be up to 1 year. Your workflow can pause, wait for callbacks, and resume many times, as long as the total elapsed time doesn't exceed this limit.

For long-running workflows where the execution timeout exceeds the function timeout, you must invoke the function asynchronously. Synchronous invocations will fail with a validation error if the execution timeout is longer than the function timeout.

Installing the SDK

Add the durable execution SDK to your function's dependency manifest. SAM will automatically install it during the build process.

TypeScript/JavaScript - Add to package.json:

{
  "dependencies": {
    "@aws/durable-execution-sdk-js": "^1.0.0"
  }
}
Enter fullscreen mode Exit fullscreen mode

Python - Add to requirements.txt:

aws_durable_execution_sdk_python
Enter fullscreen mode Exit fullscreen mode

For client functions that send callbacks or query execution state, add the AWS SDK:

TypeScript/JavaScript:

{
  "dependencies": {
    "@aws-sdk/client-lambda": "^3.0.0"
  }
}
Enter fullscreen mode Exit fullscreen mode

Python:

boto3
Enter fullscreen mode Exit fullscreen mode

Build Configuration

Configure SAM to build your TypeScript functions with esbuild:

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      Runtime: nodejs22.x
      Architectures:
        - arm64
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts
Enter fullscreen mode Exit fullscreen mode

SAM will automatically compile your TypeScript code and bundle dependencies into a deployment package.

IAM Permissions

AWS SAM automatically grants the necessary permissions for durable functions to checkpoint and manage their execution state. You don't need to explicitly configure these permissions.

However, you may want to configure permissions for client functions that interact with durable executions. For functions that send callbacks:

BaristaCallbackFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: src/barista-callback
    Handler: index.handler
    Policies:
      - Statement:
          - Effect: Allow
            Action:
              - lambda:SendDurableExecutionCallbackSuccess
              - lambda:SendDurableExecutionCallbackFailure
              - lambda:SendDurableExecutionCallbackHeartbeat
            Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
Enter fullscreen mode Exit fullscreen mode

For functions that monitor execution state or stop executions:

MonitoringFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: src/monitoring
    Handler: index.handler
    Policies:
      - Statement:
          - Effect: Allow
            Action:
              - lambda:GetDurableExecution
              - lambda:GetDurableExecutionHistory
              - lambda:ListDurableExecutionsByFunction
              - lambda:SendDurableExecutionCallbackSuccess
              - lambda:SendDurableExecutionCallbackHeartbeat
              - lambda:SendDurableExecutionCallbackFailure
              - lambda:StopDurableExecution
            Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
Enter fullscreen mode Exit fullscreen mode

These permissions allow client functions to interact with your durable workflows by sending callbacks, monitoring execution state, or stopping running executions.

Using Globals for Common Configuration

AWS SAM's Globals section reduces repetition when you have multiple functions with shared settings:

Globals:
  Function:
    Runtime: nodejs22.x
    Architectures:
      - arm64
    Timeout: 30
    MemorySize: 512
    Tracing: Active

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7

  CallbackHandlerFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/callback-handler
      Handler: index.handler
      # No DurableConfig - this is a regular Lambda function
Enter fullscreen mode Exit fullscreen mode

You can set common runtime, architecture, and memory settings at the global level. For DurableConfig, configure it at the function level to make it explicit which functions are durable. Functions without DurableConfig are regular Lambda functions.

A Complete Example

Here's a full AWS SAM template for a coffee ordering system with durable workflows:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Coffee Ordering System with Durable Functions

Globals:
  Function:
    Runtime: nodejs22.x
    Architectures:
      - arm64
    Timeout: 30
    MemorySize: 512
    Tracing: Active

Resources:
  # ========== AMAZON DYNAMODB ==========

  OrdersTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: CoffeeOrders
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: orderId
          AttributeType: S
      KeySchema:
        - AttributeName: orderId
          KeyType: HASH

  # ========== FUNCTIONS ==========

  CoffeeOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/coffee-order
      Handler: index.handler
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Environment:
        Variables:
          ORDERS_TABLE: !Ref OrdersTable
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref OrdersTable
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /orders
            Method: post
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts

  BaristaCallbackFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/barista-callback
      Handler: index.handler
      Policies:
        - Statement:
            - Effect: Allow
              Action:
                - lambda:SendDurableExecutionCallbackSuccess
                - lambda:SendDurableExecutionCallbackFailure
              Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /barista/accept/{orderId}
            Method: post
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts

Outputs:
  ApiUrl:
    Description: API Gateway endpoint URL
    Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod'

  CoffeeOrderFunctionArn:
    Description: Coffee Order Function ARN
    Value: !GetAtt CoffeeOrderFunction.Arn
Enter fullscreen mode Exit fullscreen mode

Building and Deploying

Build your application with AWS SAM:

sam build
Enter fullscreen mode Exit fullscreen mode

This builds your functions and layers. AWS SAM automatically runs the Makefiles for your layers and uses esbuild for your TypeScript functions.

Deploy to AWS:

sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

The --guided flag walks you through configuration options. After the first deployment, AWS SAM saves your settings in samconfig.toml, so subsequent deploys are just:

sam deploy
Enter fullscreen mode Exit fullscreen mode

Local Testing

Test your durable function locally before deploying:

sam local invoke CoffeeOrderFunction --event events/order.json
Enter fullscreen mode Exit fullscreen mode

The event file contains your test payload:

{
  "orderId": "test-123",
  "attendeeId": "user-456",
  "orderDetails": {
    "drinkType": "Latte",
    "size": "Grande"
  }
}
Enter fullscreen mode Exit fullscreen mode

Accessing Cloud Resources Locally

When testing locally, you often need to access deployed AWS resources like Amazon DynamoDB tables or Amazon EventBridge buses. Use the --env-vars flag with a JSON file to provide environment variables and credentials:

sam local invoke CoffeeOrderFunction \
  --event events/order.json \
  --env-vars locals.json
Enter fullscreen mode Exit fullscreen mode

Create a locals.json file with environment variables for each function:

{
  "CoffeeOrderFunction": {
    "AWS_REGION": "us-east-1",
    "ORDERS_TABLE_NAME": "CoffeeOrders",
    "EVENT_BUS_NAME": "CoffeeOrderingEventBus"
  },
  "BaristaCallbackFunction": {
    "AWS_REGION": "us-east-1",
    "ORDERS_TABLE_NAME": "CoffeeOrders"
  }
}
Enter fullscreen mode Exit fullscreen mode

This allows your local function to interact with deployed resources in AWS, assuming your AWS credentials have the necessary permissions. Your function code can access these variables through process.env:

const tableName = process.env.ORDERS_TABLE_NAME;
const region = process.env.AWS_REGION;
Enter fullscreen mode Exit fullscreen mode

Tracking Execution State

For durable functions, you can track execution state:

# Get execution details
sam local execution get $EXECUTION_ARN

# View execution history
sam local execution history $EXECUTION_ARN

# Stop a running execution
sam local execution stop $EXECUTION_ARN
Enter fullscreen mode Exit fullscreen mode

Test callbacks locally:

# Send success callback
sam local callback succeed $CALLBACK_ID --result '{"status": "accepted"}'

# Send failure callback
sam local callback fail $CALLBACK_ID --error '{"message": "Rejected"}'

# Send heartbeat
sam local callback heartbeat $CALLBACK_ID
Enter fullscreen mode Exit fullscreen mode

Remote Testing

After deploying, test against your live functions:

# Invoke the function (automatically uses $LATEST qualifier)
sam remote invoke CoffeeOrderFunction --event events/order.json

# Invoke a specific qualifier/alias
sam remote invoke CoffeeOrderFunction --event events/order.json --parameter Qualifier=prod

# Get execution details
sam remote execution get $EXECUTION_ARN

# View execution history
sam remote execution history $EXECUTION_ARN
Enter fullscreen mode Exit fullscreen mode

By default, sam remote invoke uses the $LATEST qualifier. You can override this with --parameter Qualifier=<your-qualifier> to test against a specific version or alias.

The execution history shows every step, checkpoint, and state transition. It's invaluable for debugging workflows in production.

Configuration Best Practices

ExecutionTimeout can be up to 1 year (31,536,000 seconds) and controls how long your entire workflow can run across all invocations. For a coffee order that waits up to 5 minutes for barista acceptance, consider setting it to 600 seconds (10 minutes) to allow some buffer. For document processing that might take hours or days, you can set it much higher - up to the 1-year maximum.

Function Timeout is still capped at 15 minutes (900 seconds) and controls how long each individual invocation can run. Set this based on how long your function needs between checkpoints. If your steps typically complete in seconds, starting with 30 seconds is reasonable. If you have longer-running operations, consider increasing it up to the 15-minute maximum.

Async vs Sync Invocation: If your execution timeout exceeds your function timeout, you must use asynchronous invocation. Synchronous invocations will fail validation when the execution timeout is longer than the function timeout. Configure your event sources (API Gateway, EventBridge, etc.) to invoke asynchronously for long-running workflows.

RetentionPeriodInDays determines how long execution history is kept. Consider using 7-30 days for development environments where you're actively debugging. For production environments where you might need to investigate issues weeks later, consider using 90+ days.

Memory affects both performance and cost. Starting with 512 MB and adjusting based on your function's needs is a common approach. Durable functions checkpoint frequently, so memory usage is typically lower than equivalent non-durable functions.

Monitoring and Observability

Enable AWS X-Ray tracing to see how your workflow executes:

Globals:
  Function:
    Tracing: Active
Enter fullscreen mode Exit fullscreen mode

This traces each invocation and shows you the complete execution path across multiple invocations.

Add Amazon CloudWatch Logs Insights queries to analyze execution patterns:

fields @timestamp, @message
| filter @message like /CHECKPOINT/
| sort @timestamp desc
| limit 100
Enter fullscreen mode Exit fullscreen mode

Create CloudWatch alarms for execution failures:

ExecutionFailureAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: CoffeeOrderExecutionFailures
    MetricName: Errors
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 300
    EvaluationPeriods: 1
    Threshold: 5
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: FunctionName
        Value: !Ref CoffeeOrderFunction
Enter fullscreen mode Exit fullscreen mode

Common Patterns

Event-driven workflows triggered by API Gateway, EventBridge, or Amazon Simple Queue Service (Amazon SQS) - useful for order processing, approval workflows, and asynchronous task execution:

CoffeeOrderFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Events:
      ApiTrigger:
        Type: Api
        Properties:
          Path: /orders
          Method: post
      EventBridgeTrigger:
        Type: EventBridgeRule
        Properties:
          Pattern:
            source:
              - coffee.orders
            detail-type:
              - OrderPlaced
Enter fullscreen mode Exit fullscreen mode

For long-running workflows, configure API Gateway to invoke asynchronously:

CoffeeOrderFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Timeout: 900  # 15 minutes
    DurableConfig:
      ExecutionTimeout: 86400  # 24 hours
    Events:
      ApiTrigger:
        Type: Api
        Properties:
          Path: /orders
          Method: post
          RequestParameters:
            - method.request.header.X-Amz-Invocation-Type:
                Required: false
                Caching: false
Enter fullscreen mode Exit fullscreen mode

Scheduled workflows that run on a timer - useful for daily reports, periodic data processing, or scheduled maintenance tasks:

DailyReportFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Events:
      DailySchedule:
        Type: Schedule
        Properties:
          Schedule: cron(0 9 * * ? *)  # 9 AM daily
Enter fullscreen mode Exit fullscreen mode

Fan-out workflows that process items in parallel - useful for batch processing, data transformation, or concurrent API calls:

BatchProcessorFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    ReservedConcurrentExecutions: 10  # Limit parallel executions
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

Validation error on invoke: If you get a validation error saying "Execution timeout must be within the function timeout," you're trying to synchronously invoke a function where the execution timeout exceeds the function timeout. Use asynchronous invocation for long-running workflows.

Individual invocation timeout: If your function times out during a single invocation, consider increasing the Timeout property. Start by doubling the current value, up to the maximum of 900 seconds. This is separate from the workflow's total ExecutionTimeout.

Workflow timeout: If your entire workflow times out, consider increasing the ExecutionTimeout in DurableConfig. This can be up to 1 year (31,536,000 seconds).

Permission denied for callbacks: Ensure client functions have the correct IAM permissions to interact with durable executions. Functions sending callbacks need SendDurableExecutionCallback* permissions with the target durable function as the resource.

Summary

Deploying durable functions with AWS SAM involves several key configurations: enable DurableConfig on your function to set execution timeout and retention period, install the durable execution SDK as a dependency, and configure esbuild for TypeScript compilation. AWS SAM facilitates the IAM permissions needed for durable execution operations. For client functions that interact with durable executions, consider explicitly configuring permissions for callback and monitoring operations.

Use Globals to reduce repetition across functions, set appropriate timeouts for both individual invocations and total workflow duration, and enable tracing for observability. Test locally with sam local invoke and sam local execution commands before deploying to AWS.

With these patterns in place, you can deploy long-running workflows that span hours or days, handle failures gracefully, and provide full visibility into execution state.

Top comments (0)