Ali Zgheib

Posted on Jan 10

🔥 This AWS Lambda Update Changes Everything (Durable Functions)

#serverless #aws #lambda #cloud

TL;DR: In December 2025, AWS released Lambda Durable Functions. Your Lambda can now run for 366 days (not 15 minutes), automatically checkpoint progress, suspend during waits without charges, handle retries with built-in strategies, wait for external callbacks like human approvals, and process batches with concurrency control. All in a single Lambda function.

What Are Lambda Durable Functions?

Lambda Durable Functions extends Lambda to support long-running, stateful workflows that can pause, wait, and resume. Unlike standard Lambda functions (max 15 minutes), durable functions can run for up to 366 days through checkpoint-and-replay.

Key capabilities:

Automatic checkpointing after each operation
Zero-cost suspension during waits (Lambda suspends)
Built-in retry with configurable strategies
External callback support (human approvals, webhooks)
Batch processing with per-item checkpoints

Getting Started

This tutorial uses TypeScript and Serverless Framework for infrastructure-as-code. You can also use your favorite programming language, console, or any infrastructure-as-code tool like CDK, SAM, Terraform, etc.

Enable Durable Execution

Configure your Lambda function to support durable execution in serverless.yml:

functions:
  myFunction:
    handler: handler.main
    timeout: 900                      # Lambda execution timeout (seconds, max 900 = 15 min)
    durableConfig:
      executionTimeout: 86400         # Workflow timeout (seconds, max 31,622,400 = 366 days)
      retentionPeriodInDays: 7        # Keep execution history for 7 days

Parameter explanations:

timeout: Maximum time for a single Lambda invocation (max 15 minutes). Your function is replayed multiple times, and each replay must complete within this limit.
executionTimeout: Maximum time for the entire workflow across all replays (max 366 days). This is how long your durable function can run from start to finish, including all waits.
retentionPeriodInDays: How long AWS keeps your execution history and checkpoint logs after completion (1-90 days). Used for debugging and observability.

Example scenario: You have a workflow that processes a payment, waits 2 hours, then ships an order.

Set timeout: 60 because each individual Lambda execution (processing payment, then later shipping order) completes in under 60 seconds
Set executionTimeout: 7200 (2 hours) because the entire workflow from start to finish takes 2 hours (including the wait)

Set Up the Durable Execution SDK

Durable functions require the SDK - it's not optional. The SDK handles checkpoint-and-replay, manages execution state, and provides the durable operations you'll use in your code. Without it, you'd need to manually implement all state management, checkpoint tracking, and recovery logic yourself.

Available languages:

JavaScript
TypeScript
Python

AWS will add support for more languages over time.

Install:

npm install @aws/durable-execution-sdk-js

Wrap Your Handler

Wrap your Lambda handler with withDurableExecution to enable durable execution:

import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Your durable workflow code here
    return { statusCode: 200 };
  }
);

The DurableContext gives you access to durable operations including step(), wait(), parallel(), map(), waitForCallback(), and several others for building long-running workflows.

How Durable Execution Works

Checkpoint and Replay

Durable functions run multiple times during their lifecycle. Each time Lambda invokes your function, it replays your code from the beginning - but skips completed operations by reading from the checkpoint log.

Example:

export const handler = withDurableExecution(async (event, context) => {
  // Step 1: Charge payment
  const charge = await context.step('charge', async () => {
    return processPayment(event.amount);
  });

  // Step 2: Wait 2 hours
  await context.wait({ seconds: 7200 });

  // Step 3: Ship order
  const shipment = await context.step('ship', async () => {
    return createShipment(charge.orderId);
  });

  return { shipment };
});

Execution timeline:

Invocation 1 (T+0):
┌─────────────────────────────────────────────────┐
│ [charge ✓] → checkpoint saved                   │
│ [wait 2h...] → Lambda suspends (no charges)     │
└─────────────────────────────────────────────────┘
                    ⏳ 2 hours pass...

Invocation 2 (T+2h):
┌─────────────────────────────────────────────────┐
│ [charge ⚡cached] ← reads from checkpoint       │
│ [wait ⚡skipped] ← already completed            │
│ [ship ✓] → checkpoint saved                     │
│ Return result ✓                                 │
└─────────────────────────────────────────────────┘

During the 2-hour wait, no Lambda runs. Zero charges.

Determinism Requirements

Replay depends on your code producing the same results every time it runs. Any code outside durable operations must be deterministic - meaning it returns the same output for the same input.

Non-deterministic operations must be wrapped:

// ❌ Wrong: Random value changes on each replay
const id = uuid();
await context.step('save', async () => saveWithId(id));

// ✅ Correct: Random value generated once, checkpointed
const id = await context.step('generate-id', async () => {
  return uuid();
});
await context.step('save', async () => saveWithId(id));

Wrap these in steps:

Random values (Math.random(), uuid(), uuidv4())
Timestamps (Date.now(), new Date())
External API calls
Database queries

Core Operations

The SDK provides several operations for building durable workflows. Each operation creates checkpoints automatically, ensuring your function can resume from any point.

context.step()

Executes business logic with automatic checkpointing and retry. Once a step succeeds, it never re-executes - the checkpointed result is used on replay.

const result = await context.step('process-payment', async () => {
  return await paymentService.charge(amount);
});

Use for: Database calls, API requests, any side-effecting operation.

context.wait()

Pauses execution for a specified duration. The SDK creates a checkpoint, terminates the function invocation, and schedules resumption. When the wait completes, Lambda invokes your function again.

await context.wait({ seconds: 3600 }); // Wait 1 hour

Use for: Delays between operations, rate limiting, scheduled actions.

context.parallel()

Executes multiple operations concurrently with optional concurrency control.

const results = await context.parallel([
  async (ctx) => ctx.step('task1', async () => processTask1()),
  async (ctx) => ctx.step('task2', async () => processTask2()),
  async (ctx) => ctx.step('task3', async () => processTask3())
]);

Use for: Independent operations that can run simultaneously.

context.map()

Concurrently executes an operation on each item in an array with optional concurrency control.

const results = await context.map(itemArray, async (ctx, item, index) =>
  ctx.step('task', async () => processItem(item, index))
);

Use for: Batch processing, parallel data transformation.

context.waitForCallback()

Suspends execution until an external system submits a callback. The SDK creates a callback, executes your submitter function with the callback ID, and waits for the result.

const result = await context.waitForCallback(
  'external-api',
  async (callbackId, ctx) => {
    await submitToExternalAPI(callbackId, requestData);
  },
  { timeout: { minutes: 30 } }
);

The external system receives the callbackId and sends the result back using the Lambda API (SendDurableExecutionCallbackSuccess or SendDurableExecutionCallbackFailure).

Use for: Human approvals, webhook integrations, external system coordination.

context.createCallback()

Creates a callback and returns both a promise and callback ID. You send the callback ID to an external system, which submits the result using the Lambda API (SendDurableExecutionCallbackSuccess or SendDurableExecutionCallbackFailure).

const [promise, callbackId] = await context.createCallback('approval', {
  timeout: { hours: 24 }
});
await sendApprovalRequest(callbackId, requestData);
const approval = await promise;

Use for: Advanced scenarios where you need the callback ID before suspending.

context.invoke()

Invokes another Lambda function and waits for its result.

const result = await context.invoke(
  'invoke-processor',
  'arn:aws:lambda:us-east-1:123456789012:function:processor',
  { data: inputData }
);

Use for: Function composition, workflow decomposition, calling other Lambda functions.

context.waitForCondition()

Polls for a condition with automatic checkpointing between attempts. The SDK executes your check function, creates a checkpoint with the result, waits according to your strategy, and repeats until the condition is met.

const result = await context.waitForCondition(
  async (state, ctx) => {
    const status = await checkJobStatus(state.jobId);
    return { ...state, status };
  },
  {
    initialState: { jobId: 'job-123', status: 'pending' },
    waitStrategy: (state) =>
      state.status === 'completed'
        ? { shouldContinue: false }
        : { shouldContinue: true, delay: { seconds: 30 } }
  }
);

Use for: Polling external systems, waiting for resources to be ready, implementing retry with backoff.

context.runInChildContext()

Creates an isolated execution context for grouping operations. Child contexts have their own checkpoint log and can contain multiple steps, waits, and other operations. The SDK treats the entire child context as a single unit for retry and recovery.

const result = await context.runInChildContext(
  'batch-processing',
  async (childCtx) => {
    return await processBatch(childCtx, items);
  }
);

Use for: Organizing complex workflows, implementing sub-workflows, isolating operations that should retry together.

Complete Example: Order Fulfillment Workflow

Here's a real-world example combining multiple operations:

import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: { orderId: string; items: string[] }, context: DurableContext) => {
    // Step 1: Process payment
    const payment = await context.step('process-payment', async () => {
      return await paymentService.charge(event.orderId);
    });

    // Step 2: Wait 1 hour for fraud check window
    await context.wait({ seconds: 3600 });

    // Step 3: Parallel operations - reserve inventory and calculate shipping
    const [inventory, shipping] = await context.parallel([
      async (ctx) => ctx.step('reserve-inventory', async () => 
        inventoryService.reserve(event.items)
      ),
      async (ctx) => ctx.step('calculate-shipping', async () => 
        shippingService.calculate(event.orderId)
      )
    ]);

    // Step 4: Wait for external approval
    const approval = await context.waitForCallback(
      'order-approval',
      async (callbackId, ctx) => {
        await notificationService.sendApprovalRequest(callbackId, event.orderId);
      },
      { timeout: { hours: 24 } }
    );

    if (!approval.approved) {
      return { status: 'rejected', orderId: event.orderId };
    }

    // Step 5: Process each item
    const shipments = await context.map(
      event.items,
      async (ctx, item, index) => 
        ctx.step('ship-item', async () => 
          shippingService.shipItem(item, shipping.address)
        )
    );

    // Step 6: Send confirmation
    await context.step('send-confirmation', async () => {
      return notificationService.sendConfirmation(event.orderId, shipments);
    });

    return { status: 'completed', orderId: event.orderId, shipments };
  }
);

This workflow demonstrates:

Sequential steps with automatic checkpointing (payment, confirmation)
Time-based waits for fraud checks (no charges during wait)
Parallel execution for independent operations (inventory + shipping)
External callbacks for order approval with 24-hour timeout
Batch processing with map() for shipping multiple items

The entire workflow runs for 25+ hours (1-hour fraud check + 24-hour approval window) while only consuming compute during active operations.

Resources

Have you tried Lambda Durable Functions yet? What workflows are you building with them? Share your experiences and questions in the comments below!

DEV Community