Eric D Johnson

Posted on Dec 3

The Replay Model: How AWS Lambda Durable Functions Actually Work

#serverless #aws #lambda #durable

Understanding the checkpoint-based execution that makes long-running workflows possible

You write an AWS Lambda function that looks like it runs continuously for hours. But Lambda functions can only run for 15 minutes. How does this work?

The answer is replay - a checkpoint-based execution model that makes your function restart from the beginning on every invocation, but skip the work it's already done. It's elegant, efficient, and once you understand it, surprisingly intuitive.

The Core Principle

Here's the fundamental truth about durable functions:

Your handler function re-executes from the beginning on every invocation, but completed operations return cached results from checkpoints instead of re-executing.

Let's see this in action with a simple workflow:

async function processOrder(event: any, ctx: DurableContext) {
  const order = await ctx.step('create-order', async () => {
    console.log('Creating order...');
    return { orderId: '123', total: 50 };
  });

  const payment = await ctx.step('process-payment', async () => {
    console.log('Processing payment...');
    return { transactionId: 'txn-456', status: 'success' };
  });

  await ctx.wait({ seconds: 300 }); // Wait 5 minutes

  const notification = await ctx.step('send-notification', async () => {
    console.log('Sending notification...');
    return { sent: true };
  });

  return { order, payment, notification };
}

Here's what actually happens across three separate Lambda invocations:

Invocation 1 (t=0s):

Creating order...
Processing payment...
[Checkpoint: create-order completed]
[Checkpoint: process-payment completed]
[Function terminates - waiting 5 minutes]

Invocation 2 (t=300s, after wait completes):

[REPLAY MODE: Skipping create-order - returning cached result]
[REPLAY MODE: Skipping process-payment - returning cached result]
[EXECUTION MODE: Running send-notification]
Sending notification...
[Checkpoint: send-notification completed]
[Function completes]

Notice what happened: The function started from the beginning both times, but on the second invocation, create-order and process-payment didn't re-execute. The logs only appeared once, even though the code ran twice. The function seamlessly continued from where it left off.

This is replay.

Execution Modes: The Secret Sauce

The SDK operates in two modes that automatically switch based on what's happening.

ExecutionMode is when the function is executing operations for the first time. Operations execute normally, results are saved to checkpoints, logs are emitted, and side effects happen.

ReplayMode is when the function is replaying previously completed operations. Operations return cached results instantly without actual execution, logs are suppressed, and no side effects occur.

The SDK automatically transitions from ReplayMode to ExecutionMode when it reaches an operation that hasn't been completed yet.

How Checkpoints Work

Every operation creates a checkpoint that stores:

{
  operationId: "2",           // Sequential ID
  operationType: "STEP",      // STEP, WAIT, INVOKE, etc.
  operationName: "process-payment",
  status: "SUCCEEDED",        // STARTED, SUCCEEDED, FAILED, PENDING
  result: {                   // The actual return value
    transactionId: "txn-456",
    status: "success"
  }
}

When your function restarts, the SDK loads all checkpoints from storage, indexes them by operation ID, returns cached results for completed operations, and executes new operations normally.

The Determinism Requirement

For replay to work, your code must be deterministic - the same sequence of operations must happen in the same order every time.

What Breaks Determinism

// ❌ Random control flow
if (Math.random() > 0.5) {
  await ctx.step('optional-step', async () => doSomething());
}
// First run: random = 0.7, step executes, checkpoint created
// Second run: random = 0.3, step skipped
// Error: Expected operation 'optional-step' at position 2, not found!

// ❌ Time-based branching
const isWeekend = new Date().getDay() >= 5;
if (isWeekend) {
  await ctx.step('weekend-task', async () => doWeekendWork());
}
// First run: Friday (day 5), step executes
// Second run: Monday (day 1), step skipped
// Error: Replay consistency violation!

// ❌ External state
let counter = 0;
await ctx.step('step1', async () => {
  counter++; // Won't increment during replay!
  return counter;
});

How to Write Deterministic Code

The rule is simple: capture non-deterministic values inside steps.

// ✅ Capture random values in steps
const randomId = await ctx.step('generate-id', async () => {
  return crypto.randomUUID(); // Executed once, cached on replay
});

// ✅ Capture timestamps in steps
const timestamp = await ctx.step('get-timestamp', async () => {
  return Date.now(); // Same timestamp on every replay
});

// ✅ Use event data for control flow
if (event.shouldProcess) { // Deterministic - same event every time
  await ctx.step('process', async () => doWork());
}

// ✅ Capture time-based decisions in steps
const isWeekend = await ctx.step('check-day', async () => {
  return new Date().getDay() >= 5;
});
if (isWeekend) {
  await ctx.step('weekend-task', async () => doWeekendWork());
}

Replay Consistency Validation

The SDK validates that operations occur in the same order on every invocation:

// What gets validated:
// 1. Operation type (STEP, WAIT, INVOKE)
// 2. Operation name (your identifier)
// 3. Operation position (sequential order)

// Example validation error:
// "Replay consistency violation: Expected operation 'process-payment' 
//  of type STEP at position 2, but found operation 'send-email' of type STEP"

This catches bugs early. If your code's execution path changes between invocations, you'll know immediately.

A Complete Example: Order Processing with Replay

Let's examine a potential workflow through multiple invocations:

async function processOrder(event: any, ctx: DurableContext) {
  ctx.logger.info('Order processing started', { orderId: event.orderId });

  // Step 1: Validate inventory
  const inventory = await ctx.step('check-inventory', async () => {
    ctx.logger.info('Checking inventory');
    const response = await fetch(`https://api.inventory.com/check`, {
      method: 'POST',
      body: JSON.stringify({ items: event.items })
    });
    return response.json();
  });

  if (!inventory.available) {
    ctx.logger.warn('Out of stock', { missing: inventory.missing });
    return { status: 'out-of-stock' };
  }

  // Step 2: Process payment
  const payment = await ctx.step('process-payment', async () => {
    ctx.logger.info('Processing payment', { amount: inventory.total });
    const response = await fetch(`https://api.payments.com/charge`, {
      method: 'POST',
      body: JSON.stringify({ 
        customerId: event.customerId, 
        amount: inventory.total 
      })
    });
    return response.json();
  });

  // Step 3: Wait for warehouse confirmation (5 minute timeout)
  ctx.logger.info('Waiting for warehouse confirmation');
  const confirmation = await ctx.waitForCallback(
    'warehouse-confirm',
    async (callbackId) => {
      // Send callback ID to warehouse system
      await fetch(`https://api.warehouse.com/notify`, {
        method: 'POST',
        body: JSON.stringify({ orderId: order.id, callbackId })
      });
    },
    { timeout: { seconds: 300 } }
  );

  // Step 4: Send notification
  await ctx.step('notify-customer', async () => {
    ctx.logger.info('Sending customer notification');
    await fetch(`https://api.notifications.com/send`, {
      method: 'POST',
      body: JSON.stringify({
        customerId: event.customerId,
        message: 'Your order is confirmed!'
      })
    });
  });

  ctx.logger.info('Order processing completed');
  return { status: 'completed', orderId: payment.orderId };
}

Invocation Timeline

Invocation 1 (t=0s) runs in ExecutionMode. The logs show "Order processing started", "Checking inventory", "Processing payment", and "Waiting for warehouse confirmation". Checkpoints are created for check-inventory and process-payment, both marked as SUCCEEDED. The function then enters a waiting state for the callback 'warehouse-confirm', creating a checkpoint to persist the callbackId and set the timer for the timeout.

Invocation 2 (t=120s, warehouse confirms) starts in ReplayMode and transitions to ExecutionMode. During the replay phase, "Order processing started", "Checking inventory", "Processing payment", and "Waiting for warehouse confirmation" are all suppressed - the inventory and payment steps return their cached results without re-executing. Once the function reaches new operations, it switches to ExecutionMode, checkpoints the callback result, and logs "Sending customer notification" and "Order processing completed". A checkpoint is created for notify-customer marked as SUCCEEDED, and the function completes.

Notice how the function ran from the beginning both times, but the inventory and payment APIs were only called once. Logs only appeared once with no duplicates, and the function seamlessly continued after the callback.

Operation IDs: The Replay Index

Operations are identified by sequential IDs that determine replay order:

// Root context operations
await ctx.step('step1', ...);  // ID: "1"
await ctx.step('step2', ...);  // ID: "2"
await ctx.step('step3', ...);  // ID: "3"

// Child context operations (from ctx.invoke)
await ctx.invoke('child', async (childCtx) => {
  await childCtx.step('child-step1', ...);  // ID: "parent-1"
  await childCtx.step('child-step2', ...);  // ID: "parent-2"
});

IDs are deterministic - they're based on execution order, not operation names. This is why operation order must be consistent.

Common Pitfalls and Solutions

1. Non-Deterministic Control Flow

// ❌ BAD: Random branching
if (Math.random() > 0.5) {
  await ctx.step('optional', async () => doWork());
}

// ✅ GOOD: Event-based branching
if (event.shouldDoWork) {
  await ctx.step('optional', async () => doWork());
}

// ✅ GOOD: Capture decision in step
const shouldDoWork = await ctx.step('decide', async () => {
  return Math.random() > 0.5;
});
if (shouldDoWork) {
  await ctx.step('optional', async () => doWork());
}

2. Mutating Closure Variables

// ❌ BAD: External mutation
let total = 0;
await ctx.step('add-items', async () => {
  total += 10; // Won't happen during replay!
});

// ✅ GOOD: Return values
const total = await ctx.step('calculate-total', async () => {
  return 10;
});

3. Side Effects Outside Steps

// ❌ BAD: Direct API calls
const data = await fetch('https://api.example.com/data');
await ctx.step('process', async () => processData(data));
// API called on every replay!

// ✅ GOOD: API calls inside steps
const data = await ctx.step('fetch-data', async () => {
  return await fetch('https://api.example.com/data');
});
await ctx.step('process', async () => processData(data));

Debugging Replay Issues

When replay goes wrong, use the execution history:

# View execution history
sam remote execution history $EXECUTION_ARN

# See detailed operation data
sam remote execution history $EXECUTION_ARN --format json

The history shows every operation that executed, the order they ran in, their results or errors, and when mode transitions occurred. Look for operations appearing in different orders, missing or extra operations, or operations with different names at the same position.

Best Practices for Replay-Safe Code

Wrap all non-deterministic operations in steps - random numbers, timestamps, API calls, and database queries should always be inside ctx.step(). Use event data for control flow rather than runtime-generated values, and never mutate closure variables - return values from steps instead. Keep operation order consistent so the same sequence happens every time. Test with multiple invocations to verify replay behavior locally, and check execution history to debug replay issues quickly.

Summary

The replay model is what makes durable functions possible. Your function restarts from the beginning on every invocation, but completed operations return cached results without re-executing. The SDK automatically switches between ReplayMode and ExecutionMode, and your code must be deterministic for replay to work correctly.

Once you internalize these principles, writing durable functions becomes natural. You write straightforward procedural code, and the SDK handles all the complexity of checkpointing, replay, and state management. The result? Long-running workflows that look like simple functions. That's the magic of replay.

DEV Community