DEV Community

Cover image for Building AI Video Generation Pipelines with AWS Lambda Durable Functions

Building AI Video Generation Pipelines with AWS Lambda Durable Functions

At re:Invent 2025, AWS announced Lambda Durable Functions — a new capability that lets you write long-running, stateful workflows as simple sequential code while the SDK handles checkpointing, retries, and state management automatically.

I wanted to test it with some relevant use case, so I tested with a content generation platform that transforms product photos into social media content using Gemini for image and video generation. This post covers why I chose Lambda Durable Functions and the patterns that made it work.

The Problem: AI Video Generation is Slow

Video generation with models like Veo 3.1 takes ~90 (or longer) seconds. That's a problem when:

  • API Gateway times out at 29 seconds
  • Lambda's synchronous invocation limit is 15 minutes
  • Users expect a response, not a "check back later"

Traditional solutions involve Step Functions, SQS queues, or webhook callbacks. All require orchestration code, state management, and error handling boilerplate.

What is Lambda Durable Functions?

Lambda Durable Functions uses a checkpoint-and-replay model:

  1. Execute & Checkpoint — The function runs, and the SDK saves progress at each context.step() call
  2. Wait & Suspend — When encountering a wait or external call, the function terminates gracefully, preserving state
  3. Resume & Replay — On the next invocation, completed steps are skipped using their checkpointed results

Key capabilities:

  • Workflows up to 1 year — Overcomes the 15-minute limit
  • Automatic retries — Handles failures with exponential backoff from the last successful step
  • No infrastructure — No Step Functions state machines, no queues, no additional services

The SDK (@aws/durable-execution-sdk-js) wraps your handler and manages the checkpointing transparently.

Architecture Overview

In my implementation, the Lambda function handles multiple workflow modes:

  • GENERATE_VIDEO — Image → Veo 3.1 → Video URL (with polling)
  • GENERATE_IMAGE — Prompt + Reference → S3 URL
  • GENERATE_CONTENT — Image → Content Strategy JSON
  • UPLOAD_IMAGE — Base64 → S3 (no durable steps needed)

Pattern 1: Checkpointed Polling

Video generation is async. You start a job, get an operation ID, and poll until completion. Here's the pattern:

export const handler = withDurableExecution(
  async (event, context: DurableContext) => {
    // Step 1: Start the operation
    const operationId = await context.step('start-generation', async () => {
      return await startVideoGeneration(prompt, imageBase64);
    });

    // Step 2-N: Poll with unique step names
    let result = null;
    let pollCount = 0;

    while (!result && pollCount < 60) {
      pollCount++;

      const status = await context.step(`poll-${pollCount}`, async () => {
        return await checkOperationStatus(operationId);
      });

      if (status.done) {
        result = status.videoUrl;
        break;
      }

      // Durable wait — function suspends here
      await context.step(`wait-${pollCount}`, async () => {
        await sleep(10000);
        return true;
      });
    }

    return { videoUrl: result };
  }
);
Enter fullscreen mode Exit fullscreen mode

Each poll iteration has a unique step name (poll-1, poll-2, etc.). If Lambda times out mid-poll, it resumes from the last completed step on the next invocation.

Pattern 2: Large Payload Handling

Durable function checkpoints have a 256KB limit. Generated images easily exceed this. The solution: upload to S3 first, return the URL.

// Wrong — returns large base64, checkpoint fails
const imageData = await context.step('generate-image', async () => {
  return await generateImage(prompt); // ~500KB base64
});

// Correct — upload to S3, return small URL
const imageUrl = await context.step('generate-and-upload', async () => {
  const { imageBase64, mimeType } = await generateImage(prompt);
  const url = await uploadToS3(imageBase64, mimeType);
  return url; // ~200 characters
});
Enter fullscreen mode Exit fullscreen mode

This pattern applies to any step that produces large outputs.

Pattern 3: Selective Durability

Not every operation needs checkpointing. Quick operations like generating presigned URLs or uploading base64 to S3 can run without durable steps:

if (mode === 'UPLOAD_IMAGE') {
  // No context.step() — runs directly, no checkpoint overhead
  return await handleUploadImage(body);
}

if (mode === 'GENERATE_VIDEO') {
  // Uses context.step() for long-running polling
  return await handleGenerateVideoWorkflow(body, context);
}
Enter fullscreen mode Exit fullscreen mode

Reserve durable steps for operations that:

  • Take more than a few seconds
  • Involve external API calls that might fail
  • Need retry capability

CDK Infrastructure

Deploying durable functions requires enabling the feature on the Lambda:

const workflow = new lambda.Function(this, 'Workflow', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  timeout: Duration.minutes(15),
  memorySize: 1024,
});

// Enable durable execution
const cfnFunction = workflow.node.defaultChild as lambda.CfnFunction;
cfnFunction.addPropertyOverride('DurableConfig', {
  Enabled: true,
});
Enter fullscreen mode Exit fullscreen mode

The function URL provides direct HTTPS access without API Gateway, avoiding the 29-second timeout entirely.

Lessons Learned

1. Unique step names are critical

Step names must be unique across the entire execution. Using poll-status for every poll iteration causes the SDK to return cached results instead of making new API calls.

2. Step outputs are serialized

Everything returned from context.step() must be JSON-serializable. No functions, no circular references, no Buffers.

3. Cold starts add latency

Each resume is a new Lambda invocation. For workflows with many steps, cold starts accumulate. Consider provisioned concurrency for latency-sensitive workflows.

4. Debugging requires understanding replay

Console logs in completed steps won't appear on replay — the step returns its cached result immediately. Add logging outside steps or use CloudWatch to trace the full history.

When to Use Durable Functions vs Step Functions

|        Use Case.         | Durable Functions | Step Functions |
| - - - - - - - - - -- - - | - - - - - - - - - | - - - - - - - -|
| Simple linear workflows  |          ✓        |                |
| Complex branching logic  |                   |        ✓       |
| Visual workflow designer |                   |        ✓       |
| Code-first development   |          ✓        |                |
| Sub-second coordination  |                   |        ✓       |
| Long waits (hours/days)  |          ✓        |        ✓       |
Enter fullscreen mode Exit fullscreen mode

Durable Functions shine when you want to write workflows as regular code without learning a new DSL or managing state machine definitions.

Results

With Lambda Durable Functions, the video generation pipeline:

  • Handles 30-90 second video generation without timeout issues
  • Automatically retries failed API calls from the last checkpoint
  • Scales to concurrent requests without queue management
  • Costs nothing while waiting (Lambda suspends between polls)

The entire backend is a single Lambda function. No Step Functions, no SQS, no orchestration infrastructure.

Top comments (0)