DEV Community

Cover image for Multi-Agent Systems on AWS Lambda with Durable Functions
Gunnar Grosch
Gunnar Grosch

Posted on

Multi-Agent Systems on AWS Lambda with Durable Functions

In my previous post on multi-agent systems, I built a purchasing coordinator where a coordinator agent routes requests to specialist agents based on RISEN prompt contracts. RISEN structures system prompts into five components: Role, Instructions, Steps, Expectation, and Narrowing. In a multi-agent system, the Steps section encodes the routing logic and the Narrowing section prevents agents from doing each other's work. A laptop triggers Price Research and Delivery. A used car triggers all five specialists. The routing logic lives in the prompts, not in code. It works, but it runs in a single process. No fault isolation, no independent scaling, no durability. If the process crashes halfway through specialist consultations, you start over.

The durable functions post solved the durability problem for a support ticket triage workflow: checkpoint each step, suspend for human review, resume where you left off. But that was a single-agent workflow.

This post combines the two. The same purchasing coordinator, deployed to AWS Lambda, with each specialist as its own Lambda function. The coordinator is a durable function that checkpoints every specialist call. If it's interrupted after consulting three of five specialists, it resumes from the fourth. When a high-value purchase needs human approval, the function suspends, compute charges stop, and it picks up exactly where it left off when the approver responds.

The two SDKs have distinct roles. The Strands Agents SDK handles the AI reasoning: the planning agent decides which specialists to call, and the synthesis agent produces the recommendation. The durable execution SDK handles the infrastructure: checkpointing, parallel dispatch, suspension, and replay. The coordinator uses both. The specialists only use Strands.

The complete source code is on GitHub: github.com/gunnargrosch/durable-multi-agent-purchasing.

What Changes from the In-Process Demo

The multi-agent post used tool() callbacks for specialist dispatch: each specialist was defined as a Strands tool, and the coordinator agent called them as functions within the same process. That's the simplest possible architecture, and it's fine for development. Here's what changes when you deploy:

In-Process (previous post) Lambda + Durable (this post)
Specialist invocation tool() callback, in-process context.invoke(), separate Lambda function
Execution model Sequential (one specialist at a time) Parallel via context.parallel()
Fault tolerance None. Process crash = restart Checkpointed. Resume from last completed step
Scaling Single process Each specialist scales independently
Human-in-the-loop Not supported waitForCallback() with zero-cost suspension
IAM isolation Shared process permissions Per-specialist least-privilege policies
Routing visibility Real-time hook as each tool is called Checkpointed plan step with routing summary
Infrastructure npm start SAM template, 6 Lambda functions

The RISEN prompts carry over with only minor adjustments. The coordinator's prompt splits into two phases (plan and synthesis) instead of a single invocation, because the durable function needs to checkpoint the plan before dispatching specialists. The specialist prompts are unchanged.

How It Works

If you haven't read the durable functions post, here's the key mental model: durable functions use checkpoint and replay. Your handler re-executes from the top on every resume, but completed steps return their cached results instantly without re-executing. New work picks up from where it left off. The SDK manages this transparently. You write sequential code and the infrastructure handles the rest.

[CoordinatorFunction] — Lambda Durable Function (Sonnet, 1536MB)
  ├→ context.step('plan')            → Planning agent selects specialists
  ├→ context.parallel('specialists') → Runs selected specialists concurrently:
  │     ├─ context.invoke(PriceResearchFunction)  (Haiku)   ← always
  │     ├─ context.invoke(FinancingFunction)       (Haiku)   ← if value > $5K
  │     ├─ context.invoke(DeliveryFunction)        (Haiku)   ← if physical product
  │     ├─ context.invoke(RiskAssessmentFunction)  (Sonnet)  ← if value > $10K / used
  │     └─ context.invoke(ContractReviewFunction)  (Haiku)   ← if subscription/lease
  ├→ context.step('synthesize')      → Synthesis agent combines findings
  └→ context.waitForCallback()       → Human approval (when requireApproval=true)
Enter fullscreen mode Exit fullscreen mode

Three things to notice:

  1. context.invoke() is a new primitive not covered in the durable functions post. It's the SDK's built-in method for calling other Lambda functions. It checkpoints the result automatically and suspends the coordinator while waiting, so you don't pay for compute during specialist execution. Despite the SDK's API reference describing it as invoking "another durable function," context.invoke() works with any Lambda function. The specialists here are standard functions without DurableConfig.
  2. context.step() wraps the planning and synthesis phases with retry strategies, just like the Bedrock calls in the support triage demo. Each step checkpoints its result. On replay, it returns the cached result without re-executing.
  3. context.parallel() wraps the specialist invocations so they run concurrently, each independently checkpointed. If specialist 3 of 5 fails, the other four results are preserved.

The SAM Template

Here are the key parts of the coordinator's definition in the SAM template. The template defines 6 functions total: the coordinator plus 5 specialists that follow the same pattern.

CoordinatorFunction:
  Type: AWS::Serverless::Function
  Properties:
    Handler: coordinator.handler
    MemorySize: 1536
    Timeout: 300
    Environment:
      Variables:
        COORDINATOR_MODEL_ID: !Sub global.${CoordinatorModelId}
        PRICE_RESEARCH_FUNCTION: !Ref PriceResearchFunction
        # ... remaining specialist function references
    Policies:
      - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
      - Statement:
          - Effect: Allow
            Action:
              - bedrock:InvokeModel
              - bedrock:InvokeModelWithResponseStream
            Resource: # Bedrock model + inference profile ARNs
          - Effect: Allow
            Action: lambda:InvokeFunction
            Resource:
              - !GetAtt PriceResearchFunction.Arn
              - !GetAtt FinancingFunction.Arn
              - !GetAtt DeliveryFunction.Arn
              - !GetAtt RiskAssessmentFunction.Arn
              - !GetAtt ContractReviewFunction.Arn
    DurableConfig:
      ExecutionTimeout: 86400
      RetentionPeriodInDays: 7
    AutoPublishAlias: live
Enter fullscreen mode Exit fullscreen mode

A few things to note:

  • No API Gateway, no function URLs, no public endpoints. The multi-agent post previewed two deployment options: Lambda with HTTP endpoints, or AgentCore containers. This implementation goes a simpler route. context.invoke() calls specialists directly via the Lambda API. The coordinator's IAM policy grants lambda:InvokeFunction on each specialist ARN. Specialists are unreachable from outside the coordinator's execution role.
  • Shared handler, different prompts. All five specialists use the same specialist.ts handler. The PROMPT_NAME environment variable selects which RISEN prompt to load. This keeps the template repetitive but predictable: each specialist block differs only in its name, PROMPT_NAME, and model ID.
  • Per-specialist model selection. Risk Assessment gets AdvancedSpecialistModelId (Sonnet) for stronger reasoning. The other four get SpecialistModelId (Haiku). Same pattern as the in-process demo, now enforced at the infrastructure level.
  • DurableConfig with 24-hour execution timeout. The used car scenario with human approval can sit overnight. Each individual invocation is bounded by Timeout: 300 (the coordinator's per-replay limit). ExecutionTimeout: 86400 is the outer wall-clock limit across all replays and suspensions.
  • InvokeModelWithResponseStream is included because the Strands SDK's BedrockModel may use streaming internally for token generation. Without it, the coordinator would get AccessDeniedException on agent invocations.
  • ESM build. The coordinator uses ESM (Format: esm) with esbuild. The full template includes a Banner that injects a createRequire shim because some dependencies expect CommonJS require. This is a known pattern for ESM Lambda functions with mixed dependencies. See the repo's template.yaml for the complete Metadata block.

The Coordinator Handler

The coordinator is the only durable function. Here's the handler skeleton showing the four durable phases. The full source includes types, the specialist registry, retry strategies, and routing summary logic.

export const handler = withDurableExecution(async (
  event: CoordinatorEvent,
  context: DurableContext
): Promise<Record<string, unknown>> => {

  // Phase 1: Plan — agent decides which specialists to consult
  const plan = await context.step<AnalysisPlan>('plan', async () => {
    let capturedPlan: AnalysisPlan | null = null
    const planTool = tool({
      name: 'create_analysis_plan',
      inputSchema: z.object({
        specialists: z.array(z.object({
          name: z.enum(['price-research', 'financing', 'delivery',
                        'risk-assessment', 'contract-review']),
          prompt: z.string(),
        })),
      }),
      callback: async (input) => { capturedPlan = input; return 'Plan created.' },
    })
    const agent = new Agent({ model, systemPrompt: planPrompt, tools: [planTool], printer: false })
    await agent.invoke(event.request)
    if (!capturedPlan) throw new Error('Planning agent did not call create_analysis_plan')
    return capturedPlan
  }, { retryStrategy: bedrockRetry })

  // Phase 2: Consult specialists in parallel via context.invoke()
  const results = await context.parallel<{ name: string; response: string }>('specialists',
    plan.specialists.map((spec) => ({
      name: spec.name,
      func: async (ctx) => {
        try {
          const specialist = SPECIALISTS[spec.name]
          if (!specialist) {
            throw new Error(`Unknown specialist: ${spec.name}`)
          }
          const result = await ctx.invoke<{ prompt: string }, { response: string }>(
            spec.name, specialist.functionName, { prompt: spec.prompt }
          )
          return { name: spec.name, response: result.response }
        } catch (err) {
          const display = SPECIALISTS[spec.name]?.display ?? spec.name
          return { name: spec.name, response: `[${display} unavailable: ${err instanceof Error ? err.message : 'unknown error'}]` }
        }
      },
    }))
  )

  // Phase 3: Synthesize specialist findings into a recommendation
  const response = await context.step('synthesize', async () => {
    // Build findings from checkpointed parallel results, invoke synthesis agent
    // (see full source for details)
  }, { retryStrategy: bedrockRetry })

  // Phase 4: Human approval (optional)
  if (event.requireApproval) {
    const approval = await context.waitForCallback<ApprovalPayload>(
      'approval',
      async (callbackId) => { context.logger.info('approval_callback_created', { callbackId }) },
      { timeout: { hours: 24 }, serdes: defaultSerdes }
    )
    if (!approval.approved) return { status: 'rejected', recommendation: response }
  }

  return { status: event.requireApproval ? 'approved' : 'completed', recommendation: response }
})
Enter fullscreen mode Exit fullscreen mode

Let's walk through what's happening in each phase.

Phase 1: Plan

The planning phase is the biggest change from the in-process demo. In the previous post, the coordinator was a single agent that both decided which specialists to call and synthesized their findings. Here, planning and synthesis are separate agents wrapped in separate context.step() calls.

Why split them? Checkpointing. In the in-process demo, if the coordinator crashes after calling three specialists, you lose the routing decision and all three results. With durable functions, the plan is checkpointed as a single unit. If the function replays after the plan step, it returns the cached plan instantly without calling Bedrock again.

The planning agent uses tool() with a Zod schema to produce structured output. The create_analysis_plan tool captures the plan into a closure variable, and the step returns it as its checkpointed result. If the agent doesn't call the tool, the step throws an error. The retry strategy will attempt it three times, but this is a non-transient failure: if the model didn't call the tool on the first attempt, retries won't help. After retries are exhausted, the execution fails.

Phase 2: Parallel specialists

This is where context.invoke() replaces tool() callbacks. Each specialist invocation is a branch inside context.parallel():

const result = await ctx.invoke<{ prompt: string }, { response: string }>(
  spec.name,             // step name for checkpoint history
  specialist.functionName, // Lambda function name from environment
  { prompt: spec.prompt }  // payload
)
Enter fullscreen mode Exit fullscreen mode

context.invoke() calls the specialist Lambda function directly, checkpoints the result, and suspends the coordinator while waiting. You don't pay for coordinator compute while specialists are executing. On replay, completed invocations return their cached results without re-invoking the specialist.

Each branch has a try/catch for graceful degradation. If the Delivery specialist times out, the coordinator still gets results from Price Research, Financing, Risk Assessment, and Contract Review. The synthesis agent notes the gap and advises the buyer to investigate delivery independently. This is a meaningful improvement over the in-process demo, where a failed specialist tool call could derail the entire coordinator.

Phase 3: Synthesize

The synthesis agent receives all specialist findings and produces a structured recommendation. Like the plan step, the entire synthesis is one checkpointed unit. The synthesis prompt's Narrowing section prevents it from overriding specialist findings or filling in gaps from failed specialists.

Phase 4: Human approval

When requireApproval is true (the used car scenario sets this), the function suspends at waitForCallback. Compute charges stop. The approver reviews the recommendation and sends a callback via the Lambda API or the demo's interactive prompt. The function resumes and returns the final status.

Note the serdes: defaultSerdes option on the callback. As covered in the durable functions post's gotchas, waitForCallback defaults to passthrough serialization (not JSON.parse). Without defaultSerdes, approval.approved would be undefined at runtime even though TypeScript thinks it's a boolean.

The Specialist Handler

All five specialists share one handler. The PROMPT_NAME environment variable selects the behavior:

const promptName = process.env.PROMPT_NAME!
const systemPrompt = loadPrompt(promptName)
const model = createModel(SPECIALIST_MODEL)

export const handler = async (
  event: SpecialistEvent,
  context: Context
): Promise<{ response: string }> => {
  const logger = makeLogger({ handler: promptName, requestId: context.awsRequestId })
  const tools = loadSpecialistTools(promptName, logger)
  const agent = new Agent({ model, systemPrompt, tools, printer: false })
  const result = await agent.invoke(event.prompt)
  return { response: result.toString() }
}
Enter fullscreen mode Exit fullscreen mode

Specialists are standard Lambda functions, not durable. They don't need checkpointing because each one completes in a single invocation (a Bedrock call and response processing). The coordinator's context.invoke() handles the durability: if a specialist invocation times out, the coordinator can retry from the checkpoint without re-running specialists that already succeeded.

loadSpecialistTools returns specialist-specific tools (see src/lib/specialist-tools.ts). Most specialists are pure reasoning (no tools). The Price Research specialist has a save_price_snapshot tool that logs structured price data. In production, that tool could write to DynamoDB or call a pricing API. The coordinator never sees these tools. They're scoped to each specialist's domain.

The Prompt Split

The in-process demo had one coordinator prompt with routing in the Steps section. The durable version splits this into two prompts:

coordinator-plan handles routing. Here's the Steps and Expectation sections:

# Steps
1. Read the purchase request and identify what is being purchased, its likely category
   (vehicle, electronics, real estate, software, etc.), the approximate value range,
   and any special circumstances mentioned or implied (used/secondhand, financing needed,
   physical delivery, contract or subscription, high value).
2. Select specialists using these routing rules:
   - Always include price-research to compare options and assess market value.
   - Include financing if the estimated value exceeds $5,000, or if financing or a loan
     is mentioned or implied.
   - Include delivery if the item is a tangible physical product that must be physically
     received — electronics, appliances, vehicles, furniture, machinery. Vehicles always
     need delivery planning (transport, pickup, or test drive logistics). Do not include
     for purely digital purchases (software, SaaS subscriptions, downloadable content).
   - Include contract-review if the purchase involves a subscription, lease, warranty
     agreement, service contract, purchase agreement, or any multi-year financial
     commitment. Vehicle and real estate purchases always involve contracts.
   - Include risk-assessment if the estimated value exceeds $10,000, the item is used or
     secondhand, or the category carries known risk (vehicles, real estate, machinery).
     Do not include for new electronics, appliances, or standard retail under $10,000.
3. For each selected specialist, write a focused prompt describing what to analyze about
   this specific purchase. Include relevant details from the request (item, price,
   condition, location, urgency). The specialist's own system prompt defines its role —
   do not repeat the role description in your prompt.
4. Call create_analysis_plan exactly once with the selected specialists and their prompts.

# Expectation
A single call to create_analysis_plan containing:
- An array of specialists, each with a name (from the allowed set) and a prompt string.
- Only specialists whose routing criteria are met.
- Prompts that are specific to this purchase, not generic templates.
Enter fullscreen mode Exit fullscreen mode

These are the same routing rules from the original coordinator prompt in the multi-agent post. The difference: instead of calling specialist tools directly (Steps 2-6 in the original said "invoke the research_prices tool," "invoke the evaluate_financing tool"), the plan agent calls create_analysis_plan once with all selected specialists. The actual dispatch happens via context.invoke() in Phase 2.

coordinator-synthesis handles the final recommendation. It receives specialist findings and produces the buyer-facing output. Its Narrowing prevents it from contradicting specialists or filling in missing analysis.

This split means the coordinator makes two agent invocations (plan + synthesize) instead of one. Each agent invocation may involve multiple Bedrock round-trips internally as the Strands SDK handles reasoning. The trade-off is worth it: the plan is checkpointed before any specialist is called, and the synthesis is checkpointed after all specialists complete. On replay, both return cached results instantly.

Testing with the Local Runner

The durable functions post showed LocalDurableTestRunner for the support triage workflow. The multi-agent demo adds a new pattern: registerFunction for mocking context.invoke() targets.

// Register mock handlers for each specialist Lambda function
runner = new LocalDurableTestRunner({ handlerFunction: handler });
runner
  .registerFunction("PriceResearchFunction", specialistHandler)
  .registerFunction("FinancingFunction", specialistHandler)
  .registerFunction("DeliveryFunction", specialistHandler)
  .registerFunction("RiskAssessmentFunction", specialistHandler)
  .registerFunction("ContractReviewFunction", specialistHandler);

// Run the full workflow and verify the result
const result = runner.run({ payload: laptopPayload });
const output = await result;
expect(output.getStatus()).toBe("SUCCEEDED");
expect(output.getResult().routing.called).toContain("Price Research");
Enter fullscreen mode Exit fullscreen mode

registerFunction maps function names to local handlers. When the coordinator calls context.invoke("PriceResearchFunction", payload), the test runner routes it to the registered mock instead of calling Lambda. This lets you test the full checkpoint/replay lifecycle without deploying or calling Bedrock. The full test suite also tests callback suspension and resumption using runner.getOperation() and sendCallbackSuccess(), the same pattern from the durable functions post.

The test suite covers seven scenarios: standard flow, all-5-specialists flow, approval, rejection, specialist failure with graceful degradation, callback failure, and planning agent failure.

Try the Demo

The repo includes an interactive demo with three purchase scenarios:

Scenario Value Specialists Approval
Laptop $1,500 Price Research + Delivery No
Used car $18,000 All 5 specialists Yes (waitForCallback)
SaaS subscription $200/mo Price Research + Contract Review Yes (waitForCallback)

You'll need:

  • Node.js 24+ and npm
  • For cloud mode: AWS SAM CLI 1.153.1+ and Bedrock access to Claude Sonnet 4.6 and Claude Haiku 4.5

Local mode (no AWS credentials needed)

git clone https://github.com/gunnargrosch/durable-multi-agent-purchasing.git
cd durable-multi-agent-purchasing
npm install
npm run demo:local -- --ticket=used-car
Enter fullscreen mode Exit fullscreen mode

Local mode uses mocked Bedrock responses. The used car scenario exercises all four durable primitives: step (plan and synthesis), invoke (specialist calls), parallel (concurrent dispatch), and waitForCallback (human approval where you play the purchase approver).

Cloud mode (real Bedrock responses)

sam build
sam deploy --guided
npm run demo:cloud -- --ticket=used-car --region=us-east-1
Enter fullscreen mode Exit fullscreen mode

Cloud mode invokes the deployed coordinator with real Bedrock calls. You'll see actual AI-generated specialist analyses and a synthesized recommendation. The demo polls execution history and prompts you when the approval callback is created.

The demo uses direct aws lambda invoke with --invocation-type Event for simplicity. In production, the coordinator would typically sit behind an upstream service: an API Gateway endpoint receiving purchase requests, an EventBridge rule triggered by order events, or an SQS queue processing a backlog. The coordinator itself doesn't care how it's invoked. It receives the event payload and the durable SDK handles the rest.

Inspecting execution history

After a cloud run, you can inspect the execution in the Lambda console under the Durable executions tab, or via the CLI:

aws lambda list-durable-executions-by-function \
  --function-name durable-multi-agent-purchasing-CoordinatorFunction

aws lambda get-durable-execution \
  --durable-execution-arn "<arn-from-list>"
Enter fullscreen mode Exit fullscreen mode

The execution history shows each step's status and timing: when the plan completed, how long each specialist took, and whether the callback is pending or resolved.

Design Decisions

context.invoke() over HTTP

The multi-agent post previewed two deployment approaches: Lambda with API Gateway/Function URLs (HTTP), or AgentCore containers. This demo takes a third path: context.invoke() for direct Lambda-to-Lambda invocation.

The result is simpler than either preview option. No API Gateway resources, no function URLs, no SigV4 signing, no HTTP client configuration. The coordinator calls specialists via the Lambda API, and the durable SDK handles checkpointing and retry. Specialists are unreachable from outside the coordinator's execution role, which is better isolation than an HTTP endpoint with IAM auth.

The trade-off: specialists are only callable from within a durable function. If you later need specialists accessible from other services (a REST API, a Step Functions state machine, another team's coordinator), you'd need to add API Gateway or function URLs at that point. For this use case, where one coordinator owns all specialist dispatch, direct invocation is the right call.

Splitting plan from synthesis

The in-process coordinator was a single agent invocation: read the request, call specialist tools, synthesize findings. With durable functions, that becomes two separate agents wrapped in separate context.step() calls.

This introduces an extra Bedrock call, but the checkpointing benefits are significant. The plan is preserved before any specialist runs. If the function replays after three of five specialists complete, the plan doesn't need to be regenerated. The synthesis is preserved after all specialists complete. If the function replays during the approval callback, the recommendation doesn't need to be regenerated.

The alternative would be wrapping the entire coordinator in a single step. That would mean one Bedrock conversation (cheaper) but no intermediate checkpoints. A failure during synthesis would replay from the beginning, including all specialist invocations. With the split, a synthesis failure only retries the synthesis.

Planning agent with tool capture

The planning agent uses tool() with a Zod schema to produce structured output. This is the same pattern from the in-process demo, but used differently. In the previous post, tools were the dispatch mechanism (calling tools = calling specialists). Here, the tool is purely for structured output capture. The plan step returns the captured plan as its checkpointed result, and context.invoke() handles the actual specialist dispatch.

Why not just have the planning agent return JSON directly? Tool calling with a schema gives you validation at the SDK level. If the agent returns a plan with an invalid specialist name, Zod catches it before the plan is checkpointed. Without the tool, you'd parse and validate the JSON yourself, and an invalid plan could be checkpointed and break on every subsequent replay.

Graceful degradation in parallel

Each specialist branch in context.parallel() has a try/catch that returns an error message instead of throwing:

catch (err) {
  return { name: spec.name, response: `[unavailable: ${err instanceof Error ? err.message : 'unknown error'}]` }
}
Enter fullscreen mode Exit fullscreen mode

This means a failed specialist doesn't fail the entire parallel block. The synthesis agent receives partial results and notes the gap. Without the catch, a failed specialist would fail its branch. Whether that fails the entire parallel block depends on the completion config, but either way, synthesis would never run with partial results. With the catch, every branch succeeds (some with error messages), the parallel block always completes, and the synthesis agent can work with whatever it has.

Cost

The main cost is Bedrock token usage, not Lambda compute.

Scenario Specialists Agent invocations Estimated cost
Laptop (2 specialists) 2 ~4 (plan + 2 specialists + synthesize) ~$0.02-0.05
Used car (5 specialists) 5 ~7 (plan + 5 specialists + synthesize) ~$0.05-0.15

The coordinator uses Sonnet (~$3/$15 per million input/output tokens). Most specialists use Haiku (~$1/$5 per million input/output tokens). Risk Assessment uses Sonnet. Lambda compute for the active execution periods (plan, specialist dispatch, synthesis, replay) totals ~$0.0001. During the waitForCallback suspension, compute charges are zero.

Things to Watch For

Checkpoint size. Each step result is serialized and stored as a checkpoint. The durable functions post covered the 256KB limit per checkpoint. With 5 specialists returning Bedrock responses, the parallel result could get large if the model is verbose. Monitor response sizes. If you hit the limit, truncate or summarize specialist responses before returning them from the branch, or store full responses in S3 and return a reference.

Debugging failures. If the plan step fails after exhausting retries, or a specialist consistently times out, the execution moves to FAILED status. Use get-durable-execution to see which step failed and the error message. The coordinator uses context.logger (replay-aware), so CloudWatch Logs show each phase's progress without duplicate lines from replays. Specialist failures are easier to debug since each specialist has its own log group (sam logs --name PriceResearchFunction --tail).

Replay safety. Durable functions re-execute your handler from the top on every resume. Completed context.step() and context.invoke() calls return cached results, but any code outside those primitives runs again on every replay. If you add a side effect (writing to DynamoDB, sending a notification, calling an external API), wrap it in a context.step() so it executes exactly once. The coordinator's comment on the save-recommendation step shows this pattern. Without the step wrapper, you'd send duplicate notifications every time the function replays.

Cold starts. The used car scenario invokes 5 specialist Lambda functions in parallel. If all 5 are cold, that's 5 concurrent cold starts: arm64 Node.js 24, the Strands SDK, and the Bedrock client. This can add several seconds to the first specialist round. The in-process demo had no cold start penalty for specialists since everything ran in one process. In practice, the cold start overhead is small relative to the Bedrock inference time that follows. For workflows that already include multi-second model calls and human approval, a few extra seconds on the first invocation is rarely the bottleneck.

Wrapping Up

This post showed how to take a multi-agent system from a single-process demo to a deployed, fault-tolerant system on Lambda with durable functions. The RISEN prompts carry over with minimal changes. The architectural shift is from in-process tool calls to checkpointed Lambda-to-Lambda invocations with independent scaling, failure isolation, and human-in-the-loop approval.

Additional Resources

What multi-agent workflow would you deploy with durable functions? Let me know in the comments!

Top comments (0)