Michael Uanikehi for AWS Community Builders

Posted on Mar 21

Measuring What Matters: Rethinking Serverless Workflows with AWS Lambda Durable Functions

#aws #serverless #stepfunctions #lambda

Most serverless workflows don’t fail because they can’t scale.

They fail because when something goes wrong, engineers can’t easily answer:
• Where did this workflow break?
• What state was it in?
• What happened before the failure?

This is where “measuring what matters” becomes important.

Not more metrics.
Not more dashboards.
But better ways to understand system behaviour.

Recently, I explored AWS Lambda Durable Functions, and it exposed something interesting:

The way we structure workflows directly affects how well we can observe and debug them.

The Problem: Orchestration vs Understanding

If you’ve built workflows using AWS Step Functions, you already know the benefits:
• Clear state transitions
• Visual workflows
• Strong integration with AWS services

But in practice, there’s a trade-off; Workflow logic lives outside your application code.

That means:
• You switch between code and state machine definitions
• Debugging often requires jumping across tools
• Context is split across logs, states, and services

This works well for orchestration.

But it doesn’t always optimise for debugging and reasoning under pressure.

What Durable Functions Change

AWS Lambda Durable Functions take a different approach.

Instead of defining workflows externally, you write them directly in code.

The biggest shift is state management. Durable Functions are regular Lambda functions enhanced with stateful execution capabilities!
Durable functions automatically checkpoint progress, suspend execution for up to one year during long-running tasks, and recover from failures.

Here’s a simplified example:

from aws_lambda_powertools import Logger
import time

logger = Logger()

def order_workflow(event):
    order_id = event["order_id"]

    logger.info(f"Processing order {order_id}")

    # Step 1: Validate order
    validate_order(order_id)

    # Step 2: Wait for payment confirmation
    wait_for_payment(order_id)

    # Step 3: Process shipment
    ship_order(order_id)

    return {"status": "completed"}

Now imagine this workflow:
• pauses after wait_for_payment()
• resumes hours later when payment is confirmed
• continues with full context preserved

Why This Matters for Observability

This isn’t just about developer experience, It changes how you instrument and observe workflows.

With traditional orchestration:
• Step Function execution graphs
• Distributed logs
• External state tracking

With Durable Functions You can:
• Log at each logical step
• Track state transitions in code
• Correlate execution paths more naturally

Example:

logger.info({
    "step": "payment_wait",
    "order_id": order_id,
    "status": "pending"
})

Now your logs reflect business flow, not just system events.

Measuring What Actually Matters

In real systems, useful signals are not:
• “Lambda ran successfully”
• “Step transitioned”

Useful signals are:
• “Order is waiting on payment”
• “Workflow resumed after 2 hours”
• “Shipment failed after approval”

Durable Functions make it easier to express these signals because: your workflow structure matches your mental model

That alignment reduces the gap between:
• what the system is doing
• and what you think it’s doing

Durable Functions vs Step Functions (Practical View)

Use Step Functions when you need:
• Service orchestration across AWS (Lambda, ECS, Glue)
• Visual workflows for operations teams
• Built-in execution tracing

Use Durable Functions when you need:
• Workflow logic tightly coupled with application code
• Faster iteration and local testing
• Simpler debugging of business logic

Trade-offs (Important)

Durable Functions are not a silver bullet.

You lose:
• visual workflow diagrams
• some operational visibility for non-engineers

And you gain:
• code-level control
• simpler reasoning
• tighter integration with your application

Final Thoughts

Reliable systems are not just systems that run; They’re systems engineers can:
• understand
• debug
• trust during incidents

Durable Functions don’t magically solve observability But they remove a layer of abstraction that often gets in the way.

And that makes it easier to measure what actually matters.

If you’re already using Step Functions, you don’t need to replace them. But if your workflows feel harder to reason about than they should…

It might be worth trying a different approach.

DEV Community