Most serverless workflows don’t fail because they can’t scale.
They fail because when something goes wrong, engineers can’t easily answer:
• Where did this workflow break?
• What state was it in?
• What happened before the failure?
This is where “measuring what matters” becomes important.
Not more metrics.
Not more dashboards.
But better ways to understand system behaviour.
Recently, I explored AWS Lambda Durable Functions, and it exposed something interesting:
The way we structure workflows directly affects how well we can observe and debug them.
The Problem: Orchestration vs Understanding
If you’ve built workflows using AWS Step Functions, you already know the benefits:
• Clear state transitions
• Visual workflows
• Strong integration with AWS services
But in practice, there’s a trade-off; Workflow logic lives outside your application code.
That means:
• You switch between code and state machine definitions
• Debugging often requires jumping across tools
• Context is split across logs, states, and services
This works well for orchestration.
But it doesn’t always optimise for debugging and reasoning under pressure.
What Durable Functions Change
AWS Lambda Durable Functions take a different approach.
Instead of defining workflows externally, you write them directly in code.
The biggest shift is state management. Durable Functions are regular Lambda functions enhanced with stateful execution capabilities!
Durable functions automatically checkpoint progress, suspend execution for up to one year during long-running tasks, and recover from failures.
Here’s a simplified example:
from aws_lambda_powertools import Logger
import time
logger = Logger()
def order_workflow(event):
order_id = event["order_id"]
logger.info(f"Processing order {order_id}")
# Step 1: Validate order
validate_order(order_id)
# Step 2: Wait for payment confirmation
wait_for_payment(order_id)
# Step 3: Process shipment
ship_order(order_id)
return {"status": "completed"}
Now imagine this workflow:
• pauses after wait_for_payment()
• resumes hours later when payment is confirmed
• continues with full context preserved
Why This Matters for Observability
This isn’t just about developer experience, It changes how you instrument and observe workflows.
With traditional orchestration:
• Step Function execution graphs
• Distributed logs
• External state tracking
With Durable Functions You can:
• Log at each logical step
• Track state transitions in code
• Correlate execution paths more naturally
Example:
logger.info({
"step": "payment_wait",
"order_id": order_id,
"status": "pending"
})
Now your logs reflect business flow, not just system events.
Measuring What Actually Matters
In real systems, useful signals are not:
• “Lambda ran successfully”
• “Step transitioned”
Useful signals are:
• “Order is waiting on payment”
• “Workflow resumed after 2 hours”
• “Shipment failed after approval”
Durable Functions make it easier to express these signals because: your workflow structure matches your mental model
That alignment reduces the gap between:
• what the system is doing
• and what you think it’s doing
Durable Functions vs Step Functions (Practical View)
Use Step Functions when you need:
• Service orchestration across AWS (Lambda, ECS, Glue)
• Visual workflows for operations teams
• Built-in execution tracing
Use Durable Functions when you need:
• Workflow logic tightly coupled with application code
• Faster iteration and local testing
• Simpler debugging of business logic
Trade-offs (Important)
Durable Functions are not a silver bullet.
You lose:
• visual workflow diagrams
• some operational visibility for non-engineers
And you gain:
• code-level control
• simpler reasoning
• tighter integration with your application
Final Thoughts
Reliable systems are not just systems that run; They’re systems engineers can:
• understand
• debug
• trust during incidents
Durable Functions don’t magically solve observability But they remove a layer of abstraction that often gets in the way.
And that makes it easier to measure what actually matters.
If you’re already using Step Functions, you don’t need to replace them. But if your workflows feel harder to reason about than they should…
It might be worth trying a different approach.
Top comments (0)