Last month, I spent two hours debugging a Step Functions state machine because someone on my team added an extra comma in the JSON definition. The workflow itself? Dead simple—validate an expense report, wait for manager approval, process the payment. But the state machine definition? 150 lines of JSON that felt like I was programming in the year 2000.
That debugging session cost us a production deployment delay and made me seriously question my life choices. So when AWS announced Lambda Durable Functions at re:Invent 2025, I was skeptical but curious. Another orchestration tool? Really?
Then I actually tried it. And honestly, I think this might be the most significant serverless announcement since Lambda itself launched in 2014.
The Problem We've All Been Ignoring
Here's the thing nobody talks about: Step Functions are amazing for complex workflows with lots of branching logic and AWS service integrations. But for 80% of real-world use cases—order processing, approval workflows, data pipelines—they're overkill.
I recently audited our AWS bill and found we were spending $2,847 per month on Step Functions state transitions for workflows that literally just wait for things to happen. An approval workflow with 8 state transitions, running 10,000 times monthly, costs about $2.00. That sounds cheap until you realize you're paying for states that do absolutely nothing except... exist.
And then there's the cognitive overhead. Every time I need to modify a workflow, I'm context-switching between:
- Python code for the business logic
- JSON/YAML for the workflow definition
- The visual Step Functions console to understand what's actually happening
- CloudWatch Logs to debug when something inevitably breaks
It's exhausting. And it slows down development velocity to a crawl.
Enter Lambda Durable Functions: Finally, Just Write Code
Lambda Durable Functions, announced December 2nd at re:Invent 2025, let you write long-running, stateful workflows as regular Python or Node.js code. No JSON. No YAML. No state machines.
The magic is deceptively simple: when your function hits a checkpoint (using context.step()), AWS saves your progress, shuts down the function, and brings it back to life when needed. Could be 5 seconds later. Could be 5 months later. You don't pay for the wait.
Here's what makes it revolutionary:
- Executions up to 1 year: Your workflow can pause and resume for up to a year without idle compute costs
- Automatic checkpointing: Built-in retry logic and failure recovery
- Zero wait costs: No charges while suspended waiting for callbacks or external events
- Write in code you know: Python 3.13/3.14 or Node.js 22/24—that's it
Real-World Example: Multi-Day Expense Approval Workflow
Let me show you a real use case that perfectly demonstrates why this matters. I built an expense approval system that needs to:
- Validate the expense report (30 seconds)
- Wait for manager approval (could be 5 days)
- Wait for finance approval if over $5,000 (could be another 3 days)
- Process the payment (10 seconds)
The Old Way: Step Functions Hell
With Step Functions, I had to:
- Create separate Lambda functions for each business logic step
- Define a state machine in JSON with Task states, Wait states, Choice states
- Handle callbacks manually with task tokens
- Deploy and version the state machine separately from the code
The state machine definition alone was 180 lines. Here's just the approval wait state:
{
"WaitForManagerApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "SendApprovalEmail",
"Payload": {
"taskToken.$": "$$.Task.Token",
"expenseId.$": "$.expenseId"
}
},
"TimeoutSeconds": 604800,
"Next": "CheckApprovalStatus",
"Catch": [{
"ErrorEquals": ["States.Timeout"],
"Next": "AutoReject"
}]
}
}
And this is just ONE state. Multiply that complexity across every step, every error handler, every timeout scenario.
The New Way: Durable Functions Simplicity
With Durable Functions, the entire workflow is just regular Python:
from aws_durable_execution_sdk_python import (
DurableContext,
durable_execution,
durable_step,
)
from aws_durable_execution_sdk_python.config import Duration
import boto3
ses = boto3.client('ses')
dynamodb = boto3.resource('dynamodb')
expenses_table = dynamodb.Table('expenses')
@durable_step
def validate_expense(step_context, expense_id):
step_context.logger.info(f"Validating expense {expense_id}")
# Fetch expense from DynamoDB
expense = expenses_table.get_item(Key={'id': expense_id})['Item']
# Business validation logic
if expense['amount'] <= 0:
raise ValueError("Invalid expense amount")
if not expense.get('receipt_url'):
raise ValueError("Missing receipt")
return {
'expense_id': expense_id,
'amount': expense['amount'],
'category': expense['category'],
'status': 'validated'
}
@durable_step
def process_payment(step_context, expense_id, amount):
step_context.logger.info(f"Processing payment for {expense_id}")
# Update expense status
expenses_table.update_item(
Key={'id': expense_id},
UpdateExpression='SET #status = :status, paid_at = :timestamp',
ExpressionAttributeNames={'#status': 'status'},
ExpressionAttributeValues={
':status': 'paid',
':timestamp': int(time.time())
}
)
return {'expense_id': expense_id, 'amount': amount, 'status': 'paid'}
@durable_execution
def lambda_handler(event, context: DurableContext):
expense_id = event['expense_id']
# Step 1: Validate the expense
validation = context.step(validate_expense(expense_id))
amount = validation['amount']
# Step 2: Wait for manager approval (could be days)
context.logger.info("Sending manager approval request")
manager_callback = context.create_callback(
timeout=Duration.from_days(7)
)
# Send email with callback URL
ses.send_email(
Source='noreply@company.com',
Destination={'ToAddresses': ['manager@company.com']},
Message={
'Subject': {'Data': f'Approve expense {expense_id}'},
'Body': {
'Text': {
'Data': f'Amount: ${amount}\n\nApprove: {manager_callback.approve_url}\nReject: {manager_callback.reject_url}'
}
}
}
)
manager_response = context.wait_for_callback(manager_callback)
if manager_response['action'] != 'approved':
return {'status': 'rejected_by_manager'}
# Step 3: Finance approval for high amounts
if amount > 5000:
context.logger.info("Requires finance approval")
finance_callback = context.create_callback(
timeout=Duration.from_days(5)
)
ses.send_email(
Source='noreply@company.com',
Destination={'ToAddresses': ['finance@company.com']},
Message={
'Subject': {'Data': f'Finance approval needed: {expense_id}'},
'Body': {
'Text': {
'Data': f'High-value expense: ${amount}\n\nApprove: {finance_callback.approve_url}'
}
}
}
)
finance_response = context.wait_for_callback(finance_callback)
if finance_response['action'] != 'approved':
return {'status': 'rejected_by_finance'}
# Step 4: Process the payment
payment_result = context.step(process_payment(expense_id, amount))
return {
'status': 'completed',
'expense_id': expense_id,
'amount': amount,
'paid_at': payment_result.get('paid_at')
}
That's it. The entire workflow in ~100 lines of actual Python code. No JSON. No state machines. Just regular code with context.step() for checkpointed operations and context.wait_for_callback() for human approvals.
The Cost Difference Will Surprise You
Let's run the numbers for our expense approval system processing 50,000 expenses per month:
Step Functions Approach:
- 8 state transitions per workflow × 50,000 executions = 400,000 transitions
- Cost: 400,000 ÷ 1,000,000 × $25 = $10.00/month (just for state transitions)
- Plus Lambda invocation costs: ~$15.00/month
- Plus DynamoDB costs, API Gateway, etc.
- Total orchestration cost: ~$25.00/month
Durable Functions Approach:
- Request charges: 50,000 requests × $0.20 per million = $0.01/month
- Durable operations: 4 steps × 50,000 = 200,000 operations × $0.000001 = $0.20/month
- Compute time: ~5 seconds per workflow × 50,000 = 250,000 seconds
- At 1GB memory: 250,000 GB-seconds × $0.0000166667 = $4.17/month
- Checkpoint storage: ~32KB per execution = 1.6GB × $0.10 = $0.16/month
- Total cost: ~$4.54/month
That's an 82% cost reduction for orchestration alone. And the numbers get even better for workflows with more wait states.
But here's the killer feature: you pay nothing while waiting. With Step Functions, you're technically paying for the state machine to exist during those multi-day waits. With Durable Functions, the function suspends completely—zero compute charges.
When This Makes Sense (And When It Doesn't)
Let's be real: Durable Functions aren't replacing Step Functions for everything. Here's when each makes sense:
Use Durable Functions when:
- Your workflow is mostly sequential business logic
- You have long wait periods (hours to days)
- You want to write and test workflows as code
- Your team is comfortable with Python or Node.js
- You need human-in-the-loop approvals
- Cost optimization matters for high-volume workflows
Stick with Step Functions when:
- You need visual workflow design for non-developers
- Complex branching logic is easier to represent graphically
- You're orchestrating multiple AWS services (Lambda + S3 + DynamoDB + SQS)
- You need sub-second coordination between steps
- Your workflow has 20+ complex parallel branches
- Compliance requires detailed audit trails with visual representations
The Technical Gotchas You Should Know
After migrating several workflows, I've hit some interesting edge cases:
1. Determinism is critical
Your code must be deterministic during replay. Don't use random(), Date.now(), or external API calls outside of context.step(). AWS replays your function from the beginning when resuming, skipping completed checkpoints. Non-deterministic code will cause weird behavior.
2. Cold starts accumulate
Each resume is a new Lambda invocation. For workflows with 10+ steps, cold starts can add up. Consider Provisioned Concurrency for latency-sensitive use cases.
3. Logging is different
Console logs in completed steps won't appear on replay—the step returns its cached result immediately. Use context.logger and check CloudWatch for the full execution history.
4. Region availability is limited
At launch, Durable Functions are only in us-east-2 (Ohio). AWS plans wider rollout in Q2 2026, but if you need multi-region right now, you're out of luck.
5. Version pinning matters
When you deploy a new function version while executions are suspended, replays use the original version. This is a feature (prevents inconsistencies), but you need to plan your deployment strategy accordingly.
The Developer Experience is What Matters
Here's what sold me: I can now test my entire approval workflow locally using pytest, without AWS credentials:
from aws_durable_execution_sdk_python.testing import DurableExecutionTestClient
def test_expense_approval():
client = DurableExecutionTestClient()
# Start the workflow
execution = client.start_execution(
lambda_handler,
{'expense_id': 'test-123'}
)
# Simulate manager approval
callback = execution.get_pending_callbacks()[0]
callback.complete({'action': 'approved'})
# Get result
result = execution.get_result()
assert result['status'] == 'completed'
This changes everything for development velocity. No more deploying to AWS, triggering workflows, manually clicking approval links, and checking CloudWatch. Just regular unit tests.
Why This Announcement Matters for 2025
AWS announcing Durable Functions isn't just about adding another feature—it's acknowledging that the serverless community has been asking for code-first orchestration for years. Azure has had Durable Functions since 2017. DBOS and Temporal have been showing that embedded orchestration is the future.
The timing is perfect too. With AI agents and multi-step LLM workflows becoming mainstream, we need better primitives for long-running, stateful operations. Durable Functions nail this use case.
One of our AI content moderation pipelines—which analyzes images, waits for LLM processing (90 seconds), and routes for human review if needed—was a nightmare in Step Functions. With Durable Functions, it's just code. The LLM call is wrapped in context.step(), the human review is context.wait_for_callback(), and we're done.
The Bottom Line
Lambda Durable Functions represent a fundamental shift in how we think about serverless orchestration. They take the simplicity of Lambda—just write code, AWS handles the rest—and extend it to complex, long-running workflows.
Are they perfect? No. The regional availability is limited, there are edge cases to understand, and Step Functions still win for visual workflows and multi-service orchestration.
But for the majority of real-world use cases—order processing, approval workflows, multi-step data pipelines, AI agent orchestration—Durable Functions are simpler, cheaper, and faster to develop.
I've already migrated three production workflows from Step Functions to Durable Functions. The code is cleaner, the tests are better, and our AWS bill went down. That's a win in my book.
If you're building new long-running workflows, start with Durable Functions. You'll thank me when you're not debugging JSON state machines at 2 AM.
Have you tried Lambda Durable Functions yet? What workflows are you thinking of migrating? Let me know in the comments—I'd love to hear about your use cases and challenges.
Top comments (0)