Kazuya

Posted on Dec 6, 2025 • Edited on Dec 7, 2025

AWS re:Invent 2025 - [NEW LAUNCH] Deep Dive on AWS Lambda durable functions (CNS380)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - [NEW LAUNCH] Deep Dive on AWS Lambda durable functions (CNS380)

In this video, Eric Johnson and Michael announce AWS Lambda durable functions, a new capability that enables developers to write reliable business logic as sequential code using familiar programming languages like JavaScript, Python, and TypeScript. They explain how durable functions use checkpoint and replay mechanisms through a new open source SDK, allowing Lambda functions to suspend execution for up to one year while waiting for callbacks or external events. The demo showcases a Serverlesspresso application rebuilt with durable functions, demonstrating local testing with SAM, execution monitoring in the console, and real-time order processing with callback handling. Key features include automatic retries, idempotency, version pinning during replay, and integration with existing Lambda capabilities like VPC, layers, and EventBridge notifications. They discuss pricing based on operations and storage, provide best practices for deterministic code, and compare use cases with Step Functions, emphasizing that durable functions excel at application code orchestration while Step Functions suits AWS service orchestration.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction and Setting the Stage: Meet the Speakers and the Rules

Good morning. How are you all doing? All right, the five of you that wooed, you may stay. The rest out. Hey, we're so glad to see you here. My name is Eric Johnson. I'm a Principal Developer Advocate for AWS, and I talk about serverless, and I love serverless, and I have to tell you this is one of my most favorite announcements I've ever done, but we'll talk more about that in a little bit. I'm going to turn it over to Mike. Mike, tell them who you are. Hey, good morning, everyone. I'm Michael. I'm a Product Manager in AWS for the serverless organization.

All right, I'm glad to have you here. I have to be honest, I'm really honored to speak with Mike. He's one of, you know, you have the folks, he's one of the good ones. I love him. So we're super excited you're here. How many of you all heard about this announcement? How many of you all just followed the crowd? OK, that's fair. All right. How many of you all have heard me speak before? OK. Oh, a good amount of you. All right. Do you know the rules? OK.

All right, for those of you who didn't raise your hand, I'm going to tell you the rules real quick because you've got to understand the rules when we're doing this. So these are the rules. They're fairly simple. Number one is this is any number I want it to be. OK? Now there's people coming in later and when I say five, they're going to be really confused, but that's a five. And I'm the only one that gets to do this because you look silly doing it, OK? If you say there's 25 of us, they'll be like, that's a five. All right?

So the second thing is these are quotes, not apostrophes. I know that, OK, because this looks better than this, OK? And finally, these are thumbs, because this will get you beat up. So those are the rules to help you out. Also, I'm not listening to a football game or music, unless I am. I use these for hearing aids. I just realized I still have them in. Oh yeah, I was wondering. So yeah, so I'm catching some music, stuff like that. But anyway, listen to the story.

So anyway, those are the rules. I do tell a lot of one finger jokes. I wasn't, but I didn't wake up this, I was born this way. I didn't wake up for the first time today like this, so if I did, I wouldn't be here with you. So I'm very comfortable. You're going to hear those jokes, however, and I really do mean this, if it makes you uncomfortable, I'm comfortable with that as well, so I'm going to be fine this morning regardless. I'm good.

So all right, so we're going to jump in and we're going to get started here. We have a lot to cover and we want to show you. Normally I would do a lot more jokes, but we want to show you this. We're super excited. We have a lot to cover. So you ready? I'm ready. All right. You timing me? Yes, right, I'm already watching. He's watching. He's already stressing. Move it, Eric. OK, all right. OK. This is the first time we've gotten to speak together. We actually just met for the first time this morning. You are John, right? Yeah, yeah, yeah, exactly, yeah.

The Evolution from Monoliths to Microservices: Understanding the Journey

All right, so let's jump in. Today we're going to be talking about, hopefully you've heard about this, the AWS Lambda durable functions. Now before we could talk about this we've got to kind of talk about evolution, right? So in the beginning, think of it, do you want to sing? Can you do like mysterious music? OK, all right, that was good. I like that.

All right, so in the beginning there was the monolithic. And here's the truth, we love the monolith, right? Or at least as developers, we like developing on the monolith because everything was in one place, right? It was all on a single screen. I didn't have to go over here to this room to do this thing and then run over here to another state and zip code to do this thing. It was all, I'm hearing people going yeah yeah yeah yeah, so it was all one place. So that was the, but the problem or struggle with this is that it was coupled, right? We had a problem if we wanted to scale this. It meant we were no longer developers, we were operations people scaling large computers and dealing with that and, you know, so we needed to change that.

So then we came with the microservice and we love the microservice, right? Because microservices give us this ability to be detached, to be decoupled, to work on independent things, and we love that. However, with every good thing comes a little struggle sometimes, and we have this complexity in coordination that could happen, you know, with cascading failures and it was cognitive overload or overhead, but I say overload to make those work, right?

So AWS came out and said we think we can help you do that. We have a lot of practice running large scale applications. Anybody ever heard of a little startup called Amazon.com? OK, that joke kills every time. Yeah, I'm still laughing. Good. I'm glad to hear that. We appreciate that you use that, but we're good at this, so we said we want to help developers do this, so that's why we brought out serverless.

The Serverless Journey: From Lambda to Step Functions to EventBridge

And so if you think of the serverless journey, we started with AWS Lambda, and this was really the first time that the word serverless was used in mass quantity because while we had serverless services before, the Lambda function was the first time we introduced it as compute. And you could do compute serverlessly, and we loved it. We were so excited about it, right? But it was stateless,

short-lived, and those aren't bad things. But we found we really need a little more orchestration for some things because I don't know about y'all, but is anybody else the master of the if-then, the switch, and bad code, right? My title is Developer Advocate. It shouldn't say developer. I'm a hack developer, right? So we said, you know what, we want to be able to orchestrate things. So in 2016 we came out with AWS Step Functions, which gave us orchestration without infrastructure. It's a serverless way to orchestrate things together. We were very, very happy about that, right? This is perfect for AWS service integration as we're doing that, and it gives you a visual orchestration of infrastructure services along with that.

So you've got this orchestration, but we found we needed to be able to choreograph between these domains, right, between these running orchestrations. So in 2019 we announced Amazon EventBridge, which is event-driven architecture without infrastructure, and this allowed us to easily, or more easily, decouple our architectures, right? So this was the journey that we were doing. However, we still have this question the developers ask us all the time, and I as a developer am asking, and that was, what about application logic orchestration? How can you help me do this better?

So here's what I think, all right, and here's what developers think, and I call myself a developer. There's many in the room who'd argue that. Developers want to build like you're building monoliths, but you want to deploy microservices, right? We want the old days of a single screen, but we want to be able to use the decoupled architectures that are out there without the cognitive overload. I think that's the question that we're having.

Announcing AWS Lambda Durable Functions: Building Demanding Applications with Familiar Tools

So, yeah, can I just interrupt you for a second? Well then, I'm just going to suspend you. Okay, can you give me this please? All right, so I was wondering, Eric, as you were talking, what if you could build even the most demanding applications like order processing, payment systems, and user onboarding directly on Lambda, which means using your familiar processes and tools like programming languages, IDEs, and even LLM agents these days to build, test, and debug these applications locally before deploying to production. And if you needed to even pause the execution of those functions in your applications, for example, if you have long-running operations like human-in-the-loop processes.

And so I'm excited to tell you and everyone in the room here as well that yesterday we announced AWS Lambda durable functions. Thank you. So with durable functions, you write your reliable business logic as simple sequential steps, almost like the good old monolith days, just cooler, and that means you use a familiar programming language: JavaScript, Python, TypeScript, you name it. You can even suspend the execution of those functions when you need to wait for extended durations. And lastly, as you know, it's Lambda, it's fully managed, no servers to deploy and operate.

So I'm going to pause here for a second as well because I want to say thank you to our AWS customers, partners, heroes, and my amazing AWS colleagues who contributed to this launch, so this wouldn't have been possible without you and also not without Eric. Thank you so much. Yes, thank you. This is for you.

Understanding Durable Functions: Superpowers Through Checkpoints and Replay

So I talked a bit about what these functions can do for you, but not so much how, and I think this is a deep dive, so let's go a bit deeper. First and foremost, and maybe that's the biggest thing you should know from this talk, is durable functions are regular Lambda functions. It's not a new resource. It is literally the same function that you know today. That means your processes don't change. You have the existing event integrations, the tools, the things that you use today.

However, durable functions have superpowers, and so Eric created this little superpower logo there, and I think there will be stickers at some point. So I'm rooting for you. Do this by hand. Oh yeah. So what are these superpowers? One is durable functions let you checkpoint progress in your event handler. It's almost like hitting the save button as your event handler executes to persist the progress the function has made, and this is useful for two things. One is if there's a failure, there's a crash, you know where you can recover from, but it's also useful to suspend the execution of a Lambda function when you need to wait.

So checkpoints are useful for the progress but also for suspension. Well then the question is if we have these interruptions or suspensions, how do we recover from them or resume from these points, and this is where replay comes into the picture. Replay runs the event handler, your business logic, from top to bottom again, but it skips over completed checkpoints, so it's not going to redo the work or the side effects that are already completed. But in order to simplify that for you,

as a developer, we offer you a new open source SDK that abstracts these lower level primitives like checkpoints and replay into higher level operations. Some of them are called steps, some of them are called waits, and you'll see more in the demo later. Generally, this concept of checkpoint and replay is known as durable execution. So durable execution gives you these checkpoint and replay capabilities expressed through the SDK, and therefore you will see durable execution in our API, in the console, and in the SDK as well. So in a nutshell, durable functions are regular Lambda functions that use durable execution to make reliable progress and suspend execution.

Getting Started: Configuration and SDK Integration Made Simple

Okay, how do you get started? And this is something I'm super happy about, the experience here. So in the console there's this new little toggle. It's almost like a little innocent toggle when you create a Lambda function where you can just enable the durable execution capability in your function. That's it, you just set it there. You can also do it through the APIs, which means infrastructure as code tools, CloudFormation, CDK. You will see that there, and you can set two new properties. One is the execution timeout for the whole execution of the Lambda function as it executes multiple times. I'm gonna add something here. Please do. So when we say API, you're all developers, you probably know this, but for me myself, this is also IaC, so infrastructure as code. We'll talk about it a little bit, but this is literally as simple as adding these two configurations, correct, and off you go. And one is even optional, the retention period. So the execution timeout can go up to one year. So the whole lifecycle of this durable function can be one year with suspension points, and the retention period is for the checkpoints. How long do we want to persist the data, the checkpoints, after the execution has completed? And this is configurable.

Next, in your event handler or in the Lambda function, you import the durable functions, durable execution SDK. I'm showing TypeScript here, and you will have two new primitives. One is the wrapper, that withDurableExecution wrapper, which you wrap around your event handler, so it upgrades the existing event handler that you use in your function, and you get access to this new durable context. The durable context has these superpowers. And with the durable context you get then these steps that let you persist or checkpoint your business logic progress. So anything in white here is literally existing code that you would write anyways in your function. The blue things are the new superpowers that you just use as you write your business logic in Lambda. And we also have waits, different wait capabilities. I'm just showing here one, the context.wait where I can say I'm gonna wait for five seconds, and what happens here is it's gonna make a checkpoint as well and then terminate the function, and after five seconds bring it back to life, run the checkpoints again, and then move on.

Deep Dive into Checkpoint and Replay Behavior: An Order Processing Example

So let's talk a little bit more about the checkpoint and replay behavior, because it's fundamental to how durable functions work. So I'm just gonna use a very simple order processing example here. It has four steps that we want to go through. Obviously your example would be bigger, but I couldn't fit more on the screen here. First, what we do is all of our business logic, the steps that we want to run, we put them in these steps, context.step, and then whatever you wanna do inside those with your code. So steps are used for checkpointing, so we keep track of the checkpoints in this example, but we also keep track of the benefits that we gain from these different primitives as we go through the example.

So let's start with the first one. Your event handler starts running, it's triggered by an event maybe from SQS or from another upstream. So it runs the first step, executes the validation phase, that's fine, cool, succeeds, moves on to the next one and does the checkpoint. So we get this progress tracking, that's kind of fundamental. Now the event handler moves on, hits the next step, which is the reservation of our stock, maybe reserving it. Here we have an issue. It blows up whether it's the downstream is not available or the function has some issue, whatever. And in this case, if you put a step around it, we get automatic retries, backoff, and jitter. So I'm not saying you can just write happy path code, but we remove a lot of the boilerplate that you have to put in your function to make it resilient with automatic retries. We also give you deduplication or idempotency, depends a little bit on how you wanna frame it, because let's assume this function crashed and the upstream caller, whether it's an event poller or a user clicks on retry because something happened and it's not really reacting. What do we do here? Well, behind the scenes a durable function will not spin up another durable function for another request, so we have built-in idempotency and deduplication logic on the front end to ensure only this execution is running once.

Now replay kicks in. This is when the system tries to recover from this interruption.

Replay first makes sure that the invocation, when we run the function again, uses the exact same version that the durable function used for this execution. You might have had code changes in between, bug fixes, whatsoever. Replay makes sure it always replays on the version it was started with, so the lifecycle is guaranteed and pinned to the version.

Then as the event handler starts processing again, we are going to skip over the completed checkpoints we already did. We're not going to redo them, and that's very useful, for example, if you have steps in your code that are either expensive or latency sensitive or have side effects that you don't want to redo again. If you completed them, we just skip them. Next we hit the reserve point because we didn't really complete it the last time, so we're going to rerun the reservation step again to ensure it works. Let's say this time the downstream is available and we don't crash, so we have a checkpoint here.

So you might have noticed that the payment step was not wrapped in a step, and it was for a reason, because the payment here should signal in our code that it's waiting. It has a wait capability we want to express that we need to wait for someone to click a button or swipe the credit card through, so we use one of these wait capabilities that our context provides. Waits are also checkpoints. And then when a checkpoint, when a wait point is hit, the function terminates. This is literally when it stops executing gracefully because you indicated that you want to stop here. Now it's suspended. It's not running. The function is suspended.

The question is now, how do we bring it back? What wakes up the function again? This depends a little bit on the wait that you use. A typical wait would be context wait five seconds, so you have a timer-based wake, or you can do wake me up in a week so I can send an email to someone else. But we also have callbacks where you can send a callback ID or token to a downstream service like a payment system, which then uses this token to complete this callback. Once completed, the function will resume. Another one is conditions where you can poll external APIs and say every five seconds, please call this API, but between the five seconds when you don't do work, just sleep. So we have different wait strategies.

But more importantly, sometimes you want to cancel a wait, and if you've done this before in other architectures, cancellation of sleeping or pending resources is really hard in those systems. With durable functions, it's literally just a feature or a configuration that you put on the wait when you say cancel this in five seconds if you don't hear back from the other system. Okay, let's assume we have a cancellation here because the user didn't click the link or didn't swipe the credit card through. Through cancellation, we will bring back this function. It's going to go through the replay process again. It's going to skip the wait, obviously, because we've done this. We're not going to wait again. And now we might want to do some compensation.

We might want to go in a different path in our code that says, oh, actually the user didn't click, so we have to roll back, undo work that we did before. Undoing work, often called the saga pattern, where you have different services that you want to undo in distributed architectures, are quite simple with durable functions because they're just steps. You put them in steps, try catch, and in your catch code base, for example, you just run these steps to undo work. And because they are steps, they get the same reliability guarantees that we discussed before. Steps are also then checkpointed for these undo operations.

Live Demo: Building Serverlesspresso with Durable Functions

What I cannot show you here on the slide is local testing and observability, and I think this is a good segue into your demo area. So let me now, a couple caveats here. Eric Johnson coding, we already talked about that. Most liability code in the house is mine, so you're going to get to see me. I have one finger and I have fat finger worse than anybody. But I'm going to try some live coding today. We're going to see how it happens. But for some of this, I'm going to show you on the screen.

But first of all, how many of you all have heard of Serverlesspresso? All right, a few of you. How many of you have had the coffee? All right, so if you want to go over to the expo hall, get a Serverlesspresso, coffee's on me. Not really, but you know what I'm saying. But we're going to use this as an example. What I decided to do is this is a really well-known application that is based on Step Functions, which we love, but I wanted to see how well could I do this in durable functions. So just to kind of give you an idea, the durable function is kind of the hero of this application.

Real quick overview, and this isn't really about architecture, but this is kind of what the architecture looks like. It is an event-driven architecture that I'm using, and the durable function is pausing and restarting and doing things in parallel and child context, all the things. Crazy, yeah, it is crazy. So let's take a look here. More importantly, I want you to see kind of what do the steps look like. So if you look inside the durable function, this is what's going on. We've got the order placed. We've initialized the order, then we validate the order, and when we're validating, we're doing that kind of in parallel. So let me show a couple of these.

So when we initialize the order, you can see in the blue here I've got the context.step, and this is the initialized order. So this is how it shows up in my name when I'm looking at my console, and then the durable function has a context, right, the step context. And then in there I can use the step context to get access to different things, and I've kind of abstracted away the DynamoDB code. It's probably bad anyway, but that's, you know, we do code to talk to DynamoDB to write the data.

Then we're going to actually do the logger, so we're going to write that out. Hey, here's what's going on. And when you see me run this later, I logged everything so you can see that going on. Right, so finally I'm going to set a retry strategy. Now I can set this once and apply it to all, or I could set different retry strategies, but I can tell it right in there, hey, I'm going to have you retry twice with a certain amount of jitter with different things. So there's a lot of aspects I can change on that, and I have a lot of control over that, and that's a lot better than trying to write that in my code.

Let me just place that, Eric, just to be clear, all the stuff in white is just pure business logic. Yeah, it's what you would do in any other Lambda function, right, or anywhere else. So that, and that's how I think that's a really important point to make is we built this specifically to work in a Lambda function, not to have to learn, I mean, yes, there's an SDK and you're going to learn those steps and things like that, but it's easy to use in a Lambda function because it was built on top of all the Lambda function stuff already. So, all right, good. All right, here we go. Awesome.

So the next thing we're going to do is we're going to validate the order, and in this one I'm actually going to run some things in parallel, right? So I want to do, there's no sense me doing this, then this, then this. I need to grab a couple of things and get that information. So I'm actually going to use a child context here, and so you see I've got the child context that's being delivered by the parallel step. And in there I'm going to fetch the event config which will tell me is the store open, do I have capacity, do I have the menu, whatever, right? And then I'm also going to check the amount of orders to fetch the attendee orders because here at re:Invent, if you've been over there a lot of times we limit it to one or two per day because wow, coffee in Vegas is expensive, right?

So, okay, so then I'm going to add my recharge chart. That's going to be on recording. I'm going to be in trouble for that later. So, all right, there we go. Okay, so we got a retry strategy and then I'm also adding another thing here. I'm adding the max concurrency, how many times or how many do I want running at one time, right? So I have control over that and this is super handy if I'm trying to protect downstream services and different things like that. So really important.

So the next thing we're going to do, we're going to skip a couple of these, and we're going to do a wait for barista acceptance. So this is where we, it's all powerful, but this is really cool. So what I'm going to do is I'm going to call a wait for callback and I'm going to pass it, I'm going to get a callback ID and then I can store that. And so then I can do some different work and then I'm going to pause, right? I'm going to suspend the Lambda function until that callback ID, so somebody calls the API with that callback ID and said this was successful, this was a failure, and so on. Okay, so that gives you an idea of how that works.

I'm going to actually get to the code here so we can do that a little quicker, and then again I'm setting a specific retry on this, or I'm sorry, a specific timeout, how long do I want to do that? Some other settings I'm going to do is I'm actually catching this. So if someone, if it times out because I have a two minute timeout, if it times out, I want to catch that and say, oh, it was canceled because the barista didn't pick it up on time or the user canceled it. That was the undo topic, the saga compensation that we, exactly, it's compensation. Some of you might wonder now, okay, how does the callback come back and how does this work? We added new callback APIs to Lambda APIs so you can complete or fail these callbacks as Eric just mentioned. That's right.

All right, so I'm going to go ahead and switch over the demo we should see that. Okay, good. All right, here we go. All right, first thing I want to show you is this is a SAM template. This is built into SAM, and the only thing I'm doing is if I scroll down just a little bit here on line 676, you can see durable config. Is that big enough for y'all in the back? Can you see it? All right, perfect. All right, I see thumbs up. I think that's now you're just bragging. Okay, all right. So, that's the first thing I'm going to add.

Now down here is I'm adding a statement, and these are two new permissions. You don't have to add this if you're using SAM. It'll add it for you, but I wanted you to see that. So what am I doing? I'm adding the permissions to checkpoint durable execution and get durable execution state, and then I'm scoping it down to my, to the particular function I'm going to use that gives you an idea of in SAM, but let's actually see it in action. All right, so we're going to refresh this. We've got a coffee in there. Somebody scanned my coffee when they saw the QR code.

All right, but I can play that way. Too early. Okay, all right, so that's gone. All right, so I'm going to go ahead and first, because where do we start as developers? We start locally, right? So I'm going to go ahead, let's clear this out real quick. We won't clear it out because I can't spell it. So I'm going to do sam local, hold on that's a comma invoke that comma. All right, here we go. I'm going to stress Michael out, you watch it. But it was a single tick comma. Yeah, yeah, he's going to lose it because I, yeah, if you had to watch me, all right, so we're going to move on.

So I'm going to go ahead and start this. So this is actually locally invoking the Lambda, the durable function. Right, and so it's going to run through this, and now you can see, and I'm running through it synchronously. I could do this asynchronously as well, but one thing I'm going to do is I've come to a point where it needs a response from me. So this is that first callback, and what it says is, hey, I've hit a callback. Here's the callback ID. What do you want to do? Well for this we're going to do a happy path. We're going to say go ahead and send a callback success and here's my results. And you notice over here I've got some coming in here, somebody else is ordering coffee. You're going to see this one be accepted, there it goes and now we're going to go ahead and complete that. And so at the time this, the durable function is paused and waiting for my interaction. So now I'm going to go ahead and complete that, and you'll see that go away.

So we were able to interact because I'm using, I have already deployed this, so I'm able to interact this. However, what if I haven't deployed that? Well, one of the cool things is I can go in now. Well, actually let me show you this first. I'm going to do a sam local and I'm going to get the execution. So if I'm local and I want to see the results of this, let's just paste this here. I'm moving fast here, Michael. Keep me honest on time. You're good on time, my friend. All right, so there's the results of that one. Now what if I want that full history? Well, I'm going to copy this and I'm going to paste that there. And now I've got the full history, and I want to actually blow this up so you can see this. Look at this table here. So in here I can see everything that happened.

Okay, now if I wanted it, I could also add a format JSON and I would get back the, well, there's no reason we can't do it here, so we'll do format JSON, and this will actually pull that and again this is all stored in the container or stored in this local, and here's all the information that I need that I can use to debug. All right, so this is great, but what if I haven't deployed it yet, right? So right now I'm using environment variables to talk to services in the cloud, but as a developer I want to run local tests, no problem. So I'm going to go into my Lambda function, which is in the coffee orders, and then I'm just going to run npm test and I'm going to use the test runner that's provided in the SDK that's actually going to run this and it's going to, and I've got all kinds of assertions that I can make and we won't go through all this. I actually have a blog that I've put out that'll show you a lot of this, but this allows you to test these through mocking without having to talk to the cloud and you can fully wrap a test suite around that.

Yeah, just to briefly correct you, not in the CDK, in the SDK. The SDK comes with a test. You did, you because you love CDK, but I love the SDK. So it is in the SDK gives you a testing kit now where you can run and mock these unit tests, use dena as Eric just described it to run the function locally for some more integration testing against the cloud and then later deployed, which I think you did, right? I did, yeah, so, all right, so now I'm going to get you to help me out, get your phones out, so we're going to see how this goes, all right? I'm going to let you order coffee, although some of you jumped ahead, we'll have words. All right, so let's throw up this QR code. Now you'll see you should be able to see this, and I should see orders starting to pop in. If not, Michael built this. Wasn't it Kro? It was Kro, yeah, let's be really honest, Kro and I did this. All right. More Kro than you, I guess. Yeah, yeah, yeah, exactly.

All right, so here we go. We got orders. Oh my God, whoa, whoa, whoa, whoa. Okay, go warn the baristas. All right, okay, so we're going to take this off because you're killing our baristas. All right, so if I go in here now, these have a two minute timeout. I'm going to try really hard to get to all of you in two minutes, but I'm probably going to cancel some of you. But let's look what's happening under the hood. So here's the Lambda function, and we have a new tab called durable executions, and you know it's new because we put the word dash new in there. So there's the new tab, all right, in case some of you couldn't find it, look for new. All right? So when we go to new, you're going to see a ton of these running.

Now I've got one that we may have, that I may have, oh, this may have been one that I canceled earlier, but these are all running. So if I go in here, let's actually read this and look at the output. And yeah, I canceled my, oh, someone ordered it and canceled the attendee, so I'm able to look at the input and the output here. So let me go back and you can see I'm on version 99. I've been doing this for a while here, so I'm going to go into one of these and you'll see here that we are sitting in wait for a callback in the acceptance. So it's actually gone in here.

Hold on a sec Eric, can you pull up the steps again, the durable, because this is easy to miss in his code. Eric is using step names so you can provide names to these steps which then we'll visualize in this table here. So for example, generate timestamp, so you have full visibility in your code if you provide step names and then you also see the logical sequence of your operations including parallel. Yeah, parallel is there as well. Yep, that's right, yeah, I've got parallel going on. So here you can see the parallel what's going on. You can go in here. Here's the fetch event config. Here's the, you can see the outputs I got from that. Now you can see here, and now that you've talked, I'm not going to get to some of these. Yeah, so there's the callback, right? So you can actually go here and you can see the callback ID that's going in.

What else do you want me to show? Well, in the console, if you scroll right a little bit, you can even complete a callback in the console if you want to. It's in the actions. You see the little buttons, so you can even complete those in failure. That's right. So you do that right in the console to see how that works. All right, now I'm going to go in here to the barista. I'm going to accept a few of these, and we'll complete a few of these. And so what's happening is the Lambda functions are coming back up. They're starting that replay model, they're skipping over everything they've done, and they're continuing on. And you'll see what you're going to see is a lot of these start canceling out. Oh yeah, we've had a lot canceled out already because I'm sorry, I'm barista-ing as fast as I can.

All right, so we'll, oh, they're going. I feel a lot of pressure, Mike. All right, so we'll complete. I'll say I completed a few, and there you go. The other thing I want to show you real quick is we'll go into CloudWatch. You can see on this, let me go back and get the most recent. And you can see here that I've got all my structured logs. I can pull this as I need it, much like what I'm seeing in the console. So good, awesome. All right, great demo. Turn back over to you. No, I think it's you. That's still me. That's right. It's okay, so let's go to there's that code again. So I'm going to be taking this down today. So there you go. All right, so production really a reality. Let me talk about what this means.

One of the really cool things I want to throw out, all kidding aside, I'm not a strong developer. I'm an average developer. I probably represent the average developer. And so when I decided to build this, of course with the new code assistants, we're able to move very fast. I was able to build, take away all the wrestling I did with types and bundling because I struggle with those things, but I was able to build this application in roughly six hours using Quiro. So the reason that's so cool is coding assistants love durable functions. They love code, right? They understand the code. There are many references available. Obviously we're pushing a lot more out on our documentation, but load up your steering docs with the documentation with the references, and Quiro really, really rocked on this. I was so proud of this. I bought him a hat, so they moved quickly.

All right, so let me, the other thing is the durable function unit test framework, right? So you've got two testing modes that you can do this. This is really slick. You can do local and cloud, and I have a working version of this I'll be posting probably early next week. Complete execution inspections, storage options for storing to take that, testing strategies, local test for business logic, cloud test for integration staging, and then you focus on the outcomes on this, right? So this is just an example of what it looks like. I have a whole blog on testing that I'll be posting later this afternoon as well.

All right, so the last thing I want to talk about here is infrastructure as code, which you already saw this in effect, but we are going to be obviously SAM is and the version of SAM is coming out actually as we speak they're deploying this, so that'll be out, but this it allows for local and remote invocation, execution invocation data execution history like we showed you. You could do callbacks. You could stop the execution, and you could get the logs, but SAM's not the only one supporting this, as Michael said. We're going to be CloudFormation. If SAM supports it, CloudFormation supports it. That's how it works, right? And so Cloud Development Kit or CDK is coming out this week in Terraform. Yeah, it's merged already. I'm sorry, it's merged already. It's already, it's merged already. Michael's way ahead of me, so that gives you an idea of what's going on with infrastructure as code. Michael, I'm going to turn it over to you. Awesome, thank you. Yeah, you bet. Give a, I think we should give an applause for the demo so that this was very brave, very brave.

Key Integrations, Features, and Technical Capabilities of Durable Functions

Thank you. Well done, Eric. Thank you. All right, let's talk more about the key integrations and Lambda features that you can use with durable functions. So first, runtime. Durable functions support at launch Node.js 22 and 24, and Python 3.13 and 3.14. And also OCI can be used for bundling the SDK. We'll come back to this in a second. We have more developer tips. If it does Node, it does TypeScript. Oh yeah, good point. We also have, everybody knew that, a good call. Yeah, thank you and thanks for interrupting me. Yeah, you got it. It's what I do. Yeah, okay, let me replay then. Okay, sorry. So for runtime we have more languages planned in the pipeline as well.

Event sources. It's a Lambda function as I said to you before, so it works with all your event sources that you have. Direct synchronous invocations, there the Lambda durable function is limited to 15 minutes of execution because it is the way synchronous invocations work in Lambda today. You would not want to hang on for a year on a connection. However, there are some cool features even with synchronous invocations. One is if you have a synchronous invocation that let's say fails after five minutes, the caller terminates, does a network glitch, you can now reattach to a running execution if it's still running. If it's completed, you get the result back, so we have idempotent behavior if you pass the execution name parameter.

On async invocations, which will probably be the most common use case, you get up to one year execution. By the way, I forgot to say in direct sync you also can use waits. So we have heard yesterday from a customer who said that they have a synchronous invocation, but then they need to wait for a callback on the external backend. Now with synchronous invocations, as long as you stay within the 15 minutes, which you usually do on a sync, you can even use waits in between, and the caller won't even notice that someone is terminating and sleeping behind the scenes, so that's pretty fancy.

Next, this is the idempotency way. If you invoke a durable function through sync or async, you can pass execution name, and that gives you this deduplication effects that we spoke about earlier. We also have, not shown in the demo, invoking other durable or non-durable functions. You will likely have a lot of existing Lambda functions, so in your context you can do context.invoke and invoke non-durable and durable functions. While those are running, your durable function will go to sleep and suspend, which means function chaining now becomes somewhat of a less anti-pattern because you get the reliability and you don't pay for the wait while the other function is executing.

Obviously, Lambda functions support event source mapping, so durable functions do so as well. However, event source mappings today also invoke your functions synchronously, which means you're bound to the 15 minute execution duration. If you want to go for longer, you can invoke a non-durable function, for example, and just dispatch on an async path, but that is just the way event source mappings work today, and that's for a reason. You might want to use SQS as a buffer with concurrency controls in your Lambda functions. They do still apply. Therefore, with a synchronous behavior in this sense, we still keep the promise of concurrency control. Alternatively, if you have FIFO order-based event sources, synchronous ensures that the processing remains in order for the system. So we didn't want to break you in this way.

And all the other integrations, S3, EventBridge, whenever you can pass a function ARN, things will just work. Versions and aliases are not just supported but also very important for durable functions, because the replay behavior, the checkpoint replay behavior requires that your code remains deterministic when it's re-executed. Therefore, we don't allow unqualified ARNs, and this is a safety behavior that we put into the system because unqualified ARNs are kind of, I don't really care what's going on. But if you tell us you don't care, we also can't really care about your code because we don't know what version of the code you want it to be executed. So replay becomes hard in those scenarios.

Therefore we only support latest as kind of a you only live once mode if you really want to do fast prototyping, or proper versions and aliases because they are strict. They are kind of required in the replay model of durable functions. Yeah, I'll throw this out that those are also with the IaC, Sam, they're gonna handle a lot of that for you. True, yeah, and it's kind of a best practice to be honest, to know what code is executing in production.

x86, ARM, I'm just gonna skip through some of these kind of things. Dead letter queues are supported, and you might still want to use them. For example, if an execution fails or it can't be executed, you want to put it in that dead letter queue. Layers and extensions are supported, but we couldn't test all the layers and extensions out there, so there might be some things, technologies, SDKs, integrations that might not be aware of the durable functions, so we have to work together with them to make them aware of durable functions, of this replay behavior. VPC attachments, running them in private networks, concurrency settings, I already explained, especially for event source mapping with SQS, quite interesting. SnapStart works,

and Power tools. Power tools doesn't just work, we really worked amazingly together with the Power tools teams if they are in the room or watching in one of the other broadcast rooms. Amazing folks, great support, kudos to you. Awesome. We've got more covered in the Lambda developer guide. There's a big new section on durable functions, so it has all the details, so please take a look there.

On the security side, we introduced a new managed policy because while you might not want to distinguish between non-durable and durable functions in the future, we still want to make sure that you can gracefully adopt this technology. So we introduced new IAM conditions and resource policies so you can gradually roll out durable functions and also prevent access or control gate the usage of durable functions. Therefore, we have these two new checkpoints in the system. There are more APIs that we offer like the callbacks, for example, or the durable invokes. They are not part of the managed policy for security reasons. You can just adopt and amend them.

On the encryption side, data encryption at rest, we support the existing customer managed KMS keys on Lambda function for code, environment variables, and ESM filters. At launch, we only support an AWS owned KMS key for the checkpoint encryption. I'll come back to this in a second. For monitoring, we obviously integrate with CloudWatch logs. You've seen this. Anything that you log in your function will just be logged. However, we have this new context logger which is replay-aware, so it will suppress logs on replay so you don't get spam in your system. However, even that logger can be configured for some testing. You might actually want to see the duplicate logs printing out, so that's not a big change on CloudWatch here.

CloudTrail is supported and we also emit new CloudWatch metrics for how many durable executions are there running in my system, which ones have failed, and all the operations that you create like checkpoints and waits. X-Ray is also supported for tracing. Something that's really cool and the person is even in the room who worked on this, we now have notification support with EventBridge, so durable functions emit execution status change events to EventBridge so you can capture execution completions when they succeed, and even when they fail through EventBridge. So here's an example of a running execution so you can even see the progress of those executions through EventBridge events in the system.

For quotas, there are new quotas we are introducing for durable functions. Some of them are also to protect you and your workloads. For example, the number of running open executions. We have 1 million here, but you might want to adjust based on your needs and a couple other quotas as well for the system. Quotas should, by the way, be high enough to cover most of the use cases. Please discuss them with us if you need more.

Best Practices, Code Considerations, and Pricing Model

Okay, let's move to best practices. I think the biggest one here is, and this we've discussed it with the community before, start simple. Don't build these gigantic, even though the monolithic experience is great, but what Eric was not saying is throw all your 1000 Lambda functions in one gigantic Lambda function. Good coding practices, I think that's what you did not want to say, correct? Okay, awesome, yeah, just want to make sure because you're not a developer, you say I'm not, you're just a drummer, awesome.

So coding best practices still apply, and with the durable invokes and the callback patterns and so on, you still have good ways to compose your applications through multiple Lambda functions, including non-durable functions. We also recommend, because the SDKs are super fast moving, so while it's easy to get started on the console with our SDKs that we bundle with the runtime, please use your favorite package manager to bundle the SDK from the open source GitHub repositories, because they move much faster than the runtimes are providing those new features. Okay, it's good for getting started in the console, but please use the SDK directly.

Your AI copilot was super helpful for you, and that's true. However, all those agents were based off knowledge that before durable functions existed. So therefore you might have to prime your LLM agents a little bit before running them through our existing code-based examples, the blog posts Eric and the community is putting out to make sure they understand what durable functions is because otherwise they're just going to make stuff up. As this evolves and the context has been added to these LLM agents, it will naturally get better. Because I said this is still a Lambda function, your timeouts that you can put on a Lambda function still apply, so we have kind of two timeouts now that you can adjust and knob. One is for the function event handler execution for the single loop that it's doing, and one is for the whole execution duration which can span multiple of these invocations. So do make sure that your function timeout allows for sufficient time to actually get through a successful execution.

Now let's go a little bit more into the code. We talked about the checkpoint replay and the versioning behavior, so because your code can be re-executed multiple times,

although we skip in some places, make sure that any non-deterministic code like generating UUIDs, timestamps, math.random, and these kinds of things, you put and wrap with steps. Any non-deterministic code needs to be within these step bubbles because otherwise they're going to lead to different outcomes on the replay, which you don't want.

We covered the logger already, so you can switch replay on and off or use your own logger. The SDK has its own concurrency primitives to also ensure reliable replay. You might know in TypeScript or JavaScript, for example in Node, where you have Promise.race, Promise.all, and so on. We provide deterministic, safe versions of those to ensure that even on replay they behave correctly. Now I mentioned this in the same case section for encryption.

The steps also allow you to provide your own serializer and deserializer for the checkpoint data that we actually do, and this is for a couple of reasons. Sometimes you might have a very complex object that you want to checkpoint, and therefore you can provide your custom serializer and deserializer, which will be used to process the data eventually in the backend. Also, if you have large payloads exceeding our checkpoint sizes, you can use your custom serializer to offload to another data store like S3, and so on replay, we will just use that to retrieve the data. And lastly, the serializer can also be used if you have very specific encryption requirements so that you can encrypt, including with CMK, your code before it hits the backend.

What we didn't really talk about is that within those steps we talked about steps and checkpoints, you control what you checkpoint. There will always be a checkpoint when you use a step, which will just be the name, and only if you return something from a step will it be persisted in the system. So we have full control actually about whether I don't really want to return anything from my step, I just want to see it in the observability like in the execution history, and that's perfectly fine. But you can obviously also return something which then will be part of the checkpoint. And there's much more, including details, examples, and best practices in our developer guide.

Okay, let's talk about pricing. For pricing, we wanted to achieve different use cases and requirements. We wanted to make durable functions work for any scale, whether it's your little hobby project like Eric opening a coffee shop or you're running this at scale in your payment processing system. Pricing should also be flexible and transparent, so if you don't persist state in those steps, you shouldn't be charged for persisting those states. Same for if you have different duration and retention requirements, the system should adapt.

Therefore, we introduced three new dimensions for durable functions. One is the number of operations, checkpoints, like steps and waits that you perform in your system, and this is eight dollars per million. Prices are all for US East 1. And then we have two storage dimensions. One is for the data that you write within these checkpoints. I mentioned before that you fully control that process. If you offload, sideline, or don't even return, there's no data written in the system. And then there's the data retained for the data that you persist within the durable functions backend, which you fully control. Note that existing Lambda compute charges do apply, so when a function runs, the compute charges still apply.

Choosing Between Step Functions and Durable Functions: Final Thoughts and Call to Action

Oh, what is this elephant? What is the elephant? So I get to talk about the elephant in the room. It is not me. But the elephant in the room, and we get this, is choosing a service. How many of you all are wondering that? When do I do Step Functions? When do I do durable functions? I don't know. I do know, I have some opinions on this.

Well, here's what I would tell you is we unapologetically offer this out. We think these are both fantastic services, right? We think different people use things different ways and they have some different things, and we have a little guidance we want to throw up here. You know, if you're looking at your primary focus is workflow orchestration across AWS, I'm orchestrating a bunch of different services, batches, things like that, Step Functions might make more sense to you, right? Whereas if I'm doing application code orchestration, then durable functions make more sense. But again, that line is super fuzzy, and I'm not going to read all this to you, but you kind of get this idea of here's kind of some things that we thought about.

But in reality, like I said, unapologetically, we know that you can do a lot of things with either one of these. So it really comes down to how do you think about it? Do you want to use orchestration to cross AWS services? Go with Step Functions. You want to do it in an app, durable functions. If you want a visual builder, Step Functions.

So it really comes down to what do you want and how do you approach it, right? There are a lot of different things to consider. So the reality is, what do you prefer? What do you want to do? They're both going to be there, they're both available, and we're working full-time on both. So I encourage you to let us know your preference. I want to hear back from you on that because we want to know, right?

Okay, the future is durable. We see more and more needs for this, and so we encourage you to check this out. We want you to play with this. We want to hear feedback. Twitter, LinkedIn, let me know what you're thinking. But here's a couple of things I would give you to kind of walk away with. First is build like a monolith. Now, back to what we were saying, I'm not saying build a bunch of what I call fat lambdas with a PH, or Lambda. We don't know if that's working or not, but I'm not saying go out and build those. But what I'm saying is get everything on a single screen if you can.

Build like a monolith, deploy with microservices. Enjoy that single pane of glass. No more choosing between simple and reliable, you have both. Either way you go, however you want to do it, right? Choose the right tool for the right job. Again, it comes down to that right tool for the right job. How are you going to do that? How does your preference weigh in, right?

And I love that statement because so many times every good technical question, hey, how do you do this or how do you do this, has a solid, anybody know the common answer? It depends. It depends. That's right. Did you say that? No, it sounds like that was awesome, yeah. So it depends. There's always a bunch of different ways to do things, so I encourage you to do that.

So use your familiar programming languages with this, focus on business logic. Yeah, we absolutely encourage you to do that. Sorry, Mike. So, all righty. And finally, how to reach out. AWS forums, support channels, feedback mechanisms, we want to hear from you. This is just the beginning. We're going to be working on this. Michael and team have done a fantastic job. And with that, one sec, go ahead.

The SDKs are open source, so issues, contributions are very welcome. Yeah, yeah, yeah. With that, Michael, thank you for letting me speak with you. I appreciate that. You're the demo master. Give us feedback, tell us how we did, what can we do better. I hope you have a great rest of the half of AWS that's left. We'll see you later. I'm talking about Step Functions later today, both ends of the spectrum. Thank you so much.

; This article is entirely auto-generated using Amazon Bedrock.