🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Building event-driven architectures using Amazon ECS with AWS Fargate (CNS307)
In this video, Eric Johnson and Matt Meckes from AWS demonstrate how to build event-driven architectures using Amazon ECS with Fargate. They introduce Sarah, a developer whose synchronous microservices architecture failed during Black Friday due to tight coupling. The presenters explain EDA fundamentals and showcase various integration patterns: public APIs with EventBridge and API Gateway, point-to-point messaging with SQS using custom metric math for auto-scaling, event-based containers with Step Functions' callback tokens, and the activities pattern for batch processing. They cover event routers (EventBridge, SNS), event stores (SQS), and streams (Kinesis, MSK), emphasizing how ECS Managed Instances and Fargate provide serverless container orchestration. The session includes practical examples of transforming tightly coupled services into loosely coupled architectures, demonstrating how one customer reduced Step Functions costs from $450 to $1 per invocation using the activities pattern with ECS workers.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Sarah's Black Friday Crisis and the Challenge of Synchronous Microservices
How is everyone doing? I'm glad to see you're good. That makes me happy. I'm so excited you're here. We started early—it's not even time yet—but we did it because we're excited. My name is Eric Johnson, and I'm a Principal Developer Advocate at AWS. Let me get our slide up here. Alright, now we're seeing what we should see. Matt, introduce yourself.
Hi everybody, my name's Matt. I'm part of the specialist team here at AWS. I'm a container specialist. I spend most of my time helping customers build cool, exciting, interesting architectures with ECS and Lambda and all of our container technologies. I have to be honest, I'm super thrilled to talk with Matt. Matt and I have traveled a lot together, and I have now learned to say his last name properly. It's Meckes, not Meeks. At least that's what he tells me.
So how many of you have heard me speak before? A good room. Do you know the rules? A couple of you know the rules. Well, for those of you who don't know the rules, here they are. When I'm speaking, I like to have a lot of fun, and these rules will really help you as we go through today. There are three of them. The first is this is any number I want it to be. I'm going to hold this up and say there are three rules, and someone who comes in late is going to say that's not a three. Well, it is because that's the rule.
The second is these are quotes, not apostrophes, because this looks dumb. And finally, these are thumbs because this will get you beat up. So those are the three rules. I do like to make a lot of one-finger jokes. I will make jokes as we're going through today. I'm very comfortable with that. I was born this way, not just for the first time this morning. However, if that makes you uncomfortable, I'm also comfortable with that, so we'll have a good time today. Super happy you're here.
Matt and I have been working through this and are excited about this deck. How many of you are using ECS now? Good. How many of you are doing event-driven architectures with ECS? Well, why are you here? You want to learn how to do it better. We really have to meet the bar. Exactly. There's a lot of pressure on us to hit this. So that's what we're going to be talking about today: event-driven architectures using Amazon ECS with AWS Fargate. That may not be the first thing you think about. You may not think ECS and Fargate together, but it's a very viable solution. Today we're going to talk through that.
We want to introduce you to a buddy of ours. This is Sarah. Let me ask you this: how many of you are developers? How many of you dabble with code but don't claim to be a developer? If you dabble with code, you're a developer. This is Sarah. Sarah runs a mid-size development team, and they use ECS. They primarily use ECS and they love ECS and they love Sarah.
Here we are. This time it's November 27th. It's Black Friday, late in the evening. Anybody been watching their architecture right before Black Friday? If you haven't, you haven't been a developer long. I've been that developer in the middle of the night, watching, and you're good, you're solid. Things are looking okay. Maybe there's a little hiccup, but you're okay. And then all of a sudden, ten minutes later, everything falls apart. Ten minutes after you drop a brand new SKU, something exciting is coming out, your system can't handle it. And this is where Sarah's at. We've all been there at 3:30 in the morning. If you haven't had your head on a desk crying, you're not a developer. We've all been there.
Understanding the Problem: From Synchronous Microservices to Event-Driven Architecture
But here's the truth: this isn't an ECS problem. It's an architectural problem. If you don't know who that is, that's me. Or Matt. Really, it's just a beard that makes a difference. So what went wrong? Well,
like so many people, when Sarah built her architecture, she built it using synchronous microservices. Microservices are a good thing, right? We hear that all the time—small, independent services that run together. We love that. But the synchronous part can get us into trouble. When we make that order and hit the discount, everything's good. The next thing we're going to do is check our loyalty, and oh no, our loyalty blew up. The whole thing fizzles and falls apart because we've tightly coupled this architecture and it's all dependent on each other. This probably isn't news to a lot of you. You might have some architecture this way. Anybody responsible for that old code that's built like this? I've been there. So how do we fix this?
Sarah needs a different way. When we say a different way, let's look at the synchronous model and see why we want to change that. When you think about synchronous request-response models, you think these are good, right? They have some advantages. They're low latency. They're very fast. It's simple. Anybody can build a synchronous microservice. And it fails fast. When we're going to fail, we want it to fail fast, and we want it to fail in development. That's a good thing.
But there are some pretty big disadvantages to this as well, ones that are worth looking at getting out of. The first is we can get throttling. When you build these out, your producer needs to know about your consumers. As it's going through, it gets complex. It can get throttled because it can't send everything, or maybe your consumer gets throttled because it can't handle everything. Then you get resiliency issues and you have problems on Black Friday at 3:30 in the morning. That's where you're at. So how do you fix that? That's what we're going to talk about today in relation to running microservices and running things on ECS.
How do we think you should do this? It really comes down to not using synchronous microservices. It comes down to what we like to do: insert a broker of some type. Rather than in that previous model where the producer talks to the consumer and the consumer responds to the producer, instead we're going to put a broker in. This broker is just a term we're using. I'm not picking a specific one. We'll get into that in a bit. This is the idea of something in the middle, and their entire job is to say, "Hey, I got your request. Have a good day." Then the consumer, either through polling or pushing or any different kind of models, will say, "Hey, give me that," or "Hey, thank you for that, and I'll acknowledge." Maybe they acknowledge to the broker, maybe they acknowledge back to the client. There are different patterns we'll look at. But this idea of having something in the middle to help break these apart and decouple the producer from the consumer is where we get into a lot more flexibility and reliability.
We're talking about this idea in what we call event-driven architecture. Now, let's look at what event-driven architecture means. Event-driven architecture, or EDA, is a distributed computing paradigm implementing asynchronous message passing semantics through intermediary message brokers, enabling temporal and spatial decoupling of system components via non-blocking IO operations and persistent event storage. I intentionally had that complex explanation because sometimes EDA is a mystery to us. We look at it and go, "Oh, that's scary stuff." But in reality, I'm going to give you a more technical definition of EDA. If you were ever going to write something down or take a picture, this is the time, because this is going to get super technical. I might have to do it a couple of times.
Let me explain EDA to you. Here's a less technical but still technical definition of EDA. Something happens, and we react. That's EDA. Yes, that's oversimplified, and yes, it's a little tongue in cheek, but it's the idea that rather than our consumers going, "What are you doing? What are you doing? What are you doing?" or our producers going, "Here's what I'm doing. Here's what I'm doing. Here's what I'm doing," and the logic getting very coupled, we're able to say, "Let's break that apart and make them decoupled from each other." So what does that look like? Well, here in the before EDA we have this
producer who's talking to the consumers, and the producer needs to know about each of these consumers. He needs to know who's consuming my data, and so it needs to send to each one. So if something happens, it's going to send to consumer 1, send to consumer 2, send to consumer 3. We kind of saw this pattern already. So let's throw in a broker, some kind of broker, and we break this down and say, OK, now instead I'm going to just send out an event and I don't care what happens after that. Metaphorically and allegorically, I don't even know the right words, but we care, but the producer doesn't care. The producer just says, hey, I did my job here's an event. Whoever consumes it, do it. So these consumers go, OK, yeah, I want that. I want that. I want that.
And so there's some benefits to this. When we build this out and we say OK, we're going to break these apart, it gives us the ability to build very scalable applications, right? We're able to build bigger because we're not so tightly coupled. It also gives us resilience, right? So if consumer A goes down, consumer B is still chugging along. If there's some dependency on each other, we can handle that, but they can still handle that gracefully. We can degrade gracefully, right? And finally, it gives us a lot of agility. When you build this way rather than constantly updating that producer to send it to, you need to know more logic to talk to these consumers. All we do is we add another consumer and say I'll consume that as well. So we can do that unbeknownst to the producer, we can do that right.
The Mental Model for Transformation: Business Logic, Compute, and Event Routers
This is a really cool pattern, however, here's the catch. Like so many of you, Sarah's asking the question, how do we go from synchronous microservices to event-driven architecture? And that's why we're here today, right? So I'm going to turn it over to my buddy Matt, and Matt's going to walk you through some of this, and we'll be talking about some things. So I think this is often when someone like me gets involved. You've got a customer who's had a major problem, and they want to make some sort of transformation or modernization, and we've got to think about what's the kind of mental model we're going to use to go through this journey from kind of monolithic synchronous services to EDA.
I like to think of three real buckets of categories of questions we want to think about. First we've got our business logic, so the actual stuff that we care about as developers, the stuff we've written. We've got to run that somewhere, so we've got to think about what's our compute. Eric talked about event routers and he just put a kind of nice green circle in his diagram and said, hey, we've now decoupled. There's a lot more subtlety to that, so we've really got to think about what event routers are we going to use, what are the trade-offs, what are the patterns that we can apply.
And then we've got to take those components and we've got to combine them into all of these architectural patterns and blueprints that we want to do at scale, because whilst something happens, we react to it. It's really, really simple on that side, actually there's a lot of subtlety. Actually within the category of event-driven architectures, there's a whole bunch of quite precise sub-patterns we can use that have their own trade-offs and complexities, and we want to kind of codify those so we can use them at scale across our organization, baking best practice into what we do.
So first off we've got some options around compute, and I think that one of the really common things that I see customers do when they come out of a big outage and your CTO is grumpy, hey, we built this modern microservices thing, we're running it on ECS and we still went down, they kind of throw the baby out with the bathwater. We're going to throw all this away and we're going to refactor to Lambda because we've seen that Lambda is scalable and Lambda always, you know, AWS takes care of it and we don't have to think about it.
You know, Lambda's great at scaling, but you know, you can still with bad architectural patterns, get yourself caught into some dead ends. Or let's throw away this ECS thing. What if we move to Kubernetes and the Kubernetes ecosystem? Maybe we can solve all of our problems with this technology change, and yeah, that's going to be more scalable, more powerful, look at all these cool open source things that we can buy and we can use this whole ISV ecosystem. Well, actually, you know, I'd like to propose that when you're going through a big architectural change, you want to really focus on the bits where you can add value, and you probably don't want to change everything at once.
If you're changing programming language, don't change architectural pattern. If you're changing architectural pattern, maybe don't change your compute. You know, so limit yourself to a set of things that you can change. And also, we want to go on an incremental journey. We don't want to say, hey, we're going to spend the next year building a brand new platform that's going to be ready in early November, so when Black Friday comes around we can throw a big switch and underneath them. We want to make a series of changes. Every week we're iterating towards this path over the next year, so when our next big sale comes on, we're all ready.
So my hypothesis is, let's go and do this, let's stay on ECS and let's see how we can build this with ECS because we're going to make a whole bunch of other interesting changes.
Choosing ECS and Fargate: Compute Options for Event-Driven Architectures
We're all ready. A little bit of information about what ECS is—I think we've got a lot of hands raised in the room, so I won't belabor the point too much—but I think it's a pretty cool service run by AWS that allows you to deploy, run, and launch containers at scale. It does a whole bunch of optimizations behind the scenes for you, so you can focus on delivering business value.
Customers tend to agree with this approach. Something like 3 billion Amazon ECS tasks are launched every week, which is a completely wild number. And a really interesting one that I really like is that when people start building with containers, the majority of them start building with ECS. So it's a really common pattern. When you're using the ECS control plane, there are a couple of options with your data plane. You could use Fargate, our serverless data plane where you just define your tasks and your services and we'll manage that compute for you. We have an isolated workload model where you get strong tenancy and you only pay for the compute that you're using. However, we're making some choices on your behalf that we think are sensible, but they're not the right choices in every case.
You can also bring your own data plane, so you can say I want these specific instance types in this size and I'm going to manage scaling them out, and Fargate is going to run them on those EC2 instances. One of the big pre-Invent releases a couple of weeks ago was ECS Managed Instances. I'll do another little poll in the room—who's seen the ECS Managed Instances release in the last few weeks? Has anyone heard of it? Yeah, a few hands, but it's pretty early still. ECS Managed Instances and the Lambda Managed Instances that was launched last night are really interesting models where you can bring your own data plane of EC2 instances, but AWS does all the management of those instances on your behalf. We'll do all the patching, OS patching, and management of those instances. You can just define what instance types you want to use.
So you get a lot of the benefits of using the flexibility that I want a specific Graviton instance in my fleet because I've worked out my Java code runs best on those instances. But you don't have all of that overhead. It's also got a really interesting modification to the capacity provider that means it scales faster than running your own EC2 instance fleet, which is really cool. So that moves the shared responsibility a bit, so instead of having to think about your own autoscaling and launch templates and availability distribution and instance selection, you can hand that over to AWS. You can define the instances that you want in your task templates, and we'll take care of it. And you can get all the benefits of your savings plans and all the capacity planning.
However, it doesn't mean that Fargate isn't still the default way that people are building on ECS, and I think that most people should probably start with the most serverless option, which is Fargate, and only break out if they've got more complex requirements. So we've chosen ECS, we're going to continue running on Fargate, and we think that's still a sensible option. There are a couple of different models when you're running with ECS. The first thing is running with a standalone task. So with a standalone task, something happens, we get to run a task event. And the Fargate service goes, okay, great, something's happened, let's react. So we're going to provision some compute to deal with that event, and we're going to take some time provisioning, so we'll go through a provisioning state. We're then going to run that task, so we're going to do some running, and then afterwards we're going to get that container and we're going to stop it and not reuse it. Now that's a really good pattern and really powerful, but it is creating a new container to run each of your events. So it's not really appropriate if you're sending thousands of events—we're trying to do our Black Friday sale, we've got 1,000 orders a second coming into our e-commerce store. We don't want to be using this model for that. It's going to get us into some trouble.
We also have this other model where we can run as a service. So this is where we can define some compute, so we're going to say we want a vCPU, we want a couple of gigs of RAM, we want to have at least 4 copies, and we want to scale that out with some scale parameters. And Eric's going to talk about some of the metrics you can use to scale out later. So we're going to go and build a target group that's going to be running, and actually we can think about how we can send events to that statically stable compute and how it's going to scale out.
Eric will go into some more details around that. We have these big patterns, and my hypothesis is these are really powerful and useful characteristics to use when we're building event-driven architectures. We have scalability, we can have target tracking, and we can build all sorts of interesting target tracking with custom metrics. Containers and microservices work very well together with these patterns. Because it's an AWS managed service, it has native integrations with all of the building blocks we use to build event-driven architectures. Things like Step Functions, EventBridge, and even SQS and SNS all work very well with EventBridge. You can use all of those things like managed instances and savings plans and Spot to make sure that you're getting the best price for your compute and the best total cost of ownership.
Event Brokers and Routers: EventBridge, SNS, SQS, and Streaming Services
So, we've settled on compute. Our next big category of questions is what event brokers are we going to use and what routers are we going to use? Our first category of brokers are routers. Routers are a pattern where the router processes one event at a time, and you can point that event at multiple targets and protocols, so it's a one-to-many pattern. You're building logic at the router, so you can build rules and filtering rules and targets at the router. You've got quite a smart, sophisticated router that's doing a whole bunch of work for you, and that means the router can handle things like retries, dead letter queues, error handling, and even things like throttling and rate limiting.
It's a pretty smart router, which means that it's really simple to have simple logic at your receiver and your sender. For your receiver, a single SDK call means your receiver just has to either pull an event through a native integration or a simple API call. You don't have to build a lot of error handling and additional logic in your producers and consumers. That allows you to really efficiently reduce coupling. In the Amazon world, we're usually talking about EventBridge when we're talking about this pattern. It's a serverless service with native integrations with most of the things on AWS, and it allows you to build global architectures with global endpoints.
The EventBridge architecture starts with defining some event sources, which could be an AWS service or custom events, and in our case, a lot of custom events. Those then go to a number of buses, so a default event bus or a custom event bus, depending on whether it's an AWS event or your own event. Then we have these rules, so a rule is the way that we map those events into the downstream services. What is an event? This is one of the things that we have to be really deliberate about when we're building event-driven architectures. We go one step deeper than something happens and we react. Something happens in the past, and we emit a signal that state has changed. An order has been created, a channel was created, or the lights switched off. We can't change it, so we can't unswitch off the lights. We have to submit a new event that says the lights are switched on. These events act as a contract between our producers and our consumers. In EventBridge world, that's a piece of JSON.
An EventBridge rule is pretty simple. We have a bit of JSON, which is our event that our producer has emitted. Our rule has another bit of JSON, which says our source is com.flix, our region is Australia and New Zealand, and we want to get all of those events and pass them to a particular consumer. They've matched, that's great. Those consumers can be a whole bunch of different patterns, so we can point to Lambda, and lots of event-driven architectures are built with Lambda, but we can also send that to Amazon ECS as we discussed earlier. You can also talk to API destinations, and that's a really powerful feature of EventBridge, and Eric's going to go into some detail around the nuances of that shortly. And there are a whole bunch of other services.
However, this is not the only pattern we have for event routers. We also have topics, which are a subtly different form of event router. Here we can broadcast to multiple consumers, and we still can filter consumers at the topic. However, this is really powerful for scaling out to lots and lots of consumers. We can support multiple protocols, things like SMS and push notifications. We also have options to maintain order with FIFO topics. A fully managed service for doing this in AWS is SNS.
It allows you to send nearly unlimited messages per second, and it says that in our documentation: nearly unlimited. So we're quite serious about it. It's very difficult to put any strain on the SNS service, and you can send out to 12.5 million subscriptions on each individual topic. If you're a fan out and you want to send codes or messages to everybody, that's really powerful. So we talked about routers, and our next category of things are event stores.
An event store allows our receiver to process events in batches and control how it's doing processing. It can control its order, it can configure its own retry, and it can configure its own dead letter queue. Now one of the important things about a queue is that only one receiver can process a message, so it's a one-to-one integration rather than a one-to-many integration. You can manage that ordering, but it's typically not used as the backbone of our event-driven architecture because of that one-to-one pattern. It's used more as a kind of sub-pattern we're going to think about.
In AWS, the main service for doing that is SQS. It's another one where the docs say we scale nearly infinitely, super easy to use, simple API, download the queues, first in first out, all the kind of features we expect. We also support some of the open source services, so things like Amazon MQ, which supports RabbitMQ or ActiveMQ. If you're using some kind of integration with existing open source stuff, definitely a sensible option and supports many of the same patterns that you'd be using.
And then the third and final big bucket of brokers we can think about are streams. Streams are subtly different again. Here we can do one-to-many, so we can send messages to multiple consumers. However, all the consumers reading a stream are going to get all the messages, so the consumer itself has to manage the filtering. You've got ordering enforced in the stream, but again the consumers have to manage their own kind of error handling and processing to ensure that ordering is enforced. Really powerful service, particularly if you're doing really large throughput of messages.
We've got Amazon Kinesis as an AWS native version of dealing with streams. You could do gigabits per second of streaming, all the kind of integrations that we talked about with native integrations. And the big player in the open source space here is Kafka. Apache Kafka is an absolutely huge project, really powerful streaming solution which has loads of open source integrations and is used for lots of event-driven architectures. Our managed service for that is MSK, so you can use all the kind of Kafka goodness and get lots of the management done by AWS.
So we've now got this kind of box of toolkits. We've got some compute, we've got some routers, and we've got all these services. Really easy to start putting some of those services into our diagrams and starting to build things with these, particularly where you have to set up an event bridge bus. A few lines of code, a few lines of SAM or CDK or Terraform, and you can be up and running. But they're actually very powerful, and there's a lot of places you can go wrong. So the meat of what we want to do is how do we turn all these Lego bricks, all of these different building blocks into useful things that we can build.
Pattern One and Two: Public API Integration with EventBridge and Private Endpoints
Eric, I think you're going to talk to us about some of the patterns we can apply. All right, so we've talked to you about what is EDA. And for those of you here in America, routers is routers. Just throw that out there. I know some of you are like, well, I don't even know what a router is, but it's a router. All right, so we're going to talk about what are these patterns. How do we apply this to ECS? How do we make that work?
So let's jump in, and we're going to talk about long running containers. This is kind of those services that Matt was talking about earlier. It's push or pull, it's kind of end to end. How many do you want to send, how many do you want to process? A traditional long running container looks something like this. You have an ECS with many different types of tasks with different containers running in that. A lot of ECS users are familiar with that. I don't really need to probably teach you about that. And of course you have your capacity providers with our new managed instances, which is super cool. You could do a lot with that.
So I want to show you we're going to go through some different patterns, and the first pattern is probably the most simple. This is a very common pattern. Pattern number one is the public API. Now what this usually is, you're going to have some type of producer pushing data to a consumer, and it's going to go through an API. So somewhere on another machine,
they have a signed URL or something, and they're pushing data to it. But your consumer, even though you run this container, maybe it doesn't handle all that. We have legacy code from 1988 that isn't ready to process everything, so how do we put a broker in something like this?
EventBridge is where I generally start. This is a really good way to say events are coming in, and I'm going to send those and then push them on to the container through the API. The producer sends directly into EventBridge with IAM roles that have permissions to do that. It pushes data into EventBridge, and EventBridge uses an API destination to make a secure connection to the API and drop events into the consumer. It can run however many instances we need, and we can control that through EventBridge and how many we're sending and different things like that.
Now you may say, "Eric, that's great, but not all of us are doing machine to machine, right? What if it's clients?" That's all right, I got you. You can actually do something like this where you have an authenticated user send data to API Gateway. Let's say they're sending analytics or something like that, and then I'm going to use VTL. Are you familiar with VTL? Do you love it? I know exactly. VTL is Apache Velocity Templating Language, and it runs in API Gateway. You can actually transform a request coming from a user and put it right onto EventBridge. I have some examples out in the wild of how to do that, but this allows you secure access where they can continue to push data and I don't have to run a Lambda function in between. I can still get data to the container.
This is pattern number one, and writing VTL is no fun at all if anyone's tried it, but ChatGPT is really good at writing VTL code. A couple of prompts and you'll have something working really quickly. VTL is back because of ChatGPT, and it's velocity templating language, by the way. I'm just proud I know that.
The second pattern looks a whole lot like the first because it really is. What we're doing here is we have this ability in EventBridge with PrivateLink and VPC Lattice. You can actually connect to private endpoints. This is super helpful if you need to throttle how much is getting in there. Let's go ahead and drop data into EventBridge, and then we'll use the private endpoint connection. Those are the first two patterns, kind of starting easy, and we'll climb up there.
Pattern Three: Point-to-Point with SQS and Auto Scaling Strategies
The next pattern we're going to do is what we call point to point. This looks pretty simple, but what we're doing here is we have ECS containers that pull for events. Instead of things getting pushed to the ECS container, it's going to actually pull data off the queue. You manage the rate limit in the code. You can say grab ten, grab twenty, but you can also do some of that with the window and control on SQS as well. Your retry and failure are managed at the queue as well. If something doesn't happen, you get a retry out of that, so this gives you a lot of control and downstream protection.
When you're running queues with polling patterns, it looks kind of like this. Let's say you have six tasks. It'll grab six off there and my different tasks will handle that, kind of like how the containers work on that. How SQS works, if you're not familiar, is when a task grabs messages, they don't actually disappear out of the queue. They're made invisible. It's magic. It's the invisible queue or the invisible event. After a certain amount of time, if the consumer hasn't said "yep, I took care of it," then it's made visible again and it can be redone. This works really well when you have metered traffic and you can do that.
However, what if your traffic is unpredictable? You get a lot of traffic, then a little bit of traffic, then a lot of traffic, and it goes up and down like that. Well, that's all right. We can handle things like that through what we call predefined metrics or through auto scaling. With auto scaling, we're actually able to do step scaling in SQS through using the message count. We look at the message count and how we step that up to handle the auto scaling and how to bring it back down. In this particular example, I'm doing plus one task at five messages, plus two at fifteen, so it's watching that and it's going up as needed and it's coming down. Then we put it in a cool down of three hundred seconds so you don't want your auto scaling doing this. You want it doing this. How relaxing was that? That was cool. All right, I feel good.
So that's why we put in that cool down, and then you get multi-period evaluation. We're not just looking at one and reacting, one and reacting, one and reacting. We're looking at how the scope looks. This works really well for a lot of your traffic. You use these predefined metrics, and there you go. However, not all traffic is the same size. Traffic looks very different. I don't know about you all, but you never know. Am I processing a 2 minute video? Am I processing a 2 day video? Those kinds of things. So what if this looks really big and your size is not consistent? How do you do that? Are these predefined metrics going to work? Probably not.
So then we get into custom metric math. Now I've never been great at math. I can only count to 2. Some of you are awake. I can actually get to 4 if I take my shoes off, but that gets weird for everybody, so we'll go with 2. The idea here is we're going to use custom metric math and combine some metrics. We're going to do metric messages divided by the running tasks. We look at the pack log per task, and we're going to go for a target value of acceptable latency divided by processing time. We're looking at some different things, and then we set this up. We still do our cool down so we're not reacting really fast, and we prevent division by zero. You can say I've got some code in there by doing a max function so you prevent division by zero.
This allows us to do some mathematical equations. I'm saying all kinds of big words, some math to figure out how we're going to do it, but not just one metric. You're using multiple metrics to do this. This is an idea of how that policy would look and how you set that up. Well done, Eric. I've been watching Eric rehearse that for the last two weeks and trying to get his math out. I can't do math to save my life. Matt's been tense up till that moment. Thanks buddy. So that's how you do that.
So when you're going to do metrics, don't just let your system run and hope it makes it. You have the story. You have the observability to go and follow that. That's really important anytime you're doing anything, but really specific in event-driven architecture because you can build really optimized systems that work and run with your load rather than just saying provision till we can handle anything and pay for that.
Event-Based Containers: EventBridge Run Task, Step Functions Patterns, and Activities
So the next thing we're going to get into are what I call event-based containers. Now, I haven't made that a popular term yet, but every re:Invent for the last two re:Invents I've said this, someday it's going to stick. I'm coining this mark. I want to see it in a blog. So event-based containers. What I mean by this is kind of like event-driven containers but a little different, and maybe I'll explain that and maybe I'll forget, but it's this idea of a one to one. One event and Matt explained it earlier when we want to run one container for an event. So what does that look like? Well, we got a producer and we're going to kick an event into EventBridge.
Normally if I'm running a container on Fargate, I will have a service and that service will have a task definition. This will all be at the container level. But now with this I can actually do the task definition at EventBridge that actually will come with the event. I can say, hey, here's a definition. Here's an event, do your thing. Right now I'm not going to get any data back. I'm just going to get a simple fire and forget, but that works for a lot of what we're doing. Your producer is going to kick a single event out that EventBridge is going to take that and send the task to it. It calls the run task API, and it's going to run that, and you get one task versus one event. Actually, it's not necessarily one container. You want to look at it. It's probably it could be multiple containers because you might have multiple tasks, sidecars, things like that.
This is the pattern. This is the EventBridge pattern. This is very popular, and this is really here. You don't want to use this for dumping a million events in EventBridge and bringing up a million containers. That's not how the run task API will get overrun. It's not designed for that. You want to do this. We actually built this exact example here a couple of years ago. We were doing video processing and serverless because it can be done. We were running video processing and we would say, OK, if the video has these maybe file size or time limit because we had the metadata, if it's under 2 minutes, process it in a Lambda function because that's all we need. If it's over 2 minutes, spin up a container. Most of our videos were under 2 minutes, but every once in a while we had those one-offs. This is a really good example of doing that. We could do rules for that to set it up.
The second pattern is Step Functions. Who uses Step Functions today? I love Step Functions. I'm a Step Functions nerd. I believe in Step Functions first, and I'm super excited about it. Step Functions can work much like EventBridge in that it can have the task definition and you send that to EventBridge. But it gives you a little bit more flexibility. It gives you synchronous and asynchronous options. That means blocking and non-blocking, right?
Let me show you how this can work. The async pattern is the first one we're going to do, and this is kind of a fire and forget, just like EventBridge. We use a run task, and it's an async fire and forget that returns an acknowledgement of the request. Yes, I got your request. I'm working on it. Move on. The Step Function will move on from this point. It won't wait. You don't get any data back. Off you go.
However, sometimes there are times when you need to stop and wait. You need to know that it was successful or failed. So we also offer what we call the dot sync pattern. This is really cool because it handles the polling for you. When it sends to the run task API, it then polls the described ECS task for a status. If it worked, then you get a 200 back. If it failed, then you get a 400 or a 500 or whatever is going on. However, you still don't get data back from the ECS, and a lot of times this is okay. But this is if I need to poll services, and this happens a lot.
But you may say, Eric, we need data back. How do I do that? Well, let me show you. The next pattern we're going to talk about is the callback pattern using the token and send task completion. Does anybody use this now? A couple of you, okay. Does anybody like it? Well, I haven't explained it yet. Let's wait on that question. I will get a wow out of you.
The way this works is when you come to this task, it's going to generate a token, and it's going to send that token with the data and the task definition to the run task API and bring the task up. It's going to give all that data to the container. The container will turn on and sit and wait for that task to be done. When the job is all done, you actually call the service. There's an API endpoint: send success or send fail. You call it with the callback token. Step Functions goes, oh, I know which workflow that is based on your token. Here you go, success or fail, and here's the return data. You're able to get data back to the Step Function and modify and work on that later. This is a very powerful event-driven model that you can do in Step Functions.
The task needs to invoke Step Functions and the API goes to complete, and off you go. There's one more pattern I want to show you that I'm going to be really honest about. I just learned about it last year, and I think it is super cool. I learned about this when you talked about it a few months ago, Eric. Is anybody using activities in Step Functions? Anybody heard of activities in Step Functions? One person, but he knows a lot, so he's kind of a nerd.
Let me tell you about activities. When you create an activity, you go into the console, or you can do it through the API, but you actually create an activity outside your Step Function. It sits on its own. Step Functions creates a managed SQS queue and manages it for you and handles all that stuff. From your Step Functions, not just one of them but any of them, you can pump data into this activity queue. You have a pool of ECS workers. This pool of workers, much like any other pattern we've seen, polls for this data, does the work. It's got the same kind of interaction as SQS because it's an SQS queue, and then sends back to Step Functions with the task token and the data. This is when you have thousands upon thousands upon thousands of events and you need them to be processed from multiple workflows. This is the pattern for this.
Real-World Success: Applying EDA Patterns to Transform Sarah's Architecture
We discussed this last year in this very session, actually. I will tell you this stuff really works. Don't just take my word for it. This tweet came out in September saying a few months back, inspired by a great session by EDJ Gee. No, I'm just kidding. Let's move on.
We made changes to an AWS Step Functions workflow, reducing its cost from $450 per invocation to $1. If you're intrigued, check it out and I'll post that in the resources where you can find it. Now the first question is, what are they doing that costs $450 per invocation? They were doing a lot of work with hundreds and hundreds of minuscule microtransitions that probably made more sense to run in an ECS cluster instead.
The activities pattern handled that form perfectly. Activities give you several benefits. First, you get a managed queue with one year retention. Second, you can connect on-premises services. This is where you can say, I need to process this with on-premises infrastructure, and you can do that from there. Finally, you get a loosely coupled architecture and the flexibility to use Spot instances, which is something a lot of people don't think about. If I'm using a clustered ECS setup, I can use Spot for that.
Let's talk about how this helps Sarah. Sarah was in trouble, and now we have a whole set of tools that we can combine with our ECS architecture. We have all of our different brokers that we can apply and all of these different patterns that we can apply to our architecture. So what does that actually look like? We talked before about having a relatively sensibly designed microservices architecture that is containerized.
The first thing we can do is start publishing some events. We have our order service, and every time an order is created, we can create an order created event with a single SDK call. That's going to publish it to EventBridge. Now we want to do something with that event. Maybe we want to take our payment service that is currently happening synchronously but depends on some third-party APIs, and we want to make that asynchronous. We accept the transaction and use some validation on the payment details. We assume that our payment process is going to process it, and if there's a problem down the road, we can send the customer a message saying, hey, can you come and retry with some new payment details? We can do that with our existing service by having a rule with that API integration. We can do a private API integration into our VPC using all the stuff that Eric showed, and so pretty quickly we can take two synchronous services and break them up using a really simple pattern and make them asynchronous.
We also want to send a bunch of messages to our customers. Instead of having a direct SDK call to one of the messaging services inside our code, maybe doing it synchronously, we can break that out. We'll have a separate rule that's looking at those orders and sending an SMS message saying, hey, your order's been confirmed, or actually you need to go and update your payment details.
We have our loyalty service. A year ago, this is the service that blew up. We had a third-party service that we were expecting to respond immediately. Now it's important that our customers receive loyalty points when they make a purchase, but it probably doesn't have to happen immediately as part of that flow. In fact, we're probably happy if it happens at some point before the end of the month. Here we can send all of those orders and the order details into a queue, and that loyalty service can process them on their own time. We don't need to scale out that service. It can just chug away and make sure it's catching up with that queue.
And then finally, we have this really cool activity pattern that Eric taught us about. Once a month we want to run a process that's going to look at all of our customers' sales and loyalty points and tier changes and do a whole bunch of processing. We go and send them some personalized discounts to help encourage them to come back and buy more stuff. Here we can use Step Functions and the activity pattern, and we can actually do that batch processing in a really efficient event-driven way. So we can keep applying these patterns in different ways to our architecture, with EventBridge at the heart of it, using ECS for our compute, and quite quickly get to a new solution.
So next year in November when things ramp up, Sarah sits at a desk and nothing happens, and her boss is happy. Everyone gets to celebrate. Oh Sarah, we're happy for you. That's right. So Sarah's happy. She's now doing EDA partially and she's moved a good part of it. She could do a lot more.
Key Takeaways and Resources: Starting Your Event-Driven Architecture Journey
So here's what I want to help you with when you walk away from here. If you're not doing EDA now, it's not a flip of a switch. I love this statement from Dr. Werner Vogels, VP and CTO at Amazon.com, who says systems that don't evolve will die. He's also the one who says everything breaks all the time. These are very well-known quotes, and it's important that we evolve. If your system is good enough, it probably won't be soon. You always want to be watching, even if it's working. Is it working efficiently? Is it saving you money?
So one thing we encourage when you're looking at your EDA journey is to consider where to start. You may not have a full understanding of EDA. Like anything else, you learn as you go. I encourage you to start small with an implementation that you can work on, and your understanding will grow with that. Obviously, don't start in production. Hopefully you know that already, but there it is. As you grow with that, you learn more and you go.
Here are some key takeaways I would give you. Synchronous microservices crack under pressure. You learned that Sarah didn't fail because of ECS. She failed because of tight coupling. Things like that cause systems to collapse at scale. EDA is the ultimate decoupling superpower. It really is. It allows you to build that scale and reliability. ECS plus Fargate brings the muscle. Serverless containers with zero servers to manage, instant scale, and rock-solid isolation. You've got routing options for days. It can give you a lot of flexibility to do that.
ECS runs every EDA pattern like a champ. We just talked about push, pull, sync, async, and callbacks. Auto scaling becomes your superpower. ECS Managed Instances equals less ops, so you're going to be able to obviously benefit from how those are going to help. Step Functions unlock next-level orchestration, so you're able to do a lot of orchestration with that. And finally, loosely coupled systems win every time. Yes, Sarah does win by running this route.
We promised you some resources. I encourage you to check these out. You can find information at S12D.com/CNS307-25. Here are a couple of other resources I'm going to give you. Let me go back. I saw some of you still trying to get that, so let me throw that up here for a moment. The next thing I'll tell you is we've got a couple of other resources I would encourage you to check out. I tried to get ones after this, but the best practices for serverless developers is a super strong session. This is a great one. API207 is about using event-driven architectures to modernize legacy applications at scale, more EDA. And then Serverlesspresso is a great example of an event-driven application at scale. Get the coffee in the expo hall or the certification lounge and then go see how it was built. I really encourage you to do that.
If you want to learn more about service and application integration resources, there's a lot out there. On behalf of Matt and myself, I really want to tell you thank you, Matt. Do you have anything else you want to throw out? Shameless plugs? If you can make it to the Mandalay Bay at 5:30, we'll be doing a chalk talk diving into a lot of these patterns in more detail and really helping unpack them with the audience. I think that's the last one.
And for my shameless plug, on Wednesday I'll be discussing something at 10:30. I can't mention what that is right now, but I encourage you to keep 10:30 at Mandalay Bay open. If you love serverless, you'll love me for this. That's all I can say. I wish I could say more, but I encourage you to do that. With that, I want to say thank you very much. Please fill out the survey. Let us know. We always like to know how we can do better. Enjoy re:Invent. Thank you very much.
; This article is entirely auto-generated using Amazon Bedrock.































































































Top comments (0)