🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - Circuit Breaker, Saga & Strangler Fig: Patterns for Transformation (MAM358)
In this video, three AWS Solutions Architects explain transformation patterns for modernizing monolithic applications. They introduce the Strangler Fig pattern for incrementally extracting services from a monolith using API Gateway as a proxy and anti-corruption layers. The session covers the Saga pattern in both Choreography and Orchestration flavors, demonstrating how AWS Step Functions and Amazon EventBridge manage distributed transactions across microservices. Finally, they present the Circuit Breaker pattern to prevent cascading failures when downstream APIs fail, showing implementations using CloudWatch alarms, EventBridge Pipes, and rate limiting strategies with API Gateway usage plans to protect both consuming and providing APIs.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Three Critical Transformation Patterns for Modern Architecture
Good afternoon everybody, thank you for joining us today. My name is Adrian Begg, and I'm a Solutions Architect at AWS helping our customers with modernization. Hello everyone, I'm Dirk. I'm also a Solutions Architect with AWS. I work with software companies on their multi-dimensional transformation. Hello, my name is Robert, and not to bore you, but I'm the third Solutions Architect on stage.
All right. So today we're going to be talking about three really important transformation patterns: Strangler Fig, Sagas, and Circuit Breakers. And it doesn't really matter where you are on your transformation journey, whether you're working on a monolithic application. Or maybe you've started to carve some of the components out because that monolith is literally keeping you up at night. Or maybe you've moved and you've started to move towards a divide and conquer pattern. You have a bunch of integrations with services, and you're integrating with external services as well.
In the same way that you want to be prepared when you're moving, when you're on a journey, making sure you have, for example, a four-wheel drive to navigate a bumpy road, a good understanding of different software architecture patterns and how those patterns can help you will help you face unexpected challenges along the way. But before we begin, I just want to start by acknowledging that every architecture decision that we make involves trade-offs. In other words, whatever you do, something's going to suck.
For the patterns that we're going to be discussing today, this typically means trading some kind of simplicity in the architecture for some kind of complexity, but with a reason and a purpose. What we need to do as architects is choose the least painful option and really focus on what's most important to us or our customers, whether that be improving customer experience, reducing operational burden, or increasing the agility and evolvability of our application.
New Science: From Simple Startup to Monolithic Nightmare
Okay, so for the purpose of the discussion today, we're going to have a sample use case. We're going to explore each of these patterns and see how they can solve each of the problems. So to start with, we have our sample application. Now in the beginning, it all started with three friends and a simple idea. We were all interested in science news, and we were really not satisfied with the services that were available. We wanted something entertaining, something engaging.
So we sat together and we thought, how can we solve this problem? A couple of late nights, a lot of coffee, and New Science was born. And in the beginning, it was a pretty simple application. We had a way for users to register, we had a way for our contributors to publish articles, and we had a way for our users to comment and give us some feedback. We pushed to AWS and we were live.
And within 24 hours, we're getting our first feedback. Our customers love it. But wouldn't it be good if we could find the articles, because there's no search feature at the moment. Now we always knew that this was going to be something that would come up fairly early, and we thought about it, but it just wasn't important for the MVP. So we organized a meeting, we're going to get a whiteboard, we're going to solve this problem.
The next day, we get into the office, Slack notification pops up, it's from my CTO. It's done. Sorry, what's done? Search, it's implemented. Okay. Now, it wasn't the most robust implementation, but at this point in time, the whole application's a couple hundred lines of code, and there's only three developers. We can kind of understand the whole application, and look, it works, and our customers are delighted. We can always go back and fix it later, right?
We're growing, and everything's great, and this worked really well. Until it didn't. Now we're a few years down the track, and a few problems are starting to occur. So we've prioritized getting new features into the application as quickly as possible, and as we started to grow, we started to notice, hey, our costs are starting to explode. This really led us into a path where we really leaned into our monolithic architecture.
Our original developers have now since moved on to greener pastures. And they were great builders, but they weren't great at writing documentation. We've got a whole bunch of these features now in our application, a whole bunch of different components, and nobody really understands how all of these things work. We've even got things in there that we don't understand what they really do.
We only know they're really important because we changed them once and the entire application went down. We're not sure exactly when this happened, but at some point we went from developing new features and focusing on our backlog to just coming into the office every day and constantly seeing what's on fire today. Everyone's scared, fearful, and now anytime we make a change, we need to get approval from every product manager. At some point, everything becomes a severity one incident.
On top of that, as we've started to grow, our application's just doing things that we never really designed it to do. We've been successful, we've grown, but we really didn't design for the user base that we had originally. As I mentioned, our team's paralyzed, fearful, and unhappy. The monolith that was helping us to move quickly in the beginning is now the thing that's starting to hold us back.
We thought about rebuilding the application, but honestly, it's going to take too long. We estimate that it's going to take at least two years to do this, and we're not even sure this is the right approach. We make changes in some parts of the application very frequently, which we can't do at the moment. But other parts of the application haven't really changed since we implemented them.
So if these or some of these issues sound familiar to you, it's because you're not alone. These are problems that we see commonly with customers. Untangling this mess can seem like an impossible task. It's hard to know where to begin. So you may be asking yourself, why is there a picture of a tree on screen right now?
The Strangler Fig Pattern: Nature's Blueprint for System Transformation
Just a quick show of hands, does anybody actually recognize this tree? Okay, I see a few hands. There is a hint in the title. So this is in fact the Strangler Fig tree. It's native to Queensland in Australia where I actually grew up, and this particular Strangler Fig tree, this photo was taken in the Botanical Gardens in Brisbane, and the buildings in the background are actually the university that I studied software architecture at many, many years ago.
Now, these trees are, as you can see, quite interesting in terms of how they grow. They've got quite interesting patterns with all of these branches or roots that are coming down, looks like from the top of the canopy. To explore why these trees look like this, we need to understand how they start their lives. So the Strangler Fig tree is actually a parasitic tree.
A bird or a possum or some kind of other animal eats its seed, flies away, finds another tree, which I'll refer to as a host tree from now on, and deposits the seed on one of the branches up top. Now because of the material that's mixed in with the seed, it kind of sticks to the branch. That seed will germinate, and the tree will actually start to grow from the top down.
So it'll start its life up the top and kind of wrap itself around one of the branches, and then as quickly as possible, its objective is to get into the soil so that it can suck up all the nutrients, which will enable it to grow faster. So you get this quite interesting pattern where these trees are actually hollow in the center. So there's a cavity, for example, in the center there, and you can actually go in, kind of move your way through the tree, and it's actually hollow up the top.
The reason for this is because this is not the tree that was originally there. Eventually, they'll start to consume and envelop the entire host tree and kill it. But an interesting benefit for the host tree is, while this is happening, those roots, all of these extra rooting points, are actually providing more stability to the host tree that's being killed.
This is relevant because in Queensland, where these trees live, severe storms, cyclones or hurricanes of category five are common. In summer, almost every day there's severe storms with very high winds, and ironically, the Strangler Fig is actually providing more stability to that tree, the host tree that it's killing, during the period of time where it's growing. So it's really helping the host tree survive while it's killing it.
So how does this apply to software, and how can we apply this behavior of providing the stability to the existing system while killing it at the same time? Now, this is where the Strangler Fig pattern comes in. This is not a new pattern. This was actually coined by Martin Fowler back in 2004, and it's still relevant in 2025 because it works. Turns out that this is a very effective pattern. The core concept is we want to incrementally build a new system on the edges of the existing system.
The focus is really to do this incrementally, focusing on the biggest problem that we have and extracting that from the monolith and building a new component around the edges of the system. We can do the same thing with new features. So we can start to build smaller independent systems around the edges of the monolith and slowly strangle it.
This gives us a couple of advantages. Firstly, as we move those services off the monolith, we're going to improve the stability and the durability of the monolith itself. We'll start to solve some of the problems in the monolith, and by doing so, the monolith is going to be more stable, and we can start to get some of that agility back. But at the same time, we're also reducing the risk of the thing falling over.
Extracting the Search Service: First Steps in Strangling the Monolith
So this is the general idea, but let's actually look at how we do this. So this is a typical starting point for most of our customers. We've got our monolith. It's been simplified a little bit, but just for illustration purposes. So the whole thing is a problem, and we talked about that, but the biggest problem we have is this search service. It's been there for years. There's so much duct tape on that thing, every time there's any release, there's always a problem there. So this is where we're going to start.
To compound this, the search is implemented in a relational database. That was probably not a good idea, but it's what we used. We had a batch process that was running, and it used to run in about one minute every hour, and now it runs for the full hour, if it completes at all. So this is thrashing our monolithic database as well.
So we've decided to start, but before we get started, we need to introduce a component into the design so that we can make these changes without having to go into the code and change the monolith every time. So the first thing we're going to do is implement a facade or a proxy, and it's going to sit in between the users and the application. This is going to be used to steer traffic, so initially it's just going to be a thin routing layer that's just going to send all of the existing requests from the users or the user interface straight through to the monolithic application, pretty much as it's operating. It should be transparent to the end users and the clients.
But we're going to start selectively steering the traffic, based on the new services we develop, to these new services or to the monolith. This allows us to make those changes without having to go back in and change everything in the monolith every time we want to implement new services, so it gives us a bit more control.
So we've dedicated two sprints to developing our new search service, and we're going to make different architectural choices this time. We're going to have the service own its own data, because this will allow us to evolve that service independently without having a shared data model that requires us to go back in and change all of those components and potentially impact other services. So this is going to give us a bit more agility to independently develop that service.
We're also going to use a purpose-built data structure that's optimized for search, so we're going to move away from our using a relational database for everything model. And we're going to reimagine how we do indexing as well. We're going to move to a constant work pattern, where we're only incrementally loading some things in.
So we've made our choices, we've tested the service, and everything's looking pretty good. We're ready to cut over. So how we do this is, we're actually going to, in the proxy, go back in and add a route to start routing that traffic to our new modern service. All the existing requests will continue to go through the monolith as usual, but our new requests for search will be redirected to our new search service.
So we've taken our first step towards moving away from the monolith, and we've solved two key problems. We've taken a significant problem area in our monolithic application and removed it. At this point, this is just some dead code. It's not doing anything. We could go and clean it up, but we don't really need to. But we've also alleviated a lot of the pressure on our monolithic database at the same time.
Anti-Corruption Layers: Decoupling Article Management from the Monolith
So this one was fairly simple because it didn't have too many dependencies between other components in the monolithic application. We solved some of our problems, but as I mentioned, we need to change parts of this application very frequently, in particular the article management service. We're actively developing this, or we want to, if we can get our way out of fighting fires all day. So this is where we're going to focus next, but as you can see, there's more interaction between these components.
So we need to actually introduce another component into our design to help us to be able to transform this component. So we're going to introduce something that's called an anti-corruption layer. Now, we could do this differently. We could go back in and try and find all of the references to these internal interfaces and go and update them in our code, but as I said, we'd have to get approval from all the product managers to do this. It's going to be difficult and it's going to be difficult to roll back as well.
Instead, what we're going to do is introduce a small wrapper into our monolith, and all that's going to do is translate the old to the new. So it's just going to convert the data formats and interfaces used by the old interactions into the new interactions.
The goal here is really to minimize the change to the monolith as much as possible. So again, we've taken the same approach. We've dedicated a couple of sprints to developing our new service, we've implemented our anti-corruption layer in our monolith, and we're ready to cut over.
So we follow the same pattern. We go back into our proxy, we implement our route, and we deploy our anti-corruption layer, which is forwarding any internal interfaces out to our new modern service. At this point, we essentially just rinse and repeat, focusing on the biggest problems or the most important problems to solve to give us back that agility to independently evolve these components, but also to relieve pressure on the monolith itself while we're doing this and give us more cycles so that we can move towards innovating instead of fighting fires all day.
Implementing Strangler Fig with AWS: API Gateway, Lambda, and Data Migration
By doing it this way, we reduce a lot of the risks that would be involved in just trying to redevelop the whole system. We can continue to use this pattern also to develop new features as well, and we slowly start to build those components around the edges of our existing system. So we'll quickly go back to our original application and see how you can implement this pattern using AWS services, for example.
So we have our application, and the first thing we need to do is we need to insert that proxy. So for that, we're going to use Amazon API Gateway as our Strangler Fig proxy. API Gateway, there are other options here, but from our use case, the way we do our billing is per request, so it works pretty well for that. It scales automatically, and we hate managing reverse proxy fleets, so it's serverless and this is great for us. So to start with, we're just going to have a pattern that's just going to proxy all requests through to our monolithic application.
Next we go in and develop our search service. So we're going to make, as I said, we're going to make a couple of different design decisions this time. So we're going to use AWS Lambda to implement the front-end components or the API components. We're doing that because we have variable load. We don't have requests coming in all day, so this fits pretty well for our variable use pattern, and it also allows us to scale up as we need to and reduces a lot of the operational burden that we have managing the application today.
We're also going to use Amazon OpenSearch Service for our data store. The reason for this is it's much better suited for our search use case, and it's going to allow us to ship new features around that search service and serve our future customer needs. So again, we've tested extensively and we're happy with it. It's time to cut over. The next thing we'll do is we'll add a route into our API Gateway to redirect any search requests to the new service and continue to send those through to our existing application, the rest of the services.
Now you might have noticed that once we've removed the search service, we've actually reduced the number of tasks that are serving the original monolithic application. This is the other benefit we get by doing this. As we start to reduce load on our monolith, we can actually start to right-size that down as well, and we get an additional cost benefit there as we reduce the load on that component.
Next we have our article management service. We pretty much take the same approach, but this time again, we want to make different design decisions here, and we can. So we have a slight difference here. So for our search use case, we're not going to migrate any data from our relational database. We're just going to re-index it, so we didn't necessarily have to migrate any data. But with our articles, we've got like five or six years' worth of articles and we do want to migrate that data.
We've chosen Amazon S3 as the store for storing our article data in our new application because it's pretty effective for the static content that we have. It's cost effective, and it scales well. We have some metadata. We're just going to store that in DynamoDB, and we're going to use AWS Database Migration Service, or AWS DMS, to actually do a heterogeneous migration of selective parts of the database into S3. This way we can load just the article data into S3, continually synchronize that until we're ready to cut over.
So once we're ready to cut over, we again go back into our API Gateway and we implement a new route for our article service, and we can use the Lambda integration to directly integrate with our new service that we've built around the edges. So using this pattern, we started to build smaller components around the edges of our application. We're able to move a little bit faster again. We can evolve those components independently, and whilst this process does take time, we're incrementally tackling the problems that are most important to us.
From Monolith to Microservices: Managing Distributed System Complexity with Sagas
We're gaining time back. We're not fighting fires all day, and we can actually focus back on our roadmap. Easy, right? Well, as I mentioned, every decision comes with trade-offs. As we move from a monolithic pattern to a distributed system, integrating and coordinating those interactions between those components does get more complex.
Dirk is going to dive a little bit deeper into some of the more complex components, and we'll explore how we can address those.
Thank you, Adrian. It's an interesting journey so far, and I'm very happy that I'm part of this journey as the new Chief Integration Architect for New Science. Of course, this journey cannot stop here. We need to continue. Similarly to a lift and shift migration where you shouldn't stop after that migration but then also modernize your workload, we also now need to consider a few other things. For instance, how to integrate the number of systems that we now get out of that first Strangler Fig migration. And then, apparently, we also need to understand how we can actually manage the interaction between those systems.
For that, we have a pattern at hand that is widely known and used. It is called the Saga pattern, and it actually comes with two flavors: Choreography and Orchestration. Both of them, however, are built on a workflow and the execution of that workflow that manages the interaction between those systems.
Premium Tier Onboarding: Designing a Multi-Step Subscription Workflow
Now, a few months back when we showed this to our product team, they actually liked this a lot that we gain a lot of flexibility when we cut our monolith into several pieces. And they thought, okay guys, from now on we can build all those cool products that we can think of. Of course, they didn't look at the price point on the lower right end that we can see here and thought they could leave it to us.
Well, for next year, our product team has decided that a new feature of a premium tier is to be the most important feature of our New Science service, and customers can onboard to that new feature and then consume premium content because they are premium subscribers. We will have a look at the onboarding process for such a premium customer to discuss the interaction between systems in our microservices landscape.
As you can see, there are a number of steps that need to be fulfilled to do that onboarding process. Don't look at that in detail, just to illustrate there's a ton of work to be done. And then, of course, the product team cannot get away for free from this. They need to make up their mind about a few business decisions. But to understand that flow and use case a little bit better, let's pour it into a diagram.
So what we intend to do is that our end user clients can be used by our customers to submit that request to subscribe to a premium tier, and it will land in the Subscription Management service. We will make it asynchronous so that the user directly receives a response, and we take care about the actual subscription onboarding process under the hood. Now, since we now have a landscape of systems under the hood, the Subscription Management service is not going to do everything on its own. It's going to employ a lot of downstream systems that contribute to the overall workflow, and one of the most important ones for us, because we want to earn money, is that we try to get some money from our onboarding customers.
But then we also need to enable that premium content, make it available for the new customer, and last but not least, we want to add some customer relationship goodies like subscribing them to a premium newsletter and onboarding them also to a loyalty service. And maybe you can also already think of that we want to have a certain order of this in place so that we can make sure we first get the money before we do everything else.
Now, the fundamental question before we even think about how we can manage the interaction between systems is that we need to make up our minds how we do the integration between the systems. And even one step back, we also need to make up our mind about what do we actually want to build ourselves and what do we leave to others that maybe do it better. For instance, a payment service. I don't want to build this on my own. I don't want to undergo a certification. I would leave that to somebody who does this for a living. And the loyalty service and also a newsletter service, those are assets that we have to have, but they don't really distinguish us from our competitors, so we also make use of a third-party service for that.
All right, so with that up front, we can now discuss how do we actually want to integrate our systems. Our goal should be in that context to make the integration as loosely coupled and simple as possible for added flexibility, and also to make the overall app as resilient as possible.
How could we do that?
Evolving Integration Patterns: From Synchronous APIs to Event-Driven Architecture
Well, probably most of you would by default, when it comes to integrating systems, think about the synchronous request-response model and using that with APIs. Why is that so? Because probably most of us have used synchronous request-response all of our professional lives with programming languages when we call procedures or functions, and we also do the same when we consume a website with our browser. So that seems quite natural.
Unfortunately, it introduces a lot of coupling to our overall architecture, and particularly runtime coupling, so temporal coupling. Because when a request is running through this entire network here, you will bind resources on all stages, and it can unfortunately then violate one of the major promises of microservices. One of those promises is that you can scale out and scale in every service individually. Well, you can still do this, but you might be forced to scale out an upstream system just because a downstream system is in a struggle, and this is not what we want to do, obviously.
And even if it happens that one of those downstream systems comes into a struggle, it may also jeopardize upstream systems and bring the stability of your overall app into not the best shape. So the natural next step for that would be to transform this integration from synchronous into asynchronous request-response. This reduces a lot of our temporal coupling or runtime coupling because now an upstream system just sends down a POST request to a downstream system, and that downstream system immediately responds with 202 Accepted, as we have seen in the communication with the end user client, and then you don't have that runtime coupling anymore.
There's still a very important dependency left, however, which is the availability dependency. You cannot send a request to a downstream system using an API if that downstream system is not ready to receive that request, if it's not online, or if it's currently in a struggle, if it responds with increased error rates. And this is where message queues help you. Because message queues have the very helpful ability to buffer messages, and that means even if one of those downstream systems is currently in a struggle and might not be able to directly respond to an API request, the message queue buffers it, and the downstream system can consume those messages when it is ready again.
Also, it's quite handy for a scale-out scenario. Because it has the point-to-point characteristic, that means every message in a queue is delivered to one process on the consumer side, you can just add more processes to handle more load on those messages. This is why you can call a message queue also a buffering load balancer. So that's quite helpful.
Apparently, you shift a lot of operational responsibility now to your messaging system. That means you should probably want to have one that is highly available, scalable, and reliable, and this is why I typically recommend to go with Amazon SQS in this case as a cloud-native serverless messaging system. Fun fact here, in Prime Day 2025, at peak times, Amazon SQS has managed 166 million messages per second, so it's quite scalable for sure. If you're bound to industry standard protocols like JMS or AMQP, you can fall back to Amazon MQ, which provides managed Apache ActiveMQ or RabbitMQ.
So that's quite good already. However, our diagram is a bit busy still, right? We have a lot of channels here. Maybe we can reduce that a little bit. And in fact we can, we can add now publish-subscribe channels in some places instead of the queues. What does that help us? So the characteristic with publish-subscribe channels, also known as topics, is that every subscriber on the right-hand side receives every one of those messages that the producer on the left-hand side sends.
So in this case, the subscription management service only needs to send one downstream message compared to three previously. That simplifies it a little bit. And since we added it, we can also replace the return channel from a queue with a topic. So that's also a good simplification. And the other positive aspect that it brings to us is that if an additional service should eventually pop up, it can just subscribe to that topic too and can start receiving the messages. The subscription management service, for instance, doesn't have to be changed at all because it's all in the responsibility of the subscribers.
Again, the operational responsibility is something you shift onto the messaging system. And I would again recommend to go with Amazon Simple Notification Service as a cloud native serverless service, or you can also fall back to Amazon MQ because Apache ActiveMQ and RabbitMQ also support publish-subscribe channels. Now you might ask, hey Dirk, you have introduced queues for additional resilience and reliability. Now you have cut out those queues again. Why are you doing this in the first place?
Well, I wouldn't cut out the queues. Just conceptually, we have now focused on publish-subscribe channels, but I always recommend to use the best of both worlds and apply the topic-queue-chaining pattern, so you can still add a queue in front of that subscriber system to have that characteristic of queues to buffer messages and also to be able to scale out on the consumer side. Okay, so it gets simpler and simpler, and now maybe we can even make it more simple, and maybe we can also forget about our beloved request-response model.
Maybe we can start thinking in events. And with the introduction of our topics, publish-subscribe channels, we actually provided the precondition to stop sending requests or commands downstream. Now we can send events downstream, and we can make the messaging infrastructure also simpler when we just replace all those messaging channels with one message bus. You will see two icons that are labeled with bus, but it's actually the same instance. It just appears like that to keep the visualization of the different layers up here.
And if we want to drive it to the extreme, we can also put a little tiny piece of proxy component in front of our third-party service providers to make the event-driven experience put it down to the lowest level of our architecture here, and those proxies can be simple Lambda functions that just handle the traffic between us and those third-party service providers. Again, you want to have a reliable and available service for that. The recommendation would be Amazon EventBridge as a serverless cloud native message bus implementation.
Saga Choreography: Distributed Workflow Through Event Notifications
So that's quite a simplification of our integration architecture, and guess what, with that we have naturally provided the preconditions to now think about saga choreography because everything we need is now in place. The idea of saga choreography is that the workflow that we talked about in the beginning is distributed over all those involved parties, and each of those parties is listening to an event that is relevant to do some processing. And then each party does some processing and sends out another event, publishes a notification that certain things have happened.
How can that look in concrete details with our use case? So we have the subscription management service that is being touched by the end user client. It receives the subscription request. What it can do now is to publish an event. The type is tier upgrade requested, and it goes into the bus. And what we have as the next level of downstream systems are probably the systems that are interested in this event, and they can now do some work, some sanity checks, some pre-work, and so on, and they will probably emit and publish events on their own.
And other downstream systems are interested.
For instance, the payment management service can prepare everything internally for the payment, maybe add some database records and so on, and then send out a customer payment requested event. Similarly, the customer relationship service can prepare everything for the status update of the customer and then send out an event that the status has been updated. Maybe some downstream systems are interested in doing more work in that respect, as we can see here. For instance, the customer payment requested event seems to be interesting for the fraud detection service and the proxy for our payment service provider. In the same way, the proxies for the loyalty service provider and the newsletter service provider are interested in that customer status update event.
Now those downstream systems do their work and they come back with a pass or failed event maybe, and what you can see here already is that we would actually want to have a timing or a condition between those two. We would want to do the fraud detection before we actually trigger the payment collection, so you can already keep that in mind. This is something that is not out of the box conceptually supported by the choreography pattern.
The other systems downstream also do their work and publish afterwards status update events alike. Now one of those downstream services, in this case the payment management service, is certainly interested in those events that stem from those detailed services downstream. In the same way, the customer relationship service is interested in learning about those events that come from the loyalty service and the newsletter service. That will certainly trigger some status updates also in their systems of records.
Then they provide updates again into the bus about the overall progress of the status, and the subscription management service will be interested in those events because after all it needs to be able to provide a status update to the customer. Now while it might look at first glance still like request response because things in this visualization went down and other things went back again, there is an important difference here. We are only sending events, so we are not sending commands. An upstream service is not sending a command with the expectation that a certain downstream system is doing a certain thing. It just notifies about it.
There is obviously the expectation that eventually it can see that something has changed here or there to provide a status update, but there is no obvious command to an obvious target. That would again by its own apply tight coupling into our overall architecture again. With that in mind, we can have a quick look at the benefits and challenges that we see in the saga choreography pattern.
Obviously it is quite easy to start. We came to the conclusion quite naturally to our integration architecture setup. It really supports also autonomous teams, so that every service that is involved here can still work individually, and it also provides you with high scale. The challenges, however, is the fact that the workflow is distributed across all those parties and encoded there, which makes it really tough to grasp it. It might be well documented in the very beginning, but everybody knows over time people will inject more event types into the system and then you might end up with that big ball of mud.
Saga Orchestration: Externalizing Workflows with AWS Step Functions
Also, if you want to change the workflow, you will always have to touch either code or configuration or both of the involved systems. Now maybe there's an opportunity to externalize that workflow and make it better visible. Yes, there is, and this is where saga orchestration comes into action, because the idea of saga orchestration is to make that workflow externally available and explicit. To illustrate this, let's only look at the first half or the first two thirds of this diagram, the subscription management service and the first downstream services. How can we externalize that using an orchestrator now?
Now we can think about the workflow, how we would want to have it. The subscription management service would actually launch such a workflow. It would call the payment management service's API called initiate a payment, and then there's a condition check.
Was that actually successful? If no, then we go back to the subscription management service and let it send a notification about the failure to the end user. But maybe it was successful, and hopefully that is the case most of the time. Then we can continue. We can unblock the content.
If that was not successful, we would provide a refund to the customer for the time being. In the future, maybe we will make it more comprehensive, but for now that is how we are going to start. If the unblocking of the content was successful, we can also use the customer relationship management service to initiate the provisioning of the newsletter subscription and the loyalty program onboarding. If that wasn't successful, currently our agreement with those service providers is that we do a manual intervention and then we can notify the customer of partial success. If that was successful, we can notify full success to our customers.
Okay, so while it might be fun to draw this on a slide, you need to implement and actually execute this, right? So this is where AWS Step Functions comes into action, which is again a managed serverless service that is made exactly for such things to execute workflows at scale. And you can redraw literally the same thing also in the AWS Step Functions management console where it would look like that. If you want to take a picture, you can do it now. Right, so what are the benefits and challenges of orchestration?
So the main benefit obviously is that you make this workflow explicit, and that also makes it easier to evolve the workflow and also to find error solutions or uncover error situations and make sure that everything is consistent because the executor also keeps state of it. Apparently you introduce a new component here and that should be highly available and highly scalable. And I also want to stress the last point on the challenges aside that you actually need to couple this workflow a little bit to the implementation details of the involved parties.
So that brings us to the conclusion of when to use what. In the end, it's quite simple. You use choreography when the benefits of choreography outweigh the challenges for you personally, or you use orchestration when the benefits of orchestration have more importance for you than the challenges to have. My personal recommendation, however, is you use both. You can actually combine both approaches in the same architecture. In our example use case, I would use orchestration for everything that is related to payment and content unblocking, and I would use choreography for everything that is related to customer relationship management.
So that's all good. I have one problem left, Robert, and I hope you can help me. So with those third party downstream systems, they give us often a hard time. Like they are moody on Monday mornings, particularly the payment service provider, and we don't seem to treat them well either. So maybe you have a solution for that. Let's have a look at that gladly.
The Circuit Breaker Pattern: Preventing Cascading Failures in Distributed Systems
So looking at our current situation, we should have every reason to celebrate. We have this very nice new subscription workflow with a saga pattern, and our customers love our premium content. They are practically lining up to become paid subscribers. The issue is our subscription team is not having a good time regardless. How come? Well, as Dirk already mentioned, we have some downstream APIs that can be a little bit moody. Some of them maybe we are at fault because we are also overloading them.
And when that is happening, the saga pattern is kicking in and it rolls back the whole distributed transaction with all of the involved parties that we have set up, and that is good. And after some time we can retry this whole operation. And if it succeeds, that's fine, but if it fails again, we have to roll back again. And if we do that multiple times, if we retry and fall back and retry and fall back, we are adding so many remediating API calls that undo any state changes that have happened in the downstream systems that we are also overloading APIs that used to work perfectly fine for us. So we are cascading the issue and making everything worse for ourselves.
How do we handle this ourselves? Let's compare it to a real world example. Where do you have an overload situation in your home, for example, in your electrical system? If your electrical system at home is overloaded, you have a circuit breaker that trips, and this is saving your appliances from being damaged. It also protects your wiring, and it might even protect your home from fire breaking out. The cool thing about the circuit breaker system at your home is, if this is happening in the kitchen, most likely all of your appliances in your bedroom, in your bathroom, and elsewhere still run fine because only the circuit of the kitchen is affected in this moment. So the issue at hand is isolated, contained, and can be safely dealt with.
Wouldn't it be nice if we had a similar thing for our distributed systems that we are handling? As luck would have it, we have a pattern for this, and aptly it is also named the Circuit Breaker pattern. A circuit breaker is operating in two primary states: the good closed state and the bad open state. Please do remember that because it is a little bit counterintuitive. Normally we associate something being closed with blocking our progress, where we cannot continue doing what we want. However, for a circuit, the closed state is the only one where current can successfully flow. An open circuit is a broken circuit.
So our normal state is a closed circuit with normal operation. If we get too many errors here, however, this flips over and trips into the open state and we stop operations. In our physical example, now somebody would need to go to the kitchen, check the different appliances, find the culprit, disconnect it, and remediate the situation. It's a very binary and manual process. In our integration pattern, however, we can do something more clever. We introduce a third state: the half-open state.
This half-open state kicks in after a little bit of time has elapsed that you configure, and we just try a few operations and we see if they succeed or if they continue failing. If they fail, we know the system isn't ready yet. We can't continue. We go back to the open state and we wait for another time to pass, and we try the half-open processing again. However, if our executions are successful, we hit the success threshold and we go to the closed state and continue our normal operation.
Implementing Circuit Breakers on AWS: Automated Failure Detection and Recovery
How would it look like to protect a single execution with this Circuit Breaker pattern? Let's have a look here. First of all, I need to know at the start of my execution, what is the state of the circuit? Is it open or is it closed? So I need to have that information stored somewhere, have a circuit status stored that I'm checking. Then I decide very binary, is my circuit closed, yes or no? If it is not closed, or if it's open, the bad state, I don't continue. I don't want to overwhelm the system even more. I stop at that point and I might retry later.
However, if the circuit is closed, I execute my business logic and I hope that it now is successful. If it is, all is done, all is fine, we end here. If the business logic fails at this point, I will record the failing. I will increase a failure counter that, if it reaches my configured threshold, opens the circuit and stops further operations from happening. A key message here is that this is not a theoretical process that I'm showing you here. Exactly this process is something that our customers are implementing with AWS Step Functions to protect their executions with a Circuit Breaker pattern.
Knowing this now, let's get back to our story. We have a lot of business success. We have a lot of customer interest coming in, but it is all at risk because of the technical failures that are happening downstream. What do we do in this situation as a proper company? We have a crisis meeting, of course. We have the subscription crisis meeting. Here our ops lead is kicking it off and is saying, well, the thing keeps failing at different points under high load. Sometimes it's this, sometimes it's that, sometimes it's a timeout or it fails outright. Things are not working properly.
Immediately our CFO is leaning forward and is saying, in our current growth phase with this customer interest, this is not acceptable. We need these subscriptions to come in. We need the payment to continue growing. The always helpful architects are saying, well, what about we store the subscription intent, but we do not process them if the system is overloaded? What if we do just that? To which the user experience leader is saying, well, we cannot do that silently because our customers will just continue retrying and retrying, and maybe their credit card gets charged multiple times. They will hate that. Our support will be overloaded.
Finally, the poor on-call guy who has been dealing with this the whole weekend already is just tired of hitting F5 in the monitoring system all the time and trying five minutes after five minutes new executions until everything works again.
So he just wants to be done with this. As you see, the situation is quite chaotic, like the speech bubbles on this slide here, but let's have a look at how we can step by step resolve this with an event-driven pattern on AWS.
First of all, the concern of the on-call engineer. The thing keeps failing, and here you see the subscription state machine that we have set up, and what we can do is to configure a CloudWatch alarm that triggers after a certain threshold of failed executions is reached. That is our detection mechanism here.
Next, the CFO wants the business running, so this in the middle is our happy path. What usually happens in our application is we get a subscription intent from a customer via a POST request to API Gateway. This is being persisted in an Amazon SQS queue, and when we have a closed circuit, there is an EventBridge Pipe that is forwarding the intent into the Step Functions workflow that is processing the subscription for us with all of the downstream systems. So this is the good path that the CFO wants to have happening.
However, if we do have errors, the architect was saying, well, let's just pause the execution. And the good thing is our CloudWatch alarm is always publishing state changes to an EventBridge event bus. So when it goes into the alarm state, when it reaches the failure threshold, it publishes an event there to which we can subscribe with a rule, and that rule could trigger a different Step Functions workflow that could stop the EventBridge Pipe between the subscription intent queue and the actual subscription execution state machine. So at this point we have opened the circuit. Nothing is going on, but we are not losing subscription intents because they are stored and persisted in the SQS queue coming in.
The concern of user experience was we need to inform our users about this, so we could also add this purpose to the Step Functions workflow that we just introduced and say this workflow can additionally change a parameter in Parameter Store which is signaling that we have a processing delay. And this information we make available via the API Gateway as an endpoint, which gives our front-end team the possibility now to have a warning box there. Hey, we are under heavy load. We have received your subscription intent, but it might take a while until we get to process it. Don't retry all the time.
And the on-call guy, finally really tired, he just wants to sleep, he doesn't want to do something every 5 minutes, so let's automate that for him. We can have an EventBridge schedule that is executing every 5 minutes, a Lambda function, for example, and that Lambda function gets some test messages out of the subscription intent queue, and these test messages for itself, it is executing against our subscription state machine. And if all of these executions are successful, we can start the pipe again between the subscription intent queue and the subscription execution state machine.
So you might say, Robert, this is all fine and well, but you have painted so many icons here and so many arrows. Isn't this extremely complicated and a mess to maintain? To which personally I would say it's rather structured actually, because if we layer on top the different states of our circuit breaker that we discussed, then you see on the very top you have this half-open state that is just testing whether something works again and executing it. In the middle you have the happy path, the closed state, and on the bottom are the components that are acting when we have an error and an open state. So this is rather structured and maintainable still for the parts that they represent.
Rate Limiting and API Protection: Being a Responsible API Citizen
Now this can solve our challenge at hand, but it might be a little bit drastic, especially if our main problem is that we are overloading some third-party APIs downstream. Just stopping our subscription workflow completely and having some not that nice looking error message towards our customers is not the best signal, maybe. So can we do something differently, something more targeted for this situation of overloading APIs? And of course we can.
This is where rate limiting comes in, and it's all about consuming APIs responsibly, being a good API citizen. Usually you have a relationship and a contract with your third-party API provider that says you are allowed to execute so many API calls per minute, for example, and anything more, they will either block you or they will just fail. You can handle the situation in two different scenarios differently. First of all, you might have APIs where you do not really care what the return body is of the API. You just send messages somewhere and you need to know that the call was successful, but you don't need the body of the request response coming back.
For this, the pattern would be rather straightforward. You could just have an API execution intent again being persistent in Amazon Simple Queue Service, which has an EventBridge pipe that we talked about before, and this EventBridge pipe is configured with an API destination, which is a feature of EventBridge that conveniently has a max invocations per second parameter. So this is doing the throttling for you and you do not violate any of your agreements, but it is more for a fire and forget situation where you're not interested in the results.
The other situation that might even be more common is that you do need the response of a request because you need to decide how to react depending on what you get back as a response. There might be a decision to be made, for example, the fraud check that we have seen. If a customer is known to be a fraudster, you probably don't want to continue your subscription process with them. For that we have a different pattern that is not too dissimilar to the general circuit breaker pattern that we talked about before, but instead of having a counter where you have the error threshold that is going up if something is failing, you maintain a bucket of request tokens that are available to you.
You need to define a little bit of logic. How often am I allowed to execute that API and in what time? But if you have the logic for that, you could use DynamoDB to protocol your requests and calculate at this moment where I want to execute something, do I still have capacity left? If you do not have capacity left, then you can just go into a wait task and let's say after five minutes retry again and check again, do I have any tokens available to me.
If you do have a token available, however, you reserve one of these tokens, so you protocol that you are executing this now to adhere to your contract, and then you are executing a task in AWS Step Functions which will give you back the response body that you can then make decisions upon. So very similar to the circuit breaker pattern itself and can help you in this situation.
Knowing how we can be a responsible citizen and not overwhelm downstream APIs, can we rely on everybody also respecting our boundaries out on the internet? Well, of course we cannot, and this is why you also need to protect your own APIs. This is something that we also want to go into. If I mention protecting APIs, the first service that many people will think about is AWS Web Application Firewall, and indeed you can of course protect your APIs with the Web Application Firewall from overloading.
Here you can react with counting the requests to just get a picture of what is happening currently, blocking any requests that are being too much, or challenging them with a CAPTCHA, for example, and you can do that on a five minute sliding window. This is very important to remember. You're not very flexible with the timing here. It's always the five minute window that you are working with, but you can scope it very flexibly. You can look at the IP, the path, the header, and a combination of all of these to decide this is now too many requests coming from that source.
In many cases, you will, however, know who your counterpart is or you will want to know, and here you can use the Amazon API Gateway with its usage plans and API keys. This is exclusive to the REST flavor of API Gateway, not available for HTTP, and here you hand out API keys. Important, never use API keys for authentication and security. This is just for identification and tracking, so you still need a proper authentication mechanism. But you can associate usage plans to your API keys that define how often may your customer execute your APIs in what time frame. This is one possibility.
The other thing is you can also configure throttling on the API Gateway in general. You can do that on all APIs in the region or per route, and if you want to be more granular, you can do it on the API, stage, and method level, but only for the REST flavor again. Coming back to this very important message that we also started the talk with, you have to make some decisions. We gave you some options, some patterns that you can work with here. It's not a full list at all, but we hope that you have more tools in your belt now to handle these situations.
Coming to the actions part, the first action that I want you to do is to take your phones out because there will be a little bit of information for you to retain. First is we have curated a list of other sessions that you might be interested in that go into similar topics or do a deep dive on ones that we brought up right now. So please take a picture if you're interested, and the next picture opportunity is on the next slide that I'm going down to, which is this QR code, which will lead you to a website where we have curated further resources where you will find slides and also our contact information if you want to discuss further.
Talking about discussing further, this is a silent session, so we cannot be too loud here in the room, but the three of us will be outside of the room and happy to take any questions. With that, thank you all so much for coming, especially during lunchtime. It's been a pleasure.
; This article is entirely auto-generated using Amazon Bedrock.







































































































Top comments (0)