Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Architecting resilient multicloud operations, feat. Monzo Bank (HMC201)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Architecting resilient multicloud operations, feat. Monzo Bank (HMC201)

In this video, AWS Principal Technologists Clark Richey and Bruno Emer, along with Monzo Bank's Andrew Lawson, discuss multi-cloud resilience strategies. They introduce the SEEMS framework (Single points of failure, Excessive load, Excessive latency, Misconfiguration/bugs, Shared fate) for analyzing failure modes. The speakers emphasize that multi-cloud doesn't inherently increase resilience due to added complexity, but can be valuable for specific scenarios like disaster recovery when data sovereignty requirements exist. Andrew presents Monzo's "Stand-in" platform—a simplified banking system running on Google Cloud as a lifeboat strategy while their primary platform operates on AWS. This system processes real customer transactions daily for testing, costs only 1% of their primary platform, and has been successfully used in actual incidents. Key best practices include maintaining fault isolation boundaries between cloud providers, implementing comprehensive observability, extensive testing, and ensuring critical dependencies like DNS and authentication aren't single points of failure.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Multi-Cloud and Resilience at AWS

Good afternoon. Thank you for coming today to this talk. I know it's during lunch, which is a tough time for everybody. Just to start off real quick, I have a quick question. Show of hands, who really likes capybaras? Yeah, me too. Nothing to do with the talk. I'm just kind of into capybaras today. But another question more related to the talk: who here is thinking about multi-cloud or is actively doing multi-cloud in your organization? Awesome. Let's keep those hands up for a second. Keep your hand up if you're doing multi-cloud because you are concerned about increasing your resilience, or if someone's asked you about whether multi-cloud will make you more resilient. Awesome. You're definitely in the right talk then.

My name is Clark Richey. I'm a Principal Technologist at AWS, and I'm joined today by my colleague Bruno Emer, another Principal Technologist on my team, and more importantly, Andrew Lawson, a Senior Staff Engineer with Monzo Bank. Today we're going to talk to you about multi-cloud resilience. We'll start off by going over a few key concepts so we're all level set and talking about the same things. As part of that, we'll move into a framework that we call SEEMS, which is a way of thinking about resilience and potential failure modes in a structured way. From there we'll dive into some best practices around how you can be resilient in a multi-cloud environment, and then you'll get the part you've all been waiting for, which is to hear Andrew talk to you about how Monzo Bank is actually achieving this with their application.

So as I said, a couple of key concepts just to level set all of us. If I were to go around the room and ask everyone here to tell me what they think of when they say multi-cloud, we'd probably get 100 different answers, and that's totally fine. Here at AWS, when we talk about multi-cloud, we're talking about having an organization where you actually have more than one cloud service provider where you're operating applications. For example, if you're just in AWS with all your applications but you're in Office 365 because everyone seems to be, that's not what we're talking about, but actually operating your own business applications in more than one cloud service provider. At AWS, we believe that multi-cloud is about people, processes, and technology coming together to help you achieve your business goals in an intentional way. Today, of course, we're going to focus on the technology aspect of that, but also people and processes as it relates to being resilient in your multi-cloud environment.

Understanding the Relationship Between Multi-Cloud Complexity and Resilience

As Clark was talking to us really about multi-cloud and what it means, a natural question that you might be asking yourself is how about resilience, because we started talking about it. I know that this is in the title of the session as well, and the reality is that when we are talking about multi-cloud, this is not something that inherently is going to increase your overall resilience. We all understand that when it comes to resilience and complexity, those are actually enemies. As you increase your complexity, you tend actually to decrease your resilience because you have more things that can fail in the middle, and this is not exactly what we want.

But this is not to say that multi-cloud cannot help. There are certain situations where the implementation of multi-cloud is actually going to help you to increase your overall resilience or to achieve whatever goals you need around your overall resilience, and I can give you an example. Let's suppose that you are running your workload in a region in a country where the primary cloud service provider for that cloud has just only one region, and you need to build a disaster recovery solution and your data cannot leave that country for regulatory needs, for example. Well, for this case, multi-cloud becomes a very valid answer. It's something that you are going to implement that's going to help you to implement your disaster recovery practice and help you to increase your overall resilience because of that.

And I know that everybody here wants to hear more about it, and we have Andrew who's going to talk to us about how Monzo did that. Just keep calm, stay tuned, stay here with us. We'll get the word to Andrew in a few minutes. You're going to hear an amazing story of how Monzo did that, but just hang out with us for a few more minutes before we get there.

And similarly to what Clark did, I also wanted to start by talking about what is resilience so that we all can be on the same page when it comes to the concepts. When we are talking about resilience, we are talking about the notion of preventing, mitigating, and recovering from failures as fast as you can on your overall application stack. So a resilient application is an application that's going to prevent failures from happening. If failures are going to happen, it's going to mitigate them and reduce the scope of impact and actually allow you to recover from them and get into your normal operations as soon as you can.

AWS Mental Model for Resilience: HA, DR, and the Resilience Life Cycle Framework

The way we see resilience at AWS is actually following this mental model. I just wanted to bring this mental model here with you so that we can all be on the same page and understand how we think through that. The reality is that we don't understand resiliency as being a single thing, a single action that you are going to take. It's not something that you are going to implement and it's going to magically solve or fix or mitigate or anticipate all of the problems that you might have in your application stack.

The reality is there are different practices that are going to help you navigate through different failure modes or different categories of failures. The first of them that we like to talk about is the notion of high availability. When we are talking about high availability, commonly referred to as HA, we are talking about having your workload in the primary site and avoiding the notion of single points of failure. So you have pretty much everything replicated using a primary site. This is going to help you navigate towards the most common day-to-day problems that your application might face.

The second practice that I wanted to talk about is the notion of disaster recovery, or DR, which I already mentioned in the session before. The idea of DR is really created because essentially what you're going to have is a different site where you're going to implement your application or parts of your application. And if there is a major issue with your primary site, then you are going to start running your workload on that secondary site. So now we are talking about two different sites.

And again, as I started mentioning, we don't see resilience as being a single action, a single thing. Those practices need to be there. We believe in them, but we also understand that as a customer you need to be thinking through the notion of your CI/CD, your observability stack, your deployments, your management of your platform, and everything needs to be done in a resilient fashion. So it's really something that permeates both HA and DR practice implementation.

And this is not something that we at AWS just tell customers and want them to figure out by themselves. We actually have created additional mechanisms to help customers navigate through the notion of resilience here. The first one that I wanted to talk to you about today is really one that we announced. We released this about two years ago if I'm not mistaken, and that's called the Resilience Life Cycle Framework.

So when we think about resilience, we understand that this is something that needs to be embedded across different areas of your software development life cycle. We understand that resilience is something that needs to be present when you are architecting your solutions, something that needs to be present when you are engineering your solutions, and something that needs to be present when you are operating your solutions. And thinking through that and in order to help our customers navigate this, we created the Life Cycle Framework which actually is comprised of five different stages and has the goal of helping customers think through what they need to do in each of those phases aligned with their software development life cycle.

So we start with the Set Objectives phase, which is much more of a business conversation. Then we go to Design and Implement, we go to Evaluating Tasks, Operate, and finally Respond and Learn, which is when you have everything in production, and we call out a lot of practices that can be implemented. But more importantly than just having the framework itself is understanding that this all is definitely very tied with the notion of culture. We want this to be embedded in the culture so that when you are building your systems, every single team that's engaged in that will be thinking about resilience.

Introducing the SEEMS Framework: Five Categories of Failure Modes

Now another of those resources that I wanted to talk to you about is the idea of the SEEMS framework. So you saw that Clark talked about this on the agenda, and the place where it boils down to is really thinking about how can a workload fail or what are the capabilities that a given workload needs to have to be considered resilient. We understand that every single workload to be considered resilient needs to have five specific capabilities. We are talking about redundancy. We are talking about having sufficient capacity. We are talking about your workload providing answers, responses in a timely way, which is what we call timely output. We are talking about your workload doing exactly what you expect your systems to be doing, and we are talking about you implementing the correct fault isolation boundaries within your workload.

When those categories are not matched, we have what we call the categories of failures, which is pretty much what we call SEEMS. So essentially those categories of failures is what happens when those capabilities are not matched. So when you don't have redundancy in your system, what you are going to see is the notion of single points of failure. If you don't have enough processing so that you can handle that additional capacity, you are going to have what we call excessive load. If your workload is not responding in a timely way, you are going to see excessive latency. If your workload is not doing what it is supposed to be doing, very often this is caused because of misconfiguration and bugs, and when the failures are actually crossing fault isolation boundaries, we will see what we call the notion of shared fate. So this is really how the SEEMS framework is comprised and where this comes from.

Another thing that I wanted to mention here is that for this specific session we are going to mostly be focusing on how this works and on the multi-cloud approach for a disaster recovery practice. We are not talking about active-active multi-cloud. This is not what we're covering today. We are definitely going to talk to you about how to build a disaster recovery practice and what are some of the things you need to think about when building a disaster recovery implementation across multiple providers.

Complexity as the Enemy of Resilience and the Lifeboat Strategy

At AWS we have a saying that complexity and resilience are enemies. What do I mean by that? Well, let me give you an analogy. As I try to look out across the room, it's challenging. I think some of you here are probably old enough to have had, or maybe had a parent or grandparent who had a refrigerator that maybe looks something like this, right? These were really simple devices. You had a coil in the back with some freon in it and a fan. If it was a really nice one, you might have had a light bulb inside. That's it. They were very simple and they lasted forever, 20 to 30 years, no problem. There are very few things that can go wrong with them.

I, on the other hand, have a refrigerator that looks a lot more like this. I can come downstairs sometimes in the morning and I can literally see that my refrigerator is performing a software update. I keep my fingers crossed every time that happens that I'm still going to have a working refrigerator. Every day when my food is still cold, I am grateful. This is not going to last 10, 20, or 30 years. It's a miracle every day it keeps going. I mean, is it cool? Yes, but is it resilient? Not as much as the other refrigerator.

Why? Again, more things can fail in a complex system, whether that's a refrigerator or your software. More things mean more potential points of failure. That's what we mean when we say that complexity reduces resilience or is the enemy of resilience. And of course we know that building in a distributed world, never mind a multi-cloud world, is inherently complicated.

There's good news though. One piece of that good news is something we call fault isolation boundaries. We have them internal to our AWS architecture that we provide you, as do other cloud service providers. The idea of a fault isolation boundary is these are inherent constructs to your infrastructure or architecture that exist such that should an impairment occur, that impairment is contained within that boundary. So since we're talking about multi-cloud, the fault isolation boundary we're going to talk about here is that of the cloud service provider. Cloud service providers are inherently a fault isolation boundary.

So here we have a picture of AWS and pick your other least favorite cloud service provider. And so if something were to happen, let's say, I don't know, angry fire bees appear in that provider, that could be a problem. But because it's a fault isolation boundary, AWS would be unimpaired. Now conversely, because I don't want to be unfair, it is possible that we could have angry fire bees at AWS as well, which would be bad. And in just the same way, that other cloud service provider, because of the fault isolation boundaries, would be unimpaired.

Why does that matter? Why do I care about that? Well, let's start talking now about some actual best practices for achieving multi-cloud resilience. So here I have my steady state full workload. This is all the things that I do to run my business, whether they're the most important functions down to the least important functions, and it's running on my cloud service provider of choice. Everything is well and it's good and things are going fine until one day, it's not.

And as Bruno mentioned earlier, in my case, I can't take that data and move it to another country because I have regulatory requirements for data sovereignty, data protection, and so forth. And my primary cloud service provider of choice doesn't have another region in that country. Enter the lifeboat strategy. I launch my steady state minimal critical functionality in that other cloud service provider. So those fault isolation boundaries that we just talked about prevent whatever occurred that took down my steady state full workload from impacting me in the other cloud service provider. And I'm just going to run those core critical business services that I really need to keep my business up and functioning while we repair our steady state full workload.

So in general, as you're thinking about multi-cloud and multi-cloud resilience strategies, it's really important to understand how your chosen cloud service providers are architected for resilience, fault isolation boundaries, and other features that are inherent to that provider to help you be resilient within your primary cloud service provider. And as you go through and think about that, we highly encourage you to think in what we call user stories or user journeys. The idea there is to think about specific actions that your customers, your users, are going to take from a business perspective, end to end through a service or set of services to achieve a business goal. And we say this because it's going to allow you to think from a top-down business perspective in terms of what is critical to your business and what services need to actually be functioning for that business function to occur.

Single Points of Failure: Infrastructure Components Beyond Applications

So again, we wanted to help you to think through what needs to be implemented, what are some of the things that you need to be looking at when building this lifeboat strategy, when spreading our workloads, even if it's using DR in a secondary provider. And we understand that many times when we are talking about specifically single points of failures, which is the first S of SEEMS, we tend to look a lot at the application components. But I wanted to invite us to extend this a little bit more and think about other areas where single points of failure could be a problem for us.

So the first area that I wanted to mention is really thinking about the communications of your systems, and not necessarily your systems between cloud providers, but think about how the users are going to reach out to your systems, how your systems are going to reach out to their dependencies, how one system is going to talk to other systems that might be part of that minimum system that you are deploying on the secondary provider. And because of this, it's extremely important that we avoid having single points of failures on the infrastructure components such as our DNS, which is really the backbone of your communications if we think about it that way.

Another thing that we need to avoid is having single points of failure that might be attached to your CI/CD pipelines. When you think about that and using the example Clark just brought with the AWS and the different fault isolation boundaries, you want to have the ability to perform deployments across those different CSPs, and you want to be there independently. You don't want your CI/CD to impact your ability to deploy your systems differently in different providers. So think about your CI/CD. You definitely need to think about your security components. The last thing you will want is to activate your DR, then you need to access your system, and you cannot because your authentication flow is in your primary CSP and got affected in some way because of an impairment that you're seeing.

So think about your authentication processes. Think about the systems that are serving the purpose of AuthN and AuthZ requests that you are going to be using. And finally, you definitely need to think about network connectivity. Think about how your systems are going to interact with each other. Think about how your users are going to be connected to your secondary CSP. Think about how your on-premises environment might be talking to the secondary CSP as well. You don't want to have single points of failure. You don't want to have a central place where all the communications are flowing, and if there's a problem, that central place gets affected as well. So those are really a few things, a few areas that we invite you to think through and to avoid having single points of failure on that.

Excessive Load: Managing Capacity and Service Limits Across Cloud Providers

So E brings us to excessive load. It's critical to understand how excessive load is going to impact you with each of your applications in each different cloud service provider. We of course want to encourage you to be as resilient as possible in your primary cloud service provider, and part of that is understanding how will that application handle excessive load, what mechanisms do you have in place to do load shedding to potentially scale up capacity and deal with that situation. But it's equally important to have that in place for your lifeboat strategy as well. It's entirely possible that when you are forced at some point to launch that lifeboat and to fail over, you might get an unexpected surge in customers for some completely unrelated reason. So that lifeboat needs to have the capacity to deal with excessive load just as much as your primary application.

And you need to understand what data access patterns exist, if any, between these two cloud service providers. So while we're launching two separate applications, there could still be, for example, some data that is synchronized between the two, and we need to understand how that can also impact the load on each of our applications. All of the cloud service providers give us quotas and service limits which govern how much of a service we can use at any given period of time.

Each cloud service provider handles that a little bit differently, so we need to be mindful in each provider that we're aware of our limits, how they're configured, and how that relates to our strategy for managing excessive load. You can hear us talk about this a lot for the rest of the talk. It's critical to test these things. It's a complex environment. You have multiple systems now, and the only way we're going to know how each of these systems responds to excessive load is to test them on a regular basis against high amounts of load.

Excessive Latency: Caching, Data Access Patterns, and Testing Under Load

Let's go to the second E of SEEMS, which is excessive latency. Latency is an interesting one because many times when we are talking to customers about latency, there is this idea of my system might be working but it's just taking a little longer. The reality is that if your system is taking longer to respond or to process requests than what your users are willing to wait, the user's perspective is that your system is unavailable. So there is a direct connection between excessive latency and resiliency, and we definitely need to think about that.

In order to deal with latency, there are a few tips and tricks that we can share as well. The first of them is to always leverage caching mechanisms whenever you can. They tend to reduce the latency and allow you to scale out your system better. What I mean by leverage caching mechanisms appropriately is you will want to think when implementing those about avoiding what we call bimodal behavior. What we mean by bimodal behavior is that your system offers a behavior, presents a behavior every time that it receives a connection. If something changes in the path, you don't necessarily want your system to be working differently. You want your system to work consistently. It doesn't matter what's going on with the additional environment.

For example, if you have a caching mechanism implemented, users go and get data from the cache, and it goes quick. But what if there's a problem with the cache? What if you activate your DR environment? Your cache comes out cold, so the data is not loaded yet. What if your application starts then defaulting to the database? Your users might be seeing additional latency, meaning that your system will behave differently under different circumstances. So if implementing cache, always remember to think through that. You want to avoid the notion of bimodal behavior, and this is really what we mean by using that appropriately.

Another thing is understanding our data access patterns. You want to understand where the data comes from, where the data goes to. You want to understand what is the kind of information that you are sharing to end users or to downstream systems in case you are being used as a dependency, because by understanding the data access patterns, you will be able to start thinking about how to improve that and how to reduce the latency on that communication.

Talking about dependencies, another very important thing, especially when you are going to the lifeboat strategy, is understanding how to measure the latency to external dependencies. What this means is you are running on your primary CSP. Your system has dependencies that are not sitting inside your environment. Let's suppose it's a third party dependency. You activate the DR, you go to the minimum environment that sits in the secondary CSP. Now the secondary CSP has the same dependency, but the dependency is not moved together with your application. What happens in this case is extremely important that you map this out so that you can understand what are those dependencies and what's the behavior of your application and what needs to be moved together with your applications in case you are activating the DR to use the lifeboat strategy.

And last but not least, Clark just mentioned this in the previous E. I'm going to mention this again. The best way for you to understand how to handle that and how it affects your system is through testing. So you definitely want to test your system under load, but you also want to keep an eye on the latency. You want to test your system on the minimum environment as well, and you want to see what happens with latency. You want to have specific metrics on that, on the observability platform, and you want to test to validate how your system is behaving under those circumstances.

Misconfiguration and Bugs: Observability, Rollback, and Deployment Practices

And M brings us to misconfiguration and bugs. This is by far the most common source of impairment for all distributed systems, and even I would venture to say non-distributed systems. I know because I'm a software engineer and whenever I write code and all the tests pass right out the gate, I know I've done something really wrong somewhere. We write software as people, we configure it as people, and we make mistakes. These mistakes can cause bad deployments, misconfigurations that bring down our system. So we want to have really good observability of our systems to understand, particularly when a software deployment is happening or a change to configuration, what is the impact to our system.

If something negative is happening, is it potentially tied to that change? If so, I want to have processes in place to automatically roll back that change to take me back to a good known last state. After the fact, I can figure out what went wrong. All I care about in that moment is getting the business back and operating and reducing the scope of impact of whatever that bad change or misconfiguration was.

We want to make sure we align these new code releases or configuration changes to those fault isolation boundaries we've talked about. One of the reasons we have completely separate applications for our full workload and our minimum critical functionality is so that we don't unintentionally violate those fault isolation boundaries. If I had the same application but in different places and I had a bad code deployment, well, I now just have bad code in two places and I'm down in multiple places because I've coupled those fault isolation boundaries by deploying to both of them the same code at the same time.

As I mentioned, the observability, not just for our deployments but generally speaking, now has to be in line with our fault isolation boundaries. I want to constantly be monitoring my application in each CSP to see if things are healthy, and I need to understand where that information, those metrics that are coming in, where they're actually coming from, which application, which CSP. Of course, we're going to continue to test this thoroughly for each cloud service provider and in lower environments for each. So I'm going to have testing, integration test, staging, pre-production, etc. for both applications in both cloud service providers.

Shared Fate: Maintaining Fault Isolation Boundaries and Best Practices Summary

And here we are with the last S of SEEMS, which is shared fate. If you remember, probably my favorite slide from this presentation is the one where we had the boundaries, right, where the problems would be contained within the individual CSPs. This is pretty much what we're talking about, and this is pretty much what we want to achieve when we are implementing the lifeboat strategy or implementing pretty much any type of disaster recovery. We don't want problems to be cascading from one CSP to the other in this case.

When you activate your disaster recovery, you want your minimum system to be up and running independently without being affected by whatever problem has caused the disaster recovery activation. The number one thing to do it is you should be avoiding dependencies across CSPs, so you want to minimize as much as possible all the traffic between your systems. You don't want one CSP, your application that sits in one CSP, to be calling the application that sits on the other. We understand that this might need to happen at some point, but you will want to minimize this as much as you can. You don't want to have that kind of dependency because if the primary environment got affected, it's impaired by some reason, you want the secondary environment to work flawlessly.

Another way of doing this is remember, always move your data together with your application. What it means is on your primary provider, the full workload that sits on the primary CSP, you might have an application layer, you might have a database layer. If there is a problem with the application layer, you want to move both the application and the data. You don't want to just move the application and keep the data where it was. There are a few reasons for that. Obviously, if we have the application sitting in one CSP and then we have my database on the other side in the other CSP, I might have additional latency. I might be implementing this communication that I wouldn't want. And again, depending on the problem on the primary CSP, it actually might take our system down on the secondary one. So we don't want to be using that.

Another practice is remember to always implement visibility across your CSPs. So what it means is CSP 2 needs to be able to understand the status of CSP 1 and vice versa. Essentially what you don't want to have is your system that sits on that primary CSP telling you that they are working. Obviously this data is important, but you also want to extrapolate that. You want to measure, understand how your system is behaving from the outside. This is really what we call the notion of differential observability, extremely important when you are implementing this type of pattern, when you are implementing this type of deployment.

And again, make sure to always map your shared dependencies. An example of it, your system is running on the primary CSP, has dependencies on-premises. You activate it, you move your system to the secondary CSP. The dependencies are still on-premises. You want to know which ones they are. You want to understand if you have the connectivity. You want to understand how the second system is going to talk to that dependency as well.

So here it's much more a matter of being aware of how they work and being aware of how you can mitigate an eventual impact to that dependency to be just contained within that dependency. So this is pretty much what we have to talk about regarding the SEEMS framework and some of the best practices on that.

Just summing up our best practices for resilience in multi-cloud. Again, testing, constant, constant testing. It's a complicated system. Testing is more important than ever before. Observability. Bruno and I both mentioned different aspects of observability. Observability is absolutely one of the most challenging, nuanced aspects of operating a distributed system in the cloud. It becomes even more challenging, of course, in a multi-cloud environment. Expect to spend some time on this. You're not going to get it right the first time or the second. It's a journey. Take metrics, measure how well you're doing, and just resolve to do a little bit better on every increment. It's a lifecycle.

And then automation. Automation is absolutely critical whether in one cloud or two, but again it becomes more critical in two. We want to make sure that if something happens, there's an impairment, my disaster recovery plan, such as launching my lifeboat strategy, is automated. That's going to reduce the amount of time it takes for me to recover from that impairment. And perhaps even more importantly, it's going to reduce the opportunity for a human to have to take a lot of manual steps and make a mistake during a period of high stress.

And continuing with the best practices, if you think through what we just shared using the SEEMS framework as a baseline for us, we talked quite a lot about technologies. We talked about DNS, authentication, we talked about databases and how to replicate that. But I also wanted to make it very clear that when we talk about implementing this type of pattern, it's not a matter of only picking the right services or the right technology stack. We are definitely talking about having very aligned people, process, and technology with your business needs.

We definitely will need people, and we want our teams, we want our operators to understand when to push the button, when we are activating the DR. Where do they go? Where is the information? Who do they have to call? What are the things that they need to do? We rely on people in the end, even though we have a lot of automation and you won't succeed without automation. People are still a very important part of it.

When those people need to implement these changes, well, we want to have processes, right? And as examples, I'm activating the DR. Who do I call? Who do I need to communicate with? Where is the information? Where's the database? Do I need to tell something to my customers? Do I need to tell something to my partners? So processes are extremely important and you need to have them documented. What we mean by process is not just defining the process but also validating those processes, testing those processes, making sure that they are properly spelled out in your documentation, that everybody that's being onboarded is aware of what those processes look like.

And finally, we will definitely talk about technology. You think about the services. You think about how do you deploy. You're thinking about your compute infrastructure, right? So you need to think about all those three different areas. So it's not just by picking the right services, but it's also thinking about the people, process, and technology. Everything there needs to be very aligned when you do it.

And again, I know we've been talking quite a lot about some of the challenges or about a lot of things that customers need to do when implementing that. Even though all of this is true, the reality is that multi-cloud actually can help you increase your overall resilience. We wanted to bring here the cases, we wanted to bring here really the areas that we expect customers to be looking at when implementing this. Because again, as Clark shared with the example of the refrigerators, we want to reduce complexity, and when dealing with complexity we want to understand where it will be and how we can do better on that. But the reality is that yes, multi-cloud can be a pretty good idea and can actually help you to increase your resilience.

And then to talk more to us about how this is actually implemented, I would like to invite Andrew to the stage so he can share a little bit more with us how Monzo Bank has built a DR strategy across multiple cloud providers. Thank you, Andrew.

Monzo Bank's Challenge: Why Traditional DR and Active-Active Don't Work

Hi everyone, I'm Andrew. I'm an engineer at Monzo. If you've never heard of Monzo before, Monzo is a retail and business bank with products across the UK, the US, and soon in some European markets. We have over 50% of adults aged 25 to 34 use Monzo. And our mission is to make money work for everyone, and these customers use Monzo to make their money work for them. We're digital only and so resilience in our technology is really, really important to us and to our customers.

In the last two years, nine major UK retail banks suffered from a total of 803 hours of downtime caused by IT outages. To give those numbers some context, those nine UK banks account for over 90% of the UK's retail deposits and over 80% of the UK's retail banking population. So if you bank in the UK, you are almost certainly impacted by these outages.

These outages happen not because these banks don't care about reliability. They've spent an awful lot of time, money, and effort making sure that they have reliable systems, but these incidents still keep happening. If Monzo's systems went down, it would have a really large impact on our customers' lives. They might not be able to buy their groceries, get their cab home, pay their bills, or even just check that their money is safe. So we have to really think about what we do about this.

Resiliency offered by CSPs like Amazon AWS is usually really enough, so building reliable systems starts with following their guidance. Cloud service providers like AWS help us distribute workloads across many co-located servers and help us embrace hardware failure as something that we can recover from automatically. The next obvious logical step from co-located servers is to distribute workloads across multiple data centers and availability zones, giving us resilience to a single site going down. Here we solve for a failure of host machines and data centers, so there must be something else that we're missing here.

Active-active platforms, I believe, aren't practical for most of us. Whether that's multi-region within a single cloud service provider or multi-cloud across multiple cloud service providers, we just don't believe that this works for most. CSPs like AWS offer products that support multi-region workloads within a single cloud, and these products don't solve exactly for global cloud outages within a single cloud service provider. At Monzo we recognize that just depending on these products doesn't solve the underlying resiliency problems that we have. So we have to consider how we use multiple cloud providers and not bake into these products.

Multi-region and multi-cloud designs also introduce technically complex data sync issues and complexity, which can increase latency for our systems. As we talked about earlier, it will multiply your infrastructure costs whenever you add a new region or add a new cloud. Traditional disaster recovery setups that we talked about before also, we believe, don't solve the most likely cause of outages. Firstly, it's difficult to build and maintain absolute confidence in an inactive recovery site. Making a decision to pull the lever is just so complex in the heat of a severe incident, and it's incredibly difficult to make.

Most importantly, running the same software across two sites will ultimately fail in the same way. An incident in 2023 impacting UK aviation and tourism is a super interesting example of this. The system that they'd built had a really impressive disaster recovery site with automated failover that activated within seconds of an incident being detected. But they were running the same software on both sites, so their secondary disaster recovery site failed in exactly the same way. So they couldn't recover from the outage with their disaster recovery site. I'll share a link to that report later in the slides.

Monzo Stand-in: A Separate Platform Built from the Ground Up

So we had to think about what to do differently, and so we built a system that we call Monzo Stand-in. It's built from the ground up with new software, new data, deployed into a separate cloud service provider, and it's always running and ready to take production workloads within seconds. On the left-hand side you can see what our usual Monzo app experience sort of looks like, and on the right, what our Monzo Stand-in app experience looks like. It's a much more reduced version of our app.

The Stand-in app explains the types of things that they're still able to do while we're in Stand-in, things like pay with their card, send bank transfers, see that their money still exists and it's still fine, but ultimately it's a much more reduced version of our usual app. Keeping the app experience of Monzo Stand-in simple but functional is a really key design principle for us. As Clark mentioned earlier, keeping these features really simple means that we can reduce the risk of these components failing when we need to depend on them the most. We still want the Stand-in app to feel like Monzo.

Choosing not to add layers of complexity and loads of bells and whistles keeps both the front end and the back end of Monzo Stand-in super simple, so we can really depend on it. Here's a high level diagram explaining how our infrastructure works. On the left hand side, within AWS, you can see our primary platform. We have over 3,000 services deployed over hundreds of thousands of replicas built on top of shared platform components, on top of AWS infrastructure and products like EKS, Amazon Keyspaces, etc.

On the right hand side, you can see our Stand-in platform running within Google Cloud. We only have 18 services, and these are separate services from the services we run in our primary platform. They perform similar behaviors of some of the services in our primary platform, but they're entirely separate, again using platform components, but in this case using Google Cloud's product lineup instead.

Monzo Stand-in can be activated automatically within seconds of us detecting an outage for our most critical systems like processing payments. For serving API traffic to things like our mobile apps, we take a couple of minutes to ramp up, but we can switch that over pretty quickly too. In our primary platform, we need to be able to sync data into our Stand-in platform. Within our primary platform, we have a system called our Stand-in data syncer that's consuming all the events of all the things that are happening with our primary platform.

For some simple examples, if an account is opened or a transaction is created, or a card is expired, this data syncer system is just consuming those events. In near real time, it's effectively transforming that data and loading it into our Stand-in platform. We know that there will be a point at which the primary platform will fail and that some of that data that's being synced will still be in flight. We want to make sure that that ETL system is really, really simple, so we reduce the complexity as much as we can. To do that, we effectively accept the risk that some of the data might still be in flight as it fails.

We also need to think about syncing data back from the Stand-in platform to the primary platform. By data, what I mean is really the processing decisions made by Monzo Stand-in as it's processing payments, etc. Say a customer spends, as an example, in Whole Foods, $100. In this case, the payment's processed in Monzo Stand-in and the decision to approve or decline is made by the Stand-in platform's Mastercard processor in this case.

That Stand-in platform Mastercard processor will update the balance in Stand-in for that customer, create transactions that they can see in that reduced functionality app, and it will also emit a message that we call an advice. That advice is effectively telling the primary platform of decisions it's already made. The advice message is consumed at some point in the future by our primary platform, whether it was completely down and it's all come back, or if it was partially down.

That Stand-in advice consumer effectively takes the advice and sends it to these processors within our primary platform to apply the effects of the advice. We don't want it to remake a new decision because effectively this customer has already walked out of Whole Foods with their $100 of groceries. We can't make a different decision, so we're effectively applying the decision that was made within Stand-in. We also recognize that the advice queue is designed so that Monzo Stand-in is designed to work for up to two weeks without the primary platform.

Effectively, these advice messages are queued up until the primary platform is able to consume them. Once the primary platform is back online, we'll continue to operate within Monzo Stand-in until we can reduce the lag on that advice queue to near zero, indicating we're processing payments basically as soon as we're processing them. Once we're at the point that we're processing advice messages in near real time, we know there is still a risk that there will be some messages in flight while we switch back. But again, that's a risk that we've chosen to accept for the reduced complexity of the system.

Control Plane and Traffic Routing: Managing Decisions Between Two Platforms

Between the two platforms, we operate a control plane that effectively negotiates and decides which platform is responsible for making which decisions at which time. So this control plane controls which API serves our mobile apps.

It decides which payment systems should run within stand-in or the primary platform, which percentage of our customers should be routed to which platform, and even which precise customers should be routed where. We have manual controls to do all these things, like the simple CLI that you can see. This is how we first built and rolled out our system. Now we have a whole bunch of automation to automatically flip into Monzo Stand-in as soon as we detect an outage, but we'll always need these manual controls in case our automated systems fail.

Our inbound traffic from our payment networks is still routed through the primary platform, even when we're processing payments in Monzo Stand-in. This sounds really silly given we've built this whole Monzo Stand-in platform to work in the case that our primary platform is out. But in the usual case, most of our primary platform is still running, and if we depend solely on the edge network of our primary platform and the control plane, we can route subsets of customers between these two platforms as we want. We can enable rolling out automatically. But in the case that our primary platform is completely down, we can still connect directly from Monzo Stand-in to our payment networks.

In the background, while we're in stand-in, our mobile apps are constantly talking to both our stand-in platform and our primary platform. We serve effectively similar APIs across API.Monzo.com in the case of our primary platform from AWS, and we have a different domain entirely for our stand-in platform with similar interfaces on some of these APIs. If we think about the SEEMS framework that Bruno was talking through before, we chose to have separate APIs serving our mobile clients, and those separate APIs on separate domains with different DNS providers entirely. So we didn't rely on DNS as our single point of failure.

Always Running: Daily Testing with Real Customers and Shadow Testing

Monzo Stand-in is always, always running every day to make sure that it's ready to use at a moment's notice. Every day we enroll a subset of our customers into Monzo Stand-in. These customers are production customers, not in test. In real life, in practice, they'll open their app and they'll see that they're in stand-in. We'll explain that we're using them to test our fallback systems and their payments will be processed in Monzo Stand-in. They'll use the reduced functionality in the stand-in app, and this really helps us test end to end all the data syncing, all the messaging back to our payment networks, all the advice processing that we talked about before, the whole thing. This is really, really powerful for us. If customers really want to use features that aren't in stand-in, they can simply opt out and that takes them straight out of the testing for Monzo Stand-in.

This is all powered by that control plane where we can precisely specify that this customer is routed into stand-in. That can only happen because we can route payments and other things through our primary platform while we use stand-in. We also do shadow testing, and what I mean by that is that all the payments that we process in our primary platform are also asynchronously sent to our stand-in platform to make decisions about those same payments. Now we have two decisions. The ones from the primary platform are effectively the ones that are honored and sent back to our payment networks. The decision that's made in stand-in, we compare against the decision we've already made in the primary platform, and we effectively monitor and alert on diverging differences over time.

We don't expect the two platforms to always make the same decisions. We expect them to be pretty close all the time, but effectively we're able to spot differences if over time changes happen in our primary platform that aren't being also changed in our stand-in platform. To make sure that we can still process payments in our stand-in platform without routing via our primary platform, in the case that our primary platform is completely out, we want to know that we can connect from stand-in. So every day we have an automated test that will directly connect Monzo Stand-in to our payments networks. It will receive a portion of our inbound payments traffic. It will process those payments, it will make decisions, it will send the decisions back to our payments networks. And again, that's a big end to end. We see the whole end to end flow again with the advices being sent back to our primary platform.

The main purpose of that test is specifically to test that our network connection is alive and it can be used if we need it. We also run coherence testing over the data that's syncing both to the Stand-in platform and the advices that are coming back. What this helps us with is to be able to detect systemic problems over time where we have two data sets that are effectively diverging. We can monitor and alert on all those things, and we also automatically fix forward if we spot differences. All of these tests and all this work really gives us continuous high confidence that the Stand-in platform is performing as we expect, and that makes it super trivial to use during an incident when we really need it.

Real-World Results, Key Takeaways, and Additional Resources

So Monzo Stand-in has been used in real incidents over a range of severities. We built Monzo Stand-in to help us in the most severe outages when our whole primary platform is down. But in practice we've used it in many, many more cases than that. Everything from a partial outage to a portion of our primary platform, all the way to key infrastructure failure, we've used Monzo Stand-in. Sometimes not the whole thing, sometimes simply just some customers being routed for Mastercard payments or some customers being routed for another specific payment processor. But in some cases, the whole thing with the app, with all payments, and so on.

Sometimes some of the questions that we hear when we come and talk about Monzo Stand-in are like, doesn't this cost an absolute fortune? Because of the reduced complexity in the Stand-in platform, even though we're running it all the time in the background, it costs us about 1% of the cost of our primary platform. So it really doesn't cost us an awful lot. Isn't it a pain to maintain, having two completely different platforms? And the answer is just no. We automatically update things like library versions and language versions and things, but across the last year, fewer than 1% of our explicit changes were made explicitly to our Stand-in platform, so it's really not a burden to maintain it.

And finally, I guess the question is, should I build one? While the decision really is yours to take, I believe the decision only really works best if you operate both client and server side in your product. The availability is absolutely critical to you and your customers, and you can't tolerate any downtime, and you're able to take the business decisions to accept some of these trade-offs that I talked about. I think this design could really work well. I think unless those cases are true, I think it might not fit well. I'm going to hand you back to Clark.

Thank you, Andrew. Appreciate that. So just to wrap up. Unsurprisingly, we are going to finish by stressing how important it is to test. We're going to test our application in both cloud service providers in lower environments, pre-production, staging, and so on. We're going to test them under load. We're going to test under various conditions for latency, and we're going to test them all of the time. So we're even going to run test traffic to our lifeboat, whether that's synthetic or some percentage of our actual production customer traffic in this way that we'd know with high confidence that that system is absolutely running and working as we expect it so it's there when the business needs it to be.

Want to leave you with just a couple of other sessions to go ahead and check out if you have time. There are three more there on the board. One that's not listed on here that's actually coming up tomorrow that I will be speaking at on how to centralize your multicloud management with AWS that is tomorrow afternoon as well. There is a kiosk in the AWS village where we have demos, we have socks, we have even little nice cloud temporary tattoos and stickers and things and a lot more information about all of the multicloud features that AWS does have to offer. And Andrew and I will actually both starting at 2 o'clock today all the way until 3, be at the Meet the Expert booth in the middle of the AWS village. So if you have any questions that we haven't answered here for you today, we'd love to see you and talk to you then.

And just to share a few more resources that we think will help you folks to move faster on your multicloud journey around your resilience journey as well, we have a few additional links. So we have the AWS multicloud page, we have the recently launched AWS multicloud blogs where you can find additional information on how to use AWS services to build your multicloud systems as well. We have here a link to the blog post actually where Monzo is talking about how they have built the Stand-in platform that Andrew talked to us about. There's plenty of useful information and we also have the link that Andrew mentioned with the Civil Aviation Authority report from May 2023. It's available there for you to take a look as well.

With that, we would like to thank you all for being here. Thank you all for being here in Vegas this week, for joining us for this session. Thank you, Andrew, as well, for joining us and talking to us about Monzo. Please do not forget to rate the session on the mobile app and in case you folks have additional questions, any other things you want to talk to us, we are going to be certainly around. We are going to be at the experts booth as well. It was really a pleasure to be here with you all today. Thank you all. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.