DEV Community

Cover image for AWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)

In this video, Capital One's engineering leaders share their journey building highly resilient platforms for business-critical applications at massive scale. They discuss their 2019 transformation from duplicated line-of-business capabilities to a unified platform organization, focusing on reliability as a foundational pillar. The presentation covers architectural evolution from anti-patterns (single region deployments, tightly coupled microservices) to modern standards including multi-region deployment, domain-driven design, and shuffle sharding techniques that reduce blast radius to 0.0007%. Key topics include serverless adoption with ECS Fargate, zero-downtime deployments using AWS CodeDeploy, infrastructure as code with AWS CDK, chaos engineering with AWS FIS, and observability standards. They detail handling poison pill requests through circuit breakers and rate limiters, achieving five nines availability for mission-critical applications, and implementing DynamoDB multi-region strong consistency for cross-regional data integrity. The session emphasizes shift-left testing, automated failover/failback mechanisms, and moving from static to dynamic alerts based on traffic patterns.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Capital One's World-Class Resilience Practices at AWS re:Invent 2025

First of all, welcome to Reinvent 2025. I hope day one is going well for you. My name is Enrique Bastante. I am a Sr. Customer Solutions Manager with AWS Strategic Accounts, and I'm really excited about this session today for a couple of reasons. First of all, as part of my role, I get to work closely with Capital One on some really cool stuff, including some of the things that you'll hear about today. But also because resilience is actually one of my favorite topics when it comes to cloud. It cuts across every industry and it's important to everyone.

You may or may not know this, but Capital One is actually really good at resilience. They're often regarded across AWS as a world-class organization when it comes to resilience practices, so I'm really excited for you to hear about it today. I've got with me two leaders from their enterprise platform organization. I'm going to talk to you about how Capital One goes about building highly resilient platforms for business-critical applications that run at massive scale. They're going to talk a little bit about how they got to where they are today, some of the challenges they faced early on, and some of the decisions that they made along the way.

They're also going to talk to you about some of the best practices and lessons learned. But they're not going to stop there. They're going to talk to you also about where they're headed, so some of the more advanced resilience patterns that they're implementing now to continue to build more resilience into these business-critical applications. Because as you know, when it comes to the stuff that runs your business, you always have to continue to raise the bar. So welcome again, thank you for joining us, and with that said, I'm going to pass it over to Sobhan.

Thumbnail 150

From Duplicated Capabilities to Platform Organization: Capital One's Transformation Journey

Thanks for setting the stage for us. First of all, good afternoon, everyone. Thanks for joining us today. I'll start with a simple question. How many of you are using any of Capital One apps, like our shopping, banking, or any of our services? A lot of you. So all of these services and apps are built on top of our modern platforms. So today we are going to focus on our journey as a platform company and what are the principles that we believe in when building the platforms, our focus on reliability, the architectural patterns, and the best practices that we adopted to gain our customer trust.

Thumbnail 180

Thumbnail 200

Platforms are nothing but digital Lego blocks that are leveraged to build our customer capabilities and functionalities, which we can use to build similarly like a vibrant city with a Lego box. So before 2019, we used to have duplicated capabilities across our lines of business. For example, our lines of business like card and bank used to build payment capabilities or transaction processing capabilities as per their business needs. So we observed non-standardized architectural patterns and a duplication of data. For example, if I'm a customer that has both a banking account and a credit card account, my customer data is duplicated across these lines of business in different formats.

Thumbnail 300

We spent a lot of developer effort and runtime engine cost with that, and we incurred an impact on our speed to market for business capabilities and scaling out platforms. And in turn, it resulted in a loss of our customer trust. So in 2019, we declared ourselves as a platform organization. We put heavy investment into building the foundational structures or platforms that are not just reusable capabilities, but also that are scalable, reliable, and trustworthy.

There are multiple flavors of these platforms that can be leveraged by both our internal customers like our developers, our LOB partners, and our business holders, as well as our external customers and end users. Some of them are developer partner platforms like the CI/CD platform, where we streamline our deployment processes with a lot of controls and guardrails embedded into it so that every engineering team or engineer can benefit from it and they don't need to reinvent the wheel.

Similarly, we resolve the arbitrary uniqueness across lines of business by building core business platforms like a transaction processing platform or payment processing platform to provide support across a seamless experience across all lines of businesses. We have also invested in building customer-interacting platforms like an identity platform or a messaging platform that provides a seamless customer experience irrespective of what type of platforms that we build on.

Thumbnail 400

Our objective is how we can scale as an organization. These platforms can scale based upon the trust that we get from our users. To secure that foundational trust, we believe our modern platforms have to stand on seven foundational pillars. All of these seven pillars are critical, but today we will be focusing on our first pillar: reliability.

Reliability as the Engine for Trust: Measuring Availability, Resiliency, and Customer Success

Let me take an example. If I'm a customer trying to make a payment and my deadline is today, I use the Capital One app, and my payment was not successful on the first attempt. I retry it, and it still hasn't gone through. As a customer, I look at my other options to make that payment. As a company, I'm losing my customer trust and the business as well.

Thumbnail 460

In this journey, the availability of your platform and the resiliency are very important. Your availability is going to be measured based upon your uptime. Uptime is a classic metric that can help define whether your service is operational or not, and your success rate. Even though the service is up and running, if customer transactions are ending up with errors, that means they are not happy. So how can we measure the success rate? We aggregate all the requests that we receive as a platform, and out of those, how many actually error out. As an industry standard, we measure using those with nines.

Resilience plays a critical role. Even though we build, as you all know, we still face failures. While we are designing these platforms, how can we ensure these platforms can be recovered from those failure points and unexpected chaos? Fault tolerance is key. When we design and build a bridge with a structure of redundant pillars, we always ensure that if something goes wrong with one of those structures, the bridge is still up. Similarly, when we are building a platform, we need to ensure that if a component fails, our platform is still serviceable.

Thumbnail 590

We also measure our mean time to recover. When a failure happens, how quickly are we going to identify the failure? How can we isolate from that failure? How quickly can we auto-resolve from that failure? These are very critical. I treat reliability as an engine for trust. So along with availability and resiliency, it is also very important that the capabilities provided by our platform are functioning as intended.

Over a period of time, that is going to ensure the trust with our platform, with our users. Our users, our customers are going to provide their data, spend their time, rely on our systems. So with a lot of trust on these services, as platform builders and owners, we need to ensure how we can make sure our capabilities are working as expected and also reliable, scalable, and trustworthy. To get more details on the how part, I'm going to hand it over to Aaron, who can get into the details of our architectural practices and best practices that we adopted throughout this journey. Thank you, Sobhan.

Thumbnail 670

Thumbnail 680

Architectural Strategies and the Shift-Left Approach to Resilience Engineering

Cool. Nice. All the good things have to start like this. Cool. So I have about 47 minutes of a lot of engineering content relating to reliability engineering and resilience engineering and a capital approach to accomplish those. So I'm going to dive deep. So there are two focus areas in my talk. In my first focus area, I'm going to dive deep into the architectural strategies that we follow and architectural pitfalls we had in the past, and our architectural journey, basically our architectural evolution towards high availability and resiliency. And then I will talk about our observability adoption, like why we are doing observability in Capital One and how it is improving the reliability aspects of our systems that we built.

Thumbnail 750

A lot of times I tend to see when we talk about resiliency and reliability, we just focus on the architecture. So there are a lot of non-architectural factors that impact your reliability of the system. So I'm going to dive deep into the different failure modes that could impact your platform services, and then I will talk about zero downtime deployments using AWS CodeDeploy. And then I will also talk about reliable infrastructure building using AWS CDK. And then I will jump on to a resiliency testing framework that we built using AWS FIS. And then I will finally wrap it up with observability standards.

Thumbnail 780

Resilient architecture is the foundational building blocks to build any system. So when I talk about architecture, I'm talking about the system architecture as well as the deployment architecture. These are all the building blocks and resiliency and reliability shouldn't be an afterthought. The moment we go build something into the architecture, if we add more components in the architecture, we need to bring the shift left thought process about the reliability into the architecture.

Thumbnail 790

Anti-Patterns from the Pre-Platform Era: Single Points of Failure and Tightly Coupled Microservices

So let me start with some of the anti-patterns that we had in the past, in the pre-platform world, because Sobhan talked about some of the non-standard architectures and the pitfalls that we had. Let's start with this. We call this the all-in-one approach. You can see a lot of single points of failures in this approach. So we have all these platform capabilities packaged into a single monolith. It deploys into a single AWS region with a single availability zone, with no auto-scaling and no database replication. This pattern is a big no in Capital One. From day one of Capital One's cloud migration journey, we made a clear decision that we are not going to deploy in a single region or with many single points of failures. So we have always deployed our services into two regions.

Thumbnail 870

Then you might be wondering why I'm talking about this now. We had a recent US East One outage in AWS in October. All of our systems were able to quickly recover from it by auto failover to the other region. But I have seen some of the systems outside of Capital One struggle during this outage because they were heavily reliant on a single region with no other region to run their systems. This is a big no. If in 2026 you're still thinking about having multiple regions, you're in this room because you think that resiliency is important. You cannot take your car to a road without insurance. The same concept applies here. You have to have multiple regions for higher resiliency and higher reliability.

So avoid single points of failures. It's one architectural bad pattern. The next one that we have seen in our pre-platform paradigm is having multiple proliferation of microservices with zero fault tolerance. So the moment we thought about moving from monolith to a microservices-based approach, we tended to create a lot of microservices with a tightly coupled nature. So if one service goes down, it impacts all the other services. So when we think about microservices, every single service should be able to service a capability and avoid cascading failures.

In this example, if one service goes down, the customer is impacted because all these services are tightly coupled in order to provide a capability to the customer. This is an architectural anti-pattern for high resiliency, or rather, it results in low resiliency.

Thumbnail 930

Another deployment pattern we had was in the pre-platform paradigm. This is the deployment anti-pattern where we had a single cluster that was sharing the same traffic patterns. Web and mobile customers were using the real-time clusters, and at the same time, batch load was hitting our real-time clusters. This cast pressure into our systems, causing a lot of database timeouts and impacting the real-time users' experience. With this approach, you cannot reach high availability. This is an anti-pattern. In the modern platform world, we have routed all the batch traffic into a separate cluster by using step functions or other vendor products for a batch execution model.

Thumbnail 980

Another anti-pattern we had in the pre-platform paradigm was having a single database shared between analytical loads and real-time customer loads. Whenever the analytical job runs, it hacks the system resources. When it consumes the system resources, we started getting a lot of database timeouts and throttling, which impacted our user experience. This is another anti-pattern that existed in our pre-platform world.

I have talked about most of the anti-patterns that impacted our high reliability. Let's talk about some of the patterns that we follow and what are the enterprise architectural standards we follow.

Thumbnail 1010

Enterprise Architectural Standards: Multi-Region Deployment and Domain-Driven Design

Thumbnail 1020

When it comes to deployment, we always deploy into two regions. All of our CI/CD pipeline will deploy the same version of software into two different regions, and all of our services are scaled out so that they are anytime capable of taking the load in one region. When it comes to architectural dependency, no single region should have tight coupling or any dependency with the cross region. All these services should have regional affinity. Think of this: when you fail over from one region to another region because your region one is impaired, but if you still have a dependency with the impaired region, you are not solving the problem. You are still defeating the purpose of having two regions. Our guidance for you is that all these regional services should have regional affinity and should operate in a silo.

If you have any other dependencies in our systems, we have other platform dependencies within Capital One systems, so we maintain the same level of resiliency standards across all of our systems. If I have a top tier resiliency standard, all my dependencies should also follow the same resiliency standards. When it comes to data consistency, we always tend to choose databases that are auto-replicable. When it comes to failure recovery management, any single region failure should not cause impact to our service capabilities, and all of our services should be equipped with auto failover. All of our platform services have a very strict RTO and RPO defined.

Thumbnail 1120

The bottom line here is that when we are aiming for building a highly reliable system, we need to clearly define what are the goals we are trying to achieve. Without goals, you cannot track your resilience and your reliability goals and aspirations.

Thumbnail 1130

Let's move on to the architectural patterns. Putting the words into a diagram, you can see we follow the domain-driven design pattern. All of our modern platforms follow domain-driven design pattern because domain-driven design gives us high flexibility. This creates modular services which are highly decoupled and fault tolerant.

Thumbnail 1150

Think of it this way: take a bigger domain and divide this bigger domain into multiple subdomains. These subdomains have their own entities, aggregates, and their own database and layers, and they heavily operate on a bounded context. Now think this into a platform model. You have a platform. This platform has multiple capabilities. Each capability has its own entities, aggregate models, and value objects, and they do not depend on each other. Every capability runs in a single bounded context with no dependency with the other capability.

Thumbnail 1180

With the banking example, take this banking domain as a larger domain. You can divide this multiple banking domain into subdomains like a transaction processing service, a reporting service, customer management, and account service. The bottom line here is that these subdomains operate independently and every capability has its own SLAs. You are not targeting all these services to be five nines.

Thumbnail 1220

So you can set different uptime SLAs and response time SLAs, and domain-driven design is best for this. What are the other benefits that we have? These domain-driven designs are modular, which means those services are easy to maintain, repair, and replace. They operate in a bounded context in a highly decoupled manner, are fault tolerant, and are independently scalable. These service capabilities give you the SLA flexibility. Not only that, our customers can now pick and choose the capability they want to connect to. For example, our finance will use a reporting module, whereas our card and bank will use capabilities like transaction processing. So this offers extreme flexibility.

Thumbnail 1270

Putting these concepts into a diagram, this is the minimum requirement to deploy in Capital One's 12 regions with multiple availability zones. We have domain-driven design with bounded context services, auto scaling enabled, and a database that auto-replicates the data. We also have Route 53 record sets that has two types. One is geo-based location, so every request that comes from the customer goes to the nearest data center. After that, we also have the failover record set. This failover record set is the mechanism for us to do auto failover.

Thumbnail 1340

The failover record set has two routes. The primary route goes to the nearest region of the customer. The secondary route always takes care of the failover mechanisms. It keeps an eye on the other region's health, and the moment we find out that your primary region is faulty or impaired, this auto failover will kick in using the Route 53 auto failover record set. All of that is done automatically. So this is the minimum deployment architecture requirement for us, but this is giving us guaranteed 99.9% availability up to 99.99% availability. But there are some mission critical applications that require five nines availability, and let's look into what are the challenges we have in this architecture pattern.

Thumbnail 1350

Thumbnail 1360

Thumbnail 1380

Advanced Resilience Patterns: From Circuit Breakers to Shuffle Sharding for Five Nines Availability

Now imagine you have 10 customers and you have 2 regions with these bounded context services and all the good things I talked about. Let me introduce something called poison pill requests. Poison pill requests are a type of request that are capable of killing your services every single time they get processed, right? It could cause memory outage or it could cause some saturation into your resources, or it causes high database timeouts or retry storms. But the bottom line here is every single time it causes an outage very consistently.

Now think about this poison pill and apply it to our architecture. Customer C2 is sending a poison pill request, and we have the router that sends the request to region 1. The region will have multiple tasks. So now this poison pill request goes to the service task one. The service task 1 will crash. We have an Application Load Balancer in front of the service tasks. So the ALB operates in a round robin manner. Every time when a customer sends a request, it goes to the service task 1, crashes, service task 2 crashes, service task 3 crashes. Now guess what? Now we have our auto failover. It will kick in because our auto failover will think region one is impaired because all the services are getting crashed. We have alerts based on the unhealthy host count and HTTP error codes. So now Route 53 will think that our region is down. So it will automatically flip the traffic to region 2.

Thumbnail 1460

But now what will happen? The customer is still sending the poison pill request. Now the request will go to region 2. In region 2, it will start killing the task one after the other. So now it is worse than the previous situation, right? You are not only impacting region 1, you are also now impacting region 2. Now, how are we solving this problem? We are creating a circuit breaker. Typically, the circuit breaker is used in the microservices world when we have a dependency that is struggling. We do not want to overwhelm the dependency, so we'll wait for a minute. We are shifting left that circuit breaker capability into our router. Our circuit breaker has customer level metrics, so our circuit breakers are enabled with observability that tracks poison pill requests from a customer level. The moment we find out the customer is sending a poison pill request, we'll open the circuit and we'll stop taking the request.

Thumbnail 1490

We also have this router enabled with rate limiters because sometimes our customers used to overrun the systems and cause reliability issues. Even though we have these two capabilities, these are more reactive capabilities, right?

Thumbnail 1520

When we talk about finance, your budget for error or downtime in a year is five minutes. When these things happen, you have already breached your reliability aspirations, so your reliability goals are down. This is a reactive system. What are we doing now?

Thumbnail 1550

Think of this. We have been designing sharding techniques. We went with edition one using the standard sharding pattern where we have three shards and three customer groups, and we have a consistent hashing algorithm that always creates a stickiness session. So customer group one always gets attached to shard one, group two goes to shard two, and group three goes to shard three. Now imagine customer C6 is sending a poison pill request. It goes to shard one and then it completely takes down shard one, but your other customer groups are not impacted because your request is going to shard one.

Thumbnail 1590

Is this good enough? This is not good enough for us. Now imagine that we did not have the circuit breaker or we did not have rate limiters and things like that. We used to have a blast radius of one hundred percent. Now after the shard technique, our blast radius is reduced to thirty-three percent. Now out of ten customers, you are only impacting three customers because the other customers are having healthy shards. Is this good enough for us to build finance services? Definitely not.

So where we are heading towards is the shuffle sharding pattern. I am pretty sure everyone in this room would have played playing cards. You have the deck of playing cards and then you shuffle the cards and you have multiple players and then you create the combinations and then you deal five, seven, or three cards according to the game, and then you try to create the combinations. The same concept is applied here. So imagine you have this deck of cards as the deck of shards. You have now multiple shards and multiple customers. You create multiple combinations of shards and every time you onboard a customer, you create a unique combination and you attach particular shards to a customer.

Thumbnail 1690

In this diagram, you can take a close look at customer one and customer three. They are having, in the middle block, the same shard, shard thirty. Now think of this poison situation. Customer one is sending a poison request. It will only impact the shards that they belong to: shard one, shard thirty, and shard seventy. Even though customer three's shard thirty is impacted, customer three has other two healthy shards. So either the customer retries or internally when we retry, customer three's request will go to shard seven or shard ninety-nine. At the same time, if shard thirty had a problem, if there is a database problem or a shard problem, customer one or customer three, either of them are not impacted because they have the other healthy shards to use. This is the pattern that has been proven to be very useful for us to achieve finance.

How do we create these multiple shard combinations? That is this mathematical formula: the binomial coefficient. You key in these different values. In my examples, I have one hundred shards, a million customers, and I am trying to assign three shards per customer. So with this mathematical formula, I am getting one million sixty-one thousand seven hundred unique combinations. These unique combinations of shards will be split across the different customers. In this case, a million customers. So the success of this model depends on how you reduce the overlapping. With this formula, at any given point of time, seven customers will have the same combination of shards.

Thumbnail 1750

Now think of this. From the one hundred percent blast radius to the thirty-three percent blast radius, we have come down to zero point zero zero zero seven percent blast radius because of the shuffle sharding technique. This has been proven to be very helpful for us to reach the finance standard for some of the mission critical applications. Where do we put this logic? This is a sample Python code to create the different shard combinations. You key in your number of customers, the number of nodes you want, and the shards per customer you want. Then this code will output the number of different shard combinations that you want to create. In this example, you can see customer C1 is having S2, S4, and S5, and customer one hundred is having S2, S4, and S5.

Thumbnail 1780

Where does this logic sit? We have this router. This router is where this logic sits. Every single time when a customer is onboarded, this router will create the shuffle. The router will create the combinations of the shuffle and it will attach it to the customer with durable storage. We have the database attached to the router, so the router will create this combination and will create the stickiness with the customer and the number of shards.

Before the shards, we also have the load balancer. The load balancer by default uses a round robin algorithm. Every single time a customer sends a request, let's take a close look at customer C1. Imagine C1 is sending a request with shards S1, S2, and S6. The first time the router will know this is the first request the customer is sending, and the shards they belong to are S1, S2, and S6. With the round robin algorithm built within it, the router will set the header as S1. This header will then send the request to the load balancer, which based on header-based routing, will send it to the shard.

The second time when the customer sends a request, the router will attach the header as S2, and the request will go to S2. On the third time, it goes to S6, and so on. When you think of the shard, you can think of it as target groups in Amazon Elastic Container Service. Every shard shown is an ECS target group, and they can be independently scaled within that shard.

Thumbnail 1870

There was another way of achieving this. We explored achieving this by shifting the sharding mechanism into an SDK. Our platforms have an SDK where this code lies. This SDK knows the backend capability, such as the capacity of the backend, how many shards are running, how many customers will be onboarded, and how many shards per customer need to be assigned. The SDK will create these combinations and create the headers, and every single time it will send the request to the load balancer.

With this approach, you avoid the router, but the biggest problem is that every single time you make a code change or want to make a version change on the SDK, you have to reach all your customers and have them enforce the latest version of the SDK. For this reason, maintaining this SDK will be a challenging aspect. We are going to incline with the router approach instead.

Thumbnail 1930

We have talked a lot about architecture. What are the recommendations? Multiple regions, multiple availability zones, and use domain-driven design if applicable to you. Use auto scaling, retrace dromes, and try to use fail-fast timeouts because we had some reliability issues due to high timeouts. Use some of the sharding techniques that fit best for your use cases, because managing the infrastructure for sharding will be challenging for you. Beware of different failure modes, which I will talk about in the later part of the session.

When we talk about finance, we are not talking about all your platform capability must be financed. Only the mission-critical capability should be financed. It is very hard to manage a financed service.

Thumbnail 1980

Thumbnail 1990

Moving to Serverless and Managing Failure Modes: From EC2 to ECS and Sandbox Safety Models

We are also a surveillance company. Why are we moving to serverless? Because we had some critical loads running on EC2 instances, and that caused some reliability issues. We had Docker containers running on EC2 instances, and whenever we wanted to scale out, we had a few incidents where the specific instance type was not available. The IP address ran out, so we could not auto scale in a timely manner. Also, whenever we wanted to scale, EC2 takes a lot of lead time to come up. You need to bring out an EC2 instance, run the user scripts, and then bring the Docker up. It takes a lot of time for EC2-based instances to come up and take the request.

Not only that, it also created operational problems. Managing a fleet of thousands of EC2 instances and having to patch them and manage them was a big human toil for us. It also created reliability issues when we tried to update EC2s. We made some manual errors, and it caused some outages. That is the reason we moved to serverless. With serverless, none of these manual operations exist anymore, and we are able to reliably scale out. All of our operational overhead has been taken care of. If you are in the cloud and still managing EC2 instances, you have to reconsider that and try to adapt to serverless technology, especially for all of our critical mission-critical platforms running with ECS target services.

When you talk about AWS Lambda, I want to make a note here. We use AWS Lambda on a non-critical asynchronous path because in the past we had some reliability issues. Make sure that whenever you run AWS Lambda functions, you create bulkhead patterns between these functions by setting the maximum concurrency limit to the Lambdas because they have the tendency to occupy the entire account-level concurrency limits if you're not careful about that, which will cause reliability incidents.

Thumbnail 2130

These are all the different surveillance services that we use at Capital One, and we are greatly benefiting from them. This has been the biggest game changer for us operationally. We are really efficient with this adoption.

Thumbnail 2140

Let's talk about failure modes. We constantly review our different failure modes. These are all the failure modes we look into. Your cloud provider will fail you. Your internal platform dependencies could fail you. Your external vendors could fail you. Our customers can send a poison request and fail your systems. Your own platform engineers will cause some bugs and create a reliability problem. The last one is untrusted code, which is a very special condition where we are building a platform that takes business logic from our internal developers from our different business units. For instance, we have a card business unit and a bank business unit. They pack their business logic in code and ship it to us, and we execute it in our platform. This is not causing a security issue; this is causing a reliability issue.

Thumbnail 2190

We had this design in our version one, where our services and this untrusted code were sharing the same VM resources. When you get a request, we call this untrusted code function. Even though they run in a separate thread, they share the underlying JVMs and VMs. Imagine this untrusted code has something like a system.exit or a wild infinite loop where the code constantly runs nonstop, or it has recursion causing stack overflow errors. There are possibilities that either intentionally or unintentionally, our internal developers could bring reliability issues into these platforms.

Thumbnail 2300

For platforms like this, we are moving towards the sandbox safety model. In the sandbox safety model, we create multiple micro JVMs heavily built with bounded context. These service VMs are never shared with these micro VMs. Our untrusted code is allocated a very fixed capacity of, let's say, ten megabytes. We set very strict execution timelines, like timeouts for every single function execution, and we strictly prohibit IO access paths. This untrusted code can never access critical file system paths and cause an outage. If you're running malicious code inside this untrusted sandbox block, whatever happens within the sandbox stays within the sandbox. It will never impact your service because you're not sharing anything with the untrusted code. If you are building a platform similar to this, I highly recommend you read about the sandbox safety approach.

Thumbnail 2310

Thumbnail 2320

Infrastructure as Code with AWS CDK: Eliminating Manual Errors Through Automation

That's a lot of architectural context. Let's move on to some of the non-architectural context. What are the challenges we had? We talked about a lot of infrastructure management. One of the biggest challenges we had was infrastructure management. Our generation one pipeline was heavily YAML-based, so there are multiple values, multiple environments, and multiple copy-paste operations, which led to a lot of manual errors and caused reliability problems. For example, we had some port misconfiguration. Instead of port 8080, someone typed port 8808 and missed the last digit, which caused a reliability incident. Our pipeline did not have drift detection capability, so we had one set of setup running in the cloud, but our repository was completely different. The next time we created the infrastructure, we were creating it with the faulty configuration, which caused a reliability incident.

Thumbnail 2370

Thumbnail 2390

Thumbnail 2400

We are taking a quantum leap to mitigate this with complete automation using AWS CDK. When we talk about reliability, architecture is one thing, but these non-architectural things like infrastructure and code releases have to support your reliability goals. We are taking this quantum leap from infrastructure as configuration into infrastructure as code. We manage our complete infrastructure using code. We use AWS CDK for that. AWS CDK is the Cloud Development Kit.

Thumbnail 2440

Now think about this: you have Route 53, Application Load Balancers, and ECS containers. If I say all of this can be coded using an imperative programming language such as TypeScript, Java, Go, or Python instead of using a vendor-specific domain-specific language or YAML file, this unblocks significant capability. This is an example of AWS CDK code where I'm creating an ECS Fargate service behind an Application Load Balancer. When you create this, it generates a CloudFormation template underneath, and the infrastructure is managed using CloudFormation templates. This also unblocks the capability of writing test cases.

Thumbnail 2450

The moment you write code, you can now write code to test your infrastructure. In this example, I'm verifying whether I'm using the Application Load Balancer with the right SSL certificate. If not, I won't create this resource. This test case will fail in my developer machine and in the CI/CD pipeline. This is shift-left testing capability, and it's the biggest game changer for us. We're no longer creating faulty infrastructures or causing incidents.

Thumbnail 2480

Here's another example of how we find bugs. Let's imagine you're trying to create an ECS task with 256 CPU and some memory that is non-standard. This can be caught much earlier. In version one of our CI/CD pipeline, we used to create resources in the cloud and then later discover we created them with faulty configuration. We'd go back, destroy the resources, and rerun to fix and correct it. This back-and-forth process caused reliability issues and significantly impacted developer productivity and cloud costs.

Thumbnail 2530

We also run infrastructure rules as code using CDK NAG. We have multiple rules that we run, such as naming standards, Application Load Balancer port configurations, and whether you have the right SSL ports open. These are all managed using CDK NAG. Infrastructure as code offers this capability, and we run all these rules on the developer machine and in the pipeline before resources are created in the cloud.

Thumbnail 2560

Zero Downtime Deployments: Leveraging AWS CodeDeploy and DynamoDB Multi-Region Strong Consistency

Let's talk about some of the release techniques we have. When we do releases, we never take down these services. We release while customers are using the services. Let me bring back the shard-sharing diagram. Imagine we have a new version, version N+1, in the blue box. This blue box is ready to go to production. Instead of applying to all these different shards, I pick and choose which shard and which particular task to update. In this example, for customer group one, I drop the code into a particular task and replace it, then constantly monitor the behavior of the new version. If I get confident, I automatically roll the new version to the other shards and other shards and so on.

Thumbnail 2610

How are we doing this? We're using AWS CodeDeploy completely. We don't have any manual release process. Since we moved to AWS Fargate services, we've been using CodeDeploy. With CodeDeploy and the AppSpec deployment configuration files, we orchestrate our releases. We can specify what our roll-forward strategy is, what our rollback strategy is, and CodeDeploy takes care of everything for you.

Thumbnail 2640

The biggest force multiplier of CodeDeploy is the deployment lifecycle hooks. I've highlighted three examples here: before install, after install, and before allowed traffic. Before you install a particular version of the software, make sure the software has all the right configurations. After you install, before you service customer traffic, make sure the release is actually correct by running some synthetic functional tests using synthetic data. We run some functional tests in production using synthetic data and ensure the release is good and we won't break any reliability. Before we allow customer traffic, because some of our systems had cold start problems, we use this lifecycle hook to warm up our containers.

Thumbnail 2690

This is a Lambda example that listens to all these different lifecycle events. For every lifecycle event, we run different pre-validation steps.

Thumbnail 2710

Thumbnail 2720

Thumbnail 2740

Another great part is that all of this is maintained in AWS CDK. This includes not only creating our infrastructure but also managing releases as code, which allows us to test everything before we create it. As I mentioned, we use CodeDeploy to perform gradual rollouts, and we have a warranty period before we completely route customer traffic to the new version. We use CodeDeploy's linear gradual growth for traffic routing. This is part of another check we perform called readiness checks.

We had some reliability incidents that taught us valuable lessons. For example, we introduced a new capability that required DynamoDB table access, but we had an IAM bug. What happened was we opened customer traffic, and the customer use case failed because the IAM policy was not updated. That incident taught us an important lesson: we should shift the capability of checking readiness to the application side. Now, before applications even come up, they call a custom IAM simulate policy endpoint. They check whether they have access to the services they need before taking customer traffic, before declaring themselves healthy and attaching to the Application Load Balancer. The application itself performs several checks before it takes customer traffic, and this has been a game changer for us. Many reliability issues related to code releases have been eliminated.

Thumbnail 2800

Thumbnail 2810

We also have automated failover and automated failback. The moment we fail over, we create an event that goes to AWS Lambda. The Lambda listens to this failover event, kicks in, and starts sending particular load to the unhealthy region to ensure it comes back up. Sometimes we use synthetic load, and sometimes we open up a little bit of customer traffic to verify the other region is functioning properly. All of this is automated, so neither the failover nor the failback is manual.

Thumbnail 2840

I want to spend a minute on this. Some of our critical mission-critical platforms require zero RTO and RPO goals, and they also require cross-regional consistency. Think of this scenario: you are a new customer with a zero dollar balance. You make a one dollar deposit, and we have DynamoDB Global Tables. Your request goes to region one, your one dollar deposit is processed and persisted in the DynamoDB Global Table. However, for the data to replicate to the other region, there is no guaranteed SLA. It will eventually be available, but it could take a few milliseconds, a few seconds, or even hours. We do not know exactly when.

Now imagine you made a one dollar deposit to region one and we failed over to region two. You try to make another one dollar payment or deposit. It goes to region two, but after the second transaction, your balance will still be one dollar because the first transaction you made in region one has not replicated to region two. This has been a problem for some of our mission-critical platforms that require cross-region consistency. What we are exploring is DynamoDB multi-region strong consistency, in which every time you write to DynamoDB, it will durably write to the other region in a strongly consistent manner. This way, when you fail over and read from the other region, your data is available.

Thumbnail 2930

Thumbnail 2950

Some of our learnings involve cold starts. There are a couple of things that impacted our high availability. When you are talking about finance, your downtime budget is very low and your error budget is very low. We were impacted in two scenarios: one is the cold start immediately after a release, and the other is when there is a long lull period. A customer comes, uses your platform, and there is a lull period. They come back and use the platform again. Every single time this pattern repeats, you see a spike in response time. This is a problem because we have set a clear response time SLA of 500 milliseconds, and every time we breach that SLA, we count it as not available, even though the request was successful. It breached the response time, so it counted as not available.

Thumbnail 2980

Thumbnail 3000

Resilience Testing, Observability Standards, and Building Customer Trust Through Reliability

How are we mitigating this problem? We are using CodeDeploy deployment hooks. The moment we deploy, the CodeDeploy deployment hook knows the new version is deployed and will use a pre-warming script. With that, we are able to eliminate the cold start problems. Real resilience testing as code is another major area we have been working on. You can also call this chaos engineering. If you remember, I talked about the different failure modes. What is the outcome of the different failure modes?

For every failure mode, we will apply a what-if scenario. What if your cloud provider region one has gone down? What if your cloud provider's availability zones have gone down? What if the platform had a bug? We will convert the different failure modes into several what-if scenarios.

Thumbnail 3030

After we convert these what-if scenarios, we will reverse those scenarios in a very controlled environment. When we reverse those environments, it helps in two ways. First, it helps to check whether our alerts are working properly and our runbooks are up to date. Second, it helps tremendously with preparing our engineers for when the real incidents happen. This is where we are getting the biggest benefit because it constantly trains your engineers.

Thumbnail 3060

How are we doing this? There are several different what-if scenarios. We are creating an AWS FIS-based SDK where we use the Gherkin format to define the different failure scenarios in a simple way, and then our SDK will convert these into failure scenario testing. We also do game days. In our game days, we completely isolate traffic to one of the regions and we see whether our systems are able to independently operate in a single region. I talked about regional affinity, and this is how we test the regional affinity by doing this. We completely isolate and ensure the systems running in a single region are able to take up the load and there is no cross-regional dependency.

Thumbnail 3080

Thumbnail 3110

Thumbnail 3120

Resilience testing is the way you test your architecture's resiliency. We also do several observability-related standardization. We standardize all our logging, so all our logging is based on structured logging. This helps to immediately identify the problems and reduce the mean time to identify the issues. We also standardize several metrics, including metrics related to saturation, metrics related to availability, and error budget. All our platforms are enabled with tracing. We have distributed tracing and we also have continuous profiling. Every single time, we run continuous profiling on different versions of the same function, and we use histograms to compare the performance of the same function with different versions.

Thumbnail 3190

We also do error code standardization because error code standardization is very important. Some errors need a failover, some errors need minus one, and some error codes might need you to retry. You cannot have a single error code for all your resiliency requirements. Standardizing the error code is another important thing that we follow at Capital One. We constantly measure these KPIs: our mean time between failures, mean time to identify an issue, and mean time to recover from an issue.

Another important thing that we do or have been doing in this space is moving away from static alerts into dynamic alerts. We had this problem where, if you look at the left side, we had static alerts for every five-minute window. This alert measures a count of errors greater than one hundred for five minutes. Now imagine you have for every five minutes, you have ninety-nine requests and all ninety-nine have failed. If that happens, this alert will never go off because even though you have a one hundred percent failure rate, this alert is based on a static threshold, which is the count of errors. In this scenario, it will never go off.

Thumbnail 3260

What we are doing is moving to more dynamic-based alerts based on different traffic patterns: our low traffic analysis, medium traffic patterns, and high throughput patterns. For different patterns, we have different dynamic thresholds and error budgets. With this, there is no way we are missing any alerts. In the static alerts, we used to miss a lot of alerts because we were tracking a static number.

Thumbnail 3290

We have this kind of dashboard in all our platforms where we have multiple shelves of five-minute intervals. These green boxes are where our thresholds were not breached, and white boxes are where there were no requests coming to our platform, and they are considered as good. Our red boxes are where we breached the threshold, and we constantly review these dashboards every day, every week, and every month. Another important aspect about running reliable and resilient systems is using these bulkhead patterns. We had an incident where our service task and our observability tool were both sharing the same resources. Our observability backend was a problem and that was creating back pressure to our service task, and it was impacting the throughput of the service task.

That incident taught us a lesson that we need to create a bounded context between these services and these auxiliary things like logging, audits, and retries. So we are moving away from tightly coupled services into bulkhead patterns for all of our auxiliary services. We use the sidecar pattern. Because of the serverless adoption, using this sidecar pattern has become very easy for us. Imagine if you're running on an EC2 instance—if any of these auxiliary systems are having pressure, it will completely occupy your EC2 instance's resources and impact your services. With the serverless adoption, we are able to create these sidecar patterns really well.

Thumbnail 3360

Thumbnail 3380

Why are we doing all these things? As we talked about in the beginning, we all wanted to create trust between customers. Customers trust that systems are highly reliable, highly resilient, always on, always secure, and always scalable. Not only that, there are other factors customers consider when placing their trust: competitive edge. The moment your systems are not reliable, your customers will start using other systems. For highly regulated industries such as ours, we have to always adhere to our compliance and also the company's reputation. These are very important, and that's why we take resiliency very seriously. Every single time we make a code commit and push it to production, please make sure that you are responsible for the company's reputation and the customer's trust.

Thumbnail 3420

That's pretty much what I have. Thanks for tuning in. I hope that you got something out of this and that you will bring it back and build reliable systems in your organizations. Thanks a lot.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)