AWS re:Invent 2025 - Capital One: From Chaos Testing to Continuous Verification (SPS328)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Capital One: From Chaos Testing to Continuous Verification (SPS328)

In this video, Sheng Liao from AWS and Troy Koss from Capital One discuss how Capital One transformed from point-in-time chaos testing to continuous verification. They explain why chaos engineering matters in complex distributed systems, highlighting the resilience gap created by quarterly game days versus continuous testing. Troy details their automated reliability verification framework built on four key dimensions: controlled self-service platform using AWS FIS, emergency stop mechanisms with rollback capabilities, service level objectives for measuring impact, and continuous automated testing. The approach addresses constant system entropy from daily code changes and configuration drift, enabling engineering teams to build confidence through repeated failure scenario testing and verification rather than one-time game days.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Why Chaos Engineering Matters: From Point-in-Time Testing to Continuous Verification

Come on out, folks. Hello everyone. Welcome to our lightning talk. My name is Sheng Liao, and I lead the technical account managers within the strategic segment at AWS. I have the pleasure of working with Troy. Troy, I'd like to introduce yourself.

Thank you so much. I'm Troy Koss, and I lead reliability engineering at Capital One. A lot of what we work on is enabling other teams to build their applications more reliably. We're here to talk about how Capital One transformed from point-in-time chaos testing to continuous verification.

First, we're going to start with the why, then we will talk about Capital One's transformation journey and their automated reliability verification framework. Finally, we'll discuss how they scale chaos engineering and measure outcomes. So why does chaos engineering matter? In today's complex landscape, we can't just hope everything works. We need to know. I want to highlight how we can build more resilient applications by applying chaos engineering principles early in development.

First, incorporating chaos testing at the beginning ensures resilience isn't something we think about only after failures occur. It should be something that's part of our development process. By designing with failure in mind, we make sure that our applications can degrade gracefully when there's a failure. Second, we need to make sure that our applications have no single point of failure. This means simulating network and system conditions like latency, timeouts, packet loss, or disconnections.

These tasks should be both deterministic and randomized so we uncover unexpected behavior. Finally, we should proactively anticipate production-level failures. We can do this by simulating the loss of virtual machines, availability zones, or even an entire region. Running these tests ensures our applications can degrade gracefully and recover quickly when there's a failure.

Now, here's a challenge with point-in-time chaos testing. In a complex distributed system, entropy is constant. Code changes daily, configurations drift, and dependencies update. If you only test during, let's say, your quarterly game day, you are creating a resilience gap. Between tests, your application reliability can gradually degrade because new changes happen, and as a result, we introduce new vulnerabilities.

Capital One's Automated Reliability Verification Framework: Four Key Dimensions

Continuous verification closes this gap. This is the shift Capital One has made. This allows them to match the speed of testing to the speed of their deployment. Now, Troy is going to walk you through it. Awesome, thank you. So, like Sheng mentioned, everything's in motion and everything's moving constantly. We had to come up with a way and a solution to really address that, close those gaps, and ultimately achieve a new level of reliability.

We see our systems are growing with new microservices and new things coming online. Everything's interconnected, with dependencies. I'm sure you've all thought about dependencies at some point in your journey. What can't occur is really that you have a rigid system where you fix something in place one time and then you're done and you move on and wait for the next thing to fail. You can't assume that there's going to be consistent happy paths or that you're going to be able to predict all of the failure modes of your applications.

And lastly, in an environment specifically for us, we have to be regulated and compliant at all times. I don't know about you guys, but when I want to get to my bank account, I want to make sure it works. So what has to fundamentally change, and a big part of it is the culture that we had to bring, is how do we embed those learnings and findings into our systems and have them running all the time. So we've repaired something and we've made a fix.

How do we check that this is happening consistently? How do we have the flexibility to understand that our system will have different failure modes in different states? What happens if our primary database isn't active and there's a problem? How do we have a different read replica or something that we can count on and depend on in that environment? And lastly, there's governance. We don't want to stop everything to fix everything. I think that's the mode and mindset you have to get into: we want to be able to be regulated and well managed, but to do that in a consistent way, in an automated way, and have our systems adapt to that versus slowing down with manual changes.

In order to achieve something like this, there has to be automation at play. I'll be the first person to say that automation isn't going to solve everything, but in this case, we really had to step back and say: how do we put all of those things in place at one time? The real crux comes down to the automation and tooling that you can build to enable your engineering teams to achieve this. In order for us to get there, it's a rather steep cliff, but to actually do it and enable the teams puts us from doing our continual game days or quarterly game days, as Sheng had mentioned, and really moving into this always-on continuous approach.

We'll take a look at four key dimensions that helped us really unlock this capability. The first is really being able to get the capability out to our engineering teams, but to do it in a way that's controlled, that's regulated, that has audit compliance baked in, and also allows us to have blast radius and control. Instead of just letting every engineer go after a fault injection simulator and start turning things off, especially if you want to mature from doing your tests in lower environments and working your way up into production environments, you have to have these controls in place and these safety mechanisms.

As an institution like ourselves, we were able to use the FIS under the hood to achieve a lot of the different chaos tests. There are plenty to choose from, and I think what's important is to think beyond just the compute layer and the traditional approach of breaking things. You're really testing your hypothesis across different layers of your applications. We're looking at your compute layer, your database layer, and all the way across your network actions and others. There's a wide catalog that's available. The self-service platform also enables us to build our own capabilities that maybe FIS doesn't have or to introduce other chaos potentials.

Once we have that in place, the emergency stop button is needed. Nobody wants to let a runaway train run away and take down a bank. So making sure that we have the telemetry and the data to help us understand that this is going south fast, we need to stop, our hypothesis has failed, and we're going to eject. We need to be able to halt on that. And then I think the general concept, and I've had a lot of conversations with people throughout the week, is being able to roll back at all capacities, but especially with the test. We need to stop what we're doing, roll it back to the previous state, whatever injection we've made, and be able to pull that back.

This is one that's near and dear to me. You can't hit a target you don't have, and that's why having service level objectives are so important and foundational for your application. If you don't have these, I highly recommend starting there. But even if you don't, chaos testing is a good mechanism to learn what your SLOs need to be. You have your offshore safety bounds when you're working on your application to know that it's okay for us to go do chaos testing. You don't want to start doing chaos testing on an application that is in turmoil or has failed recently and your customers are unsatisfied.

And then at the opposite end of the spectrum, you really want to have that measure to say we are really impacting our availability by doing these chaos tests, we need to stop doing these. Everywhere in between, we also have opportunity for learning. If you inject latency, you expect that your system can handle it and your availability maintains consistency. If that doesn't happen and you have a drop off, then you know that your understanding of where your cutoffs for latency are clearly incorrect and you need to work on that. There's a lot of different intelligence that you can get from this data and leveraging it, but without this and knowing that the outcome of your system works for your customers, it makes the chaos testing a lot more difficult.

So the pinnacle of this whole thing is setting it all into motion. We can still do chaos test points and times, and I highly encourage people to still do active chaos testing, game days, those kinds of events. They're going to be very valuable to learn and to test things and to get them working. And then once you've achieved that, once you have chaos in motion or have had it running, you can then set it into motion on a recurring basis.

What we're identifying here is that our systems have behaviors and different mood swings depending on the day, just like we all do. Finding out what those edge cases are that happen really allows us to get ahead of problems before they surface and build into something larger.

Measuring Outcomes and Building Confidence Through Continuous Chaos Testing

What we're testing and verifying is that our system is behaving as we expect it to behave at the end of the day. Testing is still valuable, but if you don't really get into motion, that gap that Sheng illustrated earlier becomes apparent. You make a fix or have an incident and issue, you identify a fix, some time passes, and then the same thing happens again. No matter what we do, that's going to happen because of the change of our systems and the entropy that naturally exists in these kinds of systems.

Assuming we have those four things in place and we look at those outcomes, I think most importantly here is the confidence that your engineering teams are going to get. In crisis or chaos of an incident situation, there's always that natural hesitancy, right? What should I do? Should I fail over or not fail over? What do we do? I don't know, and if I hit this button, I think it's going to work, but I'm not sure. The best way to feel confident in doing something is doing something, and doing it continuously and being able to build that confidence in your engineering team and weed out fragility of the system that exists not just at the technical level but even as humans. We want to feel confident hitting that button. When you get paged at three o'clock in the morning and you have to go make sure your system's working, you want to make sure you have confidence in doing that and it's business as usual at that point.

Ultimately, the goal is better reliability and better trust in our systems themselves, but this does give us that resilience. It allows us to recover fast because we know how our system behaves more. That learning of what happens when this breaks has already occurred, and you already had that opportunity to understand the behavior. You become more familiar with your system, and then you know how to react to it. These kinds of tests really allow us to do that, especially when they're running all the time.

To put it all together and conclude, the foundation to all this is being safe. The first thing I opened up with is put a big safety harness around it, make sure you know what you're doing, make sure you've got audit, you've got tests, you've got logs, you've got everything in place to see how it's going to behave, your monitoring is in place. That really unlocks the next level for you to be able to start doing more advanced testing, setting things into motion, and getting the evidence that your test is actually successful too.

I didn't really hit on it too much earlier, but not just testing but verifying is really important. As we move from just breaking things, which I think is still unfortunately a misunderstood expectation of chaos engineering, into actually saying I believe that my system can withstand this level of latency, rejection, or degradation, and it will behave in such a way when this happens. It will scale up if computers are shutting down, or if, again, the example I used earlier, if my primary database isn't responsive, that I have a read replica that it'll cut to. We want to make sure that those things are in place and they work, especially over time.

If you put together failure modes and failure scenarios and you say yes, this works, we have this failover, we have this resilience added, and then you don't test that and you don't actually exercise that, come to find out in the heat of the moment it doesn't work. It's not the best time to discover that. You'll continue that, and then the overall theme again really is to run continuously and to make sure that you set this into motion to prevent those gaps in reliability.

Thank you so much for listening to us chat. Our Capital One booth is over there by the Expo B entryway. Come check us out and see what we're up to. I'll be over there throughout the day as well. I'm looking forward to talking with you all. We do have a little bit of time. I don't know if they wanted to open up for questions or if it's up to them. Any questions?

; This article is entirely auto-generated using Amazon Bedrock.