DEV Community

Cover image for AWS re:Invent 2025 - Mastering Root Cause Analysis: Rebuilding trust after outages (ARC211)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Mastering Root Cause Analysis: Rebuilding trust after outages (ARC211)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Mastering Root Cause Analysis: Rebuilding trust after outages (ARC211)

In this video, AWS principals Giorgio and Donald share their expertise on root cause analysis (RCA) practices at AWS. They explain the six phases of incident management, emphasizing the review and improvement stages. Key topics include when RCAs warrant investment, the importance of data collection during active events, finding true root causes by zooming out rather than linear analysis, and defining meaningful preventive actions that are both deep and extensive. They introduce AWS's bar raiser mechanism for external review, demonstrate their Command Center tooling, and outline core principles like prioritizing quality over speed and taking ownership of dependencies. The session covers RCA structure including impact summary, root cause analysis, timeline, and action items, while providing style tips to focus on customer experience over internal terminology. They emphasize that RCAs aren't just documents to archive but tools for driving actual improvement, requiring customer support through learning, planning, and implementation phases.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Mastering Root Cause Analysis at AWS

OK. So welcome to ARC 211, Mastering Root Cause Analysis, Rebuilding Trust After Outages. We are quite conscious we are the only thing left between you and your evening plans, so let's get right into it.

Thumbnail 20

So I'm Giorgio. I'm a principal in AWS support, and for the last few years I've been the owner of the AWS RCA program. This is the program through which we turn our internal postmortems into documents our customers can see and use. And I'm Donald. I'm also a principal in AWS support, and for some time now I've been helping customers create their own and develop their own RCA practices and also leverage the AWS RCAs in the context of their business.

Now together, Giorgio and I have helped hundreds of organizations worldwide to navigate what could have been just very painful events into opportunities for improving resilience and rebuilding customer trust. So just to set the context for the next hour or so, we start by sharing how we do root cause analysis, also known as RCAs, at AWS and the experience we have built over the past 10 to 15 years. We'll do a particularly deep dive into ways for turning the chaos of an event or an outage into the clarity of a postmortem that your customers can use.

Thumbnail 90

We do this by showing the mechanisms, mental models, and tools that we have built over time. And finally, we are going to cover what happens after the RCA document is delivered to customers because we want to make sure that they make actual use of it and that it doesn't just end up as a document that is archived.

Thumbnail 110

Thumbnail 130

So a bit of context. Every event or outage or incident, whatever you call them, goes through six phases. So you first detect that there is something wrong, then you investigate to figure out what's wrong exactly and mitigate and resolve. After that, you get into the second macro phase, that is, you go back, review the event, review what happened, try to build the learnings from them, and then define and implement the action items. This talk is going to focus on the second part, so between review and improvement, with a particular focus on doing this in a role like ours where we support external customers that use our services. What we are going to discuss is going to particularly apply to ISVs and any sort of service provider really.

Thumbnail 180

Now before we dive in too deeply, I want to set a little bit of context. Throughout this presentation we may use some terms interchangeably, such as incident, which at AWS we call events or service events, and RCA or postmortem, which we refer to as event analysis. So let's first talk about what RCAs are in this section and why they're an important aspect of event management, and importantly, when are they most effective in helping rebuild customer trust.

Thumbnail 190

What is an RCA and Why Does It Matter?

So first, what is an RCA? Well, root cause analysis is actually a collective term for both the method of investigating failures in understanding what had happened, as well as the resulting document that customers will ultimately read to understand more about the event. Now you should think about an RCA as a well-constructed story that always includes a few foundational components.

Thumbnail 210

Thumbnail 250

First is the impact summary. Now the impact summary is going to set the context for the customer of what happened, where it happened, and what services were involved. Next, we want to take a look at the triggers, which are things that set the event in motion, as well as the ultimate causation, and we'll dive into each of these in a little bit. And thirdly, a clear resolution timeline. Now the timeline will span from event detection all the way through to mitigation, resolution, and mitigation, which we'll look at as well. And most importantly, an RCA is not complete unless it has actionable and commitative preventive actions.

Thumbnail 260

Thumbnail 280

So let's get into the next section with a quote of our founder that says something that might be obvious or not really, and that's that trust is really hard earned. It's going to take a significant effort to earn a customer's trust, but it's easily lost. You do one mistake and you start from scratch. This gets us into why we do RCAs. So there are some reasons that are obvious and some that are a bit less obvious. The first one is the main aim. We want to provide clarity into what happened. We want to share what we experienced and why. Additionally, we want to prove understanding. We know that customers are going to use our RCAs as a document to examine our ability to respond and a way to validate whether they still want to keep using our services. And additionally, there is a bit of doing a mea culpa where we want to show that we hold ourselves

accountable to the event that we caused. If we got to the point of doing an RCA, it means that one way or another we have disrupted our customers' business, and it's important to recognize that. Together, all of this has the side effect of helping boost confidence or regaining the trust that was lost.

Thumbnail 340

So here we see a customer inquiring about one of their very old antique instances. It's a Windows 2003, of course it's a production instance, and as you can tell by the name, it may have become the backbone of their business line and they're very curious about why it was unreachable for just a couple of minutes. I provide this example because it's a common question we receive, and we completely understand why customers might become curious about single instance failures. However, the reality is hardware failures are a part of operations and they're expected. So this begs the question of is this a good candidate for an RCA.

Thumbnail 380

Thumbnail 410

Now each RCA requires a significant effort, making it very important to be intentional as to when to invest time in RCA. Not every failure warrants an RCA, as we've seen in the last example, but they should be focused on situations where a promise was broken, for example something that was tied to a Service Level Agreement, or in failure modes that emerge for the first time. These could be conditions which are tied to the software build stack that doesn't tie directly to your test suite that you used to build that service.

Thumbnail 420

Thumbnail 440

Now more broadly, any unexpected or unforeseen failures absolutely warrant an RCA, and the reason for this is they become critical opportunities for learning, improvements, and improving overall service. So let's look now into the various items and the various components that you need to produce high quality Root Cause Analysis documents.

Data Collection: The Foundation of Quality RCAs

So while it might seem obvious, you always start with data collection, and RCAs are only as good as the data that supports them. The reason you need data is so that you can form a full picture that's clear, and you use this data to make informed decisions, and those decisions turn into action items. So let's take a closer look at data and actions.

Thumbnail 480

You might be thinking, well, when I'm in an event I will remember all these details later and we can grab some logs and graphs, but our experience shows that that's rarely true. What this means is your work begins during the event, so Root Cause Analysis begins while that event is still active. Think of this as documenting a crime scene where collecting the data is freshest in that very moment.

Thumbnail 500

During the heat of the moment, you want to focus on capturing four critical elements. The first is the scope of impact. The scope of impact is where you look at what was affected in this event and to what extent. The next thing is metrics. So metrics take on a few different shapes.

First, metrics should be broad. You initially look at the triggering metrics that set off the alarm for this particular event, but then you want to look beyond those service side metrics and into other areas of the stack that might be related. Second, metrics are deep. You might have started with a customer-facing metric that tells how the experience has been, like the error rate of accessing your system, and then you dive deeper into the physical infrastructure like what are the metrics telling us about the storage layer and hardware layers.

Metrics also should be categorized into a bit of a cause and effect. Look at whether you're looking at a metric that is closest to the likely causation, or is it a symptom or a victim of that original cause. Third, you want to look at observations from your support teams or your field teams. It's really important to understand what customers are seeing and what your support teams are saying about the event. This gives us a bit of a qualitative angle to the metrics and data that you're collecting or observing.

Thumbnail 610

And fourth, we wrap this all together into a timeline, and the timeline simply is just going to tell us what happened and when. This data collection is really important as it becomes the backbone and foundation of determining what happened and how you build the action items.

Thumbnail 630

Finding the Real Root Cause: Going Deep and Wide

So one of the themes that we focus on at AWS during service events is mitigation, and it's an important milestone in any service event because it helps shift our focus from the firefighting to the detective work. And at this stage, you want to ensure to understand three key elements. The first is what were the initial triggers, so looking at what took place to set the service and things in motion that didn't go as planned. Additionally, you need to identify the specific actions that led to recovery. Now this could be automated actions that systems automatically took on your behalf to help mitigate the event, or it could be operator actions such as shifting away from a portion of the network that was impaired or rolling back a suspect recent change. And the third is establishing a sequence of events. Now the sequence of events will help build out that more detailed timeline, and we'll talk about what that looks like in the structure of an RCA.

Thumbnail 700

Thumbnail 710

Thumbnail 720

Thumbnail 730

Thumbnail 740

So here we see a customer inquiring about a failed capacitor. Now like all the other quotes in this talk is completely made up, but we got really close to a similar situation and brings us to the following stage. It is what we try and define as finding the real root cause. I'm going to share an example of what we mean by true and deep root cause. So we start from quite a common problem that is we have an experience of queries that were timing out and we start looking into why this was and figure out that the database instance was unreachable. So we ask ourselves another why, go one layer down and figure out that the instance was running on a server that stopped running because the underlying power supply failed. And with an additional few months of investigation, we point out this failure to a specific capacitor around the 12-volt rail.

Thumbnail 750

Thumbnail 760

Thumbnail 770

Now, as you see, we are going really deep, but there is a problem. We are going in a very specific and single direction. We can be as deep as we want, but we are focusing on a single component on the board and by missing everything else along the way. So let's look at what we think is a better way to go about this. We start from the same problem because that's our customer experience. We get to the same second layer conclusion. The database instance was unreachable, but then we drift and we start looking into why there was no standby replica, right? Because if there was a secondary node, the failure of the primary database instance would have been completely irrelevant.

And we're asking why we didn't have a replica, then we're going to find out that the software cannot really handle the potential replication lag. And as you see, we started from the same point, but we zoomed out a bit. And by doing this, we got to a much more relevant root cause and we get to a position where we found an issue that if solved is going to help with a wide range of failure conditions. And you see, if you go through a linear analysis, you go fix that capacitor that might actually be a real problem, right? That capacitor needs to be fixed if there is a structural reason why it's failing. But equally by zooming out a bit, you get to a condition where you can implement different preventive actions that are much broader and much stronger than focusing on that single component.

Thumbnail 850

Thumbnail 860

Thumbnail 890

Defining Meaningful and Achievable Preventive Actions

So now that we've identified the real root cause as Giorgio walked us through, we need to define some meaningful preventive actions. So preventive actions take on two personas broadly. The first is that they're deep. And you should think of deep preventive actions as vertically within a very specific causation, tackling the deepest levels of tracking down the issue and surfacing whether something was systemic. For example, what Giorgio was talking about is looking all the way down to the particular power supply which may manifest itself as a systemic batch of hardware from a manufacturer vendor that is now distributed across your fleet of hardware that needs to be addressed. So going deep is important.

Thumbnail 900

The second is, and equally, action items should be extensive. Now instead of continuing to look more vertically and linearly, you want to look at action items that are extensive, which means

Thumbnail 930

going horizontally. So we step outside of that primary causation and we look at what could we do to address other triggers as well as looking around corners for other root causes or other similar things that could happen in the future. And together we also include action items that will cover detection. So in the event that a similar situation arises, we want to ask ourselves and challenge ourselves to take an action on how can we enhance our identification capabilities and detect this event faster.

Thumbnail 960

Thumbnail 970

Thumbnail 980

By doing this, you might fall into a really common mistake that is going for a really perfect solution. So here we have another completely made up quote about an engineering lead that is trying to solve the database problem by rebuilding the entire database layer of the application. The problem is that they plan to do this in the next 10 years now. While thinking about preventive actions, there are a few things we want to focus on. When you work on this, you are always going to have to find a balance between ambition and delivering perfect solutions and practicality and being realistic with timelines.

Thumbnail 990

So the first thing that we try to focus on when we do our root cause analysis is we want to make sure our plans are achievable. By the time we commit on them in the document, it's because we know that we can implement them and we know we have the resources to do so. If we don't have that, we might push out the entire commitment or we might remove the commitment from the document.

Thumbnail 1010

The second one is that timing is important, so your customers are going to validate the quality of your preventive actions, but also are going to hold you accountable to delivering them reasonably soon because one way or the other until they are implemented, you are at risk of recurrence of an event. You might have temporary measures in place that might, but they tend to be really hard to sustain. So the set of midterm action items, the ones that complete in weeks to months maximum, are really the most important ones.

Thumbnail 1040

Thumbnail 1060

As you go through the implementation and in terms of timing, let's remember you generally commit on those preventive actions in the first seven days after an event when you share the document with your customers. Afterwards, when you go towards implementation, you might learn that there are better ways to do that or you might figure out different solutions. We generally think that changing the solution is fine as long as we maintain the same goal. And the longer we're maintaining the same goal, we want to maintain the same delivery time estimate.

Now they are estimates, so there is fine print involved with the term itself, but also they are a promise to customers. So slipping by one or two days is fine. Slipping by months is definitely not. Your customers are going to call you out on that one.

Thumbnail 1100

Thumbnail 1110

AWS RCA Mechanism: Steps and Accountability

All right, so now that we've pulled together all of our data and we've established some actions, we need a mechanism to help keep ourselves organized, and we may also need some tooling. Now this is going to be true whether you're a small startup or a large enterprise that you'll need these things to build effective RCA practice. So let's explore what this looks like.

So at AWS we've developed and refined our RCA mechanism over time as well as our tools. And we've built some steps that are very intentional to ensure that we are focused on preventing a reoccurrence and delivering quality for our customers. So let me walk you through those steps now.

Thumbnail 1140

So it all begins with a request. Now you might be thinking this is a very basic step, but it actually goes beyond just clicking the button in a tool saying, hey, we better do an RCA for this event or our customers are asking for this. Each request actually includes context about customer impact, and what this does is it creates a full feedback loop that helps keep us focused on the customer experience as well as prioritizing specific aspects of our analysis.

Thumbnail 1170

Thumbnail 1190

Next, the RCA is assigned. Accountability is really important in our mechanism, and so we assign the RCA to a single threaded owner, and this could be a service owner or service leader. Next, just like any good novel, there's going to be a star author. In our case, our authors come from the service team that's impacted, though typically an engineer or a leader within that service that understands deeply the service and subcomponents that were impacted.

Now their job is to transform the chaos and mystery of what happened into clarity and trust that your customers, our customers, will read in the final document. Similarly, going back to the assignee, there's also an aspect of ownership, and that's shared between both the author as well as that service leader. Now there may be some instances that are very complex, so co-authors are also a possibility here, and we keep the authorship within the service or subfunctions of their expertise.

Thumbnail 1260

Fourth is review. All RCA undergoes a review in AWS we call that bar raising, and Giorgio will explain that in more detail in a moment. And lastly we publish the RCA. This is where the RCA becomes publicly available for customers to consume and more importantly, we track the RCA action items through to completion. It's our philosophy here that an RCA isn't done and our job isn't done until those action items are actually implemented and delivered for our customers.

Thumbnail 1300

Bar Raisers: Ensuring Quality Through External Review

So I mentioned bar raisers. Bar raisers are in general terms, the external reviewers at Amazon. They are the foundation of our culture and you'll find bar raisers in Amazon in literally every sort of process or design and implementation. We have bar raisers in interviewing, hiring interviews. We have bar raisers involved in the design of APIs, command lines, and services, and we have bar raisers involved in the review of pretty much any type of written content that we deliver. This session itself was reviewed by a bar raiser to give you an idea.

Thumbnail 1330

We believe that they are really important because they come without any unconscious bias. But more importantly, they come without conflicting interests. So if you think in the context of hiring, a team might be under pressure to hire a new engineer because they just cannot cope with the workload. The bar raiser is going to come as an external reviewer to that meeting, and the only thing that they're going to evaluate is the quality of the candidate. They are not going to feel the pressure. They are not going to even know that the organization needs to hire someone really quick, and so they really help as an external eye.

Thumbnail 1390

And the third part, most relevant in this context is that they do not have comprehensive context on what they are reviewing. So imagine an RCA. An RCA is a document written by a service team that caused the impact to customers. The bar raiser is going to come from another service and so their review is going to be much more reflective of how customers are going to receive the document than reflective of our internal standards or how the team itself would. And overall, since they tend to do this, it's not a full-time job, that's an important thing, but since they tend to do this frequently, an RCA bar raiser is going to review at least five or six per shift, and they do one shift a week, more or less, and so they end up with a really trained eye that can help spot opportunities for improvement or just in general things that the team might have missed.

Thumbnail 1420

Thumbnail 1430

So in the context of RCA, the first and most obvious goal is to improve quality. They do not have side interest. They are there to ensure that the document we deliver is consistent and with a high quality. The second thing they do is that they ensure accuracy, so they do not have the context of the team that wrote it, and so they are going to review it as an external customer and they might, for example, spot that we need to better explain an internal component or that we need to provide more details for the story that we are telling even to make sense.

Thumbnail 1450

And then they guarantee consistency, so customers see us as a single company providing a set of services. Our customers don't really care that two services are managed by two different teams, and the expectation is that regardless of the team that was impacted, we deliver a document and we carry out an internal analysis that is to the same standard. And they help with this. They really help making sure that regardless of who's impacted, we go through the same steps and we deliver documents that are at the same quality bar.

Thumbnail 1500

Tooling and Principles: Building an Effective RCA Practice

So next we'll talk about tooling, and you can think of tooling as equipping a kitchen. Now to make a delicious meal you don't need every utensil that you can think of, but you do need a few basics to get things started, so let's look at what those are. So first you want to have tooling that will help and focus on tracking some of the key mechanism steps like,

Thumbnail 1550

initiating the request and tracking approvals and things of that nature that we talked about earlier. The second is automation. You want to automate notifications around the RCAs as well as escalate when certain due dates are passed. The third aspect of tooling is being able to monitor the status of the preventative actions that we were talking about earlier. And one of the most important things to know is that starting small is okay. Sophistication can evolve over time, and it's perfectly fine to grow your tooling as your needs and your business also grow.

Thumbnail 1560

So now what I'd like to do is pull back the curtain a little bit and show you what our RCA tooling looks like here at AWS. Now we have a suite of unified tools called Command Center that we use, and RCAs is a component of that. Now on the right-hand side of the screen you see an issue summary. You may have seen a version of this issue summary in our post-event analysis on our website in the past, so that may be familiar to you. Now what I'd like to do is call your attention to the left where this part of the tooling includes some of the metadata that we had mentioned is essential earlier in the session. So here you can see we're tracking things like the status, in this case it's published, the general manager or service leader approvals, and certain SLAs around the RCA. We track SLAs with our RCAs closely, and we do this because we want to make sure that we're providing the right level of support for our authors and our leaders to deliver the RCA.

Thumbnail 1640

Now in addition to this portion of the tool which is focused on creation of the RCA, there's also the delivery aspect of completing the action items, so our tooling supports and houses all of the action items that are tied to this RCA. Notice how each action has a particular status and a due date tied to it. And we also include in our tooling a link to our engineering ticketing system where more details about what's being delivered in the action item are housed, and we also use this ticket for our change control systems as well as implementation tracking.

Thumbnail 1680

This is the request tab and it illustrates what I mentioned earlier that requesting an RCA is much more than just clicking the button. Notice how each request here includes important information and context about the customer experience and the impact that they saw. Understanding the customer voice in our tooling is very important to us when we are working on the RCA. It helps us ensure that we're addressing the experience the customer has and not overindexing on the technical components of what failed, although those are also important.

Thumbnail 1720

Thumbnail 1750

Thumbnail 1760

And finally, here is our activity timeline. This is where we track the complete history of an RCA all the way from the moment it was requested through the bar raising step and through to publication. Each step in the activity timeline includes the owner of who completed the action as well as a timestamp for later audits or other needs. So our tooling and RCA practice has evolved over time. I think we have been running it for about 10 to 12 years, but two things that have remained the same are our high-level approach and the principles we stick to. So there are a couple of ones that I want to share with you today.

Thumbnail 1770

Thumbnail 1780

Core Principles: Quality, Unity, and Appropriate Depth

The first one is we will always prioritize quality over speed. And this is really important because after an event there is going to be extreme pressure to deliver the root cause analysis as soon as possible, but we also recognize a few things. The first one is that we only really get one chance at delivering a quality document. If we deliver a document that is incomplete or that doesn't have the full story, the full chain of events, customers are going to complain about it and customers are going to expect answers in that document. If they don't find them, they are going to escalate, and all of this process of re-earning trust is going to be more painful and go through multiple iterations.

Thumbnail 1850

So we recognize the importance of delivering the perfect document, or as close as possible to perfection, at the first shot. Additionally, one thing that we stick to, and that is unfortunately not really common in the industry, is we do not deliver a root cause analysis document until we can commit on preventive actions. This is the part that generally takes the most time because after a large event we will get the entire write-up ready in 24 to 36 hours maximum, one day and a half. But then we need to pull together a wide amount of teams and we need to make sure that they define and prioritize actions in the roadmap, and this can take another couple of days. Generally the rule we go by is that if by taking an extra day we can deliver a much better quality document, we'll take it without any question.

Thumbnail 1890

Thumbnail 1930

And the second thing we are quite big on is dependencies. Now dependencies, as in using a software component, a library, or a functional block from someone else, being that an internal team or an external company, are amazing because they allow a team that is focusing on developing a service to focus on their core functionalities. Dependencies are challenging because they are run by someone else. That someone else can be in the same company or externally. When it comes to us, we recognize that the dependencies we take are our responsibility and not the responsibility of our customers. We carefully review dependencies before taking them and reconsider this choice over time, so we will reassess whether they are still the best possible option. When they fail, we tend to take ownership of those failures as if they were our own for one simple reason. Although we do not control the operations of that dependency and the failure might be completely external, one thing we absolutely control is our implementation. So we can implement them for a graceful failover, for example, and the final and most drastic choice we have is to stop using that dependency altogether.

Thumbnail 1950

Generally we don't see RCA as a place to play the blaming game. We'll take as much ownership as we can and we'll try to be as independent as possible in delivering the promise to our customers. So here we see a customer inquiring about the interaction between two teams that they're curious about and they want to understand more. And the reality here is that they shouldn't have to understand these subteam interactions during a service event they may have observed. They should know that these team boundaries are internal and they shouldn't also surface during an RCA. And as an RCA author, you want to ensure that these aspects do not surface in your writing.

Thumbnail 2000

Thumbnail 2020

Thumbnail 2040

So this brings us to another principle that we lean on, which is acting as one. For us, unity is very important, and in RCAs we present ourselves as a single entity. Remember that outcomes are what really matter to the customer and what we are doing about preventing a reoccurrence. So just like an athlete in the relay race who's running their section of the race, it doesn't move the team closer to the finish line if not all of the athletes are also moving in the same lockstep manner and winning the race as a complete team. So this means your organizational structure might be complex and that's okay, but it's not for your customers' concern. To you, you're a single provider delivering a single service or many services, and what you need to do here is to ensure that you're keeping those team boundaries where they belong, and that's internal.

And in the RCA you focus on delivering a unified message about what had happened and what we're doing about it as a complete team, and ensure that everything looks and is delivered in lockstep. Okay, so a topic that is really important when facing root cause analysis is finding the appropriate depth. You might be inclined to provide as much detail as possible, but the reality is that excessive detail can be a distraction. There are a few things that we tend to stick to, and the first one is that if customers use our services, it's because they don't want to build themselves, and so they might not even be interested in the architecture behind our services.

Thumbnail 2090

When explaining a failure, we try to connect what happened internally with functions our customers are familiar with and use in the day to day. So we explain an event through the impact on external functionality rather than explaining it through impact on internal components.

Thumbnail 2110

Thumbnail 2120

And then quite obviously, internal terminology and components are internal, and you're not going to get away with it by just providing a table of examples and explanations on the side. Now, this might not be particularly tangible, so let me show you an example, and this is a real example.

Thumbnail 2130

So here we have an internal event summary. We are talking about L514 that experienced a cut while L528 was shifted for an XLA replacement. Now, the top part is AWS internal terminology. Those things mean something very specific for us but mean nothing externally. The second part is terminology only familiar to engineers that work with optical networks. So let's look into how we go and explain a failure like this to customers.

Thumbnail 2160

You will notice we start talking about increasing latency, and increasing latency is not even mentioned on the left because on the left we know that when that event happens, there is going to be an increase in latency. While talking to external customers, we need to provide and disclose the impact and describe it, as we said, in terms of components they are aware of. Then we talk about a physical disruption. It's not really relevant that it was a fiber cut because what's relevant is that that fiber connection was unavailable, and we demystify the L514 and 528 terminology by describing them as the two fiber paths that connect the US East 3 Paris and the EU Central 1 Frankfurt regions.

Thumbnail 2230

XLA replacement is a replacement of an optical component, and again, we removed this detail and we replace it with something that is much more relevant to the customer. So we explained that one of the two paths was unavailable because of planned maintenance. You see how we added some context, we removed some terminology and didn't replace it with anything. We just replaced it with context that is relevant externally. And as a customer, if you read the summary on the right, it's going to be relevant, it's going to be on point, and it's going to explain the impact exactly on the components you use.

Thumbnail 2260

A similar example here. We are talking about 2 million packets of loss caused by a dirty reboot of something that is called W54 VCAR 331. Now, once again, as an AWS employee that works really close to our network, I know exactly that that thing is a router, and just with the name, I can tell you what's the function and where it stands. Externally, we explain this by demystifying that name into explaining that it's a Direct Connect router involved as part of the West 1 Ireland regional network that impacted customers being routed towards the AZ2 availability zone. So you see, once again, the other thing we do here is we turn the internal slang of dirty reboot, I don't know where it comes from, by explaining that it's a reboot that is just unexpected. Once again, we turned something that was really internal, really specific, really poor value into an explanation that has the context required. Done.

Thumbnail 2310

Thumbnail 2320

Thumbnail 2330

Structure and Style: Crafting Effective RCA Documents

So now let's talk about structure and style. Now customers expect consistency in how RCAs are written and presented, so let's explore how you can meet this expectation. So first is the anatomy of the RCA. How you deliver your message impacts its effectiveness. This comes down to three key elements. First is using the right tone with the audience, so you want to exemplify technical understanding as well as explaining or conveying ownership, because after all, the service that you built and your customers entrusted in didn't work or didn't go as planned or wasn't as available as they expected it to be.

Thumbnail 2360

Thumbnail 2370

Thumbnail 2380

Next is following an efficient structure that's easy to navigate, and we'll dive into that structure in a moment. And lastly, maintaining appropriate style throughout, and Giorgio will help us with some style tips in just a moment as well. Now remember that how you say something is often as important as what you say. So let's take a look through a structure that we use here at AWS in our event summaries. Everything starts with the impact summary, and now this component is where we establish the scope of the event. So we distinguish which services were involved in the event, which ones were not, specifically which regions or parts of the network we call those availability zones are impacted, and of course the impact period, when did things start and when did they end.

Thumbnail 2410

Thumbnail 2430

You focus on describing the impact of functionality that your customers experienced, and you lean away from overindexing on which internal components specifically failed. What you want to do here is find the sweet spot between providing enough detail to be meaningful while also keeping it readable and relevant, and that becomes the impact summary.

Thumbnail 2450

Next is the root cause, and this is where the meat of the document is and it's what your customers have come to meet you and understand more about the event. Writing an effective root cause section requires a few things. First is setting the proper context. The context helps the reader understand the landscape of where this event occurred and also gives you an opportunity to explain some of the subsystems that may be involved, even if they are invisible to the service that the customer is consuming. You may do this to ensure that there's proper context of the underlying subsystem, and it's okay to reveal some of that as long as it's related to the ultimate root cause that you're talking about.

Thumbnail 2520

You also want to keep the analysis focused and clearly distinguish between triggers, which are things that set the event in motion, and the actual root cause, which is the underlying problem that we're going to address through our action items. We also need to explain the chain of events, and you need to connect the dots between what the customer had experienced during the event and explain why it happened. Think of this as like a detective story where you really are here to help the readers understand not just what happened but also how it happened and why.

Thumbnail 2530

Thumbnail 2540

Thumbnail 2560

Now we also need a timeline, and the timeline is going to span from the moment that you detected the incident. We focus also on mitigation in this timeline, and we also notate when remediation was taking place as well as when initial signs of recovery occurred. You want to focus the timeline on answering three main questions that your customers will often want to understand more about. First is when and how did you detect the issue? Was it automation or some other means, and when did that first alarm bell go off? The next is what key steps led to recovery and who was involved? Was it autonomous or were operators involved and things of that nature? And thirdly, what preventive actions were taken to further reduce the impact? During mitigation you may take some short-term actions to ensure that the impact doesn't spread, and identifying those in your timeline is also important.

Thumbnail 2600

Now there's a balance here. You want to keep it focused on significant actions. You don't need to document or reveal in your timeline every command that was issued or every message that was sent, but keep it focused on those key milestones throughout the recovery.

Thumbnail 2620

Thumbnail 2630

Thumbnail 2640

Now earlier you mentioned an RCA isn't complete without having action items, and the reason for that is it becomes the promises to your customers. Additionally, your action items demonstrate a few things. Your action items should first showcase how you would prevent this specific event from happening again. The second is if you had a similar issue that was related or had similar characteristics, how will you detect that earlier and faster? And thirdly, how will you recover from that similar event in a faster way?

Thumbnail 2660

Thumbnail 2680

Remember that action items are not just a checklist to check off to say that you did something because you had an event, but they are very much commitments to continuous improvement for your services and your demonstration of rebuilding trust with your customers.

Thumbnail 2690

Now the last section of an RCA document is sometimes included, and we call this recommendations. Throughout root cause analysis you might discover how your customers use your services in ways that you didn't anticipate. This puts you in a unique position to offer some suggestions.

Thumbnail 2710

Some of the suggestions that you can think about in this section are how can the customer reduce the blast radius of their implementations. Would it be useful for them to use a horizontal construct or regional constructs of your services? And second is avoiding anti-patterns. So have we spotted that customers at large are using hard dependencies on certain other systems or control plane dependencies that were not intended for the availability that they were expecting? And lastly, how can you help them evolve their architecture to be more resilient?

Thumbnail 2770

One key thing to keep in mind here is that resilience is always a shared responsibility, and while you're taking ownership for the event that was caused by your services, having recommendations helps rebuild and build stronger partnerships with your customers. So let's review a couple of style tips that we have shared or implied up to here. The first one is, while writing RCAs, always start from customers. Focus on their experience and not on yours. Where possible, try to talk about functions that are familiar to them and components of your service and not about internals.

Thumbnail 2790

Thumbnail 2810

If you need to talk about internals, make sure that you do that to the bare minimum possible and provide all the relevant context for your customers to understand. An RCA that includes a lot of acronyms or that just jumps right into the middle of an architecture without context is not going to be easily readable and it's not going to be of particularly good value. And finally, it might be obvious, but be self-critical. So if you're writing an RCA, it's because one way or another you impacted customers.

Thumbnail 2840

Beyond the Document: Driving Real Improvement and Rebuilding Trust

This is the time for self-reflection and just mention whichever area you think you should do better. And if you think that there is an area where you can do better, take the related action items. The reason is that if you don't criticize yourself, someone else will. So that's a good opportunity here. All right, so let's now explore how RCAs drive actual improvement to support customer success, and this means going beyond the document that you deliver to your customers.

Thumbnail 2850

Okay, so this is the first real quote of this talk, and it's actually quite a common pattern. Just as we explained that we approach dependencies, we often see customers taking the opposite approach, so tagging an event that's caused by a third party and archiving it. So if you're a service provider, it's really important to support customers through closing the loop. And this is because RCAs are not really a document to receive, read, and archive. They are a document to digest and implement across the organization.

Thumbnail 2890

Thumbnail 2900

You want to make sure you support your customers through three stages. The first one is learning from the event. The second one is planning their side of the preventive actions, and the third one is the one that makes everything else valuable and complete, and it's implementing those actions. So in terms of learning, every RCA is an opportunity for learning, and there are two types of learning. The first one for your customers, customers of a service provider, is going to be your direct recommendations.

They're going to suggest how to use those dependencies, your service, how to use it better and implement it better. The second one is the fact that technology is based on a really limited amount of patterns, libraries, and components. And so the customer might have an unrelated component that runs in a similar way and that is subject to the same failure conditions. So there is going to be also some opportunity of indirect learnings from them.

Thumbnail 2940

Thumbnail 2970

And what we find extremely valuable is for account teams that are customer facing to provide relevant context of an RCA to every customer. This goes from simple things like contextualizing the impact to their specific business case to contextualizing the action items and our recommendations, maybe as part of the roadmap that they were already discussing with the customer. So here we see a dangerous mindset that you might encounter with your customers. It's like saying we'll fix the brakes on the car or we won't fix the brakes on the car because we're going to sell it here shortly anyway.

Thumbnail 3000

But just because a plan or service has a planned end of life doesn't excuse maintaining the reliability today. Now for your customers this might mean they need help with establishing a plan. And you can help customers create meaningful action plans by doing a few things. First, you want to identify specific improvements that help lay out clear implementation steps. And here this is where you guide them through setting outcomes that are measurable and have realistic timelines. A successful plan balances ambition with achievability.

After planning comes execution, turning intentions into reality. You want to think about how can you support them with implementation and can you make it easier for your customers to actually execute on these improvements.

Thumbnail 3020

Here is where you guide them through setting outcomes that are realistic and achievable, similar to what we talked about earlier. It's important to remember that a successful plan needs to balance ambition and achievability, or in the quote we saw, the lack thereof, and motivate us to build that plan and execute on it.

Thumbnail 3030

Thumbnail 3040

So that leads us to implementation. Now after planning comes the execution phase, and this is where you're turning intentions that are written down on a piece of paper into reality, and you should look at a few things here. First is how can you actively support implementation. Now actively means going beyond just providing prescriptive guidance on how you could do something, but also going into the realm of providing hands-on professional services and also being the technical advocate within an organization where you are helping the customers advocate at the leadership level that we need to implement these changes for various reasons related to availability and reliability and ultimately rebuilding that trust.

Thumbnail 3090

The next thing is that you want to look at how can you make implementation faster. This is where you as a service provider can help simplify cloud or operations in general for your customers by creating new capabilities. And lastly, is there work that you can just do on behalf of your customers where you just change the way resilience works or make it fundamentally easier and more simplistic, and sometimes the best support you can give your customers is taking ownership of certain improvements yourselves.

So let's bring everything together and look at the key elements that make RCA a powerful tool for improvement and trust. The first one is, I just said the word trust. So remember RCAs are a tool for customer trust. They help to build transparent confidence through transparency and ownership. So be open with your customers. They are going to reward you for that.

Thumbnail 3140

Thumbnail 3160

Next is have an early start. You want to begin documenting when the incident or service event is ongoing and capture the timeline, metrics, impact, and observation while they're fresh and accurate. Complete a systemic analysis. Go as deep as possible and don't just stop at the immediate root cause and triggers that are visible.

Thumbnail 3170

Thumbnail 3180

Now your customers will expect a clear structure in how you deliver your RCA documents, and they're most effective when they have three core elements: one, the impact; two, the cause; and which preventions you're taking through action items. Next, include material action items. So make sure that the commitments you make are realistic, achievable, and focus them not only on prevention but also on detection and response. Those three items come together to deliver an effective incident response and ultimately shortening the disruption for your customers.

Thumbnail 3200

And finally and most importantly, RCAs go beyond just the words written on the paper, and they should never become shelfware. They should be used to turn findings into actual improvements that improve quality and of course rebuild customer trust. So today we looked at how RCAs create a foundation for rebuilding trust. We transform chaos into clarity and meaningful improvement actions. We hope that these insights have given you some tools in your toolbox when you're reading one of our RCAs or building your own RCA practice.

Thumbnail 3250

Feel free to take a picture of this screen. We'd love for you to stay in touch and reach out to us, and thank you so much for your attention and joining our session today. Please complete the survey in the mobile app. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)