Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - The incident is over: Now what? (COP216)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - The incident is over: Now what? (COP216)

In this video, AWS Principal engineers Giorgio and Anthony share AWS's incident management practices, covering detection through aggregate alarms and customer-driven metrics, engagement via AWS Incident Response (AIR) with tech and support calls coordinated by call leaders, and mitigation strategies like shifting away from failures and rollbacks. They emphasize the Correction of Errors (COE) process using five whys as a tree structure rather than a chain, creating customer-facing root cause analyses to regain trust, and scaling lessons through operational metrics meetings with thousands of AWS employees. The presentation includes a case study of migrating STS from global to regional endpoints, demonstrating centralized change as the most effective scaling method after team education and distributed change proved insufficient.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: AWS Incident Management at Scale

Hello and welcome everyone to re:Invent 2025. Thank you for being here, especially considering this is a 9:00 a.m. session. I hope jet lag is treating you all well. We are here today to talk about the incident, and it's a really good thing that we haven't had one in a while. Let's start off with a bit of background on who we are and why we are here today. I'm Giorgio. I'm a Principal in AWS Enterprise Support, which is the organization that supports our customers in their day-to-day improvement of operations and resilience. I'm Anthony. I'm a Principal Engineer in the AWS Event Management team. We're part of the AWS Health organization. Together we have more than 20 years of combined experience in all things incident at AWS, from detection to resolution and implementation of action items. We are here today to share some lessons we have learned over time, tips and tricks, and discuss how we do things here at Amazon.

So with a show of hands, how many folks here have been involved either in an incident or after an incident? It looks like most of us. Well, you're in the right spot. Today we're going to go through a couple of different things. We're going to talk about event management at AWS, specifically how we do it, how we detect, how we engage, and then we're going to focus the discussion around the retrospective phase, which is where we'll focus on specifically how we do that at scale. We'll start with how we trigger and manage events and then zoom into that. But I want to key in on the fact that the post-incident is not just the creation of a document. It's really deep diving and truly understanding the root cause. We'll talk about the five whys and a couple of different ways to get to that. Then to wrap up, we'll talk about specifically how to scale it across a very large organization.

Detection: Service-Driven and Customer-Driven Metrics

So we're going to start with detection. All events start with detection and they end, at least in the short term, with mitigation. For detection, we have two broad categories of events. The first is service-driven. Service-driven metrics include things like alarms on the service, such as EC2 launch instances and launch latency. But we also have metrics and alarms on the subsystem layer. Those might include other subsystems in the path for the control plane or in the data plane and the creation of the instances. But we also supplement that with synthetic monitoring, which we call canaries. These are intended to emulate the customer experience end to end. So you can think of launching an EC2 instance. Certainly we're monitoring whether the launch was successful, but we also want to know whether that instance is available, whether it can be contacted, and whether it's responding to DNS requests. Those are the two main categories of metrics and how we detect issues.

But the third one is maybe a little bit unique to AWS, and these are what we call aggregate alarms. These are alarms that trigger when multiple services are in an alarm at the same time or multiple metrics are affected. Those are a little bit different in that they engage a full incident response, and we'll talk about that here in just a second. The second category of metrics that we use for detection are what we call customer-driven. Those are of two types. The first one is we monitor for traffic anomalies. We have models that allow us to predict the amount of traffic that should be hitting a service at a certain time of the day in a certain region. If the traffic is not there, then we trigger alarms that start investigations on why the traffic cannot reach the service. Related to that, we track impact reports from customers. This goes from noise on social media to reports on social media to support cases raised by customers. We analyze those trends, and if there is an uptick in cases about a specific service, then it might be a sign that we need to engage and start investigating a potential issue that the first category of metrics is not catching.

So let's talk about detection a little bit. I want to show you the AWS dashboard. Legend has it that this dashboard was built around 2008 or 2009 by an engineer from Dublin who was visiting Seattle. He had identified the problem that as we scale, we need a centralized view into the health of AWS services and AWS regions. They also wanted to build a deep dive by partition as well. So what we have here is a view of all of that. We can drill into each service and each region. Each service onboards three to five key performance metrics that indicate the service's overall health. Over the years we've continued to develop and own this same tool, and not much has changed, but the goal remains the same. We need that centralized view into AWS's health.

Engagement: Single Service vs. Multi-Service Events

After detection, we have engagement. We have two categories of events here as well. We have single service events that are triggered by service alarms.

These are alarms that engage the service team, and generally only the service team for what's in alarm. They can engage others as required, and ultimately they can engage AWS support. AWS support is in the loop if customers are impacted by an issue.

Secondarily, we have multi-service events. These are triggered by the aggregate alarms that I was mentioning before. These ones are engagement at scale, and they trigger engagements of a couple of different teams. The first one is AWS Incident Response, or as we call it, AIR. Additionally, AWS support is involved and engaged at the onset of the event. All impacted services are engaged from the creation of the event all the way through the full lifecycle of the event, not just what's in alarm at creation, but throughout the entire time. At the onset of these events, we also engage what we like to call the usual suspects. These are foundational or core services that are low in the stack and are often a root cause of multi-service events. These include things like authentication, DNS, and networking. So we'll engage Route 53, IAM, and our networking teams.

Coordination: Tech Call Operations and Documentation Practices

After engagements, we have the coordination phase. Coordinating incident response at AWS scale is quite a challenge. We do this by breaking down into work streams, each one coordinated by a related call. The first one is the tech call. It involves engineers and leaders of the affected services, and it's focused on delivering the fastest possible mitigation and resolution for customers. Secondarily, we have the support call. The support call focuses on communicating with impacted customers and giving customers advice on how to recover faster. There are often scenarios where we can provide information to get you out of pain earlier. The participants on that call include support and service leadership.

The way we think about the tech call is that we want to run a large-scale parallel investigation instead of a series. We do this by engaging all of the impacted teams and asking them to review their services, their metrics, and come up with impact statements. This might be slightly different from what you're generally used to seeing, which is a sequential investigation where each possible root cause is removed in order. That's a bit deeper into the tech call. Anthony mentioned it's supported by a team called AWS Incident Response, or AWS AIR for short. This is not the classic operations team. They are not just here to operate the courses and coordinate the incident response. They own the process end to end. They own the mental models. They own the foundations of our incident response, and they own all of the related tooling. So they are quite active in the process. They are not here just for coordination.

This goal is focused on one thing: mitigation, and after mitigation, resolution. Large events are supported by what we call a call leader. The industry standard term for call leaders is probably something like incident commander. This is an extremely senior person who is on the call and takes decisions where we face scenarios that we have not experienced before. It's a really small number of individuals—single digit across the entire AWS—and they are here to practice or just to take those split-second decisions that help with incident mitigation.

The call is paired to a ticket, and we have quite a strong mindset about what is discussed in the call. Imagine this call often runs with hundreds of attendees. Having a mindset and a clear definition of what needs to be said in the call and what can be written in the ticket instead, and what must be written in the ticket because we want to study it for the long term, is really key. We support this with strong etiquette, so it's quite clear on who's supposed to talk and who has the microphone at any certain point in time. I wish we were a bit better with muting people because it's not uncommon to have dogs barking in the background, helicopters taking off, or similar.

This call is the one main coordination bridge for the event. When a service team needs time to look deeper into their functionalities and involve different engineers, they are going to fork to a separate call and still keep at least one engineer that has a bridge between the two. So the tech call remains the core of incident response that can be forked out as appropriate and as required from time to time.

I mentioned the tactics that we use for long-term storage of observation. I want to share with you a couple of examples of those observations that might make this more tangible. In the first example, we have an engineer pasting an error log. Now, you might think this is quite redundant since

the error log is stored somewhere else and it already has a timestamp, but putting it here makes it immediately visible to the hundreds of resolvers that are involved. It also gives us an indirect piece of information. We will know later in the postmortem review that at this point in time we were aware of this error. The fact that it was written in a log somewhere is not the entire story. We want to remember that at this time we were looking at this service and we were looking at this error condition.

Then we have another engineer reporting that they have started a rollback. Our pipeline tooling obviously tracks timestamps and the start of rollback, but again, it is shared here for immediate visibility to everyone. The other thing they are doing is sharing the link to the pipeline for reviewing the deployment state. This is one of those measures that helps keep the traffic on the tech call low because you will not hear anyone asking about the deployment status or the rollback status. They are just going to click on that link and see for themselves.

Here is a case where we share metrics. We see the engineer reporting 100% recovery on their fleet, and they do this by sharing a metric that shows the steady state, shows when the anomaly started, shows when the anomaly crossed the alarming threshold, and then shows when they are back to normal. Again, this is all data that is going to be stored somewhere else, but we put it here for easy reference, especially in the post-event review phase.

Finally, we have another quite common type of observation. An engineer is looking around while trying to figure out the root cause and spots a potential correlation with a metric. Now, this might be a red herring, but sharing it here for everyone to see might trigger someone else into looking at their service and figuring out that the correlation was effective, relevant, and related to the root cause of the event.

Support Call: Customer Communication and Command Center

The support call owns customer communication and is run by AWS Event Management. We are most frequently the ones communicating to you for operational issues that you see on the personal health dashboard and the service health dashboard. Additionally, we proactively engage with account teams. We want to make sure that they are prepared to respond and check in with customers early. We also place a heavy focus on recovery guidance and best practices wherever possible. For some events, there are actions that customers can take that can get them out of pain earlier, and we want to provide that information as soon as we have it.

Additionally, we use the data from case trends, customer impact reports, and sentiment as additional data points to verify that our mitigations are working as we expect and that we understand the customer experience correctly. In order to provide detailed and real-time updates, it is super important that we are in contact with the tech call, and we need that real-time update to be able to send communications to you, the customers. We have two-way communication occurring the entire time, where we are providing information about the customer experience to the technical call and we are getting information back about what is happening in real time, where we are at in mitigation and recovery efforts.

I want to show you a visual here of Command Center. Command Center is our all-in-one support tool. This is a Command Center notice, and we use Command Center notices to track, centralize, and share information as it pertains to an event. In this example, what you see on the left-hand side is metadata about the incident. This will include the service or services that are impacted, the start time, and on the bottom left you will see the customer contacts as we have them and as we are relating them.

In the middle, at the very bottom, you will see the internal summary, and this is where we keep that up to date so that our customer-facing teams have access to that real-time information about what is happening. Additionally, you will see at the top here there is a messaging tab, and that messaging tab is where communications that have been sent to customers are visible to those account teams as well. Finally, we have the sentiments tab here at the top, and that is where we are tracking and reporting on customer feedback.

Communication is key on the support call, so it is important to communicate to the right audience. There are two different audiences that are our main stakeholders here, and those are internal stakeholders and external customers. External customers get regular updates and regular recommendations, and we are communicating to them with specificity about what is impacted and with clarity over time.

We're providing live status updates to our field teams and customer-facing account teams. We want to empower them and ensure that they have the latest information so they're not caught off guard. Ultimately, we also want to ingest customer inquiries so we can respond by creating FAQs or bolstering the FAQs that we already have. Let's take a deep dive into external communication specifically.

Communication is a matter of balance, and we bias towards speed. We want to communicate as soon as we know that something is occurring and give that heads up. Secondarily, we have accuracy. We define accuracy as targeting customers who are actually impacted. We want the communications to be relevant and actionable, and we want them to have as much detail as possible, including effective resources and other relevant information.

Finally, we have depth of our communications, where clarity is incremental and comes with time. Our first communication is often very generic, such as "we're investigating increased error rates and latencies," because we want to give you that communication first and we bias towards speed. Over time, we update those communications and provide details with specificity about what exactly is happening, what we're seeing, what we're going to do next, any workarounds that are available, and the time we think those steps will take.

Mitigation and Resolution: Shifting Away from Failure

Here you see probably the most common quote from our tech call, and it's the call leader asking about a rollback. I want to discuss now a bunch of measures that we take for mitigation. Remember, when we started the event management process and set the incident response, our one focus is mitigating the impact for our customers. We want the service to recover as soon as possible, sometimes through side measures and without tackling the real root cause, so we want our customers to go back to their normal operations and gain some time for a deeper investigation.

Our most common type of response is shifting away from the failure. This is about removing the component that is failing when we can identify it. Imagine a large fleet of instances where we identify that only some of them are responding with increased latency. We have a health check that automatically takes them out of the fleet. Those instances remain broken, so we are not really resolving the root cause, but the customer impact just goes away. Your customers recover and you get more time to conduct a deeper investigation.

In more complex cases, we might be investigating an increase in latency in a service, and our metrics show that the increase is happening only in a specific availability zone. The first measure is to remove that availability zone from service and let traffic hit only the remaining two, allowing customers to recover. Once they are recovered, you have more time and far less pressure to resolve the issue.

Similarly, with rollbacks, when we start the incident management, the first thing we look at is whether there is any deployment that is ongoing or that started around the time the event itself started. We do not only look at the component that is affected; we sometimes look at all of AWS. If we suspect some correlation, we don't waste time to confirm that correlation. We just bring everything back to the previous known and stable state to mitigate as soon as we can.

Third, we are in the cloud, and there is a large category of events that result in increased resource usage. Because of your increased latency, you might have increased CPU usage. While you work to understand why that fleet or type of request is using more CPU power than it was yesterday, you can just scale the fleet up and provide more resources so that the API latency goes back to normal. At that point, you can investigate.

There are a couple of measures that are part of the toolkit that we generally prefer not to take. The first one is the turn it off and on again kind of thing. Software carries state, there are caches and buffers, and sometimes restarting a component can help with recovery. It's quite a tricky measure to take because sometimes with the restart you're also going to lose the state that led to the error condition. Whenever we have to do this, we try to isolate some nodes on the side that we can use for the investigation and restart everything else.

Finally, changes are challenging, especially when not completely tested, but some events are clearly solved by a small change in configuration or by rolling out a new software version that we were testing to improve that specific component. When it's really our last option, that's what we do. We will only make configuration changes or update software as a last resort.

If we have extreme confidence that a configuration change will resolve the problem, or if we have extreme confidence that a new software version is stable, trustworthy, and solves the issue, we will roll forward. Once you have mitigated the issue, there is a deeper resolution phase. One of the first questions you ask is about the risk of recurrence. Before disengaging, there are a few things we want to do. As Anthony mentioned multiple times, we use data from our customers to detect events and we use that data throughout the event process to improve our communication and ensure our internal metrics are telling us the right story. We do exactly the same with mitigation. Once we are confident that we have mitigated the issue and our metrics are back in the clear, we spend an additional five to ten minutes to confirm with some customers that they are seeing the same improvement. Hopefully they will, but if they want, they can go back to incident management mode.

The second consideration is the risk of recurrence. If you made a small configuration change or changed something at runtime, there might be a pipeline, a recurring process, or a scheduled change that will undo it. You want to make sure that whatever temporary patch you have implemented will stay in place. Additionally, software components, SDKs, and libraries are used across services. If we suspect a specific library or piece of code to be the culprit, we want to check other services that are using the same component and validate that they are not on the brink of facing that same issue.

Finally, this is the point where we start forming root cause hypotheses. They do not have to be detailed—something you can do in fifteen to twenty minutes—but we want to have an idea of what services we want to conduct a deeper dive into over the following hours to days. When ready to close the incident bridge or the tech call, there is a small list of things we always make sure we do before closing it. We obviously want to validate that all short-term fixes critical to preventing recurrence are complete. We do not consider the incident resolved until we have this done and validated.

Second, we talked about tech tickets and we make sure we are preserving all the data and logs that we might need later. This ranges from logs that might be flushed out in systems every twenty-four hours to screenshots that live on the laptop of a developer who was involved in the incident resolution. This is where we check that everything we might need later is stored essentially forever in the tech ticket. This is also when we start assigning postmortems. Every team where we see an opportunity for improvement—maybe we paged a team and their colleague joined five minutes late, or there was another team that was not really sure about their automation to remove traffic from an availability zone—we will have a larger amount of engineers taking notes during the event resolution phase. At the end, we come together and assign postmortems to everyone who was involved and where we see opportunities for improvement.

It is not a blaming process. This is where we look at how we responded, we start being self-critical, and we ask teams to deeper investigate if they could have done anything better. Finally, quite obviously, we want to make sure there is a common understanding of what the next steps are. We are going to move out of twenty-four-seven operation mode and disengage from the incident. If someone was paged overnight, they can go back to bed, but there are two or three days where work is still going to be intense. We want to make sure there is coordination and a plan for what everyone needs to do. Let's take a look at phase two, reflecting and planning.

Post-Incident Analysis: COEs and the Five Whys Framework

After an incident, there is an author or someone responsible for deep diving and performing that post-incident analysis. It is critical to ensure that you are examining it through an effective lens. How can you do that? Well, first and foremost, you need to create a safe space so that you can understand the context of decisions that were made before the software was designed, during the incident, and also after the incident in hindsight. Also assume that decisions were made with the best intent. Everyone is doing their best, and you want it to be a safe space so that folks feel comfortable challenging and criticizing constructively.

Sometimes it's important to do multiple retrospectives, so don't assume that one can cover everything. It's important to deep dive into any lesson or any failure that you want to prevent from recurring. It's not uncommon for us to do this when there's a multi-service event and different teams have different learnings they want to take away and share broadly. Each team will be responsible for deep diving into every failure and everything that we want to prevent recurrence of.

At AWS, we call them COEs, which stands for Correction of Errors. This is our post-incident analysis mechanism. A COE starts with an impact summary that goes over the timeline of the incident and what exactly the customer experience was. It should stand on its own, be multiple paragraphs generally in a narrative format, and talk about the customer experience and the lifecycle of the event. You should hit on all the key milestones of an incident.

It also goes into the root causes. We do that through the five whys, which allow you to really deep dive and get to the true root cause. It's super critical that you're actually getting to the true root causes in order to ensure that you're taking the right lessons and implementing the right actions. What comes next are those learnings. These lessons should directly come from and flow from those five whys and those root causes. From those lessons, each of them will generally have an action item. The lessons are usually things that we didn't know beforehand that we want to ensure we're capturing for posterity and taking actions to either educate or resolve a technical issue.

There are a couple of things we are really big on when doing COEs. The first one is that we try to be extremely self-critical about detection. We ask ourselves whether we detected the event fast enough and whether we are happy with our response time. The chances that the answer to this question is no are high. Until detection gets really immediate and you have automated mitigation, there is always going to be an opportunity for improvement. If the answer is no and we believe there are opportunities for detecting faster, we will create action items for this just as we do for the root cause of the event itself.

Additionally, we mentioned a few times that we use our internal metrics and we use customers to confirm that those metrics are telling us the right story, but we also want to validate this. We want to make sure that throughout the event we had the right observability. We want to validate that our metrics were giving us the real status and that they were reflective of the customer experience. Finally, we zoom out a bit. We do not focus on this very specific event or error condition, but look broadly and try to figure out if we are confident with our ability to detect similar event patterns and similar incidents in the future.

The second part where we are big on is dependencies. Whether they are internal or external, such as using a piece of software from another team or from a third party, dependencies are really useful as they help avoid repeated work. You might just use a component built by someone else and focus on what's the core of your software component. When an event is caused by a dependency, it might be really tempting to just tag it as such and completely disengage from the post-incident review. We do not really allow our teams to do so, and we want them to still analyze what happened.

The first question, and probably the most obvious, is we validate that the dependency performed as promised. Every dependency is going to come with either an availability promise or RPO, RTO, or those sorts of metrics. We want to make sure that they delivered on what was sold to us internally. The second one is validating the implementation. A dependency is not the whole story. How you implement it in your software might affect whether an outage on that dependency impacts customers or not. Then we look at the failure mode and check if it's something we should have expected, something we should have planned for and didn't, or whether it's something completely new that there was no way to plan for and we need to shift to immediate contingency. Overall, we use those three questions to find opportunities for reducing the blast radius. Some dependencies are just not critical. Think about a web page that is created through involving multiple components in the API.

You might just gracefully degrade the page experience and not load the component from that dependency in case it's not available. It is a degraded experience as the words say, but it's going to be much better than timing out on the entire page. Or similarly, you might be able to cache results, so you can still serve a request that you have already previously served. Again, it's not going to be 100% availability, but it's certainly better than 0% availability.

Anthony mentioned the five whys, and the five whys are a really appealing topic. I think they are one of the most misunderstood concepts in the industry. They are often explained as a sequence of questions where you identify one root cause and then you ask yourself what caused the root cause and then what caused the second layer and so on. The first thing is you shouldn't really stop at five. Five is a number to give an idea that it's more than two and less than one hundred. But you should really keep going until you find something that is meaningful, until you find a measure or an opportunity for improvement that if taken helps with a large range of root causes.

The second point is that the way they are often explained is as a chain, and this is really not the best way to think about this. Looking at them as a tree and acknowledging that events are rarely caused by a single root cause or a single trigger, but they are often a contribution of multiple factors is key. To make this more tangible, let's look at an example. We started from the issue that API calls were failing, probably the most common failure. An API, third party or internal, was rolling out, and we tracked down those errors to one host in the fleet that was returning 500 errors. So we ask the next why. Why was that host returning 500 errors? We find some IO issue errors in its logs. Then we look into those IO errors and find a problem with the underlying EBS volume. We look at the underlying EBS volume and find a problem with a portion of an availability zone.

Now this is really effective. You are going really deep, and they found a fourth layer root cause that explains the event. However, it's only covering one of the various potential root causes. So let's look at a better option. We start with the same question, but not just one thing, and this is going to become apparent soon. We added some data. So now we are specifically talking about 3% of API errors over a time of 45 minutes. The first question and the first answer is the same as before. We have one out of one hundred hosts that is returned 500 errors. We go down the same branch as before, and it is going to point out to a temporary EBS failure in the availability zone.

In parallel though, we go look at our health checks because the reality is that if health checks were working and removed that failing host from the fleet, our service would have kept working regardless of the underlying disruption to the storage layer. We figure out that there was a bug or gap in our service templates that was incorrectly implementing the health check. Then we find the problem with those 45 minutes. Forty-five minutes of impact due to a single failure are quite significant. We start looking into that and we find out that the engineer was not engaged for the first 30 minutes. Again, this is an immediately visible opportunity for improving the detection time. If this detection was 15 minutes, the entire event would have been 20 minutes, even without changing anything on the resolution.

Finally, I was talking about numbers. I don't know how many of you picked it up, but we are saying that a 1% loss of capacity caused a 3% error rate, and there is clearly a disconnection there. We go look into that and find a problem with the load balancing algorithm. Now, this connection is quite a painful algorithm to use because when a service is erroring out, errors are generally served faster than actual responses. So a single failing node attracts more traffic than it does when it's effectively working.

You see how by doing this, we have not arrived at a single action that in the previous case was basically wait for AWS to solve the EBS problem, but rather nearly completely independent root causes. Tackling one, two, or three of them is going to build a layer of resilience across towards that failure condition. When you think about the five whys, do not imagine a chain, but rather think about a tree. It is going to bring a range of different issues and there's a range of different preventive actions.

Root Cause Analysis and Action Items: From Customer Perspective to Implementation

When we are done with the internal part, we start looking at the customer-facing root cause analysis. One thing we try to do and recommend everyone does is write a root cause analysis from your customers' point of view, not from your internal components. We try to open a root cause analysis by explaining the impact to customers. When we need to go deeper into the sequence of events, we focus on the functions that customers are familiar with, not on our internal services that provide those functions.

This is overall about removing unnecessary complexity. Spending a page to describe an internal service where the only relevant thing you have to say is that 20% of units in that service were misbehaving and were not detected will just drift attention and not be productive. While writing the root cause analysis, there is a really important balance to strike between quality and speed. Anthony mentioned that in event communications, speed is our primary metric. We want to send out a notification that something is wrong as soon as we can, and through iteration, we explain what that something is. We share details on the service, the region, and the type of failure. Root cause analyses are documents that are delivered once, so we do not get this opportunity to iterate. However, we are aware that our customers are waiting for it. They want to understand what happened, but more importantly, they want to read what we are doing about it and when we are planning to do those things.

One thing that might not be immediately visible is that root cause analyses are not only a way to explain a technical failure, but also a way to regain the trust of your customers. If you are writing one, it is because you failed, and customers are going to inspect you. They are going to inspect your response. They are going to inspect your long-term plans, and they are going to use that to decide whether to keep using your service or not. Seeing root cause analyses as an opportunity to earn customer trust or regain their confidence is quite key to delivering high-quality documents.

After this part is done and we have a high-level prompt to customers, we start looking into the detail of our action items. We categorize action items based on how long it takes to implement them and how stable they are. The first type, short-term action items, happens in hours. We work 24/7 until they are implemented, and they are focused on preventing the immediate recurrence. Short-term action items are effective but might not be particularly sustainable. You can imagine this as restarting a software component once a day to clear the buffers and caches or over-engaging engineers every time a metric starts spiking in the wrong direction. It is fine to keep them for a few days, but not fine to keep them forever.

The second category is the mid-term action items. This is when you start building a self-sustaining solution that allows your teams to go back to the priorities they were working on before. They are generally implemented in days to weeks, and more rarely months. These tend to be the promises that we put in the customer-facing root cause analysis. Then we have a third type, which is really long-term action items. This is when we think about systemic change or when we start reinventing. This type of action might be building a completely new service that solves a problem better than the previous version or solves a problem that no service is solving.

This is a real quote that shows how while building these action items, you should not just focus on incremental improvement. Once again, it is a matter of trade-offs and balances. There are four forces that are really relevant when looking at long-term plans and preventive actions. The first one is the trade-off between incremental change versus innovation. Incremental change is something that is rolled out slowly and is always retrocompatible and does not require any action from the customers, from the users of APIs, or from downstream dependencies. It is easy to implement but slow. Innovation is building new solutions. New solutions are categorized by two things. The first one is that they are new, so it is going to be a new service that customers are going to have to move to.

This does not really discount you from fixing the previous version because there is going to be a ramp down from the old one and a ramp up to the new one that is going to take quite a significant amount of time. And then they might be completely incompatible. It might be a new service that behaves in a completely different way that doesn't have the full functionality of the previous one.

The second issue is related to timing. After an event, it's really common to start focusing on what seem to be perfect solutions. So you're going to completely rebuild a component or you're going to deprecate an API or stop doing something. The problem is it's really key to have realistic completion dates. When internal and external customers have been impacted by a failure on dependency, they are going to expect some significant actions and some significant prevention happening in weeks to months maximum.

Here is when you take a decision between promising a perfect solution, maybe two years out, that doesn't really solve the immediate problem or promising something in the middle that you can deliver quicker. After you have done that, then you can go to incredible. At this stage you have your action items defined, promised to the customers, and you have internal resourcing and planning to carry them on.

It's really common since all of this phase happens in days after the event. The RCA needs to be out in seven days maximum, and by that point you are making a promise. Then you go into an implementation phase that might last three to four weeks. As you do that, you might find better opportunities. So you might decide that the plan you promised to your customers is not really the best. We are generally fine with changing those plans, as long as we don't change the goal and the promise.

If you have a different way to implement a similar solution that prevents the same failure condition, we will do that. While doing this, it's extremely important to stick to the estimated completion dates that you have promised to your customers. They are really not set in stone. It's fine to push them out by one or two days if you need them to validate a solution, but it's important to not push them out by months. I was trying to come up with a number before this talk to say being twenty percent slower is fine. Being two thousand percent slower is maybe not. I didn't find this number, but I have the feeling it's going to be quite low.

Finally, this is drilled into our mindset, but preventive action is not completed the moment you implement it. It's completed the moment you test it and validate that the same underlying conditions do not lead to the same failure. This is key in communication, but also in terms of thinking. The job doesn't stop at implementation. It stops after you have confirmed that what's implemented was right and works at scale.

Learning and Scaling: From Team Education to Centralized Change

Let's look at phase three. Phase three is learning and scaling. The most important COEs get to phase three. Here's a quote here and it's worth acknowledging that as privileged male technical folks, Malcolm X was probably not thinking of us when writing these words, but their words are inspirational nonetheless. It's super important to take every opportunity to learn a lesson and improve your performance the next time. It's the mental model that we have here at AWS and it's part of our operational excellence, and it's critical in the process here for COEs.

Let's look at learning. There are a couple of different phases of learning. We review all COEs within a team. Every service or author that writes the COE is going to review that with their individual team. Additionally, they'll review that with their wider organization. We also have the AWS operational metrics meetings, and these are meetings that occur on Wednesdays. Everyone is involved. Everyone is invited in all of AWS that includes product managers, engineers, leaders, everyone. It's super important that they're all engaged as well and this is part of our culture that allows us to scale these lessons across not only one organization or one team but all of AWS.

It's super important. It's a critical mechanism to ensure that these are learned once and not multiple times. The other key here is looking for patterns, right? Looking for patterns and ensuring that you're leveraging solutions across the organization. A safe space is super important, right? There are some ground rules there. Open discussion is absolutely welcome, but everyone should feel comfortable speaking up. Everyone should feel comfortable challenging action items, asking whether they're the right things, and challenging the five whys and how we got to the solutions that we did.

You'd be surprised, but for a meeting with thousands of folks, it is an incredible amount of value that we get out of that meeting. Let's look at the scaling aspect of it. Scaling goes through three phases. The first one is team education. In order of increased effectiveness, the first one is education. They have the best intention, but it's not necessarily mechanistic, and they don't necessarily scale to other teams and other organizations. Then you have distributed change, where everyone has to go do something. It can be expensive, but ultimately it can scale. And then finally we have centralized change. A centralized change is where you make a centralized change and everyone gets that by default. They consume that. These are changes to the SDKs, default behaviors, and so on.

One example of team education is not only reviewing the COEs within your team and within your organization but also tenants. Tenants are core foundational principles that we use to help guide us. Here's an example of some of our operational tenants. The first one is that failures must not impact multiple regions. It's certainly a lesson that everyone at AWS is very well aware of. Additionally, there's detecting failures before your customers do. This is an obvious one where in that operational metrics meeting, if we determine that customers were the catalyst for identifying an issue, it's an easy COE to write in the sense that it was a failure and we have to figure out why. There are improvements to monitoring, improvements to metrics and alarming that we need to take.

For an example of distributed change, we have Trusted Advisor. Trusted Advisor is a service that we built to scale our learnings directly with customers. It monitors your infrastructure and categorizes recommendations based on severity. On the left here you can see an example of four critical recommendations. They're not only categorized by priority, but they're also related to performance, cost optimization, and so on. The key here is that we have identified lessons and we want to scale those lessons to customers. That's the distributed change model here, but ultimately customers have to prioritize them. So it's not necessarily the most effective, but it's certainly well worth doing. That's a non-production account. I was looking for examples of bad Trusted Advisor pages and so I dug into an account that I rarely use.

The third type is centralized change and invention, which we have mentioned multiple times. Most of the services you use today in AWS are coming from our experience of a world without them, and they are solutions we built to actual problems we were facing. A really common one is if you were managing DNS fifteen years ago, or fun fact, Route 53 is turning fifteen in like three days. You knew how hard it was not only for building a scalable and resilient data plane, but also for managing control at an organizational scale. Building the right policies for only the right individuals being able to modify a certain DNS zone or some records was just not easy. That's why we built Amazon Route 53. Similarly, load balancers used to be physical. They used to be physical appliances that you were packing, managing, and updating, and quite painful overall. That's one of the reasons why we built Elastic Load Balancer. We built it for us in the first place to solve a problem we were having. We wanted a really scalable virtual balancer that could be provisioned on demand. We just wanted to stop wheeling in servers every time we needed to launch a new service, and that's how it came to be.

Similarly, if you were to manage certificate rotation in a large organization, you'd know that doing this through spreadsheets and meetings is not the best possible idea. AWS Certificate Manager today makes it so simple you basically forget about certificates and you can also afford to rotate them way more often than we were used to before. There is a case study that I want to show and now all of those things come together. You should have heard about Identity and Access Management, AWS IAM.

Identity and Access Management is a globally consistent service that you use in the build phase or while deploying a service to define permissions, access policies, and define what users and roles can do. Next to IAM there is a Security Token Service, also known as AWS STS. This service is regional instead, it's stateless, and it is used at runtime to generate the temporary credentials that are then used to connect and make requests with AWS services. Now this service is sort of in the critical part because it's not in the middle of every request, but temporary credentials are really short-lived. So you need this STS to be extremely resilient.

As with many things, STS was at the beginning only in a single region. So when we started launching multiple regions across the globe, for a few years, you would have had to make a request to the global STS endpoint to get credentials and then be allowed to use services in the regions you had. By 2015, we figured out this was a problem and launched regional STS instances. This allowed customers to isolate their workloads and make sure that a failure in any region was not spreading to others. This was when we changed the best practice.

So this is when we started talking about team education. We immediately documented the fact that using regional was better and a more resilient option, but left it pretty much at that. By 2022, we figured out the traffic was still significant on the legacy endpoint and that there was quite a significant availability risk for our customers. So we changed the defaults and we started with active campaigns for customers to move off. This is when you get a personal dashboard notification that has to stop using a certain thing or implement a change by a date.

Unfortunately, by 2024, traffic on the global endpoint was still really substantial. This is when we figured out that giving the right guidance or distributed kind of change were not effective, and we had to take a measure that would allow us to solve this quickly and for everyone and without customer action. That's why in April this year, we implemented a change that essentially, and naming here gets really confusing because the global endpoint is now served for the region that's enabled by default by the region itself. So customers don't change anything, use the same endpoint as before, but the request instead of getting to the global STS endpoint gets to the same region and removes this dependency.

To give you an idea of how this worked, here we have a graph that goes from early February to mid-April. It's normalized, with 100 percent on the Y-axis as the starting point where we consider 100 percent of affected accounts. This is how traffic changes on the global endpoint as we roll out the change. This shows a deployment that starts slowly and tests just to make sure we are deploying the right thing. Then towards the end you see the deployment going at increasing speed and beating our regions, and cross-region traffic drops to nearly zero.

Key Takeaways: Engage Early, Mitigate First, and Learn Once

So there are a couple of takeaways that we want you to remember leaving this talk. First and foremost, engage early and engage often. Ensure that leaders are engaged. Everyone wants to be involved and they should welcome engagement. Make sure that you have the right folks at the right time. The second one is on fast communications. During the event, make sure you send out a notification that tells your customers that something is potentially going wrong and think about iterating on depth later. Mitigate first and root cause later. We are all engineers and we want to deep dive and figure out what the root cause is, but during an incident is not the time to figure out the root cause. Mitigate customer impact and root cause has its time and its place afterwards.

Finally, incidents are really powerful lessons. The problem is that they are extremely expensive and come with customer impact. You want to make sure that you learn once as an organization, so don't limit the learnings to the team that was affected or the team that was the cause of the incident, but make sure that whatever they learn spreads across all of the engineers in the team. And then ensure that when you reflect, you reflect with empathy. Create a safe space where folks are welcome to ask questions and feel confident in doing so.

And finally, whenever you need a really simple set of rules that are easy to remember, that are easy to teach to new hires, and that can be used as a tiebreaker when there are hard decisions to take, just define tenets. You have seen ours. Something similar can be done, and tenets are supposed to be really short sentences that withstand the test of time and can help drive decisions. You will find them extremely useful when two teams that maybe are not closely working together need to collaborate to launch a service. That set of five or ten tenets is going to help them build a shared roadmap and shared outcomes that are relevant for the company.

Well, Giorgio and I both want to thank you all for coming. Feel free to take a picture of this slide. We will be available outside the room after the presentation here. We are obviously passionate about this topic and so we are happy to discuss. Please complete the session survey in the mobile app. We love your feedback and we want to improve next time.

; This article is entirely auto-generated using Amazon Bedrock.