Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

In this video, Mustafa Tun and Umaya Majaganathan from AWS discuss operational excellence mechanisms used at Amazon, emphasizing that excellence differs from perfection by accepting calculated risks while learning from mistakes. They explain AWS's approach through a flywheel model with four drivers: readiness (Operational Readiness Review, testing, game days), observability (instrumentation with Embedded Metric Format, CloudWatch handling 5TB logs/second and 20 quadrillion metrics monthly), incident response (SOPs, escalation culture, AI-assisted CloudWatch Investigations with 90% satisfaction), and reviews (weekly dashboard reviews, Correction of Error process with 5 Whys analysis). Key architectural patterns include dependency isolation and cellular architecture for fault tolerance. They demonstrate CloudWatch Application Signals for SLO-based monitoring, MCP servers for IDE integration, and automated incident reporting, showing how these mechanisms create a virtuous cycle that continuously improves operational excellence at AWS.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Coffee Shop Analogy: Why Excellence Matters More Than Perfection

Hey folks, thanks for being here. Hey Umaya, what's up? Hey Mustafa, I'm doing great, but I could use another coffee, not enough. You want another cup of coffee? Say you pull up your phone now, Umaya, and you find a coffee shop nearby. You walk up to this store only to see that they are closed for maintenance. That would suck, right? Yeah, yeah, sad smiley right there, yeah.

So are you a coffee geek like myself? Yes, yeah, they sell coffee in Atlanta too. There is, yeah, yeah. All right. I mean, in Seattle it's the law. You have to be a coffee geek. Do you follow your local coffee shop on social media? Yeah, of course I do. Yeah, I only drink like five coffees before 10:00 a.m. Yeah, I didn't ask if you have an addiction or something. I asked if you had a coffee. I might have one. I see.

So say you get an update like this on social media from your local coffee shop. They say they have new features. They have new syrups to choose from, new baked goods. So you go up there and check it out, and it's open, which is a great sign. But then you see the coffee machine is broken. So how would you feel? Yeah, almost crying, maybe actually crying. I see. You have coffee addiction, that's for sure. So I think these coffee shops could use some operational excellence and reliability, and that's the topic of our conversation today.

I'm Mustafa Tun. I'm a senior principal engineer in AWS. I've been with the company for 12.5 years now, and most of that time was with CloudWatch or observability and the monitoring team. And I have Umaya joining me on stage. Yeah, folks, my name is Umaya Majaganathan. I lead the worldwide specialist solution architecture team at AWS and my team focuses on cloud operations. I've been at AWS for close to eight years now, working almost all the time on CloudWatch and observability services.

So I want to start with the word itself, excellence, and it's not perfection, it's excellence. We use this word for a reason, right? So the difference between these two words is better explained by two leadership principles of Amazon. We have more than two, but these two apply here: bias for action and insist on highest standards. So one of them asks for speed, one of them says take calculated risks. The other one says it might be uncomfortable sometimes, but you should push for the highest quality, right? And there were times I was pushing my team on this end where we were very careful, but we were not shipping as fast, right?

So in the center of this, in the balance stays excellence. So it's not perfection. Perfection would instill fear, perfection would slow us down and it wouldn't accept mistakes, right? So with operational excellence, we understand that we will make mistakes at times because we have to move forward, but the key is we learn from our mistakes.

Building Reliability Through Infrastructure and Architectural Patterns

Let's get back to our coffee shop and see what would happen if AWS was operating this coffee shop. So we would add redundancy at the infrastructure level. We would add much more coffee machines to the store, just like how we run our systems on multiple servers. We create multiple copies of the data you give us, and we wouldn't run just one coffee shop. We would run multiples of these just like how we put multiple data centers within a zone. And of course, we will put them in different parts of the region, different parts of the city, just like how we do Availability Zones. And of course we put these in different parts of the region, the world, and some services are zonal, some services are regional, and all these are creating fault isolation boundaries.

And there's a great white paper on our website if you want to read more about our infrastructure and fault isolation. And this QR code is a companion website to this talk. So there are links to external resources we are going to refer to in the remainder of the talk. So let's talk about some architectural choices we make for reliability and operational excellence. So say we have an AWS service that has three dependencies and say one of them is having a bad day.

We design our API services such that only those APIs that depend on that dependency are impacted. The rest should be functioning. And this is done through dependency isolation. The simplest thing you can do is to create different thread pools for each dependency. There are of course more advanced methods. The second one I'm going to talk about is cellular architecture. When a ship hits something, when the hull of a big ship is breached, the ship doesn't sink because the hull is made of different cells.

So similar idea, we create multiple copies of our stacks and we put them in our data plane next to each other and we route them with a thin layer. So this cellular architecture allows us to reduce our blast radius. If one of those cells is impacted, the rest of the services can continue to function.

There are articles up there on our website in the Amazon Builders Library. Again, the link is in the companion website, and you can read through the architectural choices we make and apply them to your own systems. And again, the Well-Architected Framework is also a great resource, and it discusses all these topics, the infrastructure, the architecture, and operational excellence.

Mechanisms Over Good Intentions: The Operational Excellence Flywheel

All right, operational excellence. It's a feature for us. We add features, we add new services to AWS, but we invest equally in operational excellence. Operational excellence for AWS is not an afterthought. It's not a burden. We invest in it. We are intentional about it. We know you choose AWS for many reasons, and one big reason is that we pay attention to our operational excellence, and we are serious about it.

I'm going to share some learnings. I've been in the front seat of our operational excellence journey in this decade. I've seen us try new things, and some of them worked, some of them didn't. So I'm going to share these learnings, and one common thing you will see is that we don't rely on good intentions. We rely on mechanisms. And I'm going to quote Jeff Bezos. He told this to us before I joined in 2008 in an all-hands meeting. He said, often when you find a recurring problem, something that happens over and over again, we pull the team together and ask them to try harder, do better. Essentially we ask for good intentions. And he says this really doesn't work because people already have good intentions, right? And he said if good intentions don't work, what works? Mechanisms. Nobody wakes up in the morning and says I'm going to make operations worse, or during an incident, nobody's trying to make things worse, right? So everybody's trying their best.

So let's talk about what the mechanism is. It's a system. You have an input, an output, and it starts with a tool. It's not a manifesto you put up, it's not an email you send and expect people to read. You create a tool, and that tool, if it is good, it will drive adoption. People will start using it, and the more people use it, the more feedback they will give to you. And that retrospect will allow you to inspect, and that inspection will allow you to improve the tool. As you can see, this is a virtuous cycle. Things like this are called flywheels also, and if you are a gym person, and if you use the stationary bike, that big wheel is a flywheel. It comes from mechanics. It keeps the momentum going. It stores the energy. I wouldn't know. I don't go to gyms. I prefer donating to gyms. In January, I have a gym in mind that I'm going to donate to.

So the first virtuous cycle flywheel in Amazon was introduced by Bezos himself and his team at the time. So they were targeting growth. Customer experience is a driver they focused on. And customer experience will drive traffic, right? If you improve customer experience, and traffic will bring more sellers, more sellers will bring more selection, and more selection will make our customers happy. And as we grow with these drivers, we will have a lower cost structure which will lower our prices. And those lower prices will feed into customer experience itself.

So I talked about what drivers exist in AWS that drive our operational excellence, because it has gotten a lot better over a decade, and I drove this. This is my take. I think it starts with observability. We make our services observable so that we get more insights from it. We can respond to incidents better. So incident response benefits from it. And the more incidents we have, the more we review and learn. So all that data feeds into our review process. We of course don't wait for incidents to happen. We also continuously review our systems, and observability helps with that too. And the more we review, the more action items we take, and it feeds into our readiness process. We become more ready, and most action items we take are, hey, let's close this blind spot, and that feeds into observability. And as the operational excellence keeps growing with these drivers, we become better reviewers ourselves, so it feeds into our review process. We ask better questions as we get better at operational excellence. And in my opinion, operational excellence feeds into Bezos's wheel. Customer experience benefits from operational excellence.

Operational Readiness Review and Release Excellence: Testing Before Production

So in the remainder of this talk, we are going to stay with this wheel. You will get tired of it, and there will be four sections, and we will start with readiness, and we will dive deep into each one of these drivers. So I have five topics to discuss in this section, and the first one is literally called Operational Readiness Review.

This is a mechanism that we ask service owners and operators to fill out. It's a form, a checklist, and it asks certain questions like this one. These are verbatim questions from a new feature Operational Readiness Review template. It says, does this feature have SLOs defined on customer experience metrics and have alarms? This is a short question, but the critical things are it says create Service Level Objectives, SLOs, so have a goal-oriented operations and focus on customer experience metrics. You could be alarming on many things, so this question is a great one.

The second one is testing. Does this feature have testing to discover and address any unexpected performance issues? It's not just asking for discovering it, it's asking for addressing it. So it's asking for a mechanism from the team. These checklists are filled and bar raised by bar raisers, and then it has to be approved by a director or above before the service can go live. If the bar raiser or the director is not happy, the team has to address those before they can even take their service online.

Speaking of testing, we test our software extensively, of course. I'm not going to go into every detail of how we test our software, but the thing that applies to our conversation today is we test for failure. We create failure scenarios continuously in our deployment pipelines, and we test for them. Of course, we do load tests. Our pipelines, every day we check in, runs automated load tests, stresses the system in pre-production just to see if we created any regression or not. One thing that usually is overlooked is we also test for a load from a single user to see if the noisy neighbors will cause a regression or not.

We also do some manual testing days. We call them game days. We literally break our system in these tests and we create network outages. We create dependency failures. We simulate these so that we test two things, how our systems behave, but more importantly, how our operators behave. Were they paged on time? Were they able to respond on time? Did they find the correct runbooks? Were they able to find the right dashboards? How long did it take for them to diagnose and mitigate the issue? And we learned from these simulations and we improve our systems accordingly and our processes and mechanisms. We have a Fault Injection Service in AWS. We use this ourselves on these days. You can also use it to create game days in your own organization.

We don't want humans to touch our systems. If we believe in automation and continuous deployment, but if a human has to touch at times, we write every step in detail. We call this change management, CMs, and the CMs are also bar raised, and they have to come from a bar raised template. What bar raisers pay attention to when they are reviewing the CM is, did the team put rollback steps for each step? If something goes wrong, we want the team to be able to roll back with ease, and we also check if they have tested this in pre-production, so we don't leave anything to surprise.

We have a team called Release Excellence. All this automation is great, but how do we know those pipelines are doing their job? We have systems in place where we analyze all our deployment pipelines and we check them against some rules. For example, we have seven rules and this pipeline is not adhering to seven of those rules, and this pipeline was blocked. So it might be CI/CD but it's not able to deploy because the team has to address these expectations. What are those expectations? Let's look at an example pipeline before we even hit production.

We have usually three stages in our AWS services before production, and we treat our last stage before production, gamma, like it's a production environment, and we run all sorts of tests in that stage, including load tests. So if this is not in place, if the team missed baking time, for example, that would be a miss and the pipeline would be blocked. Clare Liguori, another senior Principal Engineer in our company, wrote a great article, Automating Safe, Hands-Off Deployments, and it's also in the Builders' Library. Let's move on to the second driver of our wheel, observability.

Making Systems Observable: Standardized Instrumentation and CloudWatch at Scale

So what's observability? It's a measure. It comes from control theory. You can measure a system and say the system is less observable or more observable, and the more observable it is, the more insights you can get out of it.

The more observable it is, the more questions you can ask that you didn't even know you should be asking. To make our systems observable, we basically do two things: we instrument our systems and we standardize them. Of course, we also use great tools to accomplish observability.

So we run our services, but as I said, we pay a good investment into operations and we collect the three pillars of observability: metrics, logs, and spans for traces. One thing I think Amazon did right is we standardize this, and I want to emphasize that. Say you are an AWS developer and you are building a new API using Java. We have a library for building API services, and you can import that library and import a class called Activity, extend Activity for your API. You have to implement this function, and this function will be executed when this API hits this particular server.

Even if you don't do anything and you deploy literally this piece of code, you will get with every request a blob like this. This is a simplified example, of course, for demonstration purposes, but we will be measuring many things: infrastructure level and application level measurements like duration, CPU, memory used, whether the call was successful or not, with request ID and such. This is called Embedded Metric Format, and this is available for you. This is an external format. If you write this to CloudWatch Logs, CloudWatch Logs will process this directive and will emit metrics out of it. This directive is saying, take the duration field and emit a metric from that field, right?

So what a team gets by just deploying that code which does nothing is all these measurements, and again, it's a simplified example. We collect a lot more data than this. Of course, the developer can get a metrics object and add their own business measurements. They can add counts, they can add time, they can have their own key-value pairs, and all of those will be captured in the same JSON blob as different fields. Since bean count and brew time here are measurements, they will also be added to the Embedded Metric Format directive so that we can get metrics out of it.

The team, after deploying this software, will be able to add these new widgets for their metrics. So it's very super easy to emit your metrics. You don't have to think about the transport or the formatting. It's all implemented into the library that you are using to build your application. So observability, right? Metrics are fine, but what if you want to dig deeper and you want to do some analytics? Since in this example I'm emitting customer name to CloudWatch Logs, I can run a CloudWatch Logs query and I can run a query like this, which all it does is find average brew time and group it by customer.

Since everything starts as a log, I can do cool things like this. Every request is a blob of log, and some of them are aggregated metrics, and I can run any analytics I want. Embedded Metric Format again is externally available. You could read about it on our companion website. When it comes to tooling, we use CloudWatch. Amazon and AWS prefers CloudWatch. One reason is we built it, right? So we dogfood, we use the product ourselves so that we can improve it for our customers.

The second thing is it can take our scale. Amazon and AWS generates lots of data, and this is probably an outdated slide by now. CloudWatch Logs is accepting 13 exabytes of logs per month. That's 5 terabytes of logs per second. And quadrillion, I think, is billion squared. So CloudWatch Metrics is accepting 20 quadrillion metric data points every month. Since I brought this slide up, we probably accepted 100 terabytes of logs already.

Meeting Customer Expectations: Open Standards and SLO-Based Monitoring

Of course, we have other priorities, and you told us what priorities you have when you choose an observability tool, and Imaya will walk us through those. Go ahead. All right, so my team and I, we work with hundreds of customers every year, engaging in very deep conversations about what customers want to do in terms of choosing that observability solution. You have made it very clear for us. We see these kind of patterns emerging as to what is important for you when you make decisions on what observability solution to use.

I'm going to focus on demonstrating how CloudWatch as a solution meets these expectations from you based on the learnings that we have internally and that we have discussed so far. The first one to begin with is open standards. Mustafa talked about the instrumentation part and how the standard instrumentation library that is available within Amazon means developers really don't have to worry, as all the essential signals are automatically captured. You have told us that you want to use open standards, and there's no question about it. Everybody would like to be open standards compliant, and that is beneficial for obvious reasons. OpenTelemetry is the way to go, so that's why we have AWS Distro for OpenTelemetry SDK, and we also have our auto-instrumentation agent which automatically captures important metrics and industry standard metrics, which are RED metrics, otherwise called RED metrics: request rate, errors, and duration. In addition to it, we also capture faults as well. CloudWatch also has OpenTelemetry endpoints, or OTLP endpoints, for logs and traces that you can send data from anywhere, whether it's AWS, on-premises, or hybrid environment. It doesn't really matter. As long as you have AWS Signature Version 4 for authentication, the OTLP endpoint will just accept signals from you, and you're using OpenTelemetry at that point.

The second one is, yes, I have all these features and capabilities, but if it's hard for you to actually set it up and collect signals, it's going to be kind of pointless. So if you're using Amazon EKS, you have this CloudWatch Observability add-on which automatically deploys the CloudWatch agent, which is also OpenTelemetry compatible. It also deploys the auto-instrumentation agent or injects the containers into the pods that you want based on simple configuration. A similar thing is available for Amazon ECS as well. You just have to mount the auto-instrumentation container into the application container in the task definition, and you're good to go. We have OpenTelemetry-based Lambda layers available for serverless workloads as well.

Then the third one is, yes, we have all this data and we set it up really quickly, but we have a lot of data now. How do we manage it? You don't want to be in the business of creating dashboards. You don't want to be in the business of doing manual correlation between signals. When I say between signals, yes, of course, logs, metrics, and traces from applications, but also you don't want to be manually correlating between infrastructure data from containers to application data like microservice interactions and database calls and all of that. You want all of that to be out of the box, and that message was received as well. You wanted an opinionated, highly opinionated sort of getting started experience. Yes, of course you want the ability to create your own dashboards, but you also wanted an opinionated way of doing things, and that's what we have with Application Signals. It's an APM solution within CloudWatch. It has application-centric observability where you're not essentially focusing on infrastructure or microservices, but you're looking at it from an application point of view, which is what everyone within the company, for example, looks at. With the AI-driven root cause analysis, we make it even easier for you to find issues from all the different signals that you're collecting from the different environments.

You also want to reduce the data that is being collected. You don't want to get into the fear of missing out, like FOMO mode, and collect everything. I see a lot of customers get into that, and then cost goes up. It takes a lot of time for you to actually find the root cause because there's a lot of noise, and the signal-to-noise ratio is not in the right balance. You wanted to be more controlled, so what we recommend for customers is also SLO-based monitoring. Application Signals, in fact, is centered around SLO. So how do you set SLOs? You start with the business SLA. You start by asking what the business wants in terms of what your application should do. For example, this request should be responding within 200 milliseconds, and that's a request that you get. But then you build an SLO, which is a Service Level Objective, to basically go and track whether the application is actually doing that or not. You have an error budget, so there is wiggle room for you to play around with if there are any challenges there, and that will dictate to you what kind of metrics you actually need to collect. When you do that, you actually have a direct correlation between the business requirements and the technical things, so observability at that point is not a technical concern anymore. It is also a business function. With that, we have seen customers really reduce in terms of how much data they're collecting, and it also really improves Mean Time To Resolution.

Troubleshooting with CloudWatch Application Signals: A Live Demonstration

So let's take a look at how you can use CloudWatch Application Signals for using SLO-based monitoring to troubleshoot.

Let me demonstrate using SLO-based monitoring to troubleshoot. In this example, I am on the Application Signals screen on the services. If I click on that, you will see the service screen. I have a sample application deployed that has multiple microservices in there. The moment you go there, it shows a microservices view. It's all listed based on the SLO status, and then at the overview, I can actually find out if there are any applications or services that are actually having issues. In this case, there are some applications that are unhealthy based on SLOs. I can go click on the SLOs and look at the list of SLOs and click on view all SLOs in this microservice.

Now, once I click on that, I'm able to see the SLO status, whether I'm meeting the SLO that I've set or not, if I have exceeded the error budget, and also what kind of latency is causing that. What I'm actually interested in is one of these SLOs, the availability SLO that is there. Once I click, obviously the widget automatically changes accordingly. What I want to do is I want to go and find out what this SLO even means. In this case, this SLO is actually measuring a specific operation called visits. Anything, let's go take a look at what this even looks like in that particular service.

Let's go look at the SLO so we understand what we're talking about. What I did was I actually set up Application Signals on this environment, which is deployed on an EKS environment. This has multiple microservices in there, and it automatically collects this signal data. Like I mentioned before, all the industry standard metrics like request rate, errors, duration, and so on are automatically collected. In this case, that is what I'm actually making use of, but it's not just confined to that. You can actually select any CloudWatch metric. It can be a metric from containers, it could be EC2, it doesn't matter. You can actually select any metric and you can create an SLO based on whatever metric that is there. You can obviously publish your own metric as well from logs if you want to through extracting metrics out of log events.

Here in this case, it's just a description of what this SLO even looks like. It's the availability, and also the SLO is measured every day. There's a goal, and if the error budget goes beyond 50%, then I want to be alerted and so on. That's just the description of the SLO. Here what I can do is, yes, the SLO is, I mean right now it's not healthy. Now I can actually go and see what is going on in the operations. I actually can find out what is causing the issue.

When I click on that operation, it takes me to the screen where it shows all the actual metrics that are being collected, the request rate, errors, faults, and so on. All of this that you're seeing here is out of the box. There's nothing that I had to create. Here's the runtime metrics as well. For certain workloads like Java, .NET, and so on, we collect runtime environment metrics too. In case if there is a problem with garbage collector or something like that, that is also visible out of the box so you can go fix it if you want to.

Further going up, you can see that there is this faults graph and there is a peak. I want to go investigate that. I see 284 errors in the last three hours or so. If I go click on that, that's going to show me all the different spans that were collected at that point. Yes, I can look at that particular trace and go look at all the spans and investigate if I want to, but I also talked about wanting the ability to sort of correlate the infrastructure information along with the application data. In this case, because it is deployed on an EKS environment, it is able to show the nodes as well. Obviously if it is other environment it will show that, EC2 instances as well. But in this case, I can also not only see what is going on in that particular infrastructure, but also I can go into the pod level details and find out what is happening there as well.

I can look into, go to Container Insights from here. I don't have to really go hunt for this. It's just a direct deep link. I go there and then look at Container Insights and look at all the container specific information. I can go dig deep and look at all the different deployments of that part and then look at all that infrastructure information. If there was a problem here, I would obviously easily spot if there was a pod crashing or a crash loop going on. But what I really want to do is I want to go back and look at the trace itself. In this, if I go and expand and investigate that trace, that is going to show me the specific trace map, how the request itself was processed in my environment.

There is a client, there are multiple, a couple of services, and then there's a DynamoDB that went in. Everything that you see on the right hand side is all OpenTelemetry based data that got extracted and captured and shown here. That's all enriched through OpenTelemetry. If I go down, I can actually see the span timeline that shows essentially how the request went through and what the different services that were impacted. When I select the DynamoDB itself, it actually shows that there was an issue with DynamoDB throttling. This is an error that was captured in the trace itself as a trace event.

Now when I go further down, I can see all the log events that were captured while this particular trace was captured. Obviously I can go read through it if I want to do the analysis myself, or I can go look at CloudWatch Logs Insights. I can go inside and do my own analysis here. Automatically I see the log group and query and all that stuff. So when I click on run query, it gives me all the information.

Now what I can do is I can do the analysis, or I can actually do the pattern analysis myself. So if there are, let's say, 600 or 6,000 log events, then it would be really useful for me to find out what kind of patterns are emerging in these log events instead of going through every single log event. So if we go to patterns, it extracts and groups logs based on the patterns here, and you can see that it extracted tokens and it is showing me the token that is for the endpoint, which is that visits API. If I go further, this is a little bit slow, I think. So there it also shows my container as well. This makes it really easy for me to troubleshoot the issue and find out the root cause very quickly.

I talked about the application-centric piece. So what we looked at was from a microservice, from the status of an SLO or from a microservice, we went and started troubleshooting. But I talked about how you can look at it from an application point of view. That is where the application map feature comes in. I'm on this application map screen, so all the different microservices that I showed you are part of these few applications. I have five applications here, and what you can do is now you look at it. Okay, there is one application that is having an issue, that's having a problem. I can go into that application, and once I go into that application, it's supposed to go. I don't know why it isn't.

Okay, so I can see all the dependent microservices there, and all these microservices are shown with the dependency graph and so on. But also it shows all the log events and even does a log audit to find out and show me if there are any obvious issues that are emerging, and I can go do troubleshooting from this as well. I can go and find out the application logs from here. I can troubleshoot from here too, and that should basically give me all the details. But how am I grouping the application? There are certain things that are inferred, like based on the topology graph and all that. That is an automatic understanding of what an application looks like, but that is also in your control, so you can essentially go and create an application yourself based on AWS tags or OpenTelemetry attributes to essentially group what an application should look like and basically group based on like maybe environment or team name or anything like that. So it is really easy for you to find out what application is breaking. So here, this also allows you to filter based on SLI status, based on server faults, etc. So it really makes it, you know, kind of declutters the interface for you to find out what the actual issue could be. Thank you.

Incident Response Culture: SOPs, Escalation, and AI-Assisted Operations

So let's talk about the third driver on our operational excellence wheel, flywheel, and I have five topics to cover, folks. So when we are reporting an incident, we are very comfortable. The DevOps model helps us a lot. So if you think about DevOps, it is a flywheel itself. We learn from our operational experience as developers, and then we design and implement and deploy our services accordingly.

So when I respond to an event, I don't want any surprises. So we are very disciplined in writing standard operating procedures and runbooks. The thing is, this is slowly, increasingly slowing down because, as Emaya showed you just now, CloudWatch and other operational observability tools, they are getting so good that you start from a dashboard, you start from an error, and the product itself walks you through and helps you with root cause. But there are still scenarios where we have to write steps in detail, and ideally these SOPs, they go away because we are also able to automate. So the drive for SOPs, it might be a hot take, but I think it should be down to zero at some point.

If you have to write an SOP, there are some things we pay attention to. For example, we are not vague when we write a step. We don't say go find this front-end service logs and look for errors. What would an operator do in the middle of the night? So we want to provide deep links. We want them to be able to immediately go reach whatever we want them to go reach and figure it out from there.

The first question in any Standard Operating Procedure should be: can you roll back? Even if it wasn't deployment related, the operators should check if there was a deployment and not sit there trying to fix a bad deployment. If you have SOPs, or if you create one, you might be thinking in the age of Generative AI, why would I do this? Well, you can feed them and create a knowledge base out of those. Like I said, ideally your SOPs lead to automated runbooks.

We have escalation rules. Our ticketing system will escalate. If I don't respond to a page, the secondary will be paged, and if they don't respond, then the manager on call will be paged. That's the system to make sure that we respond to our tickets on time, but I want to speak about the culture. When I first went on call years, decades ago, my manager told me it might sound not intuitive to you, but please page me if you see a customer impact. Please page the manager on call. Don't be scared. We are going to come and help you.

When there is an outage in AWS, usually many leaders are on a call somewhere. When it is very large, most of us are on a call. We are trying to help the teams. So we have this culture of escalate, let me know, I'm going to come and help you. We have an incident response team. This team will be paged if nobody is responding. They have a dashboard where they know everybody's KPIs, every service's KPIs, and if there is no engagement, they will create a ticket and they will engage you.

They also help us during large scale events. They create aggregated tickets. They engage. They help us find the right team because they have the phone book of everyone. And they made their services external, so you can buy their services if you want to implement something like this in your own companies. Of course with AI in the last two or three years we have been incorporating AI into our operations, and the way we approach AI is AI is helping us. We are still in the driver's seat. We as humans are solving the problems, but increasingly we are getting help from AI to drive our operations forward.

AI-Powered Troubleshooting: MCP Servers and CloudWatch Investigations

We already externalized some of the things we do. We launched CloudWatch Investigations and we launched CloudWatch MCP Servers, and I will walk us through a demo for those features. So it's really important for us, for you to empower your developers. I had some issues with the animation earlier, so I hope it doesn't happen again. What we have is we have a couple of MCP servers, CloudWatch MCP server and CloudWatch Application Signals MCP server, and it's really a powerful thing for a developer to have all these tools so they can get insights into what is going on within that application without them having to go switch into the applications, go into the browser, log into the AWS console, look at different things. But what if we gave them all the power that we can into their IDEs itself?

Here I'm on Claude, and obviously you can use any IDE that you have. So here, what I'm doing is I'm actually asking Claude to list all the SLOs in my AWS account, and here it is listing all the SLOs in my AWS account. It's the same thing that I did earlier, the troubleshooting mechanism that I did in the workflow earlier manually that I'm going to show exactly the same thing using Claude, using MCP servers. Here it's listed all the SLOs. Now I'm going to ask it to list all the SLOs that have issues and it really listed that there are issues there. I'm asking it to give me details about the specific SLO that is breaching, which is availability for scheduling visits.

It's giving me an understanding of, okay, these are the likely causes for this thing and that could be the problem. Let me go next and I'm going to ask it to hopefully, that would work, but apparently it doesn't like it. Okay, but anyway, so the thing is it actually went and asked me the permission to make a call to the CloudTrail API call through the MCP server, and then it came up with an understanding of what exactly happened. In this case it also came back and told me that there was a DynamoDB throttling issue and it is even making me suggestions on what I could potentially do to solve that problem, without me having to log into the AWS console or going through that process. You can obviously use your own model. It really doesn't matter because at the end of the day you're making API calls through MCP server.

Next, what if you actually had access to AWS console? How could you do that? That's where the CloudWatch Investigations comes into play. So in this, what I'm going to potentially demonstrate is the ability to go and investigate.

There is a way that you can actually go and investigate an SLO if it is unhealthy. If you click on that, you can click on investigate. That's not the only way for you to start an investigation. You can start an investigation when you're looking at a metric that is not looking good, or maybe you're creating some logs that are log events. You can ask CloudWatch Investigate to investigate if there is a problem, or you can even automate the whole thing. You can make a CloudWatch alarm, and as an action you can ask it to start an investigation as soon as the alarm goes off. So by the time the alarm goes off and then you go there and find out, it has potentially already found the problem.

So here I'm going to click on investigate. I gave it a name, but you didn't get to see it. I apologize. And then when you click on start investigation, it basically goes and starts a series of tasks. So in this case it is looking at CloudTrail logs, CloudWatch logs, metrics, and you can even create an access entry and give it access to the EKS cluster so it can go into the resources and find out what's going on. It will create a topology map based on the trace data that it finds, and you don't have to stay in this console at all. You can actually leave and come back later because it takes a few minutes. I've actually sped up this process. It takes a few minutes to actually complete this, and you can just go.

What you can do is you can also feed more information to it. So you can, let's say you start an investigation, and you can go run a query and then say, hey, I have this data also, like include this data, or maybe you can also add another metric, some metric information to it. So you are actually aiding the process of troubleshooting as well. So once you do this, it will essentially go and do all the analysis, and any second it's going to come up and tell me what the root cause is. It basically comes back with a hypothesis. And it comes back and tells me what the potential problem could be here. It says the analysis is complete and the investigation has concluded.

And here it came up with this hypothesis, and what I want you to do is I want you to pay attention to this hypothesis here, mainly to the what happened section here in the third one. So obviously the root cause that it found is that there was no, the Claude V2 model was not available and that was the problem. It found all that, but interestingly it even came up saying that the Application Signals fault metrics really did not find this problem, so I had to go dig deep and find this problem for you. And it even gave me an analysis of like, at the bottom all the way at the bottom you can see that it says that this represents a monitoring gap that you really want to focus on. So I know that there is some work to do that I have to go perform. So this makes it really, you know, when I'm without this, I would probably have to do all those steps that I did earlier like going into the SLOs, and it's not that it is hard. Yes, it is harder than this because this is doing the work for me instead of me going and finding out what the root cause is.

And you can connect the CloudWatch Investigations to a ticketing system. We connect it to ours, so every time an operator in AWS receives a ticket, by the time they engage, they already have CloudWatch Investigations running and providing updates on the ticket, and we collect feedback from these operators. The most recent number I read, we were about 90% satisfaction rate, so the teams in AWS are liking CloudWatch Investigations. And as I said, there are many services now in AWS. All their stacks are native AWS and they use CloudWatch end to end.

Learning from Operations: Dashboard Reviews and Correction of Error Process

Let's talk about the last fourth one in our driver, the last driver, reviews. So we review our incidents, we review our operations, and I'm going to cover three topics here. First, the weekly dashboard reviews. We get together as small teams on Monday, and we go through our dashboards and we look for anomalies. We look for spikes, and this is an excuse to do a retrospective, to be honest. You start with dashboards, but then you have an honest conversation with your team members about your operational posture and whether you should take any action items for that week.

And when we are reviewing these dashboards, our widgets usually have two lines, alarm threshold and investigation threshold. So let's say we are looking at a latency metric over time for an API service, and if we were to see a widget like this as a team, we would be thinking, you know what, why is our investigation line that high? Can we pull it down so that we can increase the operational bar? If we see a spike like this, of course we would investigate, but sometimes there are things like this you see where they are not breaching anything, but they are suspicious, so we also investigate those. And of course we use machine learning and AI to help us to come up with a report. So by the time we start that meeting, we already have a report telling us where to pay the most attention to.

Another learning I have is don't just look at your last week if you are doing this weekly, because you would miss longer trends.

So use the seasonal tendencies of those machine learning tools, because you would miss things that are actually going bad, but that week it might look just fine.

We review these dashboards on Tuesdays and also on Wednesdays. Tuesdays as larger groups and on Wednesday as a whole company. So we have an operational excellence meeting in AWS where all the senior leaders join, and we go through certain things. One part of it is weekly dashboard reviews, so we pick a team at random. It was a physical wheel when we first started this years ago, but of course it stopped scaling. We made it a software wheel, and it's on GitHub. You can download this and create your own weekly dashboard review process.

The idea is not to stress anyone. It's stressful when you come up there and present your dashboard. I've been there multiple times. You have your cold sweats, of course. But the idea is to learn from each other, and as leaders, when we ask questions, everybody who's dialing in is also hearing, so they take those actions as well. It's a way to scale our operational excellence. Similarly, we review our tickets every week. So in the same meeting when we are reviewing our dashboards, we also have a discussion about the high severity tickets we received.

The point is we are trying to find the recurring problems, things that keep coming back. Remember the Bezos quote. He says when we find a recurring problem, if we don't look, how are we going to find a recurring problem? The idea is we don't want to be in this reactionary mode where we keep resolving tickets and move on. We are trying to identify the root causes, and Gen AI, as you can imagine, is great at summarizing the tickets. So we are increasingly using Gen AI to start from something, not from nothing.

The last thing I want to talk about is Correction of Error. We discussed Correction of Error extensively. There are lots of articles we wrote about it, but for the completeness of our discussion, here we go. When we have a customer impacting event, when we have an event that is large scale, when we think we can learn from an event and share this with the broader company, we write a very detailed report and it's called Correction of Error, COE for short.

When we write this, we are very careful that we don't make it a blame tool. The person, the team who is writing the COE should never feel like they are being blamed or being punished, and that's very important as leaders of your companies, of your organizations, to instill that culture that this is a learning tool. We don't assume bad intentions, we assume good intentions, and we look for the mechanism that failed. A correct output of a correct COE should be what mechanism failed and what can we put in place to fix that.

Here are the sections of a Correction of Error. It's a template. It's a tool because it's a mechanism, so you can create a Correction of Error. It will give you a template to fill, and then it will be bar raised and reviewed, including in the Wednesday operational meeting. If it is broad enough or if it had great learnings, we pick them and review them. So we write a summary, we put the relevant metrics and graphs, we write the customer impact. We also share what went well. Maybe the team did something great and we want to share that.

Incident response analysis is about how we reacted to the incident, and one favorite question of mine is this one: How long did it take you to detect the issue, and how would you have cut that time in half? So if you keep asking this question and if you keep taking action items over time, you will naturally improve. Post-incident analysis covers things like the issue diagnosis, and the same question. Okay, you've detected, but did you spend an hour trying to figure out what went wrong? And we are again asking the team, hey, do the thinking exercise, how would you have cut that time in half? Or did you have a test for this scenario? Similarly, how we ask for tests in the pipeline rules or the operational readiness review.

We write a very detailed timeline just so that we can see the gaps in the timeline and the things we can improve. And this 5 Whys section, 5 is not a hard rule, but it's a rule of thumb. The first why is the symptom, why so and so service impacted something, and the last why, the answer to the last why should be the root cause. So this is a peel the onion exercise for the team. And of course, it has lessons learned and action items at the end. Every COE, if you were working in AWS and Amazon, every COE you read would have this structure.

We use these Correction of Errors to refer to each other, and the thing I want to point out is, as you can see, it's a tedious process. It takes time, right? So with AI, we build some systems to again start from something, not from nothing. For example, can timeline be crafted for us? Can we use prior Correction of Errors to use that knowledge base to give teams some ideas on action items and such? And we already externalized this. We call this CloudWatch Incident Report, and we have a quick demo for that.

So I'm going to continue from where we were. We're looking at that investigation. So one of the investigations that I already completed is right there at the bottom. It's called Appointment Scheduling Troubleshooting. When I click on that, if I'm able to, okay, so what I did was there's a button called Incident Report which I was supposed to be shown, but the animation did not show that. But once you click on that, it automatically creates this report. So what it does is on the right-hand side, everything that you see, these are all the facts that it collected when the investigation actually happened that you can actually go and edit if you want to. If you want to update or it actually impacted more customers or less customers, you can address that if you want.

However, what I want to focus on is what is on the left-hand side on the report itself that it creates. You can see that it creates, you know, puts the title out. There's a background of the incident and creates a nice summary and also adds the visualization with all the graphs, and it talks about customer impact, exactly like Mustafa mentioned, what went well, and the entire incident report analysis and so on. So what I'm really interested in showing is the Five Whys, like Mustafa talked about, and it essentially goes to, I hope, yeah, so it basically asks all the right questions, documents all the Five Whys, and finds out what the root cause itself is. So everything is entirely fully documented, and then obviously it doesn't have to stay here. It's only more useful if you actually share it with your team. So you can go export and, you know, copy to a clipboard, share it, put it in your document repository if you want, or download it and share it with other team members, right?

Yeah, all right. I mean, it wouldn't be a talk if there wasn't any technical difficulty, right? But we are not aiming for perfection. We have come full circle, folks. We covered our flywheel, and as I said earlier, the more we get better at operational excellence, the better reviewers we become. And of course, I forgot to mention something. So the last thing we mentioned was Correction of Error in the review section, and the first thing we covered was Operational Readiness Review. And this is an example where Correction of Error is actually feeding into Operational Readiness Review. Every Correction of Error we write eventually ends up as a question. I mean, not entirely, but every interesting Correction of Error, every Correction of Error that is broadly applicable, becomes an Operational Readiness Review question, and it feeds into our process. And again, the operational excellence makes us better reviewers.

Getting Started: Implementing Operational Excellence in Your Organization

You might be thinking these processes, these mechanisms are great, but Amazon is a big company, they can afford these things. Maybe you're just a startup. Remember, Amazon was also a startup, right? So the Operational Readiness Review, when it first started, it was just a Word document with a bunch of questions in it. Now it became a tool. It became, they have, we have 2,000 templates in there. Anybody can create their own Operational Readiness Review template. So you have to start from somewhere. So I'm going to suggest some steps. If you were inspired by anything you heard here, any mechanism, I suggest start thinking, how can I implement a version of this in my own company? It will be your own version. It won't be the same. But you can perhaps start looking into that, and I would love to help you folks. You don't have to use CloudWatch, you could be using anything. If you just have operational excellence improvements, if you want to invest in them, I would love to help you, talk to you, and give you some ideas and review your plan, for example.

Animayah, the same. He helps our customers to raise their operational excellence every day. And there are lots of stuff online that we already referred, and there is even more. You can read this Observability Best Practices Guide Animayah and team put together. Again, it's in the same company website. We also have an Observability Workshop. All those demos Animayah was showing, you can deploy them yourselves and you can play with them and learn more.

And I want to finish on a lighter tone. So when we first put this deck together, this slide together with Animayah, I was in Bay Area visiting our San Francisco office. I'm from Seattle, and I was going to do a dry run to a small group of people. And I wanted to grab a coffee in the morning before I go to the San Francisco office, and I found a coffee shop right across the street. If you can believe it, it was called Flywheel Cafe. And that's the store, and inside there is a flywheel up there, and if you can believe it, there was a first responder ordering right before me from San Francisco Fire Department. So I thought this was, I was destined to give this talk.

Thank you for your time. Thanks for being here. I hope you found it useful, and if you want to connect with me or Animayah, we are on social media. And if you have any questions, we will be right outside in the next five to ten minutes. We would love to connect with you in person. Thank you. Thank you, folks.

; This article is entirely auto-generated using Amazon Bedrock.