Kazuya

Posted on Dec 6, 2025

AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

In this video, AWS engineers Mustafa Torun and Imaya Kumar Jagannathan explain how AWS achieves operational excellence through mechanisms rather than good intentions. They present a flywheel model with four drivers: readiness (Operational Readiness Reviews, testing, game days), observability (standardized instrumentation using Embedded Metric Format, CloudWatch collecting 5 terabytes of logs per second), incident response (SOPs, escalation culture, AI-powered CloudWatch Investigations with 90% satisfaction rate), and reviews (weekly dashboard reviews, Correction of Error reports using 5 Whys methodology). Key architectural patterns include dependency isolation and cellular architecture for fault tolerance. They demonstrate CloudWatch Application Signals for SLO-based monitoring, MCP servers for IDE integration, and automated incident report generation, emphasizing that these mechanisms started simple and evolved over time.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Coffee Shop Analogy: Introduction to Operational Excellence at AWS

Thanks for being here. Hey Umaya, what's up? Hey Mustafa, I'm doing great, but I could use another coffee. Say you pull up your phone now, Umaya, and you find a coffee shop nearby. You walk up to this store only to see that they are closed for maintenance. That would suck, right? So are you a coffee geek like myself? Yes, they sell coffee in Atlanta too. In Seattle it's the law. You have to be a coffee geek. Do you follow your local coffee shop on social media? Of course I do. I only drink about five coffees before 10:00 a.m. So say you get an update like this on social media from your local coffee shop. They say they have new features. They have new syrups to choose from and new baked goods. So you go up there and check it out, and it's open, which is a great sign. But then you see the coffee machine is broken. How would you feel? Almost crying, maybe actually crying. You have a coffee addiction, that's for sure.

So I think these coffee shops could use some operational excellence and reliability, and that's the topic of our conversation today. I'm Mustafa Torun. I'm a Senior Principal Engineer at AWS. I've been with the company for 12.5 years now, and most of that time was with CloudWatch or observability and the monitoring team. I have Imaya joining me on stage. My name is Imaya Kumar Jagannathan. I lead the worldwide specialist Solution Architecture team at AWS, and my team focuses on cloud operations. I've been at AWS for close to eight years now, working almost all the time on CloudWatch and observability services.

Excellence Over Perfection: AWS Infrastructure and Architectural Principles

I want to start with the word itself, excellence. It's not perfection. It's excellence. We use this word for a reason. The difference between these two words relates to two leadership principles of Amazon. We have more than two, but these two apply here: bias for action and insist on highest standards. One of them asks for speed and says take calculated risks. The other one says it might be uncomfortable sometimes, but you should push for the highest quality. There were times I was pushing my team on this end where we were very careful, but we were not shipping as fast.

So in the center of this, in the balance, stays excellence. It's not perfection. Perfection would instill fear. Perfection would slow us down and it wouldn't accept mistakes. With operational excellence, we understand that we will make mistakes at times because we have to move forward, but the key is we learn from our mistakes. Let's get back to our coffee shop and see what would happen if AWS was operating this coffee shop. We would add redundancy at the infrastructure level. We would add many more coffee machines to the store, just like how we run our systems on multiple servers. We create multiple copies of the data you give us. We wouldn't run just one coffee shop. We would run multiples of these, just like how we put multiple data centers within a zone. We would put them in different parts of the region, different parts of the city, just like how we do availability zones. We would put these in different parts of the world. Some services are zonal, some services are regional, and all these are creating fault isolation boundaries. There's a great white paper on our website if you want to read more about our infrastructure and fault isolation. This QR code is a companion website to this talk. There are links to external resources we will refer to in the remainder of the talk.

So let's talk about some architectural choices we make for reliability and operational excellence. Say we have an AWS service that has three dependencies and one of them is having a bad day. We design our API services such that only those APIs that depend on that dependency are impacted. The rest should be functioning. This is done through dependency isolation. The simplest thing you can do is create different thread pools for each dependency. There are of course more advanced methods. The second one I'm going to talk about is cellular architecture. When a ship hits something and the hull of a big ship is breached, the ship doesn't sink because the hull is made of different cells. We create a similar idea. We create multiple copies of our stacks and we put them in our data plane next to each other and we route them with a thin layer. This cellular architecture allows us to reduce our blast radius. If one of those cells is impacted, the rest of the services can continue to function.

Articles are available on our website in the Amazon Builders Library. The link is in the companion website, and you can read through the architectural choices we make and apply them to your own systems. The Well-Architected Framework is also a great resource, and it discusses all these topics: infrastructure, architecture, and operational excellence.

Operational excellence is a feature for us. We add features and new services to AWS, but we invest equally in operational excellence. Operational excellence for AWS is not an afterthought or a burden. We invest in it intentionally. We know you choose AWS for many reasons, and one big reason is that we pay attention to our operational excellence and we are serious about it.

Mechanisms Over Good Intentions: The Operational Excellence Flywheel

I'm going to share some learnings from my experience. I've been in the front seat of our operational excellence journey over this decade. I've seen us try new things, and some of them worked while others didn't. One common theme you will see is that we don't rely on good intentions. We rely on mechanisms. I'm going to quote Jeff Bezos. He told us this before I joined in 2008 in an all hands meeting. He said that often when you find a recurring problem, something that happens over and over again, we pull the team together and ask them to try harder and do better. Essentially, we ask for good intentions. He said this rarely works because people already have good intentions. But if good intentions don't work, what does? Mechanisms work.

Nobody wakes up in the morning and says they're going to make operations worse, and during an incident, nobody is trying to make things worse. Everyone is trying their best. So let's talk about what a mechanism is. It's a system with an input and an output, and it starts with a tool. It's not a manifesto you put up or an email you send and expect people to read. You create a tool, and if that tool is good, it will drive adoption. People will start using it, and the more people use it, the more feedback they will give you. That feedback allows you to inspect and retrospect, which allows you to improve the tool. This is a virtuous cycle, also called a flywheel. If you use a stationary bike at a gym, that big wheel is a flywheel. It comes from mechanics and keeps the momentum going while storing energy.

The first virtuous cycle flywheel at Amazon was introduced by Bezos and his team. They were targeting growth, with customer experience as a driver. Customer experience drives traffic. If you improve customer experience, traffic brings more sellers, more sellers bring more selection, and more selection makes customers happy. As we grow with these drivers, we have a lower cost structure, which lowers our prices. Lower prices feed back into customer experience itself.

I want to talk about what drivers exist in AWS that drive our operational excellence, because it has improved significantly over this decade. In my view, it starts with observability. We make our services observable so we get more insights from them and can respond to incidents better. Incident response benefits from observability, and the more incidents we have, the more we review and learn. All that data feeds into our review process. Of course, we don't wait for incidents to happen. We also continuously review our systems, and observability helps with that too. The more we review, the more action items we take, and it feeds into our readiness process. We become more ready, and most action items we take are about closing blind spots, which feeds back into observability. As operational excellence keeps growing with these drivers, we become better reviewers ourselves, so it feeds into our review process. We ask better questions as we get better at operational excellence. In my opinion, operational excellence feeds into Bezos's wheel. Customer experience benefits from operational excellence.

Readiness: Operational Readiness Reviews, Testing, and Release Excellence

In the remainder of this talk, we are going to stay with this wheel. There will be four sections, and we will start with readiness and dive deep into each one of these drivers. I have five topics to discuss in this section, and the first one is called operational readiness review.

We have a mechanism called an Operational Readiness Review. We ask service owners and operators to fill out a form, a checklist, that asks certain questions like these. These are verbatim questions from a new feature Operational Readiness Review template. It asks: Does this feature have SLOs defined on customer experience metrics and have alarms? This is a short question, but the critical elements are that it asks you to create service level objectives—SLOs—so you have goal-oriented operations with a focus on customer experience metrics. You could be alarming on many things, so this question is a great one. The second question is about testing: Does this feature have testing to discover and address any unexpected performance issues? It's not just asking for discovering issues; it's asking for addressing them. It's asking for a mechanism from the team.

These checklists are filled out and the bar is raised by bar raisers, and then the checklist has to be approved by a director or above before the service can go live. If the bar raiser or the director is not satisfied, the team has to address those concerns before they can even take their service online. Speaking of testing, we test our software extensively. I won't go into every detail of how we test our software, but what applies to our conversation today is that we test for failure. We create failure scenarios continuously in our deployment pipelines and test for them. Of course, we do load tests. Every day our pipelines run automated load tests that stress the system in pre-production just to see if we created any regression. One thing that is usually overlooked is that we also test for load from a single user to see if noisy neighbors will cause a regression.

We also conduct manual testing days that we call game days. We literally break our system in these tests. We create network outages and simulate dependency failures so that we test two things: how our systems behave, but more importantly, how our operators behave. Were they paced on time? Were they able to respond on time? Did they find the correct runbooks? Were they able to find the right dashboards? How long did it take for them to diagnose and mitigate the issue? We learn from these simulations and improve our systems accordingly, along with our processes and mechanisms. We have a fault injection service in AWS that you can also use to create game days in your own organization.

We don't want humans to touch our systems. If we believe in automation and continuous deployment, but if a human has to touch systems at times, we write every step in detail. We call this change management, or CMs. The CMs are also bar raised and have to come from a bar raised template. When bar raisers review the CM, they pay attention to whether the team put rollback steps for each step. If something goes wrong, we want the team to be able to roll back with ease. We also check if they have tested this in pre-production so we don't leave anything to surprise.

We have a team called Release Excellence. All this automation is great, but how do we know those pipelines are doing their job? We have systems in place where we analyze all our deployment pipelines and check them against some rules. For example, we have 70 rules and this pipeline is not adhering to 7 of those rules, so this pipeline was blocked. It might be CI/CD, but it's not able to deploy because the team has to address these expectations. What are those expectations? Let's look at an example pipeline before we even hit pre-prod.

We usually have three stages in our AWS services before pre-prod. We treat our last stage before pre-prod, called gamma, like it's a production environment, and we run all sorts of tests in that stage, including load tests. If this is not in place, if the team missed bake time, for example, that would be a miss and the pipeline would be blocked. Clare Liguori, a senior PE in our company, wrote a great article called "Automating Safe, Hands-Off Deployments," and it's also in the Builders' Library.

Observability: Instrumentation, Standardization, and CloudWatch at Scale

Let's move on to the second driver of our wheel: observability. What is observability? It's a measure that comes from control theory. You can measure a system and say the system is less observable or more observable, and the more observable it is, the more insights you can get out of it.

The more observable it is, the more questions you can ask that you didn't even know you should be asking. To make our systems observable, we basically do two things: we instrument our systems and we standardize them. We also use great tools to accomplish observability. So we run our services, but as I said, we pay a good investment into operations and we collect three pillars of observability: metrics, logs, and traces or spans for traces. One thing I think Amazon did right is that we standardize this, and I want to emphasize that.

Say you are an AWS developer building a new API using Java. We have a library for building API services, and you can import that library and import a class called Activity, extended Activity for your API. You have to implement this function, and this function will be executed when this API hits this particular server. Even if you don't do anything and deploy literally this piece of code, you will get with every request a blob like this. This is a simplified example for demonstration purposes, but we will be measuring many things at both infrastructure level and application level, such as duration, CPU, memory used, whether the call was successful or not, with request ID and such. This is called Embedded Metric Format.

This is an external format. If you write this to CloudWatch Logs, CloudWatch Logs will process this directive and will emit metrics out of it. So this directive is saying: take the duration field and emit a metric from that field. What the team gets by just deploying that code which does nothing is all these measurements. Again, it's a simplified example; we collect a lot more data than this. Of course, the developer can get a metrics object and add their own business measurements. They can add accounts, they can add time, they can have their own key-value pairs, and all of those will be captured in the same JSON blob as different fields.

Since bean count and brew time here are measurements, they will also be added to the EMF directive so that we can get metrics out of EMF. After deploying this software, the team will be able to add these new widgets for their metrics. It's very easy to emit your metrics. You don't have to think about the transport or the formatting. It's all implemented into the library that you are using to build your application. So it's very observable.

Metrics are fine, but what if you want to dig deeper and do some analytics? Since in this example I'm emitting customer name to CloudWatch Logs, I can run a CloudWatch Logs query like this, which finds average brew time and groups it by customer. Since everything starts as a log, I can do cool things like this. Every request is a blob of log, and some of them are aggregated metrics, and I can run any analytics I want. EMF is externally available. You could read about it on our companion website.

When it comes to tooling, we use CloudWatch. Amazon and AWS prefer CloudWatch. One reason is we built it. We dogfood; we use the product ourselves so that we can improve it for our customers. The second thing is it can take our scale. Amazon and AWS generates lots of data, and this is probably an outdated slide by now. CloudWatch Logs is accepting 13 exabytes of logs per month. That's 5 terabytes of logs per second. And quadrillion, I think, is billion squared. CloudWatch Metrics is accepting 20 quadrillion metric data points every month. Since I brought this slide up, we probably accepted 100 terabytes of logs already.

CloudWatch Application Signals: SLO-Based Monitoring and Application-Centric Troubleshooting

Of course, we have other priorities, and you told us what priorities you have when you choose an observability tool. Imaya will walk us through those. My team and I work with hundreds of customers every year, engaging in very deep conversations about what customers want to do in terms of choosing an observability solution. You have made it very clear for us. We see these kinds of patterns emerging as to what is important for you when you make decisions on what observability solution to use.

What I'm going to do is focus on demonstrating how CloudWatch as a solution meets these expectations based on the learnings we have internally that most of us discussed so far. The first one to begin with is open standards. Mustafa talked about the instrumentation part, how the standard instrumentation library available within Amazon means developers really don't have to worry because all the essential signals are automatically captured. You have told us that you want to use an open standard compliant solution, and there's no question about it. Everybody would like to be open standards compliant, and that is beneficial for obvious reasons. Open Telemetry is the way to go, which is why we have AWS GISR for Open Telemetry SDK and also our instrumentation agent which automatically captures important metrics, industry standard metrics which are RED metrics—request rate, errors, duration, and so on. We also capture faults as well.

CloudWatch also has Open Telemetry endpoints, or OTLP endpoints, for logs and traces that you can send data from anywhere, whether it's AWS, on-premises, or a hybrid environment. It doesn't really matter as long as you have SIGV4 for authentication. The OTLP endpoint will accept signals from you, and you're using Open Telemetry at that point. The second consideration is that yes, you have all these features, but if you make it hard for you to actually set it up and collect signals, it's going to be pointless. If you're using EKS, you have the CloudWatch Observability add-on which automatically deploys the CloudWatch agent, which is also Open Telemetry compatible. It also deploys the auto instrumentation agent or injects the containers into the pods that you want based on simple configuration. A similar thing is available for ECS as well. You just have to mount the auto instrumentation container into the application container in the task definition and you're good to go. We have Open Telemetry-based Lambda layers available for serverless workloads as well.

Then the third consideration is that yes, we have all this data and we set it up really quickly, but we have a lot of data now. How do we handle it? You don't want to be in the business of creating dashboards manually. You don't want to be manually correlating between signals—and when I say between signals, I mean logs, metrics, and traces from applications, but also you don't want to be manually correlating between infrastructure data from containers to application data like microservice interactions and database calls. You want all of that to be out of the box. That message was received as well.

You wanted an opinion, a highly opinionated sort of getting started approach. Of course, you want the ability to create your own dashboards, but you also wanted an opinionated way of doing things. That's what we have with Application Signals, which is an APM solution within CloudWatch. It has an application-centric observability where you're not essentially focusing on infrastructure or microservices, but you're looking at things from an application point of view, which is how everybody within the company looks at infrastructure. With AI-driven root cause analysis, we make it even easier for you to find issues from all the different signals that you're collecting from different environments.

You also want to reduce the data that is being collected. You don't want to get into the fear of missing out, or FOMO mode, and collect everything, which I see a lot of customers get into. Then cost goes up and it takes a lot of time for you to actually find the root cause because there's a lot of noise and the signal-to-noise ratio is not in the right balance. You wanted to be more controlled. Most of our discussions mentioned SLOs, and what we recommend for customers is SLO-based monitoring. Application Signals is in fact centered around SLOs.

How do you set SLOs? You start with the business SLA. You ask what the business wants in terms of what your application should do. For example, this request should be responding within 200 milliseconds. That's a requirement that you get. Then you build an SLO, which is a service level objective, to track whether the application is actually doing that or not. You have an error budget, so there is wiggle room for you to play around with if there are any challenges. That will dictate to you what kind of metrics you actually need to collect. When you do that, you have a direct correlation between the business requirements and the technical things. Observability at that point is not a technical concern anymore. It is also a business function.

With that, we have seen customers really reduce how much data they're collecting, and it really improves mean time to resolution. Let's take a look at how CloudWatch you can use CloudWatch Application Signals for using SLO-based monitoring to troubleshoot. In this example, I am on the Application Signals screen for services.

If I click on that, you will see the service screen. I have a sample application deployed that has multiple microservices. When you go there, it shows a microservices view, all listed based on the SLO status. In the overview, I can find out if there are any applications or services that are actually having issues. In this case, there are some applications that are unhealthy based on SLOs. I can click on the SLOs and look at the list of SLOs, then click on view all SLOs in this microservice. Once I click on that, I'm able to see the SLO status—whether I'm meeting the SLO that I've set or not, if I have exceeded the error budget, and what kind of latency is causing that. What I'm interested in is one of these SLOs, the availability SLO. Once I click on it, the widget automatically changes accordingly. What I want to do is find out what this SLO even means. In this case, this SLO is actually measuring a specific operation called visits. Let me go take a look at what this looks like in that particular service.

Let me go look at the SLO so we understand what we're talking about. What I did was set up Application Signals on this environment, which is deployed on an EKS environment with multiple microservices. It automatically collects signal data. As I mentioned before, all the industry standard metrics like request rate, errors, and duration are automatically collected. In this case, that's what I'm making use of, but it's not just confined to that. You can actually select any CloudWatch metric. It could be a metric from containers, EC2, or anything else. You can select any metric and create an SLO based on whatever metric is there. You can also publish your own metric from logs if you want to by extracting metrics out of log events.

Here in this case, this is just a description of what this SLO looks like. It's the availability SLO, and it's measured every day. There's a goal, and if the error budget goes beyond 50 percent, then I want to be alerted. That's just the description of the SLO. Right now, the SLO is not healthy. I can actually go and see what is going on in the operations and find out what is causing the issue. When I click on that operation, it takes me to the screen where it shows all the actual metrics that are being collected—the request rate, errors, faults, and so on. All of this that you're seeing here is out of the box. There's nothing that I had to create. Here are the runtime metrics as well. For certain workloads like Java and .NET, we collect runtime environment metrics too. So if there is a problem with the garbage collector or something like that, it's also visible out of the box so you can go fix it if you want to.

Further up, you can see there is a faults graph and there is a peak. I want to go investigate that. I see 284 errors in the last three hours or so. If I click on that, it's going to show me all the different spans that were collected at that point. I can look at that particular trace and go look at all the spans and investigate if I want to. But I also talked about wanting the ability to correlate the infrastructure information along with the application data. Because it is deployed on an EKS environment, it is able to show the nodes as well. Obviously, if it's another environment, it will show EC2 instances as well. In this case, I can not only see what is going on in that particular infrastructure, but I can also go into the pod-level details and find out what is happening there as well. I can look into Container Insights from here. I don't have to really go hunt for this—it's just a direct deep link. I go there and then look at Container Insights and look at all the container-specific information. I can go dig deep and look at all the different deployments of that pod and then look at all that infrastructure information. If there was a problem here, I would obviously easily spot if there was a pod crashing or a crash loop going on.

But what I really want to do is go back and look at the trace itself. In this span, if I go and expand and investigate that trace, it's going to show me the specific trace map—how the request itself was processed in my environment. There is a client, there are multiple services, and then there's a DynamoDB that went in. Everything that you see on the right-hand side is all OpenTelemetry-based data that got extracted and captured and shown here, all enriched through OTel. If I go down, I can actually see the span timeline that shows essentially how the request went through and what the different services that were impacted. When I select the DynamoDB itself, it actually shows that there was an issue with DynamoDB throttling. This is an error that was captured in the trace itself as a trace event.

When I go further down, I can see all the log events that were captured while this particular trace was captured. I can go read through it if I want to do the analysis myself, or I can go look at CloudWatch Logs Insights. I can go inside and do my own analysis here. I automatically see the log group and query and all that. When I click on run query, it gives me all the information.

What I can do is perform the analysis or I can do the pattern analysis myself. If there are, let's say, 600 or 6000 log events, then it would be really useful for me to find out what kind of patterns are emerging in these log events instead of going through every single log event. If we go to patterns, it extracts and groups logs based on the patterns here. You can see that it extracted tokens and is showing me the token for the endpoint, which is the visits API. This is a little bit slow, I think. It also shows my container as well. This makes it really easy for me to troubleshoot the issue and find the root cause very quickly.

I talked about the application-centric piece. What we looked at was from a microservice, from the status of an SLO, or from a microservice we went and started troubleshooting. But I talked about how you can look at it from an application point of view. That is where the application map feature comes in. I'm on this application map screen, and all the different microservices that I showed you are part of these few applications. I have five applications here, and what you can do now is look at it and see that there is one application that is having an issue or a problem. I can go into that application.

Once I go into that application, I can see all the dependent microservices. All these microservices are shown with the dependency graph and so on, but it also shows all the log events and even does a log audit to find out and show me if there are any obvious issues that are emerging. I can go do troubleshooting from this as well. I can go find out the application logs. From here I can troubleshoot, and that should basically give me all the details. But how am I grouping the application? There are certain things that are inferred, like based on the topology graph, and that is an automatic understanding of what an application looks like. But that is also in your control, so you can essentially go and create an application yourself based on AWS tags or OpenTelemetry attributes to essentially group what an application should look like. You can group based on environment, team name, or anything like that, so it is really easy for you to find out what application is breaking. This also allows you to filter based on SLI status, based on server faults, and so on. It really makes it declutter the interface for you to find out what the actual issue could be.

Incident Response: SOPs, Escalation Culture, and AI-Powered Investigations

Let's talk about the third driver on our operational excellence flywheel, and I have five topics to cover. When we are reporting an incident, we are very comfortable. The DevOps model helps us a lot. If you think about DevOps, it is a flywheel itself. We learn from our operational experience as developers, and then we design, implement, and deploy our services accordingly.

When I respond to an event, I don't want any surprises. We are very disciplined in writing standard operating procedures and runbooks. The thing is, this is slowly increasingly slowing down because, as was shown just now, CloudWatch and other operational observability tools are getting so good that you start from a dashboard, you start from an error, and the product itself walks you through and helps you with root cause. But there are still scenarios where we have to write steps in detail. Ideally, these SOPs go away because we are also able to automate. So the drive for SOPs might be a hot take, but I think it should be down to zero at some point. If you have to write an SOP, there are some things we pay attention to. For example, we are not vague when we write a step. We don't say go find this front-end service logs and look for errors. What would an operator do in the middle of the night? So we want to provide deep things. We want them to be able to immediately go reach whatever we want them to go reach and figure it out from there. The first question in an SOP should be: can you roll back? Even if it wasn't a deployment.

The operators should check if there was a deployment and not try to fix a bad deployment. If you have SOPs or create one, you could be thinking in the age of AI. Why would I do this? Well, you can feed them and create a knowledge base out of those. Ideally, your SOPs lead to automated runbooks. We have escalation rules. Our ticketing system will escalate. If I don't respond to a page, the secondary will be paged, and if they don't respond, then the manager on call will be paged. That's the system to make sure that we respond to our tickets on time, but I want to speak about the culture.

When I first went on call years ago, my manager told me something that might sound counterintuitive, but please page me if you see a customer impact. Please page the manager on call. Don't be scared. We are going to come and help you. So when there is an outage in AWS, usually many leaders are on a call somewhere. When it is very large, most of us are on a call, trying to help the teams. We have this culture of escalating and letting people know. I'm going to come and help you.

We have an incident response team. This team will be paged if nobody is responding. They have a dashboard where they know everybody's KPIs and every service's KPIs. If there is no engagement, they will create a ticket and engage you. They also help us during large scale events. They create aggregated tickets, engage teams, and help us find the right team because they have the phone book of everyone. They made their services external, so you can buy their services if you want to implement something like this in your own companies.

Of course, with AI in the last two or three years, we have been incorporating AI into our operational operations. The way we approach AI is that AI is helping us. We are still in the driver's seat. We are as humans solving the problems, but increasingly we are getting help from AI to drive our operations forward. We already externalized some of the things we do. We launched CloudWatch Investigations and CloudWatch Servers, and I will walk you through a demo for those features.

It's really important for us to empower your developers. So what we have is a couple of MCP servers: CloudWatch MCP server and CloudWatch Application Signals MCP server. It's really a powerful thing for a developer to have all these tools so they can get insights into what is going on within that application without having to switch into applications, go into the browser, log into the AWS console, and look at different things. What if we gave them all the power that we can into their IDEs itself? Here I'm on Quiro, and obviously you can use any IDE that you have.

Here, what I'm doing is asking Quiro to list all the SLOs in my AWS account. Here it is listing all the SLOs in my AWS account. It's the same thing that I did earlier. The troubleshooting mechanism that I did in the workflow earlier manually, I'm going to show exactly the same thing using Quiro and using MCP servers. Here it's listed all the SLOs. Now I'm going to ask it to list all the SLOs that have issues, and it really listed that there are issues there. I'm asking it to give me details about the specific SLO that is breaching, which is availability for scheduling visits. It's giving me an understanding of what the likely causes are for this issue.

Let me go next and ask it to hopefully work. Apparently it doesn't like it. But anyway, the thing is it actually went and asked me for permission to make a call to the audit services API through the MCP server, and then it came up with an understanding of what exactly happened. In this case, it also came back and told me that there was a DynamoDB throttling issue, and it is even making suggestions on what I could potentially do to solve that problem without me having to log into the AWS console or go through that process. You can obviously use your own model. It really doesn't matter because at the end of the day, you're making API calls through the MCP server.

So next, what if you actually had access to the AWS console? How could you do that? That's where CloudWatch Investigations comes into play. In this, what I'm going to potentially demonstrate is the ability to go and investigate.

There is a way for you to investigate an SLO if it is an NLD. If you click on that, you can click on investigate. That is not the only way for you to start an investigation. You can start an investigation when you are looking at a metric that is not looking good, or maybe you are creating some logs or log events. You can ask CloudWatch Investigate to investigate if there is a problem, or you can even automate the whole thing by making a CloudWatch alarm and as an action you can ask it to start an investigation as soon as the alarm goes off. By the time the alarm goes off and you go there, it will probably have already found the problem.

I am going to click on investigate. I gave it a name, though you did not get to see it, and I apologize for that. When you click on start investigation, it basically goes and starts a series of tasks. In this case, it is looking at CloudTrail logs, CloudWatch logs, and metrics. You can even create an access entry and give it access to the EKS cluster so it can go into the resources and find out what is going on. It will create a topology map based on the trace data that it finds, and you do not have to stay in this console at all. You can actually leave and come back later because it takes a few minutes. I have actually sped up this process, but it takes a few minutes to actually complete.

You can also feed more information to it. Let us say you start an investigation and you can go run a query and then say, I have this data also, so include this data. Or maybe you can also add another metric or some metric information to it, so you are actually aiding the process of troubleshooting as well. Once you do this, it will essentially go and do all the analysis, and in any second it is going to come up and tell me what the root cause is. It basically comes back with a hypothesis. It comes back and tells me what the potential problem could be here. It says the analysis is complete and the investigation has concluded, and here it came up with this hypothesis. I want you to pay attention to this hypothesis here, mainly to the what happened section in the third one.

The root cause that it found is that the CloudV2 model was not available, and that was the problem. It found all that, but interestingly it even came up saying that the Application Signals fault metrics really did not find this problem, so it had to go dig deep and find this problem for you. It even gave me an analysis saying that at the bottom, you can see that it says this represents a monitoring gap that you really want to focus on. So I know that there is some work to do that I have to perform. This makes it really valuable because without this, I would probably have to do all those steps that I did earlier, like going into the SLOs. It is not that it is hard, but yes, it is harder than this because this is doing the work for me instead of me going and finding out what the root cause is.

You can connect the CloudWatch investigations to a ticketing system. We connect it to ours, so every time an operator in AWS receives a ticket, by the time they engage, they already have CloudWatch investigations running and providing updates on the ticket. We collect feedback from these operators, and the most recent number I read was about a 90 percent satisfaction rate, so the teams in AWS are liking CloudWatch investigations. As I said, there are many services now in AWS. All their stacks are native AWS and they use CloudWatch end to end.

Reviews: Weekly Dashboard Analysis and Correction of Error Process

Let us talk about the last driver, the fourth one in our driver, which is reviews. We review our incidents and we review our operations, and I am going to cover three topics here. First is the weekly dashboard reviews. We get together as small teams on Monday and we go through our dashboards and we look for anomalies and spikes. This is an excuse to do a retrospective, to be honest. You start with dashboards, but then you have an honest conversation with your team members about your operational posture and whether you should take any action items for that week.

When we are looking at these dashboards, our widgets usually have two lines: alarm threshold and investigation threshold. Let us say we are looking at latency over time for an API service. If we were to see a widget like this as a team, we would be thinking, you know what, why is our investigation line that high? Can we pull it down so that we can increase the operational bar? If we see a spike like this, of course we would investigate, but sometimes there are things like this where they are not breaching anything, but they are suspicious, so we also investigate those. We use machine learning and AI to help us come up with a report. By the time we start that meeting, we already have a report telling us where to pay the most attention to.

Another learning I have is do not just look at your last week if you are doing this weekly, because you would miss longer trends. So use the seasonal tendencies of those.

Using machine learning tools to identify these seasonal tendencies helps you avoid missing things that are actually going bad, but that week it might look just fine. We review these dashboards on Tuesdays and also on Wednesdays. Tuesdays as larger groups and on Wednesday as a whole company. We have an operational excellence meeting in AWS where all the senior leaders join, and we go through certain things. One part of it is weekly dashboard reviews, so we pick a team at random. It was a physical wheel when we first started this years ago, but of course it stopped scaling. We made it a software wheel, and it's on GitHub. You can download this and create your own weekly dashboard review process. The idea is not to stress anyone. It's stressful when you come up there and present your dashboard. I've been there multiple times. You have your cold sweats, of course, but the idea is to learn from each other. As leaders, when we ask questions, everybody who's dialing in is also hearing, so they take those actions as well. It's a way to scale our operational excellence.

Similarly, we review our tickets every week. In the same meeting when we are reviewing our dashboards, we also have a discussion about the high severity tickets we received. The point is we are trying to find the recurring problems, things that keep coming back. Remember the Bezos quote. He says when we find a recurring problem, if we don't look, how are we going to find a recurring problem? The idea is we don't want to be in this reactionary mode where we keep resolving tickets and move on. We are trying to identify the root causes. Gen AI, as you can imagine, is great at summarizing the tickets. So we are increasingly using Gen AI to start from something, not from nothing.

The last thing I want to talk about is Correction of Error. We discussed Correction of Error extensively. There are lots of articles we wrote about it, but for the completeness of our discussion, here we go. When we have a customer impacting event, when we have an event that is large scale, when we think we can learn from an event and share this with the broader company, we write a very detailed report called Correction of Error, or COE for short. When we write this, we are very careful that we don't make it a blame tool. The person or team who is writing the COE should never feel like they are being blamed or being punished. That's very important as leaders of your companies and organizations to instill that culture that this is a learning tool because we don't assume bad intentions. We assume good intentions, and we look for the mechanism that failed. A correct output of a correct COE should be what mechanism failed and what can we put in place to fix that.

Here are the sections of a Correction of Error. It's a template, it's a tool because it's a mechanism, so you can create a Correction of Error. It will give you a template to fill, and then it will be bar raised and reviewed, including in the Wednesday operational meeting. If it is broad enough or if it had great learnings, we pick them and review them. We write a summary, we put the relevant metrics and graphs, we write the customer impact. We also share what went well. Maybe the team did something great and we want to share that. Incident response analysis is about how we reacted to the incident, and one favorite question of mine is this one: How long did it take you to detect the issue? And how would you have cut that time in half? If you keep asking this question and if you keep taking action items over time, you will naturally improve.

Post-incident analysis covers things like issue diagnosis, and the same question applies. You've detected, but did you spend an hour trying to figure out what went wrong? We are asking the team to do the thinking exercise: how would you have cut that time in half? Or did you have a test for this scenario? Similarly, we ask for tests in the pipeline rules or in the operational readiness review. We write a very detailed timeline just so that we can see the gaps in the timeline and the things we can improve. The 5 Whys section is not a hard rule, but it's a rule of thumb. The first why is the symptom, why this service impacted that, and the last why's answer should be the root cause. This is like peeling an onion exercise for the team. And of course, it has lessons learned and action items at the end. Every COE if you were working in AWS and Amazon would have this structure.

We use these Centers of Excellence to refer to each other, and the thing I want to point out is that as you can see, it's a tedious process that takes time. With AI, we build systems to start from something rather than nothing. For example, can a timeline be crafted for us? Can we use prior Centers of Excellence to leverage that knowledge base to give teams ideas on action items? We've already externalized this and call it Cloud Watch incident report, and we have a quick demo for that.

Closing the Loop: Incident Reports, Practical Steps, and Final Reflections

I'm going to continue from where we were. We're looking at an investigation, and one of the investigations I've already completed is at the bottom, called appointment scheduling and troubleshooting. When I click on that, what I did was access a button called incident report, which should have shown an animation, but once you click on it, it automatically creates this report.

On the right-hand side, you see all the facts that it collected when the investigation actually happened. You can edit these if you want to update information, such as whether it impacted more or fewer customers. However, what I want to focus on is what's on the left-hand side in the report itself. It creates a title, provides background on the incident, includes a nice summary, and adds visualization with all the graphs. It talks about customer impact, exactly like Mustafa mentioned, what went well, and the entire incident report analysis.

What I'm really interested in showing is the five whys, like Mustafa talked about. It essentially asks all the right questions, documents all five whys, and finds out what the root cause is. Everything is entirely fully documented. Of course, it's only more useful if you actually share it with your team. You can export it, copy it to a clipboard, share it to put it in your document repository, or download it and share it with other team members.

I mean, it wouldn't be a talk if there wasn't any technical difficulty, but we're not aiming for perfection. We have come full circle. We covered our flywheel, and as I said earlier, the more we get better at operational excellence, the better reviewers we become. I forgot to mention something important. The last thing we mentioned was Centers of Excellence in the review section, and the first thing we covered was Operational Readiness Reviews. This is an example where Centers of Excellence actually feed into Operational Readiness Reviews. Every Centers of Excellence we write eventually ends up as a question. Not entirely, but every interesting Centers of Excellence, every Centers of Excellence that is broadly applicable becomes an Operational Readiness Review question and feeds into our process.

Again, operational excellence makes us better reviewers. You might be thinking that these processes and mechanisms are great, but Amazon is a big company and can afford these things. Maybe you're just a startup. Remember, Amazon was also a startup. When Operational Readiness Reviews first started, it was just a Word document with a bunch of questions in it. Now it's become a tool with 2,000 templates. Anybody can create their own Operational Readiness Review template. You have to start from somewhere.

I'm going to suggest some steps. If you were inspired by anything you heard here, any mechanism, I suggest you start thinking about how you can implement a version of this in your own company. It will be your own version and won't be the same, but you can perhaps start looking into that. I would love to help you folks. You don't have to use CloudWatch; you could be using anything. I just have operational excellence improvements. If you want to invest in them, I would love to help you, talk to you, and give you some ideas and review your plan. Imaya is the same. He helps our customers raise their operational excellence every day.

There's lots of stuff online that we've already referred to, and there's even more. You can read the observable best practices guide that Imaya and his team put together. It's on the same company website. We also have an observability workshop. All those demos Imaya was showing, you can deploy them yourselves and play with them to learn more.

I want to finish on a lighter tone. When we first put this deck together with Imaya, I was in the Bay Area visiting our San Francisco office. I'm from Seattle, and I was going to do a dry run to a small group of people. I wanted to grab a coffee in the morning before going to the SFO office, and I found a coffee shop right across the street. If you can believe it, it was called Flywheel Cafe. Inside there is a flywheel up there, and if you can believe it, there was a first responder ordering right before me from the San Francisco Fire Department. I thought this was destiny for me to give this talk.

Thank you for your time and for being here. I hope you found it useful. If you want to connect with me or Imaya, we are on social media. If you have any questions, we will be right outside in the next five to ten minutes. We would love to connect with you in person. Thank you, folks.

; This article is entirely auto-generated using Amazon Bedrock.