Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

In this video, Alex Livingstone and Jared Nance present enterprise observability best practices using AWS CloudWatch at scale. They emphasize measuring business outcome metrics like "orders per minute" rather than just infrastructure metrics, explaining that downtime costs enterprises $300,000-$5 million per hour. The session covers Amazon's internal monitoring approach using CloudWatch to process 20 quadrillion metric observations monthly, including centralized logging across accounts, multi-resource alarms via Metric Insights, Transaction Search for 100% trace capture without sampling, and Contributor Insights for high-cardinality data analysis. Key features demonstrated include Application Signals for automatic OpenTelemetry instrumentation, CloudWatch Investigations for AI-powered root cause analysis, Container/Database/Lambda Insights, anomaly detection across metrics/logs/traces, and MCP servers for natural language queries. They provide a practical implementation blueprint from foundational setup through AI-powered intelligence layers.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Reality of Enterprise Observability: Challenges, Costs, and the Business Case

OK, hi everyone. My name's Alex Livingstone. I'm a Principal Specialist Solutions Architect specializing in cloud operations and particularly observability. I've been doing operations work for about 25 years, which makes me feel old now, and I've been doing this in AWS for about 9 years.

I want to start with the reality of enterprise observability today. It's a lot different from traditional monitoring. You might have to worry about having hundreds or even thousands of accounts across multiple regions, thousands, maybe tens of thousands, and in some cases, maybe even hundreds of thousands of microservices or services. For some of you, you might have petabytes of telemetry daily. Not everyone's going to have that, and what we're going to talk about today is best practices generally, so it doesn't necessarily have to be at huge scale.

Some of the challenges that this leads to, generally when I speak to customers, is they have lots of different tooling. You might have logs in one tool, traces in another tool, metrics in another tool, and this becomes challenging. Also, it can become very costly, and it can lead to alert fatigue. How many of you have alert fatigue as a problem? Yeah, quite a few of you. And obviously that can lead to missed alerts and not spotting critical issues. So these are the daily realities probably for a lot of you and for enterprise teams in general.

When your observability doesn't scale, you see these kinds of problems. Increased mean time to resolution. As I said, alert noise and fatigue. Data silos, and this might not just be actual data silos, but you might also have team silos as well, and that can make things more difficult. If your data is siloed, that kind of reinforces your teams being siloed as well. And then if you've got multiple tools, then you have to do manual correlation across those tools.

I see this with customers a lot. If your metrics are in one tool and your logs are in another, then you have to be looking at one tab for your metrics, another for your logs. You have to look at the time across these different tabs and switch back and forth, and that can be a bit of a nightmare. And then you also have inconsistent coverage if you're using different tools, different teams are using different tools, different standards, and that makes it even more difficult.

Now I'm going to come onto the cost. Before I click next on my clicker, how many of you here know exactly the dollar amount per hour for downtime for your application or for your organization? Anyone? OK. So for 90% of enterprises, it's $300,000 an hour. Now I'm not saying this to scare you, but it's $1 to $5 million per hour for 41% of enterprises. So yeah, this is not to scare you. This is to help you to justify why observability is important to your business, your company, so you can justify why you should be investing in observability.

Because if it comes at that cost when you have downtime, the relatively small amount of money you can invest into observability to stop this downtime from happening, the return on investment should be really clear. Obviously, as well as just the cost, you have the lost time from engineers. Customers experience problems that can lead to reputational loss as well, beyond just the actual cost of downtime. And this can also lead to delayed features. There are other repercussions as well, but delayed feature delivery might cause problems with revenue that you're not yet getting.

Rethinking Metrics: From Infrastructure Monitoring to Business Outcome Measurement

OK, so before we go into the technical solutions, I'm going to start off by trying to get you, for some of you anyway, this might not be all of you, to fundamentally think differently about the way you do monitoring. So this section is just about the way you think about metrics. Traditionally, when we look at monitoring, we look at infrastructure metrics. So here we can see things like compute, network, containers, storage, and then obviously we vend metrics for nearly all of the AWS services. And then on top of that you might have your

application metrics, so performance, API calls, IOPs. You might even have metrics on individual features. And then golden signals, so latency, traffic, errors, and saturation. I'm sure most of you have probably heard about these.

So let me think how to phrase this. How many of you kind of stop at that top layer? Do any of you think about metrics that may be above that layer? Yeah, a couple of, yeah, a few hands. And this is what I want you to think about: business outcomes. So business outcome metrics are really, really important, and I want you to think about this. And there's two fundamental reasons why, which I'll go into in a minute.

But think about what your customers care about. They don't care about the CPU on your EC2 instance. They don't care about the utilization of your containers. They don't care about how much storage or CPU or memory or Lambda functions are using. They care about things that matter to them, and you should be caring about the things that matter to your customers.

So if you go onto Amazon.com, for example, our most important metric, as you can guess, it's not going to be CPU utilization. It's not even going to be some performance-related metric. It's not even latency. These things are all important, but the most important metric is orders per minute, because the idea of you coming onto Amazon.com is you want to buy something. If you can't buy something, there's a problem.

And the reason you should measure these business outcomes, there's two major reasons. One is, if you measure for business outcomes, if there's not a technical metric telling you you've got an issue, the business outcome metric will definitely tell you you've got an issue. The other reason to do it is if there is a technical metric that tells you you've got an issue, how do you know what the business impact is? A lot of the time, maybe you've experienced this issue before, you can take a rough estimate of what it might be. You might have a good idea of what it might be, but the only way to be sure is if you're measuring business outcomes.

So if you take, I hope you take more than just this away, but if you take one thing away from this, I'd really like you to go away and think about the business outcome metrics for your applications, what matters to your customers, and measure those. The other advantage it gives you as well is your managers will love it if they can see these business outcome metrics in near real time, rather than waiting for BI tools and getting weekly and monthly reports.

A Blueprint for Enterprise-Scale Observability

So we're going to go through like a blueprint for how you can implement observability at this kind of enterprise scale. We'll talk about centralized logging, how to collect metrics. We'll talk about tracing. Then we'll talk about adding intelligence to that. So we'll talk about anomaly detection. Jared will talk about how to look at high cardinality metrics, and then also correlation and how to do that automatically and not have to mess around between tabs.

And then to add onto that, we'll talk about CloudWatch Investigations, Application Signals for APM, and then some specialized insights integration, which I will get into later. So now I'm going to hand over to Jared, and he's going to talk about how we do this at Amazon.

How Amazon Uses CloudWatch: Scale, Service Blueprints, and Instrumentation Frameworks

Hey folks, my name is Jared Nance, and I'm a principal engineer at CloudWatch. I'm going to talk a bit about how we use CloudWatch internally to monitor our own services. But to set some context, I want to take a moment to help you understand the scale of CloudWatch. So every month we're processing 20 quadrillion metric observations, 13 exabytes of log data, 455 billion traces, and we're running 861 million canaries every month.

Why is this relevant? Well, it gives you some insights into the scale of our own services, and we actually use CloudWatch to monitor these services. At Amazon, we've largely standardized on CloudWatch for new workloads. There are, of course, some older systems that don't use CloudWatch from the early days, but most of the new ones all depend on CloudWatch, and we use this across Amazon. And the way that we approach monitoring and observability really starts with how we organize our teams and our services.

As I'm sure a lot of you know, Amazon follows a DevOps model where we partition system complexity into services, and those services are operated by what we call two pizza teams. Let's start by looking at what happens when we create a new service. We have a set of blueprints that we use for common architectures, and a team can just create an instance of a blueprint with everything already set up.

So consider an API workload running on Amazon API Gateway with a Lambda function integration that might look like this. How do we ensure that this system is set up for success with CloudWatch? Well, for all of those vended services, the infrastructure as code templates that we use to provision them already configure things like execution logging from our Amazon API Gateways and active tracing from the gateway and the Lambda functions. Of course, you get all of those vended metrics for free from the services. And if you're running on AWS Lambda, it's automatically integrated with CloudWatch Logs.

But what about your custom telemetry? And when I talk about custom telemetry, what I'm really referring to is the workload or business specific instrumentation that you need from your services. All of the data that we need to know what this workload is doing, what it's interacting with, how many records it processed in a batch, this is all custom data that we can add into our telemetry. So whenever we create these blueprints, they come prewired with instrumentation frameworks. A basic Lambda handler like this would also come with an instrumentation framework that's preconfigured the way we need it.

Whenever we create an instrument within this function, it interoperates with all the other frameworks we're using, like our caching frameworks or our web frameworks. So here when we create the instrument, it'll automatically add things like the identity that's calling the function. It'll link trace IDs into our logs, and these are all things that as a developer I don't have to think about. Then we can go in and we can add our custom metrics and the labels that we want to include in our telemetry. We then pass the instrument factory to child operations so they can continue further instrumentation, carrying all of the context as it goes.

And interoperability here is super important. In Amazon we've standardized on our instrumentation frameworks a long time ago, but you can use tools like OpenTelemetry to get this kind of interoperability so that the frameworks that you use, whether they're caching frameworks or web frameworks, are all emitting telemetry in a consistent way and carrying context as you're adding information into your telemetry. When we emit telemetry, we co-locate metrics and labels and event data that might look like this. This allows us to store high cardinality context in the spans while using metrics for other things like alarms and dashboards. It also enables us to quickly answer questions like which requests failed or which customers are impacted by running Logs Insights queries and Contributor Insights.

Solving High Cardinality Challenges with Contributor Insights

One of the most challenging things in observability today is dealing with high cardinality data. Today some of those challenges with traditional monitoring systems include that you may be emitting millions of unique metrics which may hit limits on some of the existing systems. The insights are also buried in that cardinality explosion. So if you consider a service that just wants to emit a metric latency and it has dimensions like customers, API endpoints, and regions, if I have 1,000 customers and 50 API endpoints in 10 regions, that one metric now explodes to 500,000 unique metric dimensions.

So Contributor Insights takes a different approach. It uses an automatic top-N analysis where we take those structured logs and we identify top contributors for the metrics in your data. It allows us to do real-time contributor ranking, and it's cost effective as it doesn't result in metric dimensionality and cardinality explosion. We offer a structured format for the data called Embedded Metric Format which allows this to happen automatically. So when you emit that span data, you have the structured data in your CloudWatch Logs and you have the metrics that you want extracted in your metrics all through a single event.

And we also have alarm integrations so you can actually alarm when you have a single contributor that is breaching some threshold for some metric. So this allows you to identify when just one of your customers is having a bad day.

So this is the kind of graph that you might get if you're using Contributor Insights. Internally, we use this for all of our critical metrics. We've had cases where we see a small drop in availability and we realize that it's all caused by a single customer, and we've been able to work with those customers to identify issues that they may have introduced through deployments that may be impacting them. We actually use Contributor Insights to power the Personal Health Dashboards so that in a short amount of time we can identify during a large-scale event which customers are impacted and quickly notify you. This is the tool that we use to do that.

From Aggregated Metrics to Individual Traces: Cross-Account Visibility and AI-Driven Incident Response

So Contributor Insights allows us to go from an aggregated metric and identify the top contributors for that metric. But what if I want to go from an aggregate metric to a very small slice? Maybe I have one error, one request that's failing. Maybe I want to see the full trace for that one aggregated metric. Suppose I have a graph for service availability that looks like this. This is an aggregated metric, and each sample in this data may contain many different actual requests, and I just want to get one of those requests to understand what's happening.

Because we're co-locating our metrics and our structured log data in the same events, we're able to run a Logs Insights query that looks like this, where I am first isolating the data by the API and filtering it to just the errored requests, and now I can get things like the trace ID and the exception message. But what about cross-account and cross-region? At AWS, we partition our workloads across accounts for region isolation. So whenever I deploy a service into a new region, it's entirely isolated into a different account. But how do we get visibility across all of these? Whenever an operator gets paged in, they know exactly which region and therefore they know which account to go into. If an operator group owns multiple service accounts, they can create a central monitoring account that aggregates the CloudWatch data across accounts and regions.

So now let's talk a little bit about incident detection and response and how we discover that there are actually issues that may be impacting our customers. Whenever we create those blueprints, they also come with predefined dashboards that we deploy through infrastructure as code into CloudWatch. This gives us, and every week the service team will go through and they'll review all of our customer experience metrics and ask questions about what's driving this increase in latency. This gives us a regular cadence to deep dive into issues and ask questions about whether we're monitoring the right things, what kinds of regressions we're seeing in our services, and identify things that may need automated detection that we actually haven't configured yet.

But since we don't sit around staring at dashboards all day, we use alarms to notify us when something goes wrong. We don't want every single alarm triggering tickets and paging operators because large events may actually trigger multiple issues, so we use composite alarms to actually interface with our teams. So whenever a child alarm goes off, it will trigger a composite alarm, and that composite alarm will create a ticket in our incident management system. When that ticket gets cut, we automatically trigger a CloudWatch Investigation. So the CloudWatch Investigation will initiate an investigation that's AI-driven. It will look to identify the root cause, and it will hopefully do that before the operator has even logged on to investigate the issue.

We embed a link back to the investigation into the ticket so that when the operator logs on, they go directly into the ticket and then they're able to federate directly into the account that's having the issue, review the AI summary, and if needed, do further manual investigation. So with that, I'm going to hand it back to Alex to talk about these features in more detail as well as a few others.

Centralized Logging, Multi-Resource Alarms, and Transaction Search at Scale

Thanks, Jared. Okay, the first thing I want to talk about is, and these are all things that are going to help you, I'm going to talk about a bunch of things here that are going to help you do this thing at scale. The first thing I'm going to talk about is centralized logging. Now we introduced a new feature a couple of months ago, and it's filled a big gap because we've had the ability to centralize all of your monitoring, so your metrics and your traces cross-account and cross-region in either a single monitoring account or multiple monitoring accounts for quite a while now, and the thing that was missing was the ability to do this with logs.

It's a thing that customers have asked for for a long time, and you can now do this. So if you weren't aware, we released this a couple of months ago, and you now have the ability to have a free copy of your logs in one central location, so one account, one region. And that means that you can then query all of your logs from one place.

So I think this is a really big deal. This is really exciting because this now allows you to have CloudWatch as your central destination for metrics, logs, and traces. And before this, it was a bit problematic with logs. So this is multi-account and multi-region. And as I said, this now allows you to do these unified log insights queries across accounts and across regions. And it also means you can have centralized retention policies to manage your costs as well.

Another thing that's difficult to do at scale is creating alarms. And this is something else that was released just a few months ago. Well, let me talk about the challenge first. So before this, you'd have to create an alarm per resource. So imagine you've got hundreds of thousands of containers, and you want to set up an alarm for each container. And even if you do this as infrastructure as code, it's still a bit of a pain, and you'd have to create an alarm per resource. And that would also lead to having inconsistent thresholds, so different teams would set different thresholds.

And obviously it creates alarm sprawl. You can end up with tens of thousands, hundreds of thousands, maybe even millions of alarms, and there's no centralized management. And then when you're looking at threshold tuning, it's reactive and different teams are doing different things with their thresholds. So this kind of creates a nightmare scenario where you have thousands of alarms but no confidence that they're actually measuring what matters.

So with Metric Insights, we've always had these SQL-like queries for your CloudWatch metrics. But the two things we've added recently, one is tag-based filtering for vended metrics. So this means you can use tags in the query. And the biggest thing I think is multi-resource alarms. So what this allows you to do is say you've got 100,000 containers and you want to say alert me if the CPU on any of those containers goes above 80%. You can now, you don't have to create 100,000 alarms, you can create one alarm to do that. And this obviously gives you this central control, and it also allows you to have a better view across your entire estate to look at trend analysis.

So this transforms how we can do alarm management and manage all of these individual resources. And this is what it looks like. So here we just have, in fact it's the example I said. So I'm looking at the maximum memory utilization across, potentially across every single container that I have in my account. I don't have that many, so there's about 20 or so there, but I'm able to immediately just do one query and I can set an alarm on all of those containers.

How many of you here are doing tracing? Okay, quite a few, maybe about a third to a half maybe. And how many of you are having to sample your traces for cost? Yeah, quite a few. So with Transaction Search, it solves the problem of sampling, so you can capture 100% of your traces rather than sampling 1 to 5% maybe. This means you can do, it also allows you to do real-time searching of those traces, so we can go through millions of traces in seconds. You can add custom attributes to your traces and they'll be optionally indexed. And this allows you to have this correlation without gaps.

Now, if you do sampling, there's a problem.

Now, even if you do tail sampling, which is, let's say I want to set up tail sampling, which I can do, and say I'll only have 100% of my errors and I'll sample 5% maybe of my successful requests. Well, even that doesn't work because you might have a successful request that you want to go and have a look at, because maybe actually technically it was successful, but actually there was a problem and you need to go and look at it. And the problem with sampling is you either get an aggregation of what everything looks like, or you can go and look at individual traces and see what some people had issues with maybe. But if you've got a particular issue and it's not in your sample, then you don't have the chance to go and do that.

So what Transaction Search allows you to do is the ability to have every single trace and query across millions of these traces in seconds. You can filter by the business context. So these are things maybe you'll have a customer ID, you may have a session ID, things like that, even feature flags. So you can add those into your tracing. And they correlate with logs and metrics automatically. You can also export this trace data for things like machine learning analysis.

And you can do this search on up to 10,000 accounts. So that's going to work for most scenarios I think. And this is what it looks like. So here I've just done a search on all of my traces and just grouped them by status code. So I've got accounts by status code. And then you'll see there's a button there called summarize results, and that just uses AI to summarize the results of my query.

Anomaly Detection Across Metrics, Logs, and Traces

You can do this in log insights as well as in Transaction Search. And then it's given me a summary telling me a bit about which status codes I've got and how many I've got. So with anomaly detection, we have anomaly detection on all three pillars, so metrics, logs and traces.

With metrics, we get baselines that take into account seasonality. It uses the last 14 days' worth of data. We continuously adjust the model, so if we create a new model that's better, we'll replace the model. If it's not better, we'll keep the old model. It supports custom metrics and vendor metrics. And what it does is it just identifies outliers.

And this is really useful if you've got metrics that are repeatable, so you know you have a busy pattern during the day and maybe it's quiet at night. Or you have a metric that keeps on rising or one that keeps on going down. Or maybe you're going into production for the first time and you have no idea what the baseline for that metric is going to be. So there are kind of four use cases for using anomaly detection for metrics.

And then we've also got anomaly detection for logs. This is built on something we introduced, I think it was last year, which was pattern detection in logs. So we've built on that pattern detection in logs. And you can automatically surface anomalies in logs, just turning it on for a log group. And there's five different anomaly types.

One is the frequency, is this happening more often, or is this happening less often? Or you might have a new pattern that's emerged, or a pattern that's just disappeared entirely. And you can run this continuously with log anomaly detection for your log group, or you can run it as a log insights query.

Another thing you can take away, if you just take away one log insights query, unfortunately I don't have it on here, but if you do pattern app message, pipe, anomaly, it will basically give you a summary of all of your anomalies in your chosen log groups, and you can choose all of the log groups. So if you've centralized all of your logs into one account in one region, you could go into log insights, write that pattern app message, pipe anomaly, and it would tell you all your anomalies in all of your logs in that time period. Really, really useful.

And then we've got traces. So for traces, it integrates with X-ray analytics and it looks at things like latency and error rates and dependencies on other services as well.

And I'll show you what these look like. So this is a typical metric that's quite steady and repeatable. And you can see here everything is fitting within that gray band. We can adjust the size of that gray band, and then you can alert on anything that goes either above or below, or just above or just below that band.

And this is log anomaly detection. Here it's detecting an error that I've not previously had, or an increase. No, it's one I've not previously had, I think. And you'll see here that when we detect patterns in these logs, what we're basically doing is looking at the patterns and taking out the variables, and we call these tokens. We've got token values here, and I've chosen to look at these token values because they're showing me the trace ID. So what I can do with this is I can look at this log anomaly and go, okay, I want to investigate this a bit more, and I can actually just look at those trace IDs and then go and have a look at those traces and see what's happened.

And because traces and log events are correlated, when you look at the trace view, if you looked at the trace view of any of those traces, you'd see every single log event correlated with that trace as well. And this is what anomaly anomaly detection looks like in traces, so this is in the X-Ray console. And this gives a bit of added value as well. It tries to give you the description and the root cause of the issue as well, or the root cause service. And you can drill down a bit further into this, and it gives you more information.

Application Signals and Specialized Insights: Auto-Instrumentation for Full-Stack Visibility

Okay, so Application Signals. So I've said more insights, less work, so you have to do a bit of work here but not very much. So before Application Signals and before OpenTelemetry, I guess, you'd have to do manual code instrumentation. You'd have to, so your developers would have to add code to your applications. You'd have to do manual updates every release. You have inconsistencies in data collection, which Jared talked about earlier, and there was selective coverage as well in what libraries were covered.

But now with Application Signals, at least for Python, Java, .NET, and Node.js, you can have automatic instrumentation. And this uses OpenTelemetry, right, so this is using open standards. We're using OpenTelemetry to do the auto instrumentation, and then we're sending the traces to X-Ray and the metrics to CloudWatch. And it's an agent-based deployment. There's no code, it's just drop-in and plug-and-play, so it's just configuration. This is particularly easy to do in EKS because you just add an observability add-on to EKS, and then you just turn on Application Signals and that's it.

There's a little bit more work to do when you do it in ECS or EC2, but it's basically just deploying an agent and adding some configuration. And it's also built into Lambda as well as an option. And obviously this gives you much faster implementation. There's hardly any developer effort at all. It's easy to maintain. We maintain the agent for you. In EKS, we can update the agent for you as well. You get standardized telemetry across all of your applications, and this is what gives you that full-stack visibility without any effort.

Now I would say there is one exception where you might not want to use this. If you've got very latency-sensitive applications like high-frequency trading or something like that, you wouldn't want to use auto instrumentation at all from any vendor. Doesn't matter if it's CloudWatch or anything, don't use automatic instrumentation because there is a tiny little overhead. Because of the way auto instrumentation works, it looks at the code and it's analyzing it on the fly, so it adds a tiny bit of latency. But unless you're doing that, it's kind of a no-brainer to use.

And it's really hard to get a screenshot that shows you everything with Application Signals, but this is kind of the high-level overview you get.

Application Signals performs service discovery and creates the golden signals for you. It allows you to create SLOs based on those golden signals, so you can see the health of your services. This is a weather application, and you can see I've probably been a bit aggressive with my SLOs. It's actually not that bad, but it's showing that they're all unhealthy. You get services by fault rate, and when you go into these individual services, you'll see things like what dependencies they have, what services they're interacting with, and the SLOs for that individual service.

So this is better, more insights, no work at all. You have to turn it on, but that's about it. Container Insights can be used for EKS and ECS, and it gives you resource utilization right from the cluster level down to the pod level. It can give you metrics on container performance, cluster performance, information about deployments, and it integrates with Application Signals as well.

We have Database Insights. Again, something you can just turn on. Well, actually it's turned on by default, but there are two tiers to it. If you're running something serious in production, you probably want to turn on the next tier. By default you'll just get the basic tier. It gives you performance monitoring for RDS, analysis of your queries, things like connection pool tracking, and it can also give you performance recommendations. So it can proactively give you recommendations on what you should do with your database.

And we've got Lambda Insights. Again, you can just turn this on and this gives you function-level metrics, things like cold start analysis, memory and CPU utilization, network, and this integrates with tracing as well. I'll just show you what this looks like. So this is what Container Insights looks like. I've got a view of one particular service there. I'm just running two pods and I can see all my metrics for that individual service.

For Database Insights, there's lots of information in Database Insights. This particular database is not very busy, but we can see the top SQL there. And for Lambda Insights, you can have a multi-function view or a single function view. Here I've got a single function view. This is one of my services which goes and gets the current weather. We can see things like invocations, duration, memory utilization, CPU utilization. We can also get a view of the last 1000 invocations, we can link directly to application logs, and there's also a link directly to Application Insights there as well.

AI-Powered Observability: CloudWatch Investigations and MCP Servers

And now, so we've gone from doing a little bit of work to doing no work, and now we're going to let AI do the work for you. So we've got anomaly detection, as I've talked about, across metrics, logs and traces. And we can do a correlation of all of these resources using CloudWatch Investigations. So I'm going to show you quickly what this looks like. This gives you a visual mapping of the causal relationships between your services, and it gives you root cause analysis with natural language. It can even recommend runbooks as well.

So this has a scope that can cross multiple accounts, so you don't have to worry about just running it in one account. Well you run it in one account but it can take in the scope of multiple accounts. It can correlate with change events, so it looks at things like, obviously it looks at metrics, logs and traces, but it looks at things like CloudTrail as well. As I said, it gives you that visual overview of the scope of your investigation. You can share the findings of the investigation with your teams. Obviously there's programmatic access as well.

So like Jared mentioned, we use it internally. We have an alarm, we create an investigation. You can trigger an investigation based on a CloudWatch alarm, and then obviously you can use the APIs to then send that information wherever you want. So yes, you can integrate it with your incident workflows. And I suggest that if you do use this, do it in a similar way to how we do. Integrate this with your ITSM processes. If you're using ServiceNow for example, put this into your ServiceNow tickets.

So this is an example of what a CloudWatch investigation looks like. It's telling me what happened, it's giving me the evidence and the likely causes, and it's given me that visual overview. There's some reasoning below this, and don't worry, I'm going to go through what I did.

I was basically looking at this metric, and I noticed on the left-hand side where it says impact start, there was a bit of an increase in the duration of my Lambda function. My Lambda function was running for nearly six seconds instead of roughly just under four seconds, so I asked it to investigate. And this is the root cause summary. Essentially, Bedrock was throttling me because I'd reached my rate limits, and my retry logic was failing because of these repeated failed calls to Bedrock. It was telling me my function duration went from four seconds to six seconds. So it's told me the root cause, and it would tell me then to obviously go and increase my Bedrock limits.

We've also got MCP servers for CloudWatch. We've got two of them, in fact. We've got the CloudWatch MCP server, which allows you to use natural language to go and do things like look at metrics, logs, and traces for you. We've also got Application Signals, so that can look at all the data in Application Signals. Both of those working together are probably the best way of doing it. These are the kind of natural language things that you might ask at the top here, and this is one of the things I did ask. So here I'm using Quiro.

In Quiro, I've got these MCP servers set up, and I've just said my application is deployed in EU West One. Based on the code base, Quiro will have context of the code base once I put hash code base in. I asked it to show me CPU spikes in production in the last hour. Now because it's got my code base and I've got the MCP servers, it knows what resources to look for. It's not going to go and find stuff that I don't care about that's not related to this application. You can see here there's logging information and stuff, but that's not really important for now. What I want to show you is the output.

The output here is just showing I had some spikes in CPU utilization for my EC2 instances. It said critical spikes detected, though I would argue they're probably not critical. My EC2 instances spiked to like thirty percent maximum. Then I've got some EKS pods, some ECS services, and it's given me a summary of that. At the end it said the good news is your application and other critical services remain stable throughout. But all of that, it's gone and done for me just with that one little prompt. So this is really nice if you're able to use MCP servers. Have a look at the CloudWatch MCP server and the Application Signals MCP server. They can be really useful just to go and dig into things as well as using CloudWatch Investigations.

Implementation Roadmap: From Fundamentals to Intelligence and Optimization

We've talked about lots of different things, logs, metrics, traces. These are the fundamentals before you can build up to using those AI tools in a meaningful way. You have to set the foundation. Configure a monitoring account so you've got everything in one place. Set up log centralization again so you've got everything in one place. Make sure you've got retention policies that reflect your organization's requirements, things like compliance requirements as well.

Deploy an alerting framework, so by that I mean the kind of thing that Jared talked about. When you do an alert or when you have a CloudWatch alarm, follow a standard pattern, and maybe it would involve triggering a CloudWatch investigation, opening a ticket on your system in the way that Jared described. And make sure you have tagging on your resources, especially with the ability for Metric Insights, at least for vendor metrics right now, to be able to run queries based on tags. That's super useful.

Another thing I've not mentioned on here, I should have mentioned it probably, is create standards. Create standards for logs, create standards for metrics. You tell all of your teams that they should be logging in JSON format. These are the kind of fields that we want in every single application or service. Things like who owns it, what application is it part of, what environment is it in. All of these things are super important.

And then we come onto insights. So on the left we've got the kind of fundamentals. Insights, these are really simple to deploy. There's no reason, obviously they come at a cost, but in terms of ease of deployment, there's no reason not to do these. Container Insights, just a tick. Database Insights, same thing. Lambda Insights, same thing. Obviously you can do this with infrastructure as code as well as using the console. Have CloudWatch agents on EC2. How many of you have not got an agent on EC2? Any kind of agent? Good, well that's good news. Sometimes I speak to customers and they've got nothing running on EC2.

Now the advantage of having the CloudWatch agent on EC2 is obviously that gives you everything in one place. It also means that you can use Compute Optimizer to right size your EC2 instances as well. If you use Compute Optimizer and you're not using the CloudWatch agents, then you will only get CPU and networking metrics, and who wants to right size their EC2 instances without taking into account memory. Enable network monitoring, we've got some really nice tools in CloudWatch to monitor networking. I think very recently there was something announced with monitoring EKS as well that's built into CloudWatch.

We haven't really had time to talk about this in this short space of time. Have a look at Internet Monitor, that's a really nice thing to go and have a look at to see the kind of thing you can do. And then start to look at application performance. If you've got applications that are written in Python, Java, Node, or .NET, then have a look at applications and signals. Try it on one of your workloads in development and just see what it gives you. Because it's very little effort and it gives you lots of information.

If you don't want to do sampling and you're worried about the cost of tracing everything, enable Transaction Search. Again, that's super useful, it means that you're not going to lose any data. Make use of Application Signals for your SLOs, but even if you're not using Application Signals, and it's not immediately obvious this, but if you go into the CloudWatch console and you look under application performance monitoring and you see SLOs, you can actually create SLOs on any CloudWatch metric. It doesn't have to be related to Application Signals.

You can enable Real User Monitoring for your web applications. And you can create synthetic canaries. Synthetic canaries are really, really useful. You can run them up to every minute. And you can either just do like a heartbeat check or you can run through a complex workflow. You can run them to a schedule or you can run them on demand. So they might also be useful for things like smoke testing. So you create a bunch of smoke tests using synthetic canaries and you run them on demand at the end of your pipeline. And then add intelligence.

So use Contributor Insights. Who uses Contributor Insights here? No one, this is really sad. I think Contributor Insights is one of the most powerful tools we have. And I mean, I'm not going to go into it again, Jared talked about it, but have a look at Contributor Insights because it's really, really powerful. Use Metric Insights for your alarms. Have a look at metric anomaly detection, look at log anomaly detection, also look at just how you can use pattern analysis in just when you're doing queries and Log Insights.

Once you've got all of this, then you can start to really make use of CloudWatch Investigations properly.

Now you can use CloudWatch Investigations when you haven't done all of this, but to really get the best out of CloudWatch Investigations, obviously like anything AI-powered, the more information it has, the better it's going to be at getting to the right answer. You can create runbooks in Systems Manager. How many of you create runbooks in Systems Manager? Yeah, a few of you.

When you have an alarm triggered, you can choose to run a Systems Manager automation document if you've got something that's repeatable and you can fix with an automation. One really simple explanation of this is if you've got an EC2 instance, you don't want to get woken up at 2 o'clock in the morning just because the disk is full, right? There's a Systems Manager automation document already built into Systems Manager where you can, if there's an alarm for a disk space full, just trigger this document. It will expand the EBS volume, then expand the OS volume into the EBS volume, and then you can investigate in the morning when you get into work rather than having to be woken up at 2 o'clock in the morning.

We've got AWS Chatbot as well, so you can use that to send information to and from Slack and Teams. It's really important to set up these incident workflows. Please don't be in a situation where you just send your alarms straight to Slack or Teams and you don't have any proper incident management workflow.

And then after you've done all of this, that's when you can begin to optimize the cost and performance. Maybe then consider using composite alarms. Composite alarms allow you to create a tree of alarms using boolean logic. You can use things like Embedded Metric Format, and you can start to optimize things like your log retention. Also, you can optimize metrics, and for tracing as well, you can optimize your sampling, especially if you're using Transaction Search.

But that's it. Thank you very much for coming. I hope you've enjoyed your time here. Please fill in the survey on the app. Thank you very much.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community