DEV Community

Cover image for AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

In this video, AWS observability-focused solutions architects discuss operational excellence for modern distributed applications. They cover essential observability must-haves including business and technical metrics, the importance of standardized naming conventions and RED metrics (requests, errors, duration), and using tags and dimensions for context. The session demonstrates distributed tracing with OpenTelemetry, showing auto-instrumentation requiring no code changes and manual instrumentation for business context. Key features include Application Signals in CloudWatch for automatic service discovery, correlated metrics and traces, SLO management, and transaction search. A live demo walks through troubleshooting using service maps, trace analysis, and semantic conventions. The speakers explain sampling strategies including head sampling, tail sampling, and X-Ray sampling rules to control telemetry volume. They showcase specialized tools like Container Insights, Network Flow Monitor, and Database Insights for filling observability gaps. The session concludes with AI-powered features including CloudWatch Investigations and MCP servers for accelerating incident detection and resolution, emphasizing continuous improvement through correction of errors processes.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

The Challenge of Observing Modern Distributed Applications

Good afternoon. Please put your headphones on. If you can hear me, please put your hands up. If you can't hear me, there are people here to help you, so make sure you flag them down.

Again, put your hands up if you have ever found yourself frantically searching through a sea of logs and disconnected telemetry to pinpoint the root cause of an outage in your distributed application. Most of us have experienced this. Some very lucky people haven't gone through that yet. This reminds me of an incident from when I was an AWS customer and an SRE. I looked after a financial application, and we had an outage where no new logons would take place. All the existing sessions were fine. The authentication services looked fine, everything looked fine, but we couldn't have new people log on and we had unhappy customers. My manager was asking me when this was going to be fixed and if I could jump on the bridge to tell them more about the problem, all while we were trying to figure out what was going on. Long story short, it was an expired client certificate. But as you know, that could have been anything else. Sometimes you just don't know what you don't know.

Thumbnail 120

But the good news is that with the right tools and techniques, you can make your life easier in these situations. As observability-focused solutions architects, my colleague Helen and I help customers drive operational excellence for modern applications, and that's what we're going to cover today. For the agenda of this session, we're going to first very quickly cover challenges with modern applications, and then we're going to move on to must-haves for operational excellence. We're then going to dive deeper into distributed tracing and OpenTelemetry, then we're going to do a demo. After that, we're going to show you how to fill in the gaps in your telemetry, and we're going to summarize the session for you.

Thumbnail 150

Once upon a time, there was a monolithic application. It was hard to scale, it had lots of tightly coupled dependencies. Even one bug would mean redeploying the whole thing, and it made it very hard for teams to work independently. Hence, the applications we build these days are modern applications, meaning they run on microservices, they run on serverless technology, short-lived resources, and they give us the agility and flexibility that we were missing.

Thumbnail 190

Do we miss anything about monoliths, Helen? No, not really. I certainly don't miss how long we had to wait to get a single bug fix to our customers. I certainly don't miss how much downtime we had to have to deploy those changes, or how often we had to jump through hoops if we had to extend that window. I really don't miss the stress I saw my team go through every time they had to go through one of those deployment days, so no, I don't miss it at all.

Thumbnail 230

Yeah, I get it, and ultimately I don't miss monoliths and I don't miss everything that Helen just covered, but there's one thing that I do miss. This is a KVM, and I don't miss working in a data center, but what I do miss is the fact that if there was a problem with one of the applications, I would walk into the data center, pull out the KVM, and all the data that I needed would be in one place. If I really wanted to know what was happening on the database server or the reverse proxy, I'd change the KVM input and it would be there. Now if you think back on the example I used at the beginning of the presentation, I had disconnected telemetry everywhere. I didn't have a clue which services had direct dependencies on each other.

Thumbnail 300

So again, I do miss that part, and only that part, about monoliths that were simpler to observe. But like we said, you can make modern applications more operationally excellent with tools and techniques. We're going to start with must-haves. So let's take a look at some of the things we consider to be really key for your operational excellence.

Thumbnail 310

Must-Haves for Operational Excellence: Starting with Metrics and Context

Metrics are the best place to start. Always. If you've ever listened to an AWS principal engineer talk about how they observe Amazon at scale, you'll know that they are metric obsessed. Business metrics are the best place to start. The specific metrics are going to be different for every business. What is important is that they tell you whether your workload and application is performing for your business and, more importantly, when it's not and your business is impaired.

Thumbnail 350

Technical metrics are important too, but we need to spend more time asking ourselves why. Let's consider load balancer latency. What is the impact of a change in that latency on your business? Is it a backend load balancer serving a batch job overnight? What difference does a change in latency make to your business there? Or is it a customer-facing load balancer, where the smallest change in latency can lead to customer frustration, and your business is going to want that resolved really quickly?

Thumbnail 410

The important thing is to be deliberate with your technical metrics and keep asking why until you get back to the business KPI that that technical metric maps to. The next thing is to standardize the way you name your metrics and what you collect. So that no matter where you are in that distributed application, you have the same view. Metrics like availability, requests, errors, and duration are what we call the golden metrics or the red metrics. These are really valuable for you.

Why? Well, they help you understand how to measure your application health through the existence or absence of errors. They also help you measure the quality of responses through availability and latency. And they help you with those unknown unknowns that Ania talked about. Because they're not measuring a very specific piece of functionality, they're measuring the impact of the experience.

Thumbnail 460

Finally, sharing is caring. Share your dashboards and information about your service with your colleagues. When you design these dashboards, think about what your colleagues may need to know to understand if your service is operating and so that they can also understand if there's any issues impacting their service or their KPIs. Let them be independent.

Thumbnail 490

Thumbnail 500

I've talked a little bit about business metrics and golden metrics. In CloudWatch, these are called custom metrics. They are metrics where you define those metrics and the KPIs yourself, and you have full control over the metadata that shapes your metrics. In CloudWatch that metadata is called dimensions. In Prometheus it's called labels. What's important is this is information to help you make sense of your metrics. Things like operation or service as you see here, but it might be your application name, your project, anything that helps you get business, organization, or implementation context.

So that's great for custom metrics. I can add metadata with my dimensions. But what about AWS vended metrics, the out-of-the-box ones that come with EC2 or Lambda, for example? I can't add dimensions to those. I don't have that control. Well, if you've ever read a paper on operational excellence, you'll have embraced the advice to tag everything. Tags should be an integral part of your infrastructure. The good news is that CloudWatch allows you to query your metric data for your vendor metrics using those resource tags with Metric Insights.

Thumbnail 590

For example, here we're looking at load balancer latency and we're building a Metric Insights query. I can access the application tag name and value in my query. I can use them in both the filter and the group by so that I can get at the right data. I'll show you a bit more on this in a minute.

Thumbnail 630

Continuous Improvement Through Correction of Errors and Metric Insights

So we've talked about starting with metrics and tags and adding context. But what next? Observability is a journey, not a destination. It is a journey of continuous improvement. At Amazon, we have a process called correction of errors. Now there are public blog posts about that, so I'm not going to go into all the details. But I want to highlight a few things to you. First of all, it's a mechanism to power continuous improvement. It is a post-event analysis of your issues that focuses on key questions to help you improve. It is not a documentation of failures. It's designed to push improvement.

Thumbnail 670

Thumbnail 690

And there are two really important questions that we ask when we do correction of errors. The first one is how can I reduce my detection time? How long did it take you to know that there was an issue? Now have a think, would some extra data help you get there sooner? Do you need to tune your alarms or add some extra ones so that you can reduce that time? And the second area you should be focusing on is resolution time, that time that it takes you to find and correct the issue. What else would have helped? Would some additional context, the dimensions or the tags have helped you understand and resolve quicker? Is this a time where you actually need some additional data?

Thumbnail 740

And it's always a good time to revisit your dashboards and your runbooks and make three o'clock in the morning life a little bit better. OK, I said I'd come back to metric insights and alarms. Once we have the important metrics and tags, we need to know when there's an issue. And to do this, we need alarms on the right metrics. The right notifications and the right information so that we can act. And to do this effectively, ideally I want one alarm for many metrics. Here, Metric Insights is your friend.

Thumbnail 760

So I've talked about dimensions and tags. And here we can see an example of using these in a Metric Insights query. You can see that I'm controlling which data to look at with the where tag using the environment tag that is on my CloudWatch Synthetics in this case. I only want the synthetics that are tagged with the environment prod. And in my group by, I'm looking at seeing things by canary name and by my service tag. So I can have control over the data that I return.

The more control I have, the more I can make my alarms the right scope, and I don't need individual alarms for everything. We also have notification control. I can choose to aggregate the results of that query into one single series. And then I get a single notification any time that aggregated series breaches my threshold. Or I can choose to keep my series independent so that every canary name series there. If it breaches the threshold, will send me an individual notification.

Thumbnail 850

The group by information here is what I get in my notification. So if we have a look at a subset here of an alarm event, you will see I have the values for the canary name and the service tag, which is what I had in my group by. So I can control what I query and what information I get, so that I can act either manually or send this automatically to a Lambda function or to Systems Manager. And then I have that information available to those tools to take the right actions.

Remember here, don't ignore your alarm description. We like context. Please use it to put in information about your alarm and links to your dashboards and your documentation. Again, we're back to making life happier at three o'clock in the morning. OK, so let's move on from metrics and alarms and let Anya take us into the world of tracing. Thank you, Helen.

Thumbnail 910

Why Distributed Tracing with OpenTelemetry Is Inevitable

I'm often really surprised when I talk to customers who tell me they are having observability challenges with distributed applications and then they tell me that they're not tracing. The thing is, tracing is inevitable when it comes to distributed applications. It also speeds up your time to resolution and improves your observability when you are navigating a distributed application.

Thumbnail 940

You're going to need a map. If you want to know whether the problem is upstream or downstream, or where the errors or latency are coming from, you're going to need shortcuts. Which again means that tracing is inevitable, and that alone is important for operational excellence. Now, do bear in mind that there are still some components that are hard to trace, like networking. Later on in this session, we do cover how to cover the gaps in telemetry, but over the next few slides, we're going to focus on driving operational excellence with distributed tracing.

Thumbnail 980

One of the reasons my customers tell me that they're not tracing is because they think it's going to be really complicated and it's going to involve a lot of code changes, and that doesn't necessarily have to be the case. It is recommended when it comes to tracing that you start with OpenTelemetry. OpenTelemetry represents a vendor-neutral, open standard for implementing observability. It gives you that unified approach for collecting telemetry data.

Thumbnail 1040

Whilst OpenTelemetry has reached stability for traces and metrics with logs still maturing, the underlying concepts of distributed tracing are well established and tested. It's more important for distributed applications to have that interoperability that you get with OpenTelemetry. Now, OpenTelemetry also supports auto instrumentation. Auto instrumentation wraps itself around the applications and listens for calls to common system components, incoming and outgoing HTTP calls, or database requests and database queries. It then generates telemetry and sends it to your trace backend, and that means no code changes as well.

Thumbnail 1070

Thumbnail 1090

This is a really simple example of how to instrument a Python app with OpenTelemetry. All I am doing there is installing and configuring the Python Hotel library. As you can see, I'm showing you no code because there's absolutely no code change required. In many cases, auto instrumentation will be enough. In the example on the screen, the Flask route, the HTTP calls, and the database calls will be auto instrumented with no code change.

But Helen already mentioned how important business data is, data specific to you, and that also applies to your span and trace data. So there are times where you might need to do a little bit of manual instrumentation. In my example, I added things like the user tier and the order number. They are hard coded just for the ease of the demo, but obviously you'd get them dynamically. This is the part of manual instrumentation that you might have to do. It's a couple of lines of code, and if you're not sure how to get started, you could use an energetic ID like Kiro, for example.

Thumbnail 1160

Thumbnail 1170

Another important aspect of tracing is correlation and context. We already just covered that business context that is important to record. It's really relevant in situations where you're troubleshooting. But there are other lenses here as well. We've got telemetry correlation, for example, metric to trace correlation or trace to log correlation. I'm using Python here. In this example, I'm using the log instrumentation in Python and I'm setting the logging format to true. All that means is that the log formatter will now automatically add the trace ID and the span ID to my log data. So that means if I'm frantically searching through a sea of disconnected logs, I can now correlate them back to the traces.

Thumbnail 1220

Thumbnail 1270

We talked about needing shortcuts, and that's one way to get a shortcut. The other part is the technical context, which is really important when you're troubleshooting. It's really important to know where the telemetry is coming from: what instance ID, which host, what server address. With OpenTelemetry, you have the concept of semantic conventions, which not only give you a set naming structure but also mean that you can understand the meaning of the data. You have general attributes such as host ID, server address, client address, and so on, and you also have cloud provider attributes like availability zone, AWS account, and so forth. You can record these yourself, and often instrumentation will populate some of them for you as well.

Thumbnail 1320

Thumbnail 1330

Now, sometimes you might need to propagate the context—both the business and the technical context. One way you can do that is by using baggage. With baggage, you can set the attribute once and it's available everywhere. However, it is really important to remember that this isn't for business logic. It doesn't replace caching. This is for telemetry data. You set this once and it's available everywhere. In the example, the checkout service will add the order ID to baggage. If later down the line, the payment service wants to record the order ID, it can retrieve that from baggage as well. I'm using Python, but they could be using different code and that would still apply, so that's again really useful.

Thumbnail 1350

Thumbnail 1370

Controlling Telemetry Volume: Sampling Strategies and Application Signals

But you might be thinking, what about cost and what about the volume of the telemetry? Sampling is how you control the volume of telemetry. Probably the easiest way to sample would be with head sampling at source. Basically, I'm just setting that 10 percent of all traces are to be sampled. But the caveat is that you need to be careful because if you sample too low, then you might miss some data, so you need to fine-tune that. Another option for sampling is tail sampling, where you get a bit more control.

Thumbnail 1390

Thumbnail 1430

With tail sampling, in this example, I'm actually telling the code to send all the traces over to my collector. Here is my collector config. I've added a tail sampling processor, and you can see that in the configuration I have asked the processor to sample 100 percent of all traces that have errors, 100 percent of all traces that last longer than a second, but only 5 percent of all other traces. This is how I can get more granularity. But there is one caveat: your tail sampling processor sits between your code and your traces backend, so that does add some operational considerations for you. It needs to be available and needs to have enough memory, but that's just something to bear in mind. It's certainly an option.

Thumbnail 1480

Another way to control sampling would be to set rules on the AWS X-Ray side of things. You can create sampling rules, and then your collector will make decisions based on them. Now X-Ray supports adaptive sampling, and what that means is that it will sample extra traces under certain conditions that you have defined. In this example, there will be extra traces sampled using the sampling boost in case of 500s, and also we are using anomaly trace capture, but all the spans with 501s will be captured. But again, I can also limit that as well if I want to, which is the setting at the bottom.

Now, this next one is not actually sampling, but just another way that you can choose to analyze 100 percent of your span data. What you can do is configure transaction search. But then you can also choose to maybe index some of those traces in X-Ray—1 percent, 10 percent, whatever you need. So with transaction search, you can run span analysis on all your span data and then decide how many you index in X-Ray.

Thumbnail 1540

The other thing about transaction search is that the span data gets saved to a log group called AWS SANs, which means you can run further analysis on that data. This is really handy, and you can choose to keep the data for longer as well, so that's one way to control how you analyze your spans. Now Helen mentioned red metrics, availability, errors, and so on. We've just covered traces.

Thumbnail 1600

One way to implement what we've covered would be to use Application Signals and CloudWatch. Application Signals uses auto instrumentation to instrument your applications. It then auto discovers dependencies between your services. You get correlated metrics and traces, and I'll show you that in a minute. If there's a spike in errors, you now get the correlated trace that you can look at. You also get SLO management, so we talked about KPIs. This is how you can track your SLAs, for example. You also get burn rate metrics, so you can actually be alerted when you're on your way to fail an SLA rather than after it's happened.

Thumbnail 1620

You'll also get the application map that I talked about. You get transaction search and you get all the CloudWatch integrations. I think it's a good time to switch over to the console and cover some of the concepts that we've talked about so far and demo some of the Application Signals.

Thumbnail 1680

Live Demo: Troubleshooting with Application Signals and CloudWatch

I am now in my CloudWatch console. I clicked on Application Signals on the left-hand side and I selected Application Map. Here I have a list of the applications that I'm looking after. Straight away on the left-hand side, one of the boxes is slightly red and it has an SLI breach, so that signals to me that there's a problem in that service. Let's park that for just a minute. We'll go back to it in a moment. You can see that I've got other applications here, and there are also three here that have a dotted line around them.

Thumbnail 1700

What this means is that these are uninstrumented, so Application Signals is signaling to me that I might have gaps in telemetry. That's just really useful to know as well. Let me double-click on one of my more complex services, which is the Pet Clinic frontend. I'm going to double-click it.

Thumbnail 1720

I don't expect you to see the entire map. This is a lot of detail, but as you can see, there are a lot of dependencies. It's talking to a lot of services. We're talking to Bedrock there. We're talking to all sorts of different services. I'm going to scroll around a bit so you can see it. I'm going to zoom back out again.

Thumbnail 1730

We talked about shortcuts, right? Straight away I can see a shortcut. There's a circle that's half red. There's clearly something not right with one of the services that the frontend talks to. If I hover my mouse over, there's a 45% fault rate. So again, we talked about shortcuts. That's one way to get them. I want to find out more, so I'm going to select view more.

Thumbnail 1740

Thumbnail 1750

I'm going to expand that here. Here I have the details for payment-service-dotnet. If there were any changes in the last three hours, that would show up here. This is really handy to know when you're troubleshooting at three in the morning. There are some details about availability decline, and there are those red metrics that Helen talked about: availability, latency.

Thumbnail 1760

Thumbnail 1780

We also have percentiles because it's important to remove any outliers. So that information is there for me out of the box. I want to see more, so let's select View Dashboard. In this overview, I will see the metrics that you've just seen in the little side panel. But I want to go over to Service Operations, and I'm going to take a look at fault rate.

Thumbnail 1790

Thumbnail 1800

There's a service operation that has a 100% fault rate. That's not very good. Let's have a look at what it is. Let me do that again.

Thumbnail 1810

Thumbnail 1830

There it is. So it looks like there's a problem with the payments, which is obviously not great. If I head over here to the metrics and pinpoint the faults, I can see my metric to trace correlation. The associated traces will be displayed here, so I will be able to see what was happening. I'm just going to open a couple of those traces.

Thumbnail 1850

Thumbnail 1870

Before I go over troubleshooting those, I want to point out another important aspect. We talked about the importance of business metrics. Because I've manually instrumented the owner ID for this payment—this is a pet clinic where owners are trying to pay for treatment—I've manually instrumented the owner ID. If I wanted to know the business impact to see whether it's a specific customer having problems or all customers having problems, I have that information there. This is really powerful. You can see the owner ID is 6310-1228, and more or less the same amount of spans are captured. So unfortunately, it appears everybody is affected, but again, this is so powerful to know.

Thumbnail 1900

Thumbnail 1920

Thumbnail 1930

Thumbnail 1940

Just to show you how this is done, this part of the application is written in .NET. We'll share this QR code at the end with you, but basically this is the code for the pet clinic. In .NET, this is how you would do manual instrumentation. You would do this current activity set tag, and there's the owner ID. This is how it's actually appearing in my SAM data. Let's just have a look at the traces that we opened earlier. I'm just going to close this for a second. So these are all my spans, and I can see that there are exceptions. I'm going to click on this one. There's a 500 error. Let me go to this one. You can see that it's "the specified queue does not exist," so it's pretty clear that there's a problem with the queue. It's either been deleted or it's the wrong name. I can go and tell the application team to get it sorted.

Thumbnail 1960

Thumbnail 1970

Thumbnail 1990

Thumbnail 2000

Thumbnail 2020

Let me just have a look at some of the other data that we have here. If I look at the span, for example, and scroll to the bottom, there are correlated logs. If I was troubleshooting and just went in through the log rather than through the trace, the trace ID and span ID are recorded here. That's one of my shortcuts. If I just have a look at the resources tab, you can see this was auto-instrumented with OpenTelemetry for Java. If I head over to the metadata, for example, these are those semantic conventions and general attributes I was talking about. There's the host image, we've got host ID, and there's that manual owner ID. Now we already know that there's a problem with the queue, so we can speak to the application team to get it fixed. It was pretty straightforward to narrow down the problem.

Thumbnail 2050

Thumbnail 2060

Thumbnail 2070

Thumbnail 2080

But imagine a different scenario. Imagine that your customer service has received a call saying customer number 2 is having a problem. You probably wouldn't go through this route. There's an easier way to do that. What you could do is go over to transaction search. I'm going to search for the owner ID. We also know that it was a payment problem, so I can even narrow it down to which service this was happening on. I can run this query, and then all the spans that relate to this service and this customer will appear in the list. There are the trace IDs. I can click on that, and it will take me back over to the trace console that way. That's yet another shortcut that would be really handy when troubleshooting.

Thumbnail 2090

Thumbnail 2100

I'm just going to close this for now. So if you recall at the beginning, we noticed that there was a problem with one of the services and there was an SLI breach. I'm just going to head over to my SLO console.

Thumbnail 2120

Here you can see that I have an SLO defined for appointment service availability, and my goal is 95% availability. I'm not doing really well. I'm at 74.9%, which is not great. To make things worse, I have actually gone through my error budget, so I've really failed that SLO. But this SLO depends on an operation, so I'm going to click straight through to that.

Thumbnail 2150

Thumbnail 2170

Thumbnail 2180

If I take a look at this operation here, I can see that I'm at 25% fault rate. Like before, let's select a point in the errors and open one of the traces. But there's another shortcut. If I head over to top contributors, and if I sort by versions, I can actually see that version 2 has 29% fault and 32.6% availability. So to me that indicates there's been a deployment and version two's not doing very well. So again, that's another way to shortcut, but let's have a look at the trace.

Thumbnail 2200

Thumbnail 2210

There we go. There is a null pointer exception in there. But actually, if I just look over to the metadata, this FAAS version is a semantic convention for serverless function version 2. So we can see that there's a problem with version 2. We need to tell the application team that there's a null pointer exception and to have it sorted. So again, that's a really quick way for me to pinpoint an issue.

Thumbnail 2240

Thumbnail 2260

Now the other thing that I wanted to talk about is that we're about to move on to the next bit, which is filling in gaps with telemetry. But just wanted to show you some of the stuff in the console. I have this service called the billing service Python. Now if I take a look at the service here, straight away when I'm actually taking a look in the console I can see that it's hosted in EKS and the namespace and the workload. So I'm not going to do it, but if I wanted to I could click on the hyperlink. It would take me to Container Insights if I wanted to know the infrastructure metrics that come along with the container and the namespace.

Thumbnail 2290

Thumbnail 2320

But without repeating the whole flow, I'm just going to give you a spoiler. This service has a dependency, and the dependency is on an Aurora PostgreSQL database. It does inserts into the database, selects, and updates. So I can see all the metrics here straight away, which would be handy. There is no problem, but imagine there was a problem. If I wanted to take a deeper look and have a more specialized view, because I've enabled Database Insights on my database, I can select straight view and Database Insights.

Thumbnail 2330

Thumbnail 2340

Thumbnail 2350

Thumbnail 2360

Here I can see the health of my database fleet. I can have a look at the queries adding to the load on my database. I can see the calling services, which is really useful. When I was an SRE I always struggled to know what services are affected by which database, so that's really handy to know. And then what I can do is I can dive deeper into the database instance. I'm only showing you this high level. At towards the end of the session we do share a QR code with you, and there is a section there for interactive demos if you want to dive deeper into Database Insights. By the way, please check that out.

Thumbnail 2380

Thumbnail 2390

Thumbnail 2400

For now I'm just going to show you some high level stuff. So I can take a look at my database load and let's just say I want to have a look by blocking SQL. And I have a look at the top SQL on my database that's affecting the load. I can actually have a look at the SQL code that's running. If there were any usage plans, I could take a look at it. And again, I can see those calling services, which is really handy. I can see exactly which operations are running and the statistics for those operations.

Thumbnail 2430

This is one way to have a much more specialized view into your application. Before we wrap up the demo, there was just one thing that I wanted to show you. If you want to impress your boss and justify your trip to re:Invent, you could show them how quickly you can implement RED metrics and traces on your serverless applications, for example.

Thumbnail 2450

Thumbnail 2460

Thumbnail 2470

Thumbnail 2480

Thumbnail 2490

Let's say I have a new Lambda function. It's a pet photo processing service that grabs photos from a bucket and puts them on a queue to process. This Lambda function is not instrumented. If I wanted to instrument it, all I would do is head over to configuration. Head over to monitoring operation tools, click Edit here, and tick these two boxes. Application Signals and Lambda service tracers. When I click save, two things are going to happen. First, an environment variable is added to indicate that I'm using auto instrumentation. Second, a Lambda layer is added for the code for auto instrumentation. It will take a few minutes for the telemetry to start appearing in Application Signals, but because we don't have a few minutes, I'm going to show you how this looks for a function that I made earlier.

Thumbnail 2500

Thumbnail 2510

Thumbnail 2560

If I head back over to my application map and double-click the photo processing service, my dependencies are automatically discovered. The RED metrics are there, and I can dive deeper into that function. So it took me three clicks. This is a demo to show you the RED metrics, how to instrument with auto instrumentation, and all those deep links. Let me move back to Helen to cover how to fill in gaps with your telemetry.

Thumbnail 2590

Filling the Gaps: Specialized Tools and AI-Powered Investigations

Sometimes tracing alone, or even with metrics and logs, doesn't always give the full picture. Sometimes we need to look at specialized tools to fill in the gaps. Let's look at a few of those that are available in CloudWatch. Anya has already shown you a couple of these, and if you haven't explored them, I highly recommend it. My personal favorite is Synthetic Canaries because with no code changes, it's a really easy way to get started with visibility from your customer-facing side. I want to look at two others that we haven't explored today.

Thumbnail 2610

Container Insights gives you out-of-the-box dashboards, and you can explore the telemetry from workloads on ECS, EKS, and Fargate. It collects metrics for you like CPU, memory, disk, and network, and you can explore these at various levels: cluster, node, pod, and container. You also get event information such as container restart failures, and all of this data helps you identify and resolve issues more quickly.

Thumbnail 2650

I also want to take a minute on Network Flow Monitor. It's been around for about a year, but we find that a lot of our customers aren't aware of this. Something Anya mentioned earlier was that networking is hard to get visibility into. What Network Flow Monitor gives you is near real-time observability into network performance between your compute resources, your EC2, your EKS, and AWS services like S3 and DynamoDB. Why do we care? Well, when your application gets high latency, we all look at the network first, right? So it's really handy to know if that's actually an issue. Network Flow Monitor gives you visibility into network performance things like data transfer, packet loss, and round trip time, but also it gives you overall network health indicators.

You can drill into the flows, look at the topology, and examine the metrics further. Basically, you're getting all the extra information to decide where you need to focus your attention. If this or any other specialized tools that we showed are interesting, please check them out and see if they're valuable to complement your observability and move you forward on your journey to operational excellence.

Thumbnail 2740

Earlier, I said that cutting incident response and detection time was a key aspect of your operational excellence. We didn't forget about AI. AI can accelerate your investigations and reduce your time to detection and resolution. Machines are really good at analyzing huge amounts of data and patterns, so let's use that capability. CloudWatch has had machine learning features and AI features for quite a long time. There are features like machine learning pattern analysis for logs that pulls out the constant patterns and individual tokens you're seeing inside your logs and helps you see when that pattern changes. So we have anomaly detection for log patterns, or you could look at machine learning anomaly detection for metrics.

When you choose to enable anomaly detection on a metric, it builds a machine learning model based on the history of your metric data. This allows it to tell you when things change. In your alarms, instead of having a fixed value threshold, you can set an anomaly detection band and have it tell you when things change. We also have generative AI-backed tools. CloudWatch Investigations allows an AI agent to start gathering data and coming up with hypotheses when an alarm is breached, even before you log on. The key point about CloudWatch Investigations is that the human and the AI agent work together. You can disregard things it suggested or add your own insights. It not only helps you identify the issue and come up with hypotheses, but sometimes suggests how to fix it. You can also get incident reports on your investigation that help you move forward with error correction and your improvement process.

If you haven't checked that out, please do. Alternatively, you could use AI through MCP servers. AWS has made a number of MCP servers, and more are appearing all the time. There are specific MCP servers for CloudWatch, for Application Signals, and for managed Prometheus. I've done things where I've simply asked it what alarms I have and what's happening, and asked it to tell me about the data behind this. Please try it; it's quite impressive.

Thumbnail 2910

Thumbnail 2940

Thumbnail 2950

Key Takeaways and Resources for Your Observability Journey

We have covered a lot, so let's review what we've discussed. We have looked at observability must-haves for operational excellence, about KPIs and metrics, about context through tags and dimensions, and about processes like error correction that help you keep taking steps towards a better world. Anya has taken you through a great journey on tracing and shown you how it doesn't need to be complicated, and some of the tools you can use to not only help you get that data but to explore and navigate it. We've also had a quick look at some of the specialized tools and the integrations, and don't forget about AI.

Thumbnail 2970

This is over to you. Where do you want to go next? Behind this QR code, we have a number of resources. Whether you are interested in reading, watching, or hands-on learning, there are great resources here. Our One Observability Workshop gives hands-on activities around many of the features, not just in CloudWatch, but in our open source observability offerings through AWS as well. The Observability Best Practice Guide gives you a great way to think about where you are and where you want to go next, and how other people are doing it. We've built that based on the experiences we have with the many customers that our specialist community supports.

Thumbnail 3020

If you want to learn more, first of all, we will be outside the room after this talk. Give us a minute to pack up and we're happy to answer your questions. In the expo, in the village in the center, there are a number of cloud operations kiosks covering observability, AI Ops, multi-cloud, and many other areas. All of the AWS booths for the different areas are in the same space, and I bet they've got some swag hidden away, so just ask them nicely. Thank you so much for joining us today. Please fill out the session survey. That data is really important so we keep providing the best experience for all of you. The week is only starting, so keep your energy up, stay hydrated, and please enjoy the rest of re:Invent.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)