Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)

In this video, Kevin and Andrea demonstrate building an AI-powered observability agent using Strands SDK and MCP (Model Context Protocol) tools to analyze AWS infrastructure and generate automated reports. They showcase a pet adoption website example where the agent discovers resources through CloudWatch metrics and logs, identifies security scanning patterns, creates critical alarms for latency and error rates, and provides actionable recommendations across immediate, short-term, and long-term timeframes. The presentation includes detailed code walkthrough showing how to integrate AWS Labs MCP servers for CloudWatch, Application Signals, and CloudTrail, emphasizing the importance of precise prompting to guide the LLM effectively. They demonstrate creating custom tools like create_cloudwatch_alarm and discuss strategies for reducing token usage at scale, including using Metrics Insights for bulk queries. The agent generates structured HTML reports with executive summaries, operational audits, and correlations to industry-known incidents, all available as open-source code for deployment via EventBridge triggers or scheduled runs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Using Agentic Tools to Improve Observability and Avoid 3 AM Wake-Up Calls

Hello everyone. I have a question: Who here has been woken up at 3 a.m. to look at a log for a production outage? Can I see a show of hands? I said everybody. I see some people with hands down. What's your secret? Maybe you want to come up and do the talk. I know I have. What we're talking about today is how we can use some of our new agentic tools to help avoid that situation. My name's Kevin. I have Andrea with me here today. We're here to talk to you about this code talk, so we're going to talk about what tools we have available and how we can build something that helps us improve our observability posture.

I started working with observability at AWS about 7 years ago, and the space has changed a lot. We used to see a problem where customers did not have the metrics, logs, and traces that they needed to solve their problems. Maybe a key log method was missing, or maybe distributed tracing wasn't there. But that situation has changed. With tools like OpenTelemetry and CloudWatch being integrated into everything, it's easier than ever to produce usable observability data. Sometimes it's just a few clicks of a button, and sometimes you just need one Terraform module, and all of a sudden you have all the observability you need.

So the problem has changed. What we see is that customers have too much data now to quickly solve the problems they need. You can produce all of the data, traces, and logging that you need, but if you can't quickly solve a problem, you're not really getting your value for that. It's very exciting because in our opinion, AI is a really good use case for us. I know you hear it's a good use case for everything, but I think for observability, we really do make a case for AI and it works really well for us.

When we think about an agent, I'm sure you all know most of this, but we're essentially having an LLM do some processing. We're giving it a lot of context and a query, and then we want it to perform the programmatic work. We want it to do SQL queries, write metric insights queries, and process all those values, and we want to see just a natural language response. I know that sometimes when I would be deploying something a long time ago, you look and you just see a Java stack trace or something and you're like, what is this? This really helps that problem a lot. To take that and make it natural language that is saying we think this is the problem, we recommend you do this—that's a big leap.

Understanding MCP Tools and Building an Observability Agent Architecture

But the agents don't really work great for this unless we give them access to the right data. Show of hands, who here uses MCP tools today? I think probably a lot of you. Great, yeah, most of you. At AWS we of course make a lot of them, but this talk is not really about that. We want to think about how we use the tools and how we can gain extra value from them. What we're doing is thinking about how our MCPs can access these tools like CloudWatch or Prometheus or some type of post-mortem tool. The MCP's job is to tell the agent what it can do and what APIs are available, and then maybe more importantly, provide a security boundary where you're saying my MCP tool can query Prometheus, but I don't want 1000 agents on everybody's computer directly with permissions to query. The MCP is very important for that.

So how are we going to use this? Has anybody seen this screen before? Has anybody done our One Observability demo? Just okay. You guys should all check that out, but essentially what we're going to do today and what we're going to build, and Andrea is going to show you, is we have this pet adoption website with some very cute pets, and these are

actually real pictures from our team, most of them. The pet adoption clinic is what it's called, and it's a fictional web application where you can go and adopt a pet, bring your pet to the vets, buy food for your pets, and do all different types of things.

When we look at this, let's think about the data. As applications have changed, they've become more and more distributed, and so each of these services is going to produce observability data: metrics, logs, and traces. You used to have a monolith where you have your logs and your data all in one spot. Now you have this explosion of data because you've taken one thing and abstracted it to maybe fifty things. I have three up here, but it could be way more for you in a more complicated deployment.

When we see that happening, we need a way to easily analyze this. In our case, we're sending it to CloudWatch, but this could be other observability tools or other data stores, and the same thing still applies. So what's our strategy here when we build an agent? I think this is the most important thing: we need to provide it with as much context as possible. We want to define SLOs, which are service level objectives where you're essentially telling the agent what is important to you and your organization. You're saying that over the last thirty days, I expect my application to have a latency of thirty milliseconds for ninety-nine percent of requests. That's an example of an SLO.

When you provide this rich context to the agent, you're going to get a lot more data that is specific to you and your team. Then we need to take that data, analyze the trends, and say to our agent, don't just look at the current state, look around the corners. What could potentially be an issue later? And then finally, we want to make our agent better over time. Let's take an example of our pet adoption website and run this agent that creates, let's say, an HTML report of observability findings. We want to feed those findings back into the agent on each successful invocation so that it can get better over time.

So what are we going to actually show today or what are we going to build? We are starting with a natural language query, which could be manual or could be triggered by an alarm. It would be a little bit better to have something lying in wait and an alarm is triggered, and your agent gets to work for you. We have our LLM, and what we do, and what I think is really cool, is using the Strands agent. This is an SDK that's open source, and you can build an agent in about ten lines of code. We're going to show you a lot more than just ten lines, but you can get started that quickly. It's really amazing how quickly you can build something interesting.

Then we want to provide it with all the context that it needs. We want to provide it with our operational tenets and our SLOs. We want to tell it exactly its purpose and its job and how it should do things, and we also need to provide it with the MCP access so it can properly query things and get data. Finally, we want some type of usable output. It could be an HTML report that you send out daily to a stakeholder. It could find a gap and then create an alarm for you, which is very useful and we're going to show that in a little bit. The goal is to have some type of human-understandable output from your programmatic queries of structured data.

Analyzing the Agent-Generated Observability Report: From Executive Summary to Industry Context

I'm going to pass it over to Andrea now. He's going to show you the code and show you what we built and how we think about this. Thank you, Kevin. Alright, so before I swap the screen, one thing I want to mention is that you're going to see a QR code at the end of this presentation.

All the code you're going to see is open source, so you can download it. Before going into that, I want to do a premise. The code will run and do stuff, so make sure that if you try to modify it, you test it like you do with your pre-production environment because it's going to create artifacts as we're going to see. Having said that, I want to start this by looking at what the agent is going to create after each run. What you see on screen is the observability report from one run of the agent where what we're asking the agent to do is go into my account and find interesting stuff for me. We can see what interesting stuff means because you need to steer the agent in a very specific way if you want it to be valuable. I want to analyze that report with you so that you can actually see what the outcome is and relate that with the code that we're going to see in a second.

The report is split into different sections. The first one is the executive summary. The agent is telling us that we have AWS infrastructure with mainly an EKS-based pet site adoption system that Kevin mentioned before. We have Lambda functions and comprehensive logging. One interesting thing that it's already highlighting is that our pet site deployment service has active traffic and there are security scans happening in my system. The agent then took the decision to create critical monitoring alarms for us. That's kind of like a summary of what the agent has done. Now we can actually go down and look a little bit more in detail. We can see we have an alarm section here. This system didn't have any alarms, and when the agent ran, it created four critical alarms: pet site high latency, high fault error rate, and a Lambda duration alarm. If you think about those, it's quite impressive that the agent is interpreting what we have in the account, understanding what alarms need to be created, and creating them. We're going to see how we did this in a few minutes.

Definitely, there is security scanning happening, which again is highlighting that. Next, we have some key metrics that the agent is telling us we should be looking at. You can see here we have pet site error rate, pet site fault rate, and pet site latency. If we go to the fault rate first, you're going to see we have a low level of faults at 0.07 percent, nothing concerning, but something you definitely want to be aware of. Most interestingly is the error rate time series here. In this new AI world, people used to say they're going to find something that is anomalous, but most of the time, looking at something that is not anomalous is equally important as what is going wrong. In this case specifically, the agent has found out that we have faults, but there are no errors related to those faults. You can see that we probably have a level of 500 errors but not 400 errors, and you can relate that with your code because in general they may be related.

At the same time, the agent found out that we have a latency of 400 milliseconds on average, which is also not very good. We're going to talk about that in a few sections here. Next, we have an operational audit review where the agent is telling us we have an EKS application with this latency of 402 milliseconds on average, roughly the last day average, and we have a link to the console to actually go off and check it ourselves. We have the fault rate and error rate that we just saw above. The agent is saying to us that 45 500 server errors are still at acceptable levels because it's 0.06 percent, not bad, but definitely something to notice. Definitely there is active traffic. Then we have a Lambda functional analysis where we have a few status updated functions, step functions, and synthetic monitoring. How many of you know what synthetic monitoring is? For everybody that doesn't know what synthetic monitoring is, it's a CloudWatch service that allows you to write your code and run canary testing. It's a way to do full end-to-end testing to your service. You call your API, you expect the result to be 200 or 500, depending on what you're testing, and it's based on Lambda. So it's actually saying that we have four canaries running.

So it's saying we have 4 end-to-end tests. That's the way to read it. Then we have an EKS infrastructure with a cluster linked to the cluster if you want to go there. There are 2 gigabytes of storage, which is something interesting. In fact, the agent also thought that it's worth thinking about the CloudWatch logs analysis because of those 2 gigabytes of storage. We have 1.36 gigabytes that are just in the application logs. So there are some optimization costs that we can actually look at.

Now for operational findings, this is where the security scan has been described a little bit more. The agent found out that there were significant scans across all those 100 operations that were happening to different common paths. So PHP exploitation, admin panel probing, there is WordPress vulnerability and so on. Now the agent tells us what we should do. This section, which we're going to see in the code, is split into three tiers: what you should do right now, medium term, and long term. For immediate actions, you should enable AWS Web Application Firewall on the PetSite to block all these paths, the scanning that's happening, and review CloudTrail. Then as part of monitoring, there weren't alarms, remember? So the agent said, OK, I took an action and I created the alarms for you, and we're going to see how. This is the first step where we make the agent operate into our observability artifacts.

There should also be SLOs into Application Signals. Now the agent hasn't done this because we haven't created the tools. So that's an exercise for home. I'll show you how we did it for alarms, and you can use the same approach to create Application Signal SLOs. Then moving forward, we have short-term improvements with our 400 millisecond latency, which is definitely something to look into. And long term, we have almost 3% service quota utilization, so we can improve this by using auto-scaling policies and things like that.

Now one interesting part I want to highlight here is the relation with industry-created incidents. One of the things our engine we're going to build today is going to do is not only go into your account, find stuff, and maybe create some artifacts, but we want something that when I wake up at 3 a.m. in the morning and something is happening, I know if what I'm seeing is something well known in the industry. That's what the agent is doing. If you look here, the security scan that we're seeing in the system is related to WordPress security vulnerability of 2022, and the agent is telling us what that was. It's multiple requests to WordPress login. You know what the industry context is? This was a problem in 2022, and what is the applied solution? AWS Web Application Firewall. So effectively, the framework the agent is telling us is, hey, this is a well-known thing, and the industry has fixed this in this way. Go off and do it. It's also telling us how to do it: deploy Web Application Firewall, use the top 10 rule sets, and then enable CloudTrail for API-level monitoring. Optionally, we can also set up automated incident response for those requests. It's pretty cool.

Code Walkthrough: Setting Up the Natural Language Observability Agent with MCP Clients

So we just had a single run of the agent that created this report, and it found out a few things for us. So now I think what I want to do is show you how the agent is doing that. Are you ready for some coding here? Raise your hand. OK, so get your laptop out. I'm kidding. I'm going to walk you through the code, and we're going to see. Unfortunately, we only have about an hour for this session, so I'll try my best to walk through the most interesting code. And definitely, as I mentioned, this is open source, so you can go off and download, change, and try. OK, so as I talk, I'm actually starting the agent in the terminal, which is what you're going to see there running, and we're going to look at what it's doing bit by bit. Don't get distracted too much.

When you download this repository, you're going to see a folder called observability agent. In this folder, it's effectively a Python agent. There's a main file, and in this main file, there are effectively some settings that we're using. Arguments. So effectively when we run the agent, we want to specify the regions that we're going to be using unless we use the default.

We need to decide whether we want this reporting in markdown because effectively we want to feed a different agent, for example, or we want HTML for consuming from an application, or maybe both. And definitely, if you want some verbosity there in case you're debugging and you want to see what's going on. But effectively, the most important lines of code here are on lines 52 and 53 where what we're doing is creating this natural language observability agent and we're actually asking the agent to run the analysis.

Now let's go into this observability agent and see what we're talking about. A premise here that Kevin mentioned before is that we're going to be using Strands agent across all the code here, and we're going to do that to create our tools to use MCP servers and tools and to create the actual agent. You can see how easy this is. The natural language observability agent, what it does here is it's loading the application settings , so things like regions and so on. It's not super interesting, but it's also initializing the MCP client manager, which we're going to see in a second. That's where we're going to use existing tools out there and connect that to our agent to announce its capabilities. We have a report generator that's going to be in charge of creating HTML or markdown for us, and definitely we initialize the components.

Let me go into the MCP client manager for a second because I want to show you what this really is. For those of you who were not familiar with MCP, let me step back here. I want to explain a little bit so we're all on the same page. MCP servers are tools. MCP stands for Model Context Protocol. These are servers that are, in general, vended by companies like AWS, and they contain tools to do work. It's as simple as that. Imagine MCP tools like the APIs for LLMs. APIs are not really LLM friendly because you have a bunch of parameters, but the LLM doesn't really know how to do trial and error. But MCP tools are meant to be use cases that the agent and an LLM can actually use to fulfill a full job. We're going to see this again when we build our own custom tool.

There are different tools. The idea here is that we don't want to, for example, connect to CloudWatch or CloudTrail or Application Signals like all those things without having a client and making the agent try this API or that other API without really having the context about what it's really doing. That's what MCPs are for. Today in this agent, we're going to use three MCP tools: the CloudWatch one that contains metrics, logs, alarms, and so on , Application Signals for monitoring your applications, and CloudTrail to see the CloudTrail logs in your account.

To create an MCP client with Strands agents is as simple as creating an MCP client and passing in as an argument the handle to the GitHub public repository handle, which is AWS Labs called the CloudWatch MCP server, AWS Labs called the Application Signals MCP server, and AWS Labs called the CloudTrail MCP server. You're going to find a lot of them in the AWS Labs repository. In this case, we're just going to use those three for our use case.

These MCP clients, what they do is create the MCP manager. It creates those MCP clients for the clients that we want and then exposes them with a get_all_tools function where we return the tools that are going to be used within the agent. If we go back to the agent again, we're going to see that we initialize the components. Effectively, we ask the MCP manager to set up the clients, the code that we just saw, to use the AWS Labs repository to create the clients for us and then create the agent.

Crafting Effective System Prompts: Defining Agent Persona and Capabilities

Now before I show you the literally two lines of code on how you can create an agent, I'm exaggerating a little bit. I mean, there are technically two lines, but there is a little bit around that. I want to explain to you one thing that is very important that we're doing here.

As you saw in the slides that Kevin showed us, to create an agent, we need to provide it some context and we're going to be interactive with it. It's a two-step process, and this is exactly what we're going to do in the code. The first step is to create an agent and set it up with some context, and then we're going to prepare a well-written prompt of what we're doing to actually invoke it.

One thing we learned in this experience of building this agent is that writing a good quality prompt for the agent is probably as important as the code itself. If you think about it, LLMs have general knowledge, so the last thing you want is to run this agent and have it go off, maybe find something for you, maybe not, but every time it tells you something completely different. You really want to create a kind of tunnel vision for it. You know about everything in the world, but we're talking about monitoring here, so I want to restrict your focus to monitoring and do exactly what I'm asking you to do. You're going to see a lot of that in the code starting from now.

So the first step is to create an agent as I said. We start with a system prompt and we're going to go back into this prompt in a second. I want to show you how easy it is to create an agent itself. So to create an agent, all you need to do is instantiate an agent class. You're going to pass the system prompt to it. This is going to be the initial context, who the agent is, the persona, like who you are impersonating. And then you're going to send a list of tools to it. Those are the tools that an agent can use, and you can see here HTTP requests, which basically means you can go off the web and search stuff, and that's going to be used for post-mortem search and create CloudWatch alarm, which are our custom tools we're going to see in a second. And then there are MCP tools, which are the tools coming from the MCP manager: CloudWatch, CloudTrail, and AppSignals.

Now let's spend some time looking into these prompts because I think it's really worth it. This is the context where we set the agent, and we say as the first thing: you are an observability expert. Remember the tunnel visioning thing. We're saying to the agent who you are and what persona you have. And we describe what the capabilities are: discovering and analyzing resources, querying CloudWatch metrics, correlating data, and so on, creating CloudWatch alarms using the tool that we created for the agent itself, and generating reports.

Now, one thing we're going to need to do is be specific. If you're generic enough and say go off and find stuff for me, what does that mean for the LLM? You probably will go off and do something, but we want to be precise. So we want to tell the agent how it's supposed to analyze the AWS infrastructure. We say: discover the resources that exist in the region and get infrastructure and application metrics. That's one way you can discover resources. Don't go off and call all the APIs for 250 services, because that's going to be mental. It's going to take like days to run. One way you can avoid that is to go into your observability artifacts. You're going to see metrics and logs, and that's how you can reverse engineer the resources that you have. It's a shortcut for the agent, but you can see how important that is. Then check for existing alarms and look for patterns and anomalies, and then generate actionable recommendations.

Now, one important thing here is that we're saying to the agent: be conservative with alarm creation. Remember, the agent will do what we tell it to do. So if you're not really precise about the intent, every time this agent runs, it will create alarms for us because we're saying go off and identify alarms that you should create and create them. But we don't want that because you're going to end up creating alarms that are very similar, maybe with different configurations, slightly different data points or thresholds, but effectively the same semantically. We don't want that. We want alarms only on genuine gaps, and that's exactly what we're saying here: be conservative, we want those alarms for genuine gaps, and do not create duplicate or near-duplicate alarms.

Then we ask the agent to put attention into the resource type, the key metrics, and deep links, because we want that one-step click to the AWS console. And we say what it should be focusing on: performance monitoring, customer experience, customer impact, and optimization, cost, and security.

That's how we found out the focus on impact and optimization across cost, security, and operational excellence improvement. That's how we identified the security scam we saw earlier. For example, if you have 2 gigabytes of logs, there are things you can do about it. One interesting thing here is that we're also providing a CloudWatch alarm tool. When you need to create alarms, that's the tool you're going to use. Don't try to use a boto3 client or any API by yourself. I'm going to provide you with a tool to do that. Last but not least, be thorough but concise.

Agent Invocation and Report Generation: From Query Execution to HTML Output

At this stage, what we've done is define our Natural Language Observability Agent. In the first step, we have the agent primed with the context. It has the tools it can use, including the MCP tools. Now we move to the next step where we can actually run the analysis. Let's see how we run this analysis. In this method, you'll see there are two steps: discover and analyze resources, and then generate the report. Let's look at how we discover these resources. Here is the second step where we interact with the agent. I'm going to skip the query for a second because we'll see what it looks like, but one thing worth noting is how we invoke the agent. Invoking the agent is as simple as using the agent method and passing the query, the prompt that we wrote above.

What that does is start the agent, send in the prompt, and the agent will start working. It will find things using the tools and everything like that, and then we'll have our results, a natural language result from the agent. We'll parse it, extract it, and do some modification that we'll see in a second. There are two important things to notice here. The first is when we say MCP manager execute with client. If you go back here, we have this function that might seem cryptic at first, but effectively we have a context manager here that is connecting the clients to the MCP server. Remember when we say we have agent creation step one and agent invocation step two. The idea here is that we don't want the clients to be connected for the entire lifetime of the agent. We want to connect the clients just when we invoke the agent, and that's the reason why.

When we create the agent, we invoke all tools so we prime the agent with the client initialization, but there is no connection to the MCP server and the client itself. But when we're actually invoking the agent here with agent query, that's where we actually need to execute with clients. Effectively, what we're doing there is telling the initialized client to connect to the MCP servers now because I'm going to need to use the LLM. This line here is what you're seeing in my terminal. Everything that is happening is because the agent is being invoked. After this, we extract the metric from the response and then we return the markdown and the metric data. Before we look into this, I want to go back into our prompt here because you can see this prompt is quite verbose, and there's a reason for that. The prompt is what we're asking the agent to do. This is the core of the thing here. What should it do? The query is to analyze the infrastructure and create the report. That's what we want. But before we go down that path, there's something worth mentioning, and I'm stressing this a lot because it's important: the agent is stateless unless you're fitting it with the previous run, which is something we can do here. But every time you run it, it's going to go off and do stuff. We've said this already.

To create a report that makes sense for humans, you don't want them to be random. You're going to find very quickly that if you try to run the agent yourself without looking into this code, the LLM is going to make a few mistakes and hallucinations. For example, it's going to go off, look at stuff, and then perhaps generate metrics with what it thinks is correct.

It's not structured in a way for a human to read or for you to parse and send to, for example, an agent pipeline, or create sections that are probably random. We want to fix all of that with this prompt, which is why it's very important.

What we're saying here is that this is a multiple-step job. First, you're going to need to discover resources and find all the resources that are relevant: metrics, compute resources, databases, serverless, load balancers, storage services, and so on. Then you're going to need to analyze the health for each resource. Go look into their metrics, values, trends, health status, active alarms, recent incidents, performance, and so on. Then we need to generate deep links because the report is going to allow us to go into the console. Step four is where we link to the postmortem. For any issue, get data from postmortem or look into postmortem for similar incidents. Don't do that just for the sake of it. Go check if what you're seeing makes sense with the publicly available incidents. Don't force it, because otherwise it would just say there's this incident that maybe it's related, but there's really nothing to do with it.

We're just saying go look into that, but only show it if it's really required. Then there is the alarm management section, which is kind of interesting. Think about this: the agent is going to go into your account and we need to get some recommendation and perhaps create some alarms. We need to tell the agent how to create alarms and how to come up with an alarm that makes sense. Alarms are complicated. You need to choose the right metrics or set the metrics or a query to alarm on. You need to find the right threshold, data point, and how to treat missing data. Should the alarm be breaching or not?

To solve all these problems, what we do is use MCP tools. They have a tool called get_recommended_metric_alarms. This tool is vendored by AWS and it's in the MCP servers. What it does is provide the LLM a kind of best practice recommendation for alarming where you're going to see how different AWS services metrics relate to good alarm recommendation thresholds and things like that. This is to prime the LLM and say go check that first. This is our best practices alarm from AWS itself. After you do that, check existing alarms and do a comparison with best practices. Then only create new alarms if there are gaps. Do not create duplicate alarms. At some point, we say that you can actually use the tool that we created to create those alarms.

Then we ask next: give me a recommendation and include actionable insights for performance, cost monitoring, and security. So far so good. This is what we're asking the agent to do, but we can't stop there, because the agent will output something that is completely different every time. So we need to parse this because we want to create a report and we want that report to be as stable as possible. That's why we say to the agent that this is critical: we're going to need to structure this report exactly this way. We want an executive summary with two to three sentences of what you found, active alarms and issues, operational audit overview, operational findings, actionable recommendations, and relation to industry-created incidents. Remember that in our report, those are exactly the sections that we've been through.

Another critical requirement we're saying is don't come up with some time series on your own. Use the time series that I have in CloudWatch. Use the MCP tool get_metric_data and then return them back so that I can actually print my time series like we saw in the report. That's also very important because what the LLM will do in general is analyze your environment, perhaps looking into some slice window, and then just make up values for you, which is not what we want. To do that, we're also asking the agent to provide adjacent data into the output where we have the different list of metric names with the timestamps, the values, the unit, and the brief description so we can craft that into our report.

So we're back to where we run the agent with this prompt. From the markdown results that you're going to see in the run here, you see in the report that is effectively all the sections we talked about. What we need to do here is parse the JSON for the metrics and all of that. Effectively, we don't want to show the JSON in the HTML report, and that's why what we do is structure this JSON parser so that we can provide it to our JavaScript library and we remove it from the markdown itself. That's effectively what we're doing here, and then we return the markdown with all the content that we found and the metric data.

At this stage, if we go back, you know, that's where we analyze our results. Then at this stage we're going to generate the report. We generate this report with the analysis result and the metric data. Now we can check this method very quickly, but in reality, what we're doing here is getting the markdown and the JSON, and then we're using Jinja templates. There is an HTML template in the template folder which includes CSS and everything else. This is a Jinja template where we're iterating over the markdown and printing it, and then using the JSON for the metrics to generate the graphs that you've seen before. Nothing special there.

Building Custom Tools: Creating the CloudWatch Alarm Tool and Observing Agent Execution

One thing I want to talk about though is how we invoked this agent, but we want to show how we're instructing the agent to create those alarms. That's where we're actually creating our tools. If you go into the tool folder, there is an alarm tool file. This is the tool that is used for the agent to create an alarm. There are a few things that are important here to keep in mind. The first thing is we need to make our tool, our methods, our Python code to be exposed as a tool for MCP. That is as easy as annotating with a tool decorator, which is an annotation that you can just use on any method that you want. And that's about it. You have your tool.

At that stage, you can actually take it and fit it into the agent. However, if you do that, you're going to have a problem pretty soon. If you look into this create alarms function, when you create an alarm, it's not as easy as just creating a name and maybe a metric. There is a lot of configuration like alarm name, metric name, namespace, and so on. But more importantly, there are some configuration parameters like comparison operator, greater than threshold, or statistic coverage. If we don't instruct the agent on how to use them, it would just go off and try average or something else, or maybe all over the place, and you're going to have errors from the client every time. Then at some point the agent will surrender and say I'm going to try a different way. We don't want that.

That's why, although creating a tool is as simple as adding this annotation, given the context to the LLM is probably one of the most critical things you're going to need to do here. That's exactly what we do next. We say what this tool is about. This is for the LLM to know what this code is about to do. You can see we say this is for creating alarms that are a common use case, and then we talk about the arguments. We have alarm name, metric name, namespace, and so on. See here, for comparison operator, don't come up with some parameters that probably won't work. This comparison operator can be greater than threshold, less than threshold, or greater than or equals to threshold. Same with statistics—you can only use average, sum, minimum, and maximum. This is important. This is how the agent can go off, use the tool one shot, and it works.

Then we say what we return back and some examples of how to use the tool. The tool itself is a mirror of a CloudWatch alarms API. CloudWatch provides an API called PutMetricAlarm where you send all those parameters and you just create an alarm. Now this tool has been done on purpose just for simplicity. It's a mirror of the API, but in reality, think about tools as something where you pass only the parameters that you really need for your use case, and then the code will do the rest.

This approach makes it simple for the LLM to use it. Here, we're using a Bedrock client for CloudWatch, then we do some validation, and after that we create a dictionary with the alarm configuration after validating parameters. Then we call put_metric_alarm with that configuration. This is exactly the same strategy that is used in the MCP tools. If you go and open the AWS Labs code, you're going to see that they're very similar to this. Some are a little bit more complex. For example, there is a Metrics Insights tool, which uses Metrics Insights, a SQL language to query metrics for CloudWatch. That's a little bit more complex, so obviously you're going to need to build the SQL statements. However, most of the common tools are essentially a mirror of the APIs or with some flavor of it.

Now, I want to show you quickly what our terminal has output to us. You can see all the metrics that were found. I actually ran out of space here, but you can see that Application Signals is an MCP server which is executing the CloudWatch API. There are multiple invocations of the MCP servers here. If we scroll down, you're going to see more activity. So this is still auditing, which is good. For example, CloudWatch Metrics Tools is another MCP tool where we say, "OK, build Metrics Insights SQL for AWS EC2 CPU utilization." Effectively, this line here shows we're using the MCP tools that contains the build_metrics_insights_query that I just talked about, which would be very complex if you had to do all of that out of the box.

I'm taking all the CPU utilization across your entire fleet, no matter how big it is, and getting the metrics out. With one single query, you get a lot of metrics back, and the agent can be instructed to identify any EC2 instance that is not performing well or any layer in your system. That's exactly how you're instructing the agent to behave. Similar to log_group_describe_log_group, I can see my log groups, and I sell those tools and all of that. Then at some point, all it does is provide my analysis. This is my report. My infrastructure consists of 66 monitoring services across ECS, EKS, Lambda, and so on. Severity shows latency issues identified in these other services, but no critical outage detected. Effectively, the scan has been completed.

Deployment Strategies and Q&A: Scaling the Agent and Handling Real-World Challenges

This is a new run of the system. I want to show you very quickly how you can actually deploy this in your environment. One idea here is you're going to find an agent setup Python file that effectively does what we want: this agent not running manually, but perhaps running automatically based on an event or even periodically. There are really many ways to do this, and there are a lot of talks here at re:Invent about deployment, so I'm not going to go into that. You're going to use CDK, Bedrock, or really whatever you name it. But one idea here from an observability standpoint is that we would like the agent to be woken up before us. It would be cool if all the alarms that we have, we have a duplicate of them with more aggressive thresholds in a way that can actually trigger your agent to run.

The idea is that the moment I get paged at 3 in the morning, the agent was already there doing the work for me, so I can actually go and look at the report and see what's going on. This is exactly what these agents are doing. Again, we're using the MCP clients and all of that. There is a setup context prompt, but then what we do here is create CloudWatch alarms with human alarms for operator and trigger alarms for preemptive assessment. We created EventBridge rules to actually start up the agent itself. This is one way clearly that you can run the system.

I can pass it over to you so we can show the QR code if you want to look at the repository once again. Please try it in pre-prod. As you can see, the agent will go off and do stuff. We also have time for a few questions if anybody wants to ask or talk about anything. Let me give you a mic so I can hear you.

Thanks. MCP servers are notoriously inefficient with token usage. I'm curious how many tokens this report took to run. I don't think we have the statistics, but we did make sure to carefully scope down and be very specific with the context because we ran into the same problem. I think it comes down to being very specific in terms of what you tell it to do. I don't have a token count for the runs, but we had the same problem, so you're right. With an observability tool, you really want to narrow down what its scope is. For us, it was producing a report. It wasn't doing everything. We said give us an overview. When you keep the scope down, you can also keep the token usage down.

One thing you've seen in the code is that we took some shortcuts to make sure we consume fewer tokens. For example, the metrics inside CloudWatch to understand if there is something wrong with your EC2 instances, that's one execution of one MCP tool, as opposed to going into EC2 MCP tools and finding out all your EC2 instances and then for each one of those going into metrics. There are shortcuts like that which can help you reduce the number of tokens quite significantly.

How do you execute the analysis? Do you execute periodically or each time you have an alert? We execute the whole workflow, so you could do either. With our specific tool, you can see on GitHub you can just execute it from the terminal on a cron schedule if you wanted. Or you probably may have seen we came out with something called Bedrock Agent Core where you can essentially put this into a Docker container image and deploy it to AWS very easily. It could be listening for an EventBridge event or something like that. Since we were just making a demonstration, we were running it manually on a schedule, but there are many options that are new and allow you to deploy it and trigger it easily.

Did you test this workflow at scale? For example, if we have a platform with hundreds of thousands of alerts per day, if we execute with a webhook, how have you imagined the platform behind to support the tool and all this workflow? Well, definitely, we did test this with a scale that's not 100K. That's definitely not the case. Take the repository as a demonstration of what you can do. But if you have a high-scale system like 100K, that's exactly where what we're saying before applies. All the simplification we have done, like not going off and discovering resources, there are some tricks you can do in AWS, especially when you talk about observability, like how you use Metrics Insights. Metrics Insights is like a 10K metric query. For example, if you have to analyze CPU authorization for 10K metrics, it just takes one MCP invocation for one specific tool. So definitely at scale, you're going to expect to perhaps write tools yourself to optimize for that scale. But the trick there is to try to use bulk MCP tools or bulk APIs behind the scenes. That's going to make your life easier when you go at that big scale.

So this model has a limitation through it works only through Bedrock or if I have my own OpenAI keys?

It's a Python cube with nothing to do with Bedrock core. It's just pure Python. The Strands SDK you can put in an OpenAI API key or any of them, all models, whatever you're using.

Thank you for the talk. Really cool stuff. I just cloned the code. I was curious about whether you ran into any issues with structured outputs on any of this, since it sounds like the rendering is taking a lot of data that was generated as JSON from the model and basically putting that into the HTML output. I've run into that before and was just curious how you dealt with that. Maybe it was prompting and that kind of thing.

We had to rely on templating to do it. You're right, if you essentially just say make an output, it's not there yet, at least in our experience. So we actually made our template and just fill it in. It's a standard Python Jinja template where we can go and fill in the sections that it extracted from the fields from the log and metric data because we had similar issues. If you just tell it to do things, the graphs would look strange or would use some strange CSS or something like that. So we had to be more manual in that part to get it to look proper.

We're about out of time here, and we'll be standing outside if you want to ask us more questions. Please take the survey. Thanks for hanging out with us and please take the survey. It helps us a lot. You guys were awesome. Thank you for joining us today.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community