Kazuya

Posted on Dec 4

AWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)

In this video, AWS principal solutions architects Sai Vennam and Lucas Soriano demonstrate building an AI-powered troubleshooting agent for Amazon EKS using Agentic AI and MCP (Model Context Protocol). They showcase a live coding session where they evolve from basic RAG architecture to a sophisticated multi-agent system using the Strands SDK. The demo includes implementing smart message classification with Amazon Nova Micro, integrating AWS's hosted EKS MCP server for real-time cluster data, creating a memory agent that stores tribal knowledge in S3 Vectors, and building an orchestrator pattern with specialized micro-agents. The system automatically detects Slack alerts from Alertmanager, troubleshoots Kubernetes issues by querying stored solutions and live cluster data, and can autonomously remediate problems like fixing a failing monitoring agent pod by applying the correct image from stored DevOps recommendations.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Streamlining Amazon EKS Operations with Agentic AI

Hello ladies and gentlemen, thank you for joining us and spending your time with us today for a code talk. If you're not familiar, a code talk is a live coding session, not a recorded session. Lucas and I will be live coding for you today. My name is Sai Vennam, and I'm a principal solutions architect with AWS. Hey folks, nice to meet you all. I'm Lucas Soriano Alves Duarte, a principal solutions architect here at AWS as well. Today we're going to talk about how you can streamline Amazon EKS operations with Agentic AI. This will be a very informal and casual session, and we want to give you folks an opportunity to ask questions at the end as well. We're going to be going back and forth between Lucas and me. We've done this a couple of times before, and we're excited to run through what we built for you today.

Last year we did a similar session about using AI. This was before the time of agents and even before MCP was getting popular, so we built a lot of things from scratch. This year we're leveraging tools that are available out there that make it even easier to build really robust troubleshooting agents. That's what we're going to talk about today. But first, I think it's important to understand how the technology has progressed from where we were last year, maybe even two years ago, to where we are today.

Before we talk about that, I just want to introduce to you folks a new member of our team. Meet the Kube Agent. Essentially right now we have an agent that can be responsible for troubleshooting your clusters and also remediating some issues based on root cause analysis using MCP and Agentic AI. What we're building here is not just an LLM and a chatbot because that's 2024. What we're doing right now is actually building an AI agent that can be your partner and help you accelerate your troubleshooting and actually reduce the mean time to remediation of your issues.

Demonstrating the Kube Agent: Automated Troubleshooting in Action

What we are trying to do here is empower network and operations center engineers so they can use these types of tools to reduce the mean time to remediation because we know that Kubernetes is a very spread out ecosystem. That's the idea that we have for our session. Exactly. And here's the thing: we don't have anything running right now because our setup is completely clean, but we want to show you what we're going to build, just a little bit of what's possible with the technology, the art of the possible.

So here, if you can see, we have a Slack channel. The Slack channel is where we're going to receive messages and everything. We have two terminal tabs right there. One is running a memory agent, so stick to that information to the end because we're going to build that up. The other one is a specialist agent. What's going on right now is the agent is looking at today's Slack. We have received an alert using Alertmanager, and by the time we received the alert, we configured a troubleshooting agent to identify that alert and start troubleshooting based on that information.

If you can take a look, we are saying the monitoring agent has been in a pending state for over one minute. Right now in the terminal tab, we are doing the agentic loop, and we're going to explain that later in the presentation. But right now we are doing the agentic loop to try to fix the monitoring agent problem that we saw earlier. Now the agent is doing the loop, executing the tools until it is able to find the last response. Let's see why the monitoring agent was actually stuck in pending state for a long time.

In this case, it's finishing the troubleshooting and doing the summary, and soon enough we should be able to have a solution here in our thread. Of course, in this case we're not having any human in the loop. We can have that later if you're using GitOps and everything, but just for the sake of the art of the possible, what we're showing here is that the agent got the alert from Alertmanager. It consulted runbooks that we have available in our vector database. It fixed the issue by itself because of the information that we gave to the agent, and as you can see here, we have the issue resolved.

Oh Lucas, but isn't it too dangerous to let the MCP and the agent apply that stuff in our production cluster? Of course it is. That's just the art of the possible, but we could integrate that with GitOps and everything, but that's for later in the presentation. That's essentially what we're going to be building today with our new teammate. Yeah, perfect. And of course that was a recording. We just wanted to show you what's possible, right? These alert messages can automatically fire in your Slack channel, and the agent can pick those up quickly and respond with contextual information. That's something important that we're going to dive into: you know, we all know LLMs are just as good as the amount of context you're able to provide them, and that's even more critical for what we'll show today.

So realistically, again, I love that it's not a chatbot. That's right, 2024 is behind us. We're going into 2026. We want to go beyond that. It's not just a chatbot. We want a teammate. You might just

be thinking, "Oh well, you're just wrapping chat messages in Slack. It's still a chatbot." By the end of today's session, I think you'll believe me that it's more than just a chatbot, and we're going to dive into that in a bit.

From RAG to Agents: Understanding Last Year's Approach and Its Limitations

Last year we showed RAG, which is retrieval augmented generation. It's important for us to understand this because then you'll see how far we were able to come with agentic architectures over simple RAG. With any sort of RAG pipeline, you have a massive set of source material, and in this case it can be like Kubernetes logs. Of course, logs can go into the megabytes, gigabytes, terabytes—tons of data. It would be crazy to store all of that as raw data. LLMs understand in a different language. It's embeddings, it's matrices. So the first thing you do with any sort of retrieval augmented generation is you take that data source, you chunk it up, you put it into embeddings, and you put it into a vector store. That was the first thing that we did last year.

Then when the user asks a question, we take that question and we first look for any relevant context in that vector store. So that's what you see here with the context. You take that context and you put it along with the embeddings from the original search. Now that you have the context, you pipe that into the original question and you pass it into a large language model. The embeddings model is a lightweight model just to find the relevant context. The large language model uses that context plus the original prompt and gives you a response.

Let's quickly cover what we did. We chunk the logs into vector storage. The user asks the question, "Why is my app response time high?" You take the relevant context, you pipe it in with the original input, and you pass it into a Bedrock LLM, like a big agent, maybe Sonnet or Haiku, whatever you want. And you get a response.

The problem with that is one, we are limited to the context window, which can be really intensive. Also, we are using it because the model itself is limited to the amount of training data that it was trained on. So now we need to inject that custom data, custom logs, to be able to do the troubleshooting. But this was last year's approach. There's actually one more downside of chunking and storing into the vector database. We were doing that often, sure, but maybe once every 30 minutes or once every hour. We weren't getting live data from the cluster. So if an engineer just deployed a pod and they're troubleshooting it, that data might not have been chunked and stored into the vector database yet. If the issue is happening right now, how are we able to reduce the mean time to remediation if we don't have live data?

What we saw last year was developers would find an issue, they would go to kubectl, they would run some commands, get the data, and then pass it into the LLM. It's a lot of manual back and forth and context switching. Really, this is why agents are here to solve some of that.

Building with Strands SDK: Creating a Basic Agent Architecture

Strands is an open source SDK for building agents, and it's a fundamental way that we're going to build the agent from scratch today. Is there anyone here already using Strands Agent or have tested Strands Agent? Awesome to see a couple of hands raised. It's going to be good. Perfect. We're going to get into why this SDK is so good for building agents. They've really thought through the different use cases of what agents might need to do. We're going to start using some of those things as well.

In addition, we're starting to see this trend in the industry of not just building one massive agent powered by one LLM, but using smaller agents that use the right LLM for the job. Really, I don't want to go too much into this architecture. Just know that when a user prompts this agent, that agent can use other agents. It can use multiple LLM models. It can talk directly to your AWS resources, and it can hook into MCP servers. The way it communicates and the prompts that the agent uses to talk to one another, that's all customizable, and that's what we'll show you today with a very basic getting started example.

We have a basic Strands architecture. It's going to be a Slack interface that is reading chat messages and passes it to an orchestrator agent. The reason why we use two agents here will be clear in a bit, but we use an orchestrator agent that then routes that message to another agent that's able to load information directly from the API, directly from MCP. What you'll see is that we preconfigured that agent to do a couple of things right off the bat. So let's get into it. We'll show the demo here. We don't have anything running just yet, and if we wanted to, we can. I'll actually just load up the IDE and we'll walk through some of the basic elements of the architecture.

I think this is the first one that we want to see . Let's do that. Hopefully you folks can see that nicely. I think that's pretty big. Right off the bat, you can see we're importing the agent capabilities from the Strands SDK. This is all in Python, by the way. There's a lot of boilerplate that we have in here already that I'll quickly walk through. That orchestrator agent that I talked about essentially receives messages from Slack and is able to do a couple of things. Right now we configured it with one tool. Again, boilerplate stuff—we haven't really set up MCP yet. It's not a very smart agent. We just manually coded this one. We say you have troubleshoot Kubernetes, right? So let's jump into that one.

If we go into that, it'll open this file right here. This is our Kubernetes specialist. Some hints here saying we will connect it to MCP, but it doesn't have MCP just yet, so spoiler alert, ignore that for now. We've actually just configured a couple of tools in here. We have describe pod and get pods, and we've kind of manually programmed these in. We'll talk about how these work in a second. We tell the agent to use a specific Bedrock model ID that's configured in some config settings that we'll go through here in just a second, and a troubleshoot command. This actually does the troubleshooting itself.

Let's go through some of the properties that we've set up. I think the one interesting one here was the model ID. We're deciding to use Claude 3 Sonnet for this one, which is a relatively older model but perfectly good enough for our use case. We also have some other properties in here like the AWS region stuff for Slack, that kind of thing. There's a Slack handler. I don't want to get too much into this. This is boilerplate code. It just takes messages from Slack, guides it to the agent. This stuff is really well documented. I think we coded this using Claude, right? Just auto-generated this. Yeah, I just sent the prompts to Claude and Claude coded everything. I didn't even touch the code for days. The handler itself for the agents, yes, exactly.

One last thing, let's just look at the requirements file. These are all the SDKs and tools that are loaded, and you'll see us use these throughout. Not that many here, fairly straightforward. Some of the Strands pieces, MCP, that kind of thing. Should we do the first demo? I think so. Let's do the first demo. Just remembering folks that the agent that we have right now doesn't have any access to MCP capabilities. So the only tool that it has access to is describe pods and get pods. What we're going to do now is send a question to the agent with an inquiry requesting information that we know the agent doesn't have the capabilities for, but it's going to try its best to respond in the best way possible.

I'm going to ask it what namespaces are in my EKS cluster. That's me. I'm Lucas. You're Lucas. But let's quickly take a look at the tools that we configured, right? If we look at the agent orchestrator, we actually only configured a couple of tools here: describe pod and get pods. We didn't tell it how to get namespaces. It doesn't know how to, so it's going to try its best. Before I show you what it responded with, let's actually see how it thought through the process. It doesn't know how to list namespaces, so first it's going to get pods and based on that output it's going to try and intelligently figure out what namespaces there are. Not great. It actually doesn't figure out all the namespaces. It says there are three, and if we look at the response, it says there are three namespaces. That's actually not true.

Integrating Hosted EKS MCP Server for Enhanced Cluster Access

If we go to K9s, which is just a handy CLI tool to let us see what's running in our cluster, you can see there are actually a lot more than three namespaces, so we want to make it better. To do so, I think I'll very quickly introduce a new capability that we've just announced. I think it was last week, right? Two weeks. Two weeks. It's the hosted EKS MCP server. Essentially, instead of having to launch the MCP server yourself locally, we do it for you. That means instead of having to launch a proxy and tunnel permissions and run the MCP server yourself and scale it locally handling to process a lot of messages, it's all in the cloud. Essentially, you just leverage IAM so you can use identity access and permission. You can grant the right permissions to the right set of people who should be able to use it. Let's use that. It's funny, right? Up until two weeks ago we were doing the old way. As soon as this came out we thought, hey, we can really simplify the MCP configuration for a live demo, and we switched it over.

If you're using EKS and have IAM permissions, you can just use it with your favorite code helper. Here's that new architecture. You'll notice there's one more box on the bottom left—the manually coded tools that we had before—and then now with MCP. The specialist agent can now decide to use the MCP if it wanted to. Let's take a look at how we would configure that.

We'll go back into Kiro. We need to switch the settings. Let me hit the demo button. All right, let's jump in here and we're going to go first to the settings file because we need a property that will essentially tell it we want to enable MCP. We'll enable this again, which just looks at the environment variables and then sets it. Now in the actual code, we'll go to our Kubernetes specialist, which again, remember from the PowerPoint, we built this into the specialist itself so it knows the MCP exists.

First, we'll need to import the system libraries. We import MCP, and this is essentially just using standard IO to talk to MCP. It's a standard protocol for working with MCP servers. Strands already has the MCP client embedded in it, so you can make your Strands agent be a client of MCP hosted servers or even local servers. It's very few lines of code. Let me get the spacing here right. I think you need to align the cell phone all the way up. There we go. Perfect.

So essentially what this is doing: if MCP is enabled, talk to the hosted MCP server. This will resolve to US West 2 where we're running this demo. That's the hosted MCP endpoint. One thing that we had to do in the backend was set up the IAM role so that this app can talk to MCP and get the right information. That's basically it—launch the standard IO client and that should be good. Let me save it and go back to the terminal here. I need to kill the Python process. We need to add the deletion method. If you don't add the deletion method, the MCP connection will be open forever and then we're going to have a problem. But since it's a live code session, just remember inside to do that so we don't have that issue.

There we go. So here's the cleanup step. We're going to make sure we clean up after ourselves. It's very important. All right, we're going to go back here and, as you were saying earlier, we're running this locally. It makes it easy for a live demo, so we're just going to start it back up. I killed it and then I'm starting it back up now with MCP, and we should see an indicator that MCP is being used now because until then, while we had the library downloaded, we weren't actually using it. Of course, you'll remember right up here we import it and we added some code to talk to MCP. So now it's thinking through the process.

Oh, here we go. I think it's because I made it very big. Yeah, just click on top of it. There we go. Perfect. All right, the spacing got messed up, but you can see it. It says "fast MCP MCP. We're doing it live." Okay, perfect. The Python app is up and it's listening for Slack messages. Let's ask it the exact same question again. That's me, I'm Lucas. What namespaces are in my EKS cluster? And it's thinking. Now it should be using the Kubernetes method again. And it's using a tool that I've never heard of. We haven't coded this list Kubernetes resources. We don't have that in our code and I haven't coded it. In fact, it's only in the README, right, in our documentation for MCP, but we haven't coded this anywhere. It's getting that tool from the MCP server. It talks to that hosted endpoint and it tells our agent there are countless tools that it can use, but it says for namespaces this is the perfect one to use: list Kubernetes resources. It actually gets the correct set of namespaces. Now fingers crossed, I go back to Slack and check it. Boom.

All right, so that's the first step. In just a few steps, you've seen that we're able to make our bot that much more intelligent. It's able to pull anything from our EKS cluster. Realistically, imagine if we needed to create every single tool for every single possibility of troubleshooting in our agent. We'd be replicating tools here and there. What I like to think of as MCP is almost like an API gateway that we can use to centralize the communication, standardize the tools, and then reuse those tools across multiple different agents.

Smart Message Classification with Amazon Nova Micro

Soon enough we're going to be talking about how to break monolithic agents into micro agents. The analogy a lot of folks have heard for MCP is that it's like the USB-C universal plug for integrations, but I really see this as a solution for an exponentially complicated problem. Think about all the LLMs out there. You probably use three or four yourself, and then think about all the tools that it could talk to—maybe your emails, maybe an EKS cluster, maybe different AWS services. So if you're trying to build integrations for ten LLMs to talk to one hundred different tools, that's ten times one hundred, which is a lot of integrations. MCP is just an integration layer, so you build it once and all your LLMs can talk to any of the tools that are built. That takes us from an exponential problem to an O(N) problem.

So we've added MCP and our chatbot is that much more intelligent, but there's actually one thing that I wanted to show really quickly. In Quiro, the way that we're actually classifying when we want to respond to a message, if we look in the prompts, these are the prompts that we've configured, essentially telling the orchestrator how to behave. We have some for the orchestrator and some from the system prompt, basically telling it its purpose in life. This agent is here to solve problems for us, but we have this keywords file. We essentially use these for figuring out when we should respond. We should respond only when the keywords are in the message.

So if we asked it a question like "what version is my EKS cluster," it's actually not going to say anything. I think Closer is actually going to say something because it's only keywords. That's right, it's a live demo. Let's do it again. But if I asked it "what version of EKS am I running," for example, it wouldn't actually respond, because we have this manually authored approach for keywords and we want to be better. We want to use AI. We want to figure out when we should respond to a message.

That's because in Slack you might have engineers talking to one another, and they might use one of those keywords when they didn't mean to. We should find the intent of what the message was about. If an engineer is talking about a problem and needs a solution, that's when we should fire up our bot. Otherwise, if engineers are just talking about their weekend or what they did, we should ignore that. Keywords is one simple approach to that, but let's use AI. We're going to expand our architecture a little bit here.

We're going to expand our architecture and make the orchestrator a bit more smart. When a Slack message is received, we're going to use a different LLM, Nova Micro, to classify the message intent. If it's for troubleshooting, proceed; otherwise, exit. The analogy I like here is you wouldn't put a PhD mathematician to run your hotel check-in process, right? It's overkill. You don't need that level of intelligence. You don't need that level of operations to get people to the right rooms. We want to use a more lightweight, more efficient model. Because the Slack bot listens to every single message in Slack, you can see how that could exponentially get more and more expensive if we're not being intelligent about when we respond. That's exactly why we want to use something like Nova Micro—it's a very lightweight model that responds very quickly to messages.

Implementing Intent-Based Response with Before Invocation Hooks

So now Lucas, I'll pass it to you and he'll show us how the magic is done. Let's take a look into the terminal. Let's take a look into the should_respond method real quick. As you can see, if there is any keyword that was found on that list that we just defined here, then we should respond. So if you look at the list of keywords that we have, it's a very limited amount of keywords. We have a limited validation method that we just created. So as I just said, we want to do that in a much better way. We want to be able to do a smart classification. So the first thing that I'm going to do is add a new prompt for our classification model that we are using, and then I'm calling that classification prompt.

So if you look at this prompt, it's very simple. "Is this message related to Kubernetes, system troubleshooting, technical issues, or requests for help?" Then we pass the message and we're replacing this variable right here, and I'm telling the model to please just respond with yes or no. Remember that we pay by the amount of tokens that gets generated. So if we limit the amount of tokens that the model can generate and still do a smart classification, it should be good enough and should be fast, right? This is exactly what we're going to validate. So we have added the classification prompt.

Now I'm going to go over to the agent orchestrator and add some imports. Let me remove that import from line 5 because we don't need it anymore. The Boto3 client and the most important thing before invocation is that the agent already has the capabilities of intercepting messages before the message actually hits the agent.

What we're going to do now is implement a method that's able to do smart classification using Amazon Nova Micro before we send the message to the agent itself. This way, we don't even pay for the Claude 3.5 Sonnet model that we're using on the agent. This is what we call a before invocation hook. Before invoking the agent, please run this method. What I'm going to do now is replace the constructor method here with the updated one that includes the smart classification we just created. The only thing that's different from the other one is we're instantiating a Bedrock client just to be able to talk to Bedrock and do the Amazon Nova smart classification. The agent itself didn't change, and then here on line 41 is the most important thing. The agent hooks add a callback before invocation, and then the method that I want to call is the method I want to use to validate my message. I'm calling that method callback_message_validator because it's a very straightforward name for the method.

What we're going to do now is create that function and also the Nova classification function. We're going to go over those methods real quick so we can see what we're doing. The callback_message_validator is the method that's going to be called once we send a message to our agent. Then we're going to trigger the classify_with_nova method, passing the message and the last user message. If you go to classify_with_nova, this is where the trick is happening. First, we have the classification prompt. We're replacing the message variable with the current query that we're sending to the agent. Then here is the magic. We set max tokens to 10 because the response will be less than 10 tokens. By limiting the amount of tokens, we're able to get a really fast response, and because we're using Amazon Nova Micro, we can get that even faster.

The idea here is that LLMs are nondeterministic. You can ask the same question multiple times and get a different response, but there are ways you can get around this, and that's exactly what this is doing. It's making that response more deterministic. Now we know that 99.9% of the time, if you ask this question, it's going to come back with either yes or no, and it will likely come back with that same response every time. This is really just routing the exact logic that goes through, and of course, fewer tokens means a faster response. Since every Slack message in the channel will be sent here, it has to be fast. Imagine if we have a channel with 500 people and for every message we need to classify with Claude 3.5 Sonnet. They're really expensive, so Amazon Nova Micro here could be one option. Another model that we tested was a small language model running on top of CPU. We ran that as well on our local computer and laptop, and it worked. Of course, Amazon Nova Micro is a little bit better, but if you have good enough CPUs, you can also try small language models on top of CPU if you don't want to use Amazon Nova Micro or anything else.

Going back to the code, we return the answer. If the answer that the model returns is yes, we should return true. If it's different than yes, then we're going to return false. One thing I want to show you folks right here on line 53 is message_classification. We're going to look for that information once we test the new implementation that we have done. But before we do that, we also need to remove that old method that we created before. The should_respond method and this one you're not using anymore, so we're not using the keywords static keywords analysis anymore. Now we're going to rely 100% on Amazon Nova Micro with the before invocation hook that we just defined. Let's see if it works. I actually like that better that we're using this before agent invocation rather than this manually implemented callback approach. The folks that built Strands really figured out all of the things you might need to do with an agent. Before the LLM invocation, do this custom bit. Maybe after, you want to do something to format the message in a certain way. These are all the kinds of things that we wanted to show you what's possible.

Okay, so we're restarting the Python app. No errors, no errors. That's a live demo as you can see. It's completely live, so if something breaks, it's my fault. Alright, so what version of EKS am I running?

I'm going to send the same query again to the model and hopefully, if everything works and we did it the right way, we should be able to see a smart classification and we should be able to see a response because EKS again was not on the keyword list. So let's send it again. Message received generating response message classification there you go. So if you look at this, that's the log that I told you and the content equals yes.

So before we didn't reply because EKS was not there, but now we are replying because Nova decided to say yes because it thinks that this query is related to system troubleshooting, cluster disinformation and everything. So now we did the classification and we should be able to see a response from the agent that we didn't see before because now we are doing that smart classification mechanism. So as you can see here, before we didn't have the smart classification, now we have the smart classification.

So let me just try to ask something random. Yeah, we should do the negative use case, right? Just "Hey folks, hopefully it works. Hey folks, what time are we going to have lunch today?" And then if we pray, there we go. No, there we go. All right, so you saw how quick that came back too. Yes, it's an API call, but this model runs that fast. We can look at the cost of it. It's one token or maybe a few more for the processing, but essentially, yes, output token one exactly, output token one. So that's how much you're paying for every execution. Again, if you don't want to use Bedrock, if you want to use your small language model, you can implement that same architecture that we have done. As I said, I think the Strands folks have thought about every single use case that we could do.

Capturing Tribal Knowledge: Introducing the Memory Agent with S3 Vectors

We were doing the smart classification in a different way and Sai approached me and said, "I think we already have that before the callback handler that you can use," and then it was much easier to implement actually. But now before we move any further, because if you look at the summary of a recession, there's an important concept that we talk about, right? Yes, this is something critical that I think is going to be the crux of what makes these agents so powerful. It's this concept of tribal knowledge.

I kind of see this term being coined in the context of tribes and software engineering, which has been around for a while. I kind of credit that to Spotify and the Spotify model of tribes and that kind of thing. But essentially, it's this idea that the knowledge in a company, the knowledge is in the hearts and minds of the people that work there. These people are talking over email, they're talking over Slack, and there's actually a really interesting case here, an example I want to bring up. Have you folks heard of the Voyager One probe from NASA? They sent it out in the 70s. I'm seeing a few folks nodding your heads. Yeah, last year that Voyager probe had an issue. The memory chip failed and it wasn't submitting data back to ground control in Houston.

To solve it, we have a problem, yes, we absolutely do. The memory chip failed and to solve this problem, they had to bring up documents from the 70s. They actually went back and pulled out cardboard boxes full of old documentation. Here's the crazy thing: they pulled actual NASA engineers out of retirement to consult with to solve the problem. Essentially, with the minds of the people along with the documentation, they were able to bypass that memory chip, solve the issue, and get data routing back into their ground control and all was good, right?

But that's a concept that we know and we're all familiar with. Even when you have perfect documentation, well, there really is no such thing as perfect docs, but even when you have really deep documentation and you have the folks that are working there, you really have this concept of tribal knowledge that should be stored and it should be present in the places that your developers are already operating. Right now for a lot of our developers, it's Slack or Teams or wherever you're conversing. This is something that we wanted to build into the agent.

Any time that an engineer talks about a specific design choice they made, what may be a version of the image, a tip, or like how much you should define for memory and CPU, we know that it's not really easy to define that exactly. The thing is, you shouldn't have your superstars, right, your elite SREs constantly being bombarded with pings of "Hey Frank, like what was the version of that image you were using?" We want all of us to be as efficient as possible in the workplace. So when these conversations do happen, let's store them.

In some sort of knowledge database that can be accessed when that question inevitably comes up three months down the line or in NASA's case fifty years down the line. We want to store this type of tribal knowledge somewhere, and to do so we're going to expand our architecture a bit. This is what we had before, and to add tribal knowledge we're going to leverage an all new capability that we announced back in July as S3 Vectors. If you're wondering why we're using two agents layered on top of one another, now you get a response because we're integrating multiple agents. If we have one of those use cases in a single agent, trust me I've tried it, it would take much more time to respond in the right way, and hallucination would be a problem as well.

Up until now, the orchestrator agent was just taking the message, figuring out if it's relevant, and then passing it to the specialist. But now we actually have a branching approach where first we have to think about whether that message is actually for troubleshooting or if it's just a tip that we should store away for later use. The cool thing about this is the longer it's implemented in your engineers' Slack channels and wherever this agent is listening, the smarter it gets. It's storing these solutions for future use. Essentially, any time there's some information being shared in Slack, we automatically take it and store it in vectors for future recall.

Vectors, as we talked about briefly before, are a type of database that works nicely with LLMs. Of course, LLMs don't think in terms of tokens. They work with embeddings and number matrices and transformers. Vectors are a very efficient way of storing human-readable text. You take a bunch of human-readable text, encode it into embeddings, and then when the LLM needs to scan through that data later, it's much faster to go through a vector database. Then it's able to reverse engineer that back into human-readable text and give it back to the end user. This is how we're going to expand the architecture.

It's not either an agent or an MCP. It's more a combination of different tools and capabilities to empower the agent that you can use. We're using MCP for the live data, the data that we need in real time that is happening right now in your cluster. Then we use MCP to do it. But maybe some tips, runbooks, and everything else, we can still rely on RAG. We don't need to use MCP for that. We could, but in this case we're using RAG on top of another agent, so it's a combination of everything. It's not like if I use MCP and agents, I'm not going to use RAG anymore. I think it's what is best for your use case, and like every other essay probably would say, it depends on the use case.

Building the Memory Agent Server: Store and Retrieve Solutions

Let's try to implement the memory agent right now. Hopefully it's going to work. The first thing we're going to do is actually use another agent. We created that memory agent to be a standalone agent because other agents could rely on that memory agent as well. I've created a Kubernetes troubleshooting chatbot, and maybe besides creating a database analytics troubleshooting chatbot, we are saving and storing the information maybe in the same place and we can reuse those runbooks across multiple agents.

The agent that we're going to be building now has a server embedded that's powered by Strands, and then we're going to integrate that agent server with our current structure that we have with orchestrator plus specialist using Agent2Agent. For those who don't know, Agent2Agent is just a way to communicate with different agents, JSON over HTTP essentially. It's nothing more than that, but agent to agent communication. It's really easy to understand.

As you're going through that, it sounds a whole lot like the twelve-factor app microservices spiel that we heard, what, twelve or thirteen years ago now in the early 2010s of why we break monoliths into microservices. It's really the same. It's kind of this revolution, a cycle of programming. And it's the same reason why we're breaking our agents apart because the memory agent has its own purpose that might be used outside of the context of the Slackbot, just like a microservice. What is a microservice? Something that should do that use case in a very good way, and then if you want to add other use cases on top of that, then maybe we need to start thinking about creating a different microservice. For the agent, it's exactly the same thing. The agent should have a purpose. We tried to have everything in a single agent, and the response was really awful. Once we broke it down and then used the agent as tools pattern with the orchestrator agent, we got a much better response and performance on top of the agents that we are developing.

Specialist agents—that's what I want to say. I've pasted the code right here. I'm going to go over the code that we are building again. It's yet another agent, so it's just more of that other code that we have running. We have the construction method where we're declaring some variables. We're initializing a Bedrock client to talk to our S3 database, and then we're defining the assistant prompt. Essentially, the assistant prompt is what you're giving to the agent—instructions on how it should behave. You are a Kubernetes troubleshooting memory specialist, and your role is to store solutions and retrieve solutions. After that, it's all formatting and configuration, but the tools that we have are to store and retrieve. Then using Strands again, we're defining the agent and the model that we want to use.

Lucas, I just want to point out something at number three. Some of you might be thinking, "Only do what you do and nothing more." It's very direct. You might even consider it to be rude. However, we've seen official research papers released on this topic. When you start using words like "please" or asking nicely for agents, they actually end up hallucinating more often and not retrieving solutions and responding in the right way. Additionally, if you say please, you're going to pay even more for the tokens that we're sending to the model. So you do want to be very direct with these models, and they behave better when you give them those tight, bound lines. Try to imagine that it's a coworker that you don't like and be very direct.

So the agent uses the Bedrock model ID and two tools. I'm not going to go over all those tools, but essentially we have only two tools. I'm not going to go over the code because I don't have a lot of time. One tool is to store solution, which stores solutions in the S3 vector database. The other tool is to retrieve a solution from the vector database based on a query. Just a question—do we have to do the conversion into embeddings, or does S3 do that for us? How does that work? I was about to say that. S3 will not embed the data for you. You're going to need to embed the data. Essentially, embedding is just making the data in a way that the model can understand and interpret the data to respond back to you. For S3 Vectors, it's a vector database, and we still need to do the embedding. In this case, we're using Amazon Titan on top of Bedrock to do the embeddings. Once we embed, we cannot de-embed the embedding data. So what we need to do is embed the query that we're using to retrieve.

Let's go to retrieve. We need to embed the data that we're using to retrieve. Once we embed the data that we're using to retrieve, we can search for the embeddings because we cannot de-embed an embedded query. So we need to embed using the same model, and because we're using the same model, you can use that chunk of data to search for similarities in our vector database. That's exactly right. It essentially comes down to what we talked about earlier—you don't want to store the actual human-readable text because it takes a lot longer to scan through it and find relevant data. It's just faster when you're embedding. Lucas, how many LLMs are we up to at this point? Is that our fourth, or I think that might be the fifth? We're using Sonnet. We're using Nova Micro. We used a different version of Sonnet for the memory and for embedding, and Titan for embeddings. That's four different models exactly.

Look, this is how LLMs are meant to be used. You pick the right one for the tool, and it's more efficient. Every agent that we had there could also use a different model. We're just using the same model for the purpose of exporting one single environment variable, but we could use different models for different agents with different purposes. For instance, Claude Haiku is much faster than Claude Sonnet 3.7, but it's not as smart as Claude Sonnet 3.7. So we need to do those balances and try to pick the best one for our use case. In this case, we have the memory server again. We have two tools—one to store and the other to retrieve. Now what we're going to do is launch an agent server, but before we do that, I'm going to actually try to use those tools manually.

What we're going to do is get the variable that I have here from a class. I'm going to call agent and say something like this. This just invokes the agent as if, imagine the orchestrator agent was talking to the memory agent. We're just simulating that for the sake of testing, and so we're calling the agent directly. Yeah, okay, so we're doing a real example. Lucas is copying over a real model recommendation that we have for the Node Exporter image. So what I'm saying is anyone deploying monitoring agents should use this image. If you watched at the hook of our presentation, you probably know where we're going with that information, right.

What we're doing now is triggering it manually first, so hopefully this query right here is going to trigger a store and then we want to create another query, another agent invocation to retrieve the solution. So what I'm going to say here is "what node exporter image should I use?" and this runs synchronously, so one after the other, which will give it time to store the solution. Essentially, the first one we hope will use the store tool and the other one should use the retrieve tool. We're not telling which tool to use, but because we are using an agent, it should figure that out.

What we're going to do now is go to the folder where we have that agent created. There you go, so we have the memory agent server. You can already notice the difference here. Before , we were just restarting that one Python application which had the orchestrator and the specialist agent. Now we have a separate memory agent that lives separately, so it is its own microservice. Lucas came up with a word for this earlier: micro agent. You heard it here first, folks.

By the way, we do give you Helm charts to deploy all of this and all the sample code as well. We'll share with you at the end so you could deploy this in a Kubernetes cluster and you're not doing it manually locally. I think it's just for the sake of the demo we're doing now that we are doing it locally because it's a code session, but it all abstracts to Helm charts. So if you just want to take the shot and deploy this architecture, you can. We're going to give you access to the sample code at the end.

Going back to what we are doing, let's try to see if the tools are going to be able to be executed in the right way. So we are hoping to get store and then retrieve the node exporter chip. I just triggered the memory agent and you commented that out so that the server isn't starting. We're just calling the agent synchronously. It's just a test, yeah. So the first two store solution key information: problem, team is using consistent corrected image, that's the image that you should use. And then for the second to retrieve the information, there you go. So the develop team recommendation is that one, that is the one that we added into the code.

So now we have that information in our S3 vector database. What I'm going to do now is go to a dashboard that was completely generated by AI and then I'm going to refresh it and then we should be able to see another solution. That dashboard is just querying S3 to check the solutions we have available in S3. But as you can see, we have a total of one solution in that particular bucket. If I search here for node exporter, we should be able to see the solution that we just persisted. So using consistent and correct image for node exporter in Kubernetes monitoring deployments, that's the image that you should use. So we have that in the vector database right now because we tested locally our memory agents.

What we're going to do now is remove those local executions that we have just done and then we're going to start the A2A server. If you look here, we have the A2A server that is essentially a library that we have imported from Strands multi-agent capabilities. We're passing the agent that we want to run as a server and then serve behind the scenes is going to start a Unicorn process. So if you do Python memory_agent_server.py, now we have an agent running as a server. So if it's running as a server, you can expose it to a load balancer and now you can use that agent with your other agent, so you don't need to replicate that memory agent code again and again and again. So that's it for now for the memory agent server.

Agent-to-Agent Communication: Connecting the Orchestrator to Memory Services

Now what we're going to do is we need to make the orchestrator agent communicate to the memory agent server. Right, and for that we're using A2A. So what we're going to do now is implement that A2A communication on top of our agent orchestrator. Remember, for the agent orchestrator, we are running the agent as tools pattern, so we have an orchestrator agent that is calling other agents as tools. So if you can see here , we have just a troubleshooting tool that will trigger the Kubernetes specialist agent. What we're going to do now is create the memory agent tool that will talk to the other agent running its own server using A2A.

So the first thing that we need to do is import the library that we run, so A2A client tool provider, and this way we are able to use the other agent tools in our own agent without replicating those tools in our code. So this is the first thing that we need to do. Now another thing that we need to do is actually create a tool. Since we're using the agent as tools pattern from Strands, we need to also create a tool to talk to our memory agent. So I'm going to copy and paste the method that we have created here. And as you can see, I have a decorator here. So this is particular for Strands. If you have the decorator on top of the method, you can use that method as tools in your agent, right? So that's the way that the method works.

The most important thing is the A2A client tool provider that comes from Strands, and then I'm passing the memory agent URL. If you can see here, port 9000 is exactly the same port that we have here for the memory agent running as a server. We already have configured that variable. Then what I'm doing is creating that agent here, and this agent is just an interface so we are using this agent to discover other agents available in our environment. In this case we just gave one single agent, but if you look here we have a list we could pass multiple different agents and our agent would figure out which agent it should call.

I'm going to switch here to the Strands agent architecture just to show you what we did. We have that orchestrator agent, that big box over here, and it's able to discover all the other tools it's able to work with. So these external tools and AWS can in themselves be additional agents. That in itself is the kind of swarm we're building here. Strands likes to call it swarms, but essentially it's all of these different agents that are discovering each other and only talking to them when they realize you need it. That's really the idea here: you can have all of these different tools, the MCP server that we deployed, this memory agent, and then we have this actual Kubernetes specialist agent. Realistically, it's hard for us to even code an if-else chain of when which one should talk to the other, so we offload that responsibility to an agent.

You got it. All right, so let's get back to it and see how you did it. That's the memory agent provider, my tool that's going to talk to our agent server. The last thing that we need to do is add that to the agent. You were saying earlier it's the agent as tool kind of paradigm that's essentially what we did. We took an agent and we threw it in there as another tool just like we did with MCP and the troubleshooting stuff. If you look at Strands, that's actually a pattern called "agent as tools" that's the pattern that we're following, and then we are adding Langfuse on top of that as well. That's essentially what we are doing: we have an orchestrator agent that is responsible for routing requests to other specialized agents.

Now let's go back. I have already added the memory agent provider here as a tool, so the last thing that we need to do is change the system prompt. Let's go to the orchestrator system prompt, and then if you look at the system prompt, we're not saying anything about memory or vectors or whatever. We're just saying we have access to that community specialist agent to troubleshoot, that's the one you have, go for it. Now what we are doing is replacing that orchestrator system prompt with something more aligned to the memory agent that we have created. For every agent we have a system prompt. In this case I'm saying: always check memory first before doing any troubleshooting. If you are not able to find memory, then go and troubleshoot. If the information is good enough, then persist that information for me after you are done with the troubleshooting.

We can control that. We can split responsibilities here. We can make the agent fix things, of course we don't want to do that in production. If we're talking about production, we can make the agent open a pull request for a GitOps repository. We already have that if you want to see that, come back to next year's code session, but for this year's code session we are doing an orchestrator agent and then just passing the prompt here. Now I'm going to trigger our main agent that's going to be able to talk to the other agent running as a server.

You've started them up as two separate processes, one for that memory agent, one for the actual main orchestrator plus specialist. Think about this: if you were to deploy this in Kubernetes, for example, you could deploy each of these as individual Python pods and also scale them independently of one another as well. For example, the memory agent has to respond every single time a message is pasted in Slack, whereas the troubleshooting agent doesn't need to respond as often. So you could really scale in response to demand and load. Let's see if it actually worked. I'm going to say I'm just going to try to trigger the agent to persist a tip and advice that some of our DevOps engineers have done. So what I'm going to say is: if you don't know how much CPU and memory to define for your limits, always define the same as requests, a tip from the DevOps squad.

Know how much CPU and memory define for your limits. Always define the same as requests. That might be true, right? So that's a tip that I'm giving, and I'm going to say tip from DevOps squad. So hopefully we're going to be able to classify this message. The agent should respond to this message because it's related to troubleshooting. As you can see, we didn't fix the classification, did we? So the classification should be right. So the classification should know, right now if you look at the prompt for the classification model, let's take a look at it really quickly. If it's related to Kubernetes system troubleshooting technical issues, I bet if we updated this and made it a little smarter, if you have any tips to store, right, do that, let's do it. We haven't tested this. Let's try. If it's, let's see if it works. I think it's going to work. Let's see. Is this message related to Kubernetes or request for help or any tips. Is it good enough? Pretty good. If it doesn't work, I'm going to tag the agent that should cover all the tips, but if we wanted to get more precise in the future for troubleshooting tips or something like that, we could add that. For now we're doing it live. We want to make sure that classification responds yes for a message like that. If it doesn't respond yes, then I can tag the agent and you can be sure that this is a live demo. Right, so let's wait for the community specialist agent and orchestrator agent to start. There you go. Now I'm going to send exactly the same message, and if we do it in the right way, we should be able to see as one demo. Yes, there we go. See, so here's the thing again, it is nondeterministic by nature. That's how LLMs operate. But you saw with just that little bit of prodding, like changing that classification prompt, we were able to make it respond better for something like this. So we're not saying you're going to find the perfect prompt that fits your use case immediately. Like, of course, for example, OpenAI has been refining their prompts and prompt engineering their base models for years now. You're going to want to keep optimizing. With that small optimization, the classification worked. And then if you can see the agent, they talk to each other. So first we send a message to discover the agents that we have using the agent card that is available in our agent. So if you look at here, we use a HA list discovered agents and then we discovered a memory agent and then with this we also discovered the tools that the agent that we are talking to have. And now we are sending a message to the right tool. We send a message to store solution in our other agent running as a server. And if you can look at here, we have one tool that was called on the memory agent that was to start a solution related to QOS guaranteed, essentially the same amount of CPU and memory resources. And then there you go, you have a response here. OK, and by the way, I think in the real scenario it wouldn't even respond. The bot would just sit there silently listening for any tip and just storing it in the database. But for the sake of the demo, we have it respond saying, hey, this is a good tip. We'll store it in the database. And if you're not trusting me, we had one solution before. Let's refresh it. Now we have a total of two solutions. Let's search for CPU limits. And see if we have that solution there available already. And so this is going directly to the three vectors database, right? Just looking at what solutions we have. Yeah, absolutely. So that's the thing I'm doing a rack here in the same way that we are doing for the agent. So we are just retrieving, embedding the query, and then retrieving the solution. So that's the solution with the shortest distance. So essentially that's the solution that means the most for our use case right here. So if you can see, please define the same amount of CPU and memory that you have for your servers. That's great. It's like a confidence rating, right? It's like the inverse of a confidence rating. So smaller distance means more confident. And I think you saw that second solution was at a distance of like 0.9, that node exporter tip, not very relevant for the question that was asked about CPU limits. Yeah, so last thing that I'm going to try to do, I know that we have just five minutes, so I'm going to try to do now is I'm going to deploy some failing pods. And remember that first demo that we saw before fixing the memory, the monitoring agent, that's essentially what we're going to try to do now. So let me deploy the failing pods to our cluster. We should be able to see the demo app namespace here and you still have the Python main orchestrator running, right? Or did we kill that? No, I killed that, but I'm going to make it run again. OK, but essentially we have the pods not running. Let me just trigger the agent again, agentic troubleshooting. And then I'm going to trigger the agent again. There you go. So now if you look at this, we have a lot of issues going on.

Live Troubleshooting Demo: Fixing the Monitoring Agent with Stored Solutions

We have the monitoring agent not working. We have the backend API restarting with out of memory killed, and the front end has some issues as well. Then we have Redis here. Let's try to debug and fix the monitoring agent using the recommended image that we have stored in the solutions. Before we do that, let's see if the agent has started already. Both apps are running, there we go.

So what I'm going to say is: "Hey folks, my monitoring agent on demo-app namespace is not running for some reason." I'm a developer, and I don't know why it's not running. There's usually some poor SRE that's always in this channel having to respond to these kinds of messages. This person is probably overloaded with too many messages. Maybe some of you in this room can relate to this yourself. Hopefully, the agent can pick this up and not one of you. Let's see if it's going to fail or not. Yes, so the first thing that it's going to do, because of our system prompt, is search in our vector database to see if there is any solution related to this. Hopefully, we're going to get the right image because it's related to Node Exporter.

There it goes. So the first solution, the closest one to what we searched for, is Solution Number One. It was able to get it in the proper way, but it's not enough information. So what the agent is doing is actually doing the troubleshooting by itself with that information in mind. If we had more information, it would use it. I actually love this question you asked because it just demonstrated all the features we developed on the left side. It found the two memory solutions, it realized this isn't relevant, then it went back to the original specialist agent, talks to MCP to get more information about what's failing. MCP gives it all the application logs and the pod logs, and you can see it all thinking, all orchestrated together, and ideally figures out what the original problem is.

Yeah, so right now it's just storing the solution because it was able to find the root cause. So if anyone deploys that agent again with that same root cause, we don't need to do the troubleshooting again. We can go straight to the issue. That's actually really cool. So it found the solution and decided to store it in our solutions database. Yeah, we didn't even have to code that in there. It just realized it and decided to store this for future use.

If you look at this, we didn't get the response that we were expecting to get, so let's try to ask that again. What is going on? What is the issue? You know, we've seen this before when it stores the solution in the database. It considers itself done. It's like, found the solution, stored it, we don't have to respond. But since we still have the context on the threads and everything, it responded super fast. So essentially, the root cause is we are using the image non-existing monitoring agent latest. Of course, this image doesn't exist. So would you like me to guide you through the steps to fix the issue? No, I would like you to fix the issue.

So I'm going to say, "Kate's magic pot, please," because you said that you don't want me to edit. I'm going to say please because I still watch Terminator 2, so I'm going to say please. So please, can you fix the monitoring agent for me using the recommended DevOps image? Okay, wildest things, Lucas, because we have 15 seconds left, folks. I want to share with you some resources really quickly. So scan this QR code. It's going to take you to a GitHub of all of the container sessions that we're doing at re:Invent. So find our session here. With CNS 421, find our session, and we'll have the sample code. We'll have all of the code that we showed today linked as well. It's all on GitHub for you to go through. We really appreciate you spending your time here with us today. So scan that QR. You'll find all of the relevant resources there, and we would love for you to leave us some feedback in the app as well. Thank you so much.

Hold on, hold on, hold on, hold on, hold on, hold on. Can you go back to the demo? There it goes. So now we have the monitoring agent running, as you can see right there. And then if we go to alert managers, there we go, resolved it the same day we deserve it. Thank you so much.

; This article is entirely auto-generated using Amazon Bedrock.