Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Monitor the quality and accuracy of your generative AI workloads (COP418)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Monitor the quality and accuracy of your generative AI workloads (COP418)

In this video, Ganesh Sambandan and Raviteja Sunkavalli demonstrate how to monitor generative AI workloads using Amazon CloudWatch generative AI observability. They explain the evolution from chatbots to AI agents in production, highlighting that while building agents with frameworks like Strands SDK, LangChain, or Crew AI is straightforward, observing them at scale is challenging. The session features a live demo building a Waggle AI agent for the Pet Adoptions application using Strands SDK and deploying it to Bedrock AgentCore. They showcase CloudWatch's zero-code instrumentation using AWS Distro for OpenTelemetry, which provides end-to-end tracing through sessions, traces, spans, and sub-spans. The demo illustrates how to track agent decision-making, monitor metrics like invocations, latency, and token usage, and visualize the entire agent workflow through trajectory maps. They also introduce the new evaluation metrics feature launched during the CEO keynote for assessing agent accuracy and performance.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Evolution of Generative AI and the Challenge of Observability

Welcome, everyone. Thank you for choosing to stay this hour with us. It's truly a pleasure and an honor to be standing in front of you to share this session. We know it's day four of re:Invent, after lunch, with a lot of walking between sessions and hotels. By the way, anyone hit 10,000 steps a day during this week? Wow. I hit 15,000 yesterday. Trust me, I can feel it too.

So, good news. This is not going to be a sit and listen session. We are keeping things interactive with live demos, so please feel free to interrupt us if you have any questions. Welcome to our session, COP418, Monitor the Quality and Accuracy of Your Generative AI Workloads. I am Ganesh Sambandan, Senior Technical Account Manager with AWS. I support strategic customer segments with AWS. I've been with this company for around four years now, and I'm also a Cloud Operations Technical Field Community member.

I'm Raviteja Sunkavalli. I'm a Senior Solutions Architect with AWS, specializing in Cloud Operations. My focus areas are observability and instance management. Awesome. Together we are going to explore some interesting topics today: AI, of course, AI agents using Strands SDK, and also CloudWatch generative AI observability, and how you can leverage all these tools to run your AI systems at scale in production.

This is the quick agenda for today. Introduction to agents. We're going to talk about AI agents, the evolution of AI agents, and what is the current state. Then we will discuss about observability and why it is absolutely critical for you to have observability for AI agents. Then the fun part. We are going to show a live demo of an AI agent and how to build an AI agent with Strands SDK. Then we are going to explore CloudWatch generative AI observability and how you can ingest metrics from AI agents to CloudWatch and monitor them. Then we are going to show a couple of common troubleshooting steps that you face with AI agents. Finally, we're going to share some resources that you can explore after the session, including live code samples that will help you to build AI agents with confidence.

So, generative AI has come a long way, right? Back in 2023, we were discussing chatbots, how prompt engineering works, and how RAG systems work. So it was a year of exploration and proof of concept, right? Then came 2024. We started to adopt a few AI agents and a few AI capabilities within our applications. But then the discussion has shifted from how to use these AI agents securely to being responsible with AI, right? And now we are in 2025. AI is everywhere, right? All the applications want to enable AI. All the enterprise companies want to see the real automation and business value out of AI.

Here's the interesting part. Building an AI agent is fairly straightforward, or at least less complex, but observing them is a real challenge, right? And to give you more options here, at AWS we provide Strands SDK, and you can see there are some famous frameworks available like LangChain and Crew AI where you can use to build AI agents. All these frameworks have different advantages and capabilities, right? Again, with all the popular frameworks and everything, building an agent is a straightforward thing. But again, when you deploy this AI agent into your production with tens of thousands of AI agents, it's extremely difficult to observe and monitor those AI agents, right? And customers have told us that to have the complete view of an AI agent from start to end, monitoring the requests is extremely difficult and challenging.

Amazon CloudWatch Generative AI Observability: Features and Architecture

To solve this challenge, we provide Amazon CloudWatch generative AI observability, right? So what is Amazon CloudWatch generative AI observability? It is curated insights and metrics out of the box. So you have all the critical metrics like invocations, threads, latency, and token usage, everything in the same place. So you get a 360-degree view of agent workflows. Then zero-code instrumentation. So we use AWS Distro for OpenTelemetry to instrument your AI agents so it can send telemetry to CloudWatch and you can monitor those. And then end-to-end prompt tracing. So basically you can trace the entire iteration of your AI agents, how

AI agents initiate your LLMs and use the RAG systems. You can track all the requests from start to end. Then, data protection policies. We have built-in CloudWatch data protection policies where you can use to mask your data and have compliance for your data injection. And then finally, the new capability that we launched on Tuesday during our CEO keynote, evaluation metrics. What is the evaluation metrics? It's basically monitoring your AI agents to make sure it is doing the right thing at the right time in the right way. We have some scores and the prompt that you can use to evaluate your AI agents. We will show more in the live demo.

Now that you know CloudWatch generative AI observability, one of the biggest advantages of CloudWatch generative AI observability is its flexibility. No matter where your AI agents run, whether it is in Bedrock AgentCore or EKS or ECS or Lambda, you can integrate CloudWatch generative AI observability and you can monitor your applications. That flexibility will give you the ability to monitor your AI agents across your infrastructure.

As we discussed about CloudWatch, what it monitors, let's dive more into how it structures, what it tracks, how it monitors all your metrics. At the top, we have session. Think of a session as the complete conversation between user and an AI agent. For example, in the chatbot, if you ask any questions, from the start to the end, that is session. Under session, we have traces. Think of a trace as the one conversation from one request and one response. All the things happen in between, we track it with traces.

Under trace, we have span. Span is nothing but it tracks all the individual tasks that is happening within that interaction. Basically, it is like passing your user request or initiating the LLMs. It tracks all the metrics related to that. Under span we have sub-span. Sub-span is a fine-grained detail like single API call or returning to the user or initiating the LLMs. With all these key concepts like session, trace, span, sub-span, you will get 360-degree view of your agents, how it performs, how you track with the evaluation. You will get complete control and confidence of how your agents run in production.

Now that we discussed all the concepts behind CloudWatch and AI agent, let's move from theory to practice. Now, Ravi is going to show us the live demo of AI agent and with CloudWatch generative AI observability. While Ravi is presenting, I will be around to answer any of the questions. Feel free to interrupt us if you have any questions, Ravi.

Building AI Agents with Strands SDK: From Local Development to Production Deployment

Awesome. Thank you, Ganesh. Before we start, I'm going to structure this demo into three parts. In the first part, we're going to build a simple agent and run it locally, and then we'll see how we can add tools to call, the agent can call our microservices, and then we can deploy to the runtime. That's the first part. We're going to enable the observability and observe the entire telemetry in the CloudWatch, the new generative AI console. That would be the second part, and let's see some troubleshooting scenarios to monitor the performance of the agents. And the third part, we'll monitor the accuracy of the agent using the new evaluations feature. Sounds good.

Before we get started, I'm going to ask you a question here. How many of you have ever explored AWS One observability workshop before? Okay, I see a few hands raised. Those who have explored, this application might be familiar. We let customers do hands-on with our AWS and Native CloudWatch, also open-source tools like Amazon Managed Prometheus and Amazon Managed Grafana using this application called Pet Adoptions. It has a front end, and then four different microservices, which is like PetSearch is one microservice, PetFood, PetListAdoptions, and PayForAdoption. All the telemetry from this application will be ingested into the CloudWatch. Today, what we're going to do is we're going to build an agent. It's called a Waggle AI using Strands.

We're going to deploy it to the Bedrock AgentCore, and we're going to collect the telemetry using AWS Distro for OpenTelemetry SDK and then ingest into the CloudWatch, and then explore the telemetry within the CloudWatch. So, this is the code that we're going to explore. You can take a snapshot, and then we can kickstart our demo. All good. Can I move on to the demo? Awesome.

First, let's understand the basics of the Strands and understand how easy it is to build agents using Strands SDK. How many of you are already familiar with Strands SDK and building agents using Strands? Okay, I see quite a few here. So for those who don't know or haven't explored Strands, I'm going to give you a brief here real quick. Are you able to see my IDE? Zoomed in. Awesome.

First, I'm going to import my agent module, which is used to build our agents. And then I'm importing my Bedrock model from Strands. Again, Strands supports any model from other third-party providers as well, but for the demo, I'm using Bedrock model. And the next one is I'm defining my system prompt. The system prompt is generic. So here, I want my agent to use all the data from the trained LLM model, not tapping on to any other outside tools or APIs. Just a generic system prompt, like, hey, you're Waggle, a pet friendly pet food recommendation agent, helping pet parents find the perfect food. I gave other instructions how to approach for finding the food, and I gave some guidelines on how it exactly has to act. Think of this system prompt as a job description to your agent. So it has all the instructions of what exactly it needs to follow.

And then, we are defining the agent instance itself. It's fairly straightforward. Open your agent class, and then define your model. I'm using Bedrock Claude Sonnet 4. And using the model in the US East One. And I'm passing the above-defined system prompt to this agent. That's it. Once you define, you can call the agent using this one. You can have an input here. Thus, it's as easy as building the agent using Python. So for this demo, what I did, I had an interactive loop so that we can interact with the agent in the CLI. I'm going to save it. I'm going to run the agent here real quick. I'm going to ask a sample question, probably one from this. Okay, it's giving me some responses directly. It's calling the LLM and it's using its pre-trained model and giving me some responses, right? But you don't know what exactly it's doing behind the scenes. That's where we need the visibility.

So let's add some more concepts to it before we move on to the production. One is, instead of having the agent use its pre-trained data, I want the agent to call my microservices in real-time and fetch the data and then recommend the food. So for that, I'm importing a tool called HTTP request. With this tool, agent can make HTTP calls to my API endpoints. And then, I'm importing Bedrock AgentCore app, which can be used to decorate our agent functions and deploy to the AgentCore runtime. So those who are new to the AgentCore runtime, it's our new host where you can host your agents and scale your agents dynamically.

And again, the prompt is similar, but here, instead of a generic prompt, I'm specifying certain instructions. Hey, first, get pet details from this endpoint where I have all my pets available. And then get the food details from this endpoint, where I have all my foods, food database in there. And then match the pet characteristics and consider the nutrition needs and provide clear reasoning for each recommendation. And again, some more instructions on how to give the responses back. Once that is done, you can initialize the Bedrock AgentCore app. And you can add an entry point. Earlier, if you see our local code, all we did was a synchronous invocation, which means agent waits for the entire response to publish it back. But in the real-time, if you're using chatbots, you have to stream the data. So that's why I'm using my asynchronous function here to stream the data so that my front end chatbot can catch it and stream it live. Okay. Now, we have modified our code so that we can push it to the production.

But how do we add observability here? Do we need to change the code? Is it feasible to change the code for all your agents? It's a hassle to maintain the observability packages, right? That's where we have our AWS Distro for OpenTelemetry SDK. All you need to do is just add this package in your requirements file before you deploy. So you'll have your frameworks, for instance LangChain or any other third party, and you'll have your host frameworks. Right now, I'm using Bedrock AgentCore to host. Then just add AWS OpenTelemetry Distro. What it does is whenever an agent makes a call to the LLM tools or a RAG or knowledge bases, it does a monkey patching where it intercepts the call, extracts the telemetry like prompts, metrics, and everything, and sends it back to CloudWatch.

Once that is done, you can publish the agent in different ways to AgentCore. The easiest simple way is we provide this AgentCore starter kit. You can install the starter kit and then configure the agent, provide the entry point which is your agent code, and give us the permissions where to publish it, and give a repository where you can create the images, and then name your agent. That's it. So what it does is once you configure the agent, it will create a Docker file and it will open up a CodeBuild pipeline and run the CodeBuild pipeline, and then pushes the images to the ECR repository, and then it deploys the agent to the AgentCore runtime. So you can customize this as well. If you don't want to use the starter kit, all you can do is create your own Docker file, create your images and push it to ECR, and then mention the ECR repository name in your API call and then push it to the runtime.

Once that is done, observability is enabled by default. Now, all we had is about AgentCore. How about if you want to deploy it on EC2, EKS, or any other host, right? How about you want to observe it in CloudWatch? There are a couple of extra steps you need to do. Within AgentCore, we by default have these parameters preconfigured, but in the other hosts like EKS or EC2 or others, you need to export these environment variables to let ADOT know that, hey, this is the place where you need to send it to CloudWatch. For example, you are specifying a log group where you need to send the logs to CloudWatch, and the region, and what's the protocol, and what's the OpenTelemetry Distro that we are using. Here we are using the AWS one, so you need to specify that and enable the agent observability. You just need to add these extra environment variables, and that's it. It works the same as what you did for AgentCore. So far, any questions on the AgentCore deployment?

Live Demo: Monitoring Waggle AI Agent Performance Through CloudWatch Dashboard

Sounds good. So let's move to the console here. For this demo, I have already deployed my agent, which is the pet food agent here. It's in AgentCore. And this is the application that I'm talking about, which is PetAdoptions where you can see all the pets and you can adopt a pet. You can pay for the pet and then you can adopt, and you can also buy your food for the pet. This is where I wanted to integrate a chatbot so that customers can come here and chat with the agent to get some food recommendations. I'm going to open a chat here. So this is my new Waggle AI. I'm going to have a conversation with it. There you go. It's so excited to help us find the perfect food. Let me ask a question here. I have a Max, black puppy of two months old. Can you suggest some good food for it?

So it makes some interesting decisions at the back end and gives us some response. That's where we need the observability, because we need to monitor what exactly the decision is being made by the agent at the back end. So it gave me some response here. Let me help you find a food for little Max. First, it is searching for Max in our inventory, and then it will find some great food. In the next line, it found several black puppies in our database.

The agent is searching for the food and provides some recommendations, like beef and turkey kibbles and raw chicken bites. Let's see if those two foods are available in our database. Let me open it in a new tab. There you go. We have beef and turkey kibbles, raw chicken bites, and then puppy training treats. So we have three foods, and it recommended the top two. You can see for kitten we have some other foods, and for bunny there are some other foods. It means it's working fine, but let's see how it is making those interesting decisions at the backend to find the right food for our customers.

I'm going to navigate to the new CloudWatch dashboard. This is the new CloudWatch where you'll have the generative AI observability. On the left-hand side, we provide two dashboards. One is model invocations, and the other is Bedrock AgentCore. For the model invocations, you need to enable the Bedrock model invocation logging. What it does is, let's say if you are using Bedrock models, it will give you all the generative AI golden signals, I will call it, like invocation count, latency, tokens, and also you can see the input and output of each request for each Bedrock model.

So let me click on a particular request here. You can see the input here. It's my system prompt. You can see my user input, like I'm asking to buy food as well. You can see the model invocation, what exactly. So you can see all the model, each request input and output. If you go on to the Bedrock AgentCore, this is where you can view all your agents hosted on multiple platforms in a single pane of glass. Right now, I do have two agents that are invoked in the past twelve hours. You can see the list of agents and the environment where they're hosted, and then the number of sessions that are run, and then the traces. If there are any errors and any throttles, you'll see all this in a single pane.

Also, you can see the runtime metrics. This is only for the AgentCore hosted one. You can see the runtime sessions, invocations, vCPU consumed, and also memory. You can click on one of the agents here. The pet food agent is what we interacted with. I'm going to click on here. This will give you a detailed view of a particular agent. You can see there are multiple preconfigured widgets here, errors and latency by span, number of sessions and traces again for that particular agent. And then foundation model token usage. Right now, I'm using Claude Sonnet 4. You can see the time series graph of the token usage like inputs and output. If there are any errors, throttles, you'll see in this pane.

Now, if you want to dive deep into the session, you can go to the sessions. I'm going to grab my session ID. My friend and I have my session ID here. I'm going to filter my session ID. It's going to take some time since it's filtering like twelve hours of data. There you go. Once you click on the particular session ID, you'll see all the traces. Each trace is an interaction. Think of it as a question to the agent itself. So here, I've asked two questions. One is hey Waggle. The other one is one more question. I have Max. So this one would be one trace, and this one would be one trace.

So let's go back and click on one of the traces here. It should open up a side panel where you can see the trace metrics, how many spans are in that particular trace, what is the latency, average latency, and how many tokens. If there are any errors, you'll see the errors here. Most importantly, we have the new trajectory map. You will see the entire choices made by the generative AI agents in a visualization here. For example, I invoke my Waggle API, so it invoked Bedrock AgentCore one time. Then it invokes Strands Agents. It had like three loops. In the first loop, it searched for the pets. That's what I instructed in the system prompt, if you remember. Where is my prompt? There you go, you can see. The first process is to get the details from this API. That's what it did to search for the pets.

Then in the second loop, it retrieves the food recommendations. In the third loop, it sends all the inputs to the LLM, and then the LLM provides the recommendations based on the data. Now you can see the entire conversation by clicking on one of the chat spans. There you go. First, you can see the system prompt, and you can see my user prompt. You can see what the agent responded with, and the turn-by-turn conversation. Now I asked another question about Max. So you can see Max here. You can see the assistant message and the tool message. The tool invoked my microservice to get the pet search results. It provided the entire database in a JSON format of my pets, including the age and the URL.

Then what it does is input this as a user message to the LLM so that the LLM can process it. It becomes the user message, and then the assistant says it can see all the adorable black puppies in the database. Now it found the puppies. Now it's looking at the food. Again, it called a tool, got the response from my microservice, and again it is inputting it as a user message to the LLM. Then it is making the recommendation here. So you can see the entire conversation in your CloudWatch observability dashboard. It's very important to see what your customers are seeing.

We have quick filters here. Let's say you want to filter any spans that are in errors. Right now, we don't have any errors or any high latency. You can exactly pinpoint at which hop it is taking more time to give the responses back to the user. Is it taking time at the first hop or when it's calling any tool? You can view all the details here. Also, for each LLM invocation, you will see the input and output tokens here, showing how many tokens the LLM is using. Then you can set up some alarms where if it is crossing a certain limit that you are not expecting, you can set up alarms to trigger and do an investigation from here.

The first resource gives you code samples on deploying the agents on other platforms. Today we saw Amazon Bedrock AgentCore, but if you want to deploy it on Amazon EC2, Amazon ECS, and Amazon EKS, this is the first resource. The second one addresses the question about Strands, LangChain, and SDK. That's a great resource to explore the observability on different agentic frameworks. The third and fourth are documentation and the launch blog in case you want to get started with the generative AI observability. Thank you so much. Please feel free to leave the survey feedback in the application. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.