Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Elevate application and generative AI observability (COP326)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Elevate application and generative AI observability (COP326)

In this video, Matheus Arrais and Peter Geng demonstrate Amazon CloudWatch's comprehensive observability solutions for both traditional and AI-powered applications. They introduce CloudWatch Application Signals for APM, featuring automatic service discovery via OpenTelemetry, pre-built dashboards with golden signals (volume, latency, faults, errors), and SLO management. Live demos show troubleshooting using Application Map, trace analysis, and Container Insights integration. For generative AI workloads, they unveil CloudWatch AI Observability (GA), which provides end-to-end prompt tracing across LLM calls, agent actions, and tool invocations. Key features include PII data masking, LLM-as-a-judge evaluations for quality assessment, and native integration with Amazon Bedrock AgentCore Runtime. The session emphasizes unified monitoring across infrastructure, application, and AI layers in a single pane of glass, with practical examples using a Pet Clinic application demonstrating both traditional microservices and AI agent observability.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Critical Need for Visibility in Complex AI-Powered Applications

COP 326, elevating application and generative AI observability. My name is Matheus Arrais. I'm a worldwide tech lead for CloudOps based out of Dallas, Texas. This is my sixth re:Invent. I'm very excited to be here. I have Peter with me. Hi, I'm Peter Geng. I'm a product manager in CloudWatch, my first re:Invent, but very excited to meet you as well. Awesome.

So let's start with this picture. Imagine if you were driving down an unfamiliar highway and you have very limited visibility. It's quite nerve-wracking and dangerous as well. You don't know exactly what the road's going to do or what's going to appear in front of you, and you're just starting to grip the steering wheel even harder. Your palms are sweating, you just want to make sure you drive safe home. We've all been in this familiar situation, driving with limited visibility, and we don't wish to repeat it again.

So this is like operating in today's complex applications powered by AI. We often lack visibility and observability, and we just want to understand how the systems are driving forward. Without it, it's quite dangerous. Even our CTO Werner Vogels once said that you need to have visibility, because without it, you're just guessing. You're guessing what the system's going to do next, how the systems will react to the dynamics of your end users.

So in this presentation we're going to help with that. First of all, we're going to understand how application observability monitoring is today. What are the challenges that you have now that we have generative AI in place and how that comes into play along with that, and how to gain visibility and observability to AI-powered applications. So our commitment, myself and Peter, is that you're going to walk out of this session and go back to your places, your homes, with actionable insights for how you elevate application but also generative AI observability in a single pane of glass.

Understanding Application Observability: John's Journey Through Multi-Layer Monitoring

Let's start with John. So John here is our persona. He's an SRE, he's a developer, he's someone that is using CloudWatch on a daily basis, and as you can see, his application right there is quite complex. It has many layers, backend, database, infrastructure layer. So John's using just simple monitoring, like CPU metrics, stuff like that.

There are many layers on that application that John can leverage, as you can see on this diagram. So each layer that you see on the screen is a possible layer that can be observed. So networking, infrastructure, application, database, users, and so on. From the bottom up, AWS offers many choices for you. For instance, on the infrastructure, you can observe from the EC2 perspective, containers perspective, serverless, on-premises, and so on.

The application layer is one of the crucial ones because you can use application logs to understand how the applications work, traces for the interconnection between your applications, moving from the CPU metric perspective to a service level objective perspective tied back to your business outcome. Last but not least, you have the user perspective where you can create an understanding of your user behavior using CloudWatch for that.

So what John has asked for us is, I want a native integration. Because he's using AWS today, he wants faster time to attack problems and also to resolve them. He wants to cost optimize, doesn't want to move data from here to there. Doing data duplication is quite complex. And also it should be a single pane of glass, it should be a simple service to use that has actionable insights that is smart, using, for instance, machine learning to understand anomalies.

And for each one of those layers, Amazon CloudWatch offers John a perspective and an option to monitor his application cover to cover. So from the bottom up, in the network perspective, we have, for instance, internet monitoring that can analyze internet performance, understanding if it is your internet provider that has problems or some outages.

You have the infrastructure level, the blue one, where you can have CPU container services and leverage container sites and Lambda sites to understand the system level of that particular infrastructure choice. We have database insights too to collect and monitor database performance in real time, so you can understand which SQL queries are taking longer than others. The green one, which is the application layer, is the most crucial one, like I said before, because you can automatically detect your KPIs, your application KPIs, and use SLOs to maintain service quality commitments that you have with your customers.

Last but not least, the top level, which is the user's perspective, you can use real user monitoring and synthetics to produce canaries to simulate the user's behavior and alert if you have any availability issues or performance degradation, all of that in a single tool, which is Amazon CloudWatch. So going to the three pillars of observability, everyone might be familiar with this, but that's the journey that we wanted to discuss here. Every observability strategy will come along with metrics, logs, and traces. This is called the three pillars of observability.

Metrics, I want to understand, I have the information that my application response rate increased 35%. Logs, I have the information in a data structure or unstructured format that a user just authenticated in my application. And traces, I can ask questions or answer questions about distribution processing time of my microservices across the board. So as you saw in the beginning, John has a quite complex architecture, and of course some challenges also arise with that.

Microservices, like a complex architecture, need constant performance over time to satisfy the end user. Also, John needs to manually stitch together telemetry data with the service to understand the degradation of availability and high latency. Also, it's very hard to understand what is the priority for the business, right, to understand those anomalies, which ones matter most to the business, and how he prioritizes with that. Last but not least, this disjointed experience, because correlating that telemetry data from the perspective of metrics, logs, and traces, you also need to put in place real user monitoring and synthetic monitoring. If you put it all together, that can extend to a very long time to attack the real issue.

Golden Signals and Service Level Objectives: Connecting Technical Metrics to Business Outcomes

Now, in order to offer a technical perspective on how John should use and what he should use and look specifically for to elevate his monitoring, it is the golden signals. So the golden signals are volume requests, which directly impact the demand placed on your application and directly impact latency. Latency is speed, right, how much time a specific request will take to be resolved, and if you have higher latency, you also directly impact the user's experience. The two on the bottom, you have faults and errors. This is tied back to requests that are malformed or application issues, so you need to understand those in order to understand your application monitoring at another level.

Now this is a technical perspective. What John needs to do here is also to connect that from the business perspective, right. So for instance, business impact directly to revenue per minute, or if you have a cart abandonment, what is the level of that. What is the user experience signal, for instance, page load time, API error codes, and session durations, so on and so forth. You have service health indicators like latency, P90, P95, you want to understand that particular piece to understand the overall health. And last but not least, connecting with reliability objectives, tying back each signal with an SLO, a Service Level Objective, to ensure that you're maintaining the optimal level of the application.

So in order to answer those questions, John should focus on that particular layer, because on that layer, it's where John can tie back all those dots, the application level. And Amazon offers an APM solution called Amazon CloudWatch Application Signals. From using Application Signals, John can elevate his application monitoring.

Because first of all, Application Signals automatically discovers which applications you have in your account using OpenTelemetry SDK. You don't need to do anything, right? It's already automatically discovered for you by CloudWatch. Second of all, there are pre-built dashboards that will include standard metrics including those golden signals, so we can understand the technical metrics that I just mentioned a few seconds ago. Third, you have easy understanding with a few clicks of the root cause if you have an HTTP malformed request, if you have an exception. You can understand where the exception was, which line of code, and so on. Last but not least, SLO, Service Level Objective. So this is related to goals on your reliability and tying back to the business objective. This is crucial for any application monitoring strategy.

And that's why when we talk about application monitoring, we talk about SLOs because it is a comprehensive understanding of application health from the user perspective, and this is how it works. You have an application or an API called Pet Research API, and I have a particular internal goal of 99.9% of uptime of this particular API. Well, in order to have that, I already select a particular SLI, which is Service Level Indicator, to measure that goal. In this case it is uptime. But as you see, there is a 0.1% that can create something called error budget, which means if I'm evaluating this particular SLO in a 30-day period, I still have 43 minutes of downtime and still meet my SLO.

Why is this important? Because you can tie all this back with your SLA. And SLA, everyone knows, is an agreement that you do with your customers. So when you put it in perspective, what are you gonna have? You're gonna have happy customers because if you are looking for all those APIs, establishing all those SLOs tying back to the SLIs, customers are able to understand that you take care of the application. In this particular case, John will achieve happy users too.

Live Demo: Troubleshooting Faults and Latency with Amazon CloudWatch Application Signals

Okay, so let's go for the demo. Let's go to the fun part. In this demo, I'm gonna show specifically two scenarios of fault and also latency problems that I have in my application and how we can use Application Signals to monitor my entire application observability in a single console. So first of all, what I'm gonna do is show SLOs. So I have multiple SLOs as you can see here, and some of them of course are unhealthy. I wanted to show how to create an SLO. SLO can be, I need you to choose first the SLI. SLI can be a service operation and CloudWatch metric or a service dependency. And again that service means that it is automatically discovered by OpenTelemetry SDK.

In this case, I'm gonna use my frontend, Pet Clinic frontend, and a specific operation of POST. And the interesting part right now, I can select the calculation metric method for this particular SLO, which can be by request or by period. And I can select also the condition of the particular calculation method that can be either by latency, I'm choosing 100 milliseconds right now, or also availability. One of the cool things is CloudWatch puts that little phrase over there which exactly determines what I'm doing.

From that, I select what my SLO is. I define my interval or period that will be assessed by this SLO. I set the attainment goal, and I also have the little phrase over there. Of course, none of this makes sense if I don't tie back alarms. Alarms can be directly related to the SLI or the SLO attainment goal and the overall health of this SLO. Everything there makes sense for that particular API.

Now I wanted to show also something beyond it, which is Application Map. We launched this a couple of weeks ago. This is what I mentioned related to auto-discovery. From the OpenTelemetry perspective, not just instrumented but also uninstrumented applications are capable of being discovered by Application Map. I can filter by groups. There are multiple filters that I can select over here, and then all the applications that I have in this account will automatically be selected in those little boxes on the right-hand side. Even if I wanted to create my own custom attribute, I can use configurations directly on the OpenTelemetry config file, and it will show up directly here on the console too.

Now going a little bit deeper on the Pet Clinic that I mentioned, when I double-click, I see the topology. So this is my topology for my application, all the connections that I have with the front end, with the back end, all the microservices that I have connections with I can see. And if I click on the thin line, I'm able to see the interconnection between one service to the other and understand the top path with fault rate, with high latency, and so on and so forth. If I click on that particular front end and back end again, what I have on the right-hand side is the golden signals that I mentioned before. So request volume, request latency, errors, and faults are out of the gate for me over there. I can enlarge to see this over time and select P90, P99, or P50.

And one of the cool things that I believe here that CloudWatch gives is operational auditing. So as you can see, I have some drops, some spikes. So CloudWatch already gives to me over there what are the indications that the problem should be. So what exactly happened in that period of time and what exactly is going to happen that you need to assess even further. So to assess even further, when I click on the dashboard, I can see the overall metrics for this particular application, and again, all the golden signals are here. As you can see, but one of the things that I mentioned in the SLO creation is the service operation. So using OpenTelemetry, I'm able to understand all the POSTs, the GETs, the PUTs that these applications are using behind the scenes and understand, as you can see, I have SLIs that are unhealthy.

If I double-click on the dashboard because I have a spike of a fault, it will automatically show the correlation of spans that is necessary for that request to be served. And if I click on one of those spans, it will show me exactly what is the trace map from the front end to the back end, what exactly the whole trace and the spans for this request to be resolved. And I can even see the same visualization, but now in a timeline perspective, in a span timeline perspective, where is this host. And as you can see, there is a fault in the visits-service-java. It's throwing an exception, and the exception is a DynamoDB throughput problem. I can see the message and also the line, specifically the line that in my code I need to go back and fix the problem, and even in the DynamoDB's perspective, I can see the message on the right-hand side too.

One of the interesting parts, if I go back to the previous dashboard, I can also see because this is the EKS cluster, where is the node that is having most problems with this particular fault and what exactly are the pods that are having this problem. And then I can prioritize by looking at these nodes and pods.

Even if I click to Container Insights on this little button, it's going to guide me to a console where I can see my Container Insights performance across the board. This is in a single pane of glass.

So Dependencies is another tab on the same application where I can see all the dependencies related to this application, and also, of course, the dependencies related to the service. So this is the second use case that I talked about, latency. As you can see, there is a spike at this specific time, and I'm just going to click on one of the tracings using the same methodology as the previous use case. We're now looking for latency, and you're going to see that now because it's a different API that I have high latency, it's being served by different services, including, for instance, Bedrock.

I have the same visualization trace map, and also the timeline with all the spans. As you can see, the Bedrock runtime has an error. Specifically, when I click into view, it will show me exactly what exception is being thrown in my code. In this case, the problem is the Bedrock foundation model is being deprecated, so that's why I have high latency, because it's been deprecated, so it's taking too much time to be resolved back to my user. Even if I click on the event, I'll be able to see that more properly.

Now, this is the application level. I wanted to also show that top-level layer. You can use synthetic canaries to create and simulate users' behavior. In this case, for our persona John, you can create simulations that follow exactly the steps that need to be used and simulate what a user would use his applications for, and it will take little screenshots as well if you determine to do so. You can also use the user experience, the Real User Monitoring that I mentioned before, to understand page load. If there are any errors, you can see when I increase the time on the right-hand side, I can see the errors over time and understand exactly which pages from the user perspective are having problems with load specifically. When I click on one of these, it's giving me the entire information regarding that particular error, and this is a user perspective. I can see when the session happened, when this happened, which type of browser was used to simulate the user behavior, and I can understand my user experience across the board.

The Evolution of AI Applications: From Chatbots to Autonomous Agents and New Observability Challenges

Now, this is the application part, and now I wanted to move over to the generative AI, because now John Peter needs to include generative AI in his application. Thank you so much, Maths. I'm so excited here to talk to you about how observability has changed to monitor AI workloads. So raise your hand if you are currently building an AI application or workload in your company. Okay, wow, most of the room, so you're in the right place.

So looking back this crazy year of 2025, we all know that AI-powered applications are going to transform the way that your business and your users interact with your business. Every tech leader here is seeing in real time how easy and powerful it is to build these AI workloads on AWS. So I'm here to help you navigate through this time in a lens from observability and showing you three things. Number one, how observability has changed and the new tools that you can use. Second, how observability is still some of the same things that you're already familiar with and show you the existing tools that you can use. And lastly, showing you all that together with a demo of that workload.

So walking back in memory lane, we started this AI talk back when we had these question and answer chatbots, but that is still 2023. What we then had were AI assistants, or what kind of AI entity that walks the user step by step through a particular business process. But now today, what we have is AI agents. These entities are more autonomous, that can do certain tasks by themselves and make decisions independently to achieve specific goals. What we think in the future will be is fully agentic systems, where the entire systems behave independently, achieve open-ended requests and goals, and help the entire business make its own decisions and drive those outcomes.

And how do you even build these AI applications? In AWS, we offer a full stack of tools for you to build those workloads, and monitoring is across the entire stack. Starting from the bottom, we can help you build, run, train, and deploy those AI models on SageMaker and also on our infrastructure. And the middle layer, I think, is going to be the meat of it. First is Amazon Bedrock, which is a fully managed service that offers you direct access to a slew of foundation models through a single API so you can scale up your application really easily. And then the more recently launched Bedrock AgentCore was the place for you to build and deploy highly scalable and capable AI agents. It comes with a suite of tools to augment those agents with memory so the agent knows the context, with gateway so the agent has control over what kind of third-party APIs it calls, and identity for authentication and control.

And going higher, even in AWS, we've used these things to build up those fully agentic systems with Q for the agentic IDE, Q for agentic business intelligence, and also Amazon Connect with agentic contact center. We've been loved by millions of customers. So wherever, however you're building AI, we've got you covered, and again, in every layer, observability is there with you. Now, here comes the question: how do you have the right observability on these AI applications? How do you see, how do you track what the AI is doing? And then what's new, what's different? So let me help you navigate through this rapidly changing time.

With these AI applications, what we've talked to customers, and they've unanimously come back to us with some common challenges. Number one, these agents can be indeterministic. Their actions can differ from time to time, even from similar past scenarios, which is very unexpected. They're like teenagers: sometimes they say brilliant things, sometimes you wish they didn't say anything at all. And second, root cause analysis is focused on tracking the sequence of calls that the agents made, but it's very difficult to trace. Analyzing those invocations at scale is very difficult because every model provider has a slightly different format, and tracking that volume across different regions and accounts is harder than it's supposed to be.

And also, lastly, is assessing system health. It's still the same goal that Matteo talked about, but what's new is this word quality. Right now, you need to answer why did the agent do the thing it did? Why did it route itself in that way? What context did it have? Well, the traditional monitoring tools can tell you the performance, the surface-level performance parameters like latency and errors, but it doesn't tell you the reasoning. It doesn't explain to you the AI decision-making process. So you are no longer just observing whether your AI is working or not, going up or down, but you're trying to observe reasoning and intent, and we believe this will be your new operational reality. Observability is going to be the control point for trust, safety, and quality.

Amazon CloudWatch AI Observability: Monitoring Agent Quality, Performance, and Evaluations

Now, going back to the same stack layer that Matteo showed you, we believe that AI workload is going to ride on top of all of this, and observability is there. It is going to operate at the highest level with you as well, connecting from all the way from the infrastructure signals, stitching the telemetry throughout the stack, and coalescing from those different models and agent actions to observe the entire end-to-end interaction for you. That's why, from those challenges we worked on, we introduced to you Amazon CloudWatch AI Observability, already in GA, which has a slew of capabilities. Number one, it's a 360-degree view of your agent, no matter what model they're using, on frameworks such as Strands, Crew AI, LangChain, and with out-of-the-box dashboards on those performance parameters. Second, it's very simple to instrument your AI workload as long as you're using OpenTelemetry format. We'll touch on that a little bit later.

And very importantly is the end-to-end prompt tracing, which traces across the LLM calls, the agent actions, the tools, and the memory calls it makes for you to quickly identify issues. And in the data protection world, now in these AI interactions, it's very likely that what the AI receives or the AI outputs can contain PII content, and those are stored in the logs, and we have data protection features to mask that content and protect it.

Lastly is the evaluations feature for you to monitor the quality, and these are using LLM as a judge to constantly assess how your AI is responding, whether it's saying the right things, whether it's routing itself correctly and solving your customer problems.

To build these agents is fairly easy on AWS Bedrock AgentCore. AgentCore allows you to deploy and operate highly capable AI agents securely at scale. It offers infrastructure purpose-built for these dynamic agent workloads. As you can tell, it coalesced from different aspects of what the agent would use. For example, any LLM model offered on Bedrock, and they will use AgentCore memory, which is the context it pulls and uses to operate. Identity looks at the controls of security, and also gateway, which is a central place for it to make third-party API tool calls and all of that. As you can see, all of the tools and the primitives around the agent send telemetry to the observability platform, which is CloudWatch generative AI observability. So you can have that single pane of glass to monitor all of this.

One thing I do want to highlight that has been resonating with a lot of our customers is that you can host your agents on AgentCore runtime service, and that telemetry comes through CloudWatch. You can get all the features I talked about out of the box. Also, if you're hosting your agents elsewhere, EC2, EKS, on-premises, or other clouds, as long as your data is in the OpenTelemetry format, we can also accept it and you'll get very similar capabilities in CloudWatch as well. So giving you that flexibility wherever fits your needs.

Going into a little bit deeper on the agents running on AgentCore runtime, when creating and running these agents, we provide even more flexibility on how you create them. We support agentic frameworks like very popular ones, Strands Agents, Crew AI, LangChain, and the instrumentation. Supported instrumentation libraries are very broad and are all open source: OpenInference, Traceloop. Once you've been instrumented, connect with the Amazon Distro for OpenTelemetry, ADOT, and we collect that instrumentation, send it to the CloudWatch OpenTelemetry endpoint, and we'll be powering those screens I've talked about before.

More specifically, agents hosted on runtime, remember that ADOT I talked about? Once all that telemetry is coming in, in the CloudWatch telemetry config with a single click, you can turn on all of the telemetry from those individual primitives like AgentCore memory, AgentCore gateway. A single click from entire account, all that comes to CloudWatch and powers those views. For agents hosted elsewhere, still the ADOT, but I need to emphasize that as long as it's in the OpenTelemetry formats, we also accept it and turn on CloudWatch transaction search to stitch together those actions and tool calls I talked about. Here's a very easy quick start guide you can even start working on it right now on your laptop.

So with all that groundwork we laid, let's focus on what John is doing with his generative AI workloads in more practical terms. Here I think this slide looks very similar to what you saw before, and you already know it. It's the metrics, the logs, and the traces, and that's still powering the generative AI observability. That's the same old tools you're using today. With metrics, these things are like the token consumption, and you're tracking how those have changed, the volume of work the AI has done. The logs is where you keep those verbose inputs and outputs from the user and from the outputs of the LLM models and the agents. Lastly, tracing, very important. The traces now can understand how the response propagates through the entire system, and you can analyze it at the aggregate level, and you have the capability to drill down on every single interaction.

A little bit more detail on how it really works. Metrics, these are some common metrics you'll be able to monitor. Number of invocations, how fast these invocations are completing itself, any throttles, and the volume of work like input token counts, output token counts. The logs coming from the model invocation logging and also the span logs which contain each individual step and the contents of any tool calls and agent's actions. Lastly, the trace, as long as it contains the same session ID or trace ID, will be stitched together, and we'll be able to surface those interactions that I talked to you about before.

With those metrics, we've put it into the same set of golden metrics that we believe you should be monitoring. The first bucket is the token usage, which represents the volume of work that the AI is doing and helps you not only forecast the demand that you will have but also the cost. Then there's latency, which measures how fast your AI is responding to your needs. Throttles help you keep track of how close you're getting against your limits and quotas, and errors show you anything that's going wrong in your requests. We're able to surface all of these things in pre-built automatic dashboards without any configuration, so you can diagnose and troubleshoot from here.

Additional filters include filtering by the model itself so you can drill down into any specific model you're using and see if anything is going wrong there. It's also fully integrated with existing CloudWatch capabilities like alerting if you want to keep track of your total consumption patterns. Now, here's something that's going to be a very different and relatively new concept to the world of observability that's relevant in AI now. Like I said before, the new operational reality for you is going to be looking at the quality of those AI responses and agent actions. So often these LLMs hallucinate, or the agents take a path all of their own that's not desired.

In the past, how teams have kept an eye on these qualitative issues is that the data science teams take a very small sample of the AI workloads, manually look at it and assess it, and only when that's good are they okay with deploying the AI out in the real world, almost hesitating and timid about what it's really going to do. So there's a lack of trust and also a very burdensome process. That's why with evaluations, you can leverage LLM as a judge to assess how faithful an agent is in adhering to its context, how well it follows the instructions you've given it, and how helpful it was in helping your customers achieve their tasks and answer their questions. This is done continuously and automatically on the entire traffic of your AI workload, and you have the ability to sample on full sampling or only a proportion of the samples, so it really removes the labor-intensive and manual assessment process that our customer teams used to do.

Deep Dive into Generative AI Telemetry: Logs, Data Protection, and Trace Visualization

Evaluation metrics were just launched yesterday and are available now in CloudWatch, powered by Bedrock AgentCore evaluations. We talked about the metric side of things, and coming back to the logs, some of my favorites because this is the most densely packed information you'll be using. I'm going to be talking about the invocation logs, how to query on them, and protecting that data. The invocation logs can come from both the LLM model itself and also the agents. You can choose to send them to CloudWatch and S3, and once they're in CloudWatch, they're stored in the familiar CloudWatch log groups concept that you can integrate with your existing workflows.

Once those logs are in CloudWatch, you can use CloudWatch Logs Insights to drill down deeper, which is a powerful querying tool. We support powerful languages like SQL and OpenSearch PPL, and we're launching powerful query commands regularly. You can use pattern analysis to detect common text structures within the log events for faster insights and also automated real-time anomaly detection to identify changes in these agent behaviors and performance changes. And as I said before, these customer interactions can contain PII information, and we've got you covered when those logs are in CloudWatch.

The CloudWatch data protection feature can identify and mask sensitive customer information like credit cards, names, and addresses, basically redacting them for you. We also offer granular IAM role-based controls so that if a super user needs to see the content they can, but most users won't be able to, and we generate automatic audit reports for compliance and reporting. Now, coming back to the last part, we talked about metrics, we've talked about logs, and we're talking about traces now. This is probably the most important part in the generative AI observability.

With tracing, the first thing we'll be able to do for you is show you the detailed information of every single individual API call and the function calls that agents made in a very neat list view. You no longer need to query them on multiple databases to stitch them together. It's done for you. And second, we'll show you those actions in a timeline so you can know which one's taking the longest, which one's taking longer than you expect for you to troubleshoot faster. And a trajectory map showing you all those different interactions, API calls, tool calls for this particular trace, and each step here is a span and aggregated to this trace. This map is automatically visualized for you so you can easily understand the hierarchy and the sequence of those tool calls in a very easy to understand fashion.

Comprehensive Demo and Key Takeaways: Observing Bedrock AgentCore Runtime in Action

Now I talked about what's different, what's the same, the logs, metrics and traces. There's a lot. Let's tie it all back together with a short demo. Okay, let's go to the demo part then. So in this demo, we're going to see the Pet Clinic that I showed you before, but now in the observability general observability use case, and this is a perspective of an administrator, a manager that is using a fully hosted agent that is hosted by Bedrock AgentCore Runtime. And then behind the scenes there are multiple microservices that you saw before. This is the same old Pet Clinic application. So you have database and so on and so forth.

So here I'm just typing a couple of prompts related to this Pet Clinic, and this particular prompt from here is showing me information related to the owner ID. As you can see, there will be some PII information as Peter showed before. This is of course an agent that's been used for an administrative perspective. And in this last prompt, it is related to billing information for this pet ID number one. And what you're going to see is the agent will not be able to capture that information, to fetch that information for some reasons. Now this is of course the front end. Let's go ahead and actually see in the console.

So first of all, in the left front side on Amazon CloudWatch, you're going to see a new section called GenAI Observability, and this is the model invocation that Peter showed before. You have the golden signals, the GenAI golden signals for invocation latency and so on. And all the invocations that you have in the foundation model, you will be able to either filter, query them and call logs inside and also see the whole JSON file. You can even filter by model IDs that probably are using multiple different foundation models that you can also see.

Now going to the Bedrock AgentCore, not just for the runtime, but CloudWatch also observes other parts of AgentCore too, so memory, gateway and so on. Here focus on the agent runtime specifically. I have three agents. I see all the overall information regarding these agents, how many sessions, how many traces. Going deeper in the Pet Clinic because it's the one that I've been using, I wanted to show a couple of information here in this overview tab. I have out of the gate information related to the evaluation. So I'm using evaluations and I can see the evaluation score and also configuration metric. I can see the error and latency by span specifically, and I have this table. As you can see, there's many errors related to the DynamoDB, and I can see that information, of course, double click it if I need it.

Scrolling a little bit down, still on the overview page, I see how many sessions this agent had, how many traces, and also token consumption, very important for cost optimization and how Bedrock charges you. It's information that is useful, how many sessions this time has and the latency. Everything that you're seeing is already prebuilt for you. Last but not least is the CPU and memory utilization. Now each interaction with the agent creates sessions, and sessions contain traces and traces contain spans. So going deeper on the session part, I just reduced the time for five minutes so I captured the last prompt that I did.

I have the same information here, but now I'm looking at just this particular session, so it allows you to go deeper on this particular prompt. If I click one of those traces, what I can see right out of the gate is how many spans, latency, and token consumption. If I have any errors, I already see one of the errors. Remember that I was not able to fetch that data. I also see the trajectory of this whole prompt cover to cover, what exactly the agent had to do in order to produce the information back and all the tools that the agent had to call on my behalf.

I'm able to see also the spans in the same format of a timeline, a tree, and also in the timeline here I'm showing the token consumption on that particular invocation, input and output. This is the same visualization but now in a timeline perspective. So as you can see, I have multiple ways to understand how this particular session is doing. Looking for the event itself, here is where I will be able to see the prompt that I did, right? It is trying to get information regarding payment, and then a tool internally will be triggered. And then what is going to happen, because that information is in the DynamoDB table, the agent doesn't have permission. So it is throwing an exception, an access denied exception over there because the agent itself doesn't have permission to fetch that information. That's why I was not able to see that information on the agent.

So here in order to fix it, I needed to, of course, go into the IAM role of the agent and give it the permission to do so. Now, I wanted to show, because we mentioned related to evaluations, right? I have an LLM as a judge in front of this particular agent, and I'm just going to change the time here to twelve hours to cover how many sampling information was used to analyze by this LLM. And first of all, I can see the evaluation configuration across the board. So filtering is available, and I can see related to stereotyping, to session, and so on and so forth.

Each evaluation offers that same granularity: session, trace, and spans do. So as you can see, all the tracings that I have in this particular agent, it is right there, and I can filter by many options that are being used by the evaluator. I also have the dashboards, right? Dashboards built and created by CloudWatch. So each one of those elements, those metrics that are being used by the evaluator, are right here. So instruction following, if the LLM used the instruction that I gave, if it's harmful, if it is helpful, if it is stereotyping, all that information is right here so I can understand the agent behavior itself. And of course, I have that in a log format too, using CloudWatch Logs.

I just want to go back to the demo and to the session, specifically to show one last case, one last use case. Going back to the same session, but now one of the prompts that I did, if you remember, was related to getting information for a specific owner. And again, here I'm going to have all the information related to the trace, but I wanted to show because I have data protection rules in place for my CloudWatch, and that specific rule is masking birth dates and also telephone numbers. As you can see, it will be masked in the logs too, right directly here on the console.

So we're closing our session today, so I wanted to make sure you leave with a few takeaways. So first of all, get started today enhancing your observability stack. As you saw in the first demo and also in the first part, we talked about application monitoring. You can use APM to help you enhance your overall application understanding, more the interconnection, the dependencies that you have in the application. You can use Application Map, for instance, to see the topology of your application, and it is automatically discovered using OpenTelemetry.

Second of all, if you are building agents, you should build agents and use CloudWatch AI observability to overall understand and observe your agents. If they are hosted by Bedrock AgentCore or not, you can still use CloudWatch to generate observability and fully integrate it with Amazon CloudWatch and use the assessor evaluation. Assess programmatically your agents using the evaluator. We just launched that two days ago, so set up the evaluation in order to understand your overall structure and following for your agents, and you have all that in a single pane of glass.

Last but not least, I wanted to give a couple of resource QR codes here that are useful. The first one is demos, demo codes. We have demos in the GitHub available, not just on Bedrock AgentCore, but also agents hosted in EKS, ECS, and EC2. The QR code in the middle has all the links related to observability. So we host a show in a YouTube channel called the Correlation Show that we talk about observability and correlation as a whole. We have a workshop, best practices, everything that I mentioned is in that single QR code in the middle. The last but not least, the third one, is a blog post of this particular launch that I just mentioned related to generative AI observability.

If you're interested in this topic, I wanted to invite you. We have a lot of kiosks on CloudWatch in the village and in the observability village. Just swing by. We have one-on-one demos. We have SMEs across the globe, so if you're interested in asking a question, myself and Peter are going to be there outside. We also can answer your questions, but if you wanted to go to the kiosk, please do so. We have swag, we have stickers, we can do demos, so just go there and we're going to help you.

Last but not least, I wanted to thank you. Thank you for your time. I appreciate that you came to this presentation. I really wanted to ask you to please complete the survey. We use that data every year to improve our sessions. So if you go to your AWS Events app on your cell phone, you'll be able to go to More and then Surveys on the top. This is COP326, so please give your data points related to this session so we can improve next year. With that, I wish the best of luck to the rest of re:Invent, and thank you very much for being here. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community