DEV Community

Cover image for AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

In this video, Amanda Lester, Vivek Singh, and Ishan Singh introduce Amazon Bedrock AgentCore Evaluations, a fully managed solution for continuous AI agent quality assessment. They explain how agents' non-deterministic nature creates trust gaps and demonstrate how AgentCore Evaluations addresses this with 13 built-in evaluators across quality dimensions like correctness, helpfulness, and tool usage, plus custom evaluator capabilities. The session covers two evaluation modes: online evaluations for continuous production monitoring and on-demand evaluations for CI/CD pipelines. Using a Wanderlust Travel Platform example, they show how the service detected tool selection accuracy dropping from 0.91 to 0.3, enabling rapid diagnosis through detailed explanations. Live demos illustrate the complete workflow from baseline testing to production deployment, emphasizing multi-dimensional success criteria and rigorous statistical analysis as best practices.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to Amazon Bedrock AgentCore Evaluations at re:Invent

Hello everyone and welcome to Amazon re:Invent. It's great to have you all here. My name is Amanda Lester and I am the worldwide go-to-market leader for Amazon Bedrock AgentCore, and I am joined today by two of my esteemed colleagues, Vivek Singh, Senior Technical Product Manager for AgentCore, and Ishan Singh, Senior GenAI Data Scientist here at AWS. We are incredibly excited to be able to present to you today.

Thumbnail 50

We're going to discuss how you can improve the quality of your agents in production with Amazon Bedrock AgentCore Evaluations, which we just recently launched during the keynote. We're incredibly excited to present to you what we have developed for agent evaluations, which we believe is going to fundamentally help you to improve the way that you do business. In today's session, you're going to learn a couple of things.

First, you're going to learn about Amazon Bedrock AgentCore. We're also going to discuss some of the key fundamental challenges that are associated with operating your agents at scale in production. Third, we're going to provide an overview and introduction of our solution that we've built to address some of those challenges, which is AgentCore Evaluations. Fourth, we're going to provide a couple of demos of our solution. And finally, fifth, we're going to provide some best practices and resources that you can use to be able to get started evaluating agents and get those agents that you've built into production much faster and quicker than ever before.

Thumbnail 110

The Technological Revolution of AI Agents and Amazon Bedrock AgentCore Platform

What is incredibly exciting about this year's Amazon re:Invent is that we are at the edge of another technological revolution which is unfolding right before our very eyes. This future is happening now. This is not a dream, it's happening now, and developers all around the world are empowered to be able to reimagine customer experience. They are also leveraging agents today to improve their operations more efficiently and effectively, and this is happening virtually across every single industry, every size of company from startups to enterprises, and all around the world.

Thumbnail 180

In order for developers to be able to take advantage of the benefits of agents, developers need to have the confidence and the right foundational set of services and tools to be able to bring those agents into production. One of the ways that AWS is helping to get those agents into production faster for you is we have built Amazon Bedrock AgentCore, which is our most advanced agentic platform. It provides developers with everything that you need in order to get your agents into production faster, and this includes a comprehensive set of services to be able to deploy and operate your agents at scale in a secure manner, and you can leverage that with any framework and any model.

Thumbnail 210

AgentCore includes a foundational set of services that you can use to run your agents at scale securely. This includes a set of services that you can use to enhance your agents with tools and memory. It includes purpose-built infrastructure that you can use to deploy your agents at scale, and also controls that you can leverage to gain insights into your agentic operations.

We built Amazon Bedrock AgentCore to be extremely flexible. We understand that as an agent developer, you want the choice and flexibility to build with open source protocols, such as the Model Context Protocol known as MCP and A2A for agent-to-agent intercommunications. It's also incredibly important that you have the choice to be able to build your agents with the right frameworks of your choice, which is why we've built AgentCore with the flexibility to be able to leverage any agentic framework of your choice.

Once you've built your agent with your agentic framework and you're ready to go and deploy your agent into production securely and at scale, you can do so with the right confidence and the right set of foundational services to be able to do so with trust. But we are not done innovating on your behalf.

The Non-Deterministic Nature of Agents and the Trust Gap Challenge

Agents are non-deterministic by nature, which means that as a developer, there is an entirely new set of services required to handle the non-deterministic nature of agents. What do I mean by non-deterministic?

Thumbnail 320

Agents can reason and act autonomously. This is incredibly exciting because agents are fundamentally changing the way that work can get done. We are moving to a world where agents can go off and do work on your behalf. They can do this to achieve a specific goal, a specific task that you set out for them to accomplish, and they can do this all without direct supervision.

They can reason, they can create workflows to solve a problem, and they can make decisions, all without direct supervision. This is incredibly exciting and powerful because it means that it can free you up from mundane tasks and busy work. You can then leverage that extra time to work on higher value strategic tasks. But because these agents are autonomous, which is what makes them so powerful, it is fundamentally critical that you can trust that the agents are going to perform their jobs correctly.

Thumbnail 390

The fundamental question that every single person in this room needs to answer if you are going to leverage agents is can you trust this powerful new digital workforce to do their job correctly? But the problem that is keeping developers and CTOs up at night is can you trust that these agents, because they're autonomous and you're going to hand over mission-critical business processes and tasks to them, will do that job correctly, efficiently, and effectively. Will they provide a good experience for your customers?

Can you trust that the agent is going to address and provide an optimal solution? It's not enough for the agent to just produce an answer. That answer needs to be the right answer, the correct answer, and an accurate answer. Because if the agent provides a wrong answer, it may cause more problems for your customers, your users, your company, and your developers. So the fundamental benchmark now for getting an agent into production is trust.

Thumbnail 470

Agents have a trust gap right now. If you cannot trust your agents to be reliable, to consistently do their job accurately and effectively, and to produce the right answers in a consistent manner, that may result in a poor customer experience. That is one of the biggest fundamental blockers to adoption today for companies around the world and agent developers that want to take advantage of agents. So you need to ask yourself if you're going to use agents, how can we make sure that you're building agents that are trusted so you can create a good customer experience, and you need to bridge this gap.

Thumbnail 520

Agent Performance Reviews: The Critical Need for Continuous Evaluation

One of the ways that you can bridge this gap is by making sure that your agents have a job performance review. In other words, doing an agent evaluation. You need to complete this performance review of your agents to evaluate if your agents are making the correct decisions, delivering the correct results, are efficient and effective at their job, and can be relied on to do that job and execute those tasks and workflows in a consistent manner.

When you're creating an agent evaluation, there are some questions that you need to be able to answer on a regular basis. That fundamentally includes answering whether or not the agent achieved its goal and the task and job that it was set out to do. Is that agent making the right decision? Did it generate the correct response? Did the agent select the appropriate tool to do its job, and are the answers that the agent has generated accurate? Were they polite to your customers or not?

What is fundamental about these types of questions is that they are subjective, which makes it very difficult to evaluate and measure these agents to complete a job review. But it is fundamentally critical.

Thumbnail 620

and an imperative to be able to assess the performance of your agents and whether or not you can rely on them to do their job effectively. And when you're going in and doing these agent evaluations, this is not a one-time task that you're doing when you deploy your agents into production. This needs to be done continuously, and the reason why an agent evaluation needs to be done continuously is because this needs to happen in real time.

We're not talking about doing this once a month or once a year. You need to be able to identify in real time and monitor if these agents are failing so you can monitor the behavior and the quality of those agents, so you can address problems proactively and be alerted the very second that these agents fail silently in production. And the reason why agents may fail is because every time that you go in and update a model, every time that there is a new version of the agent that you're rolling out, every time that you are modifying a source prompt or adding and removing tools for your agents to be able to leverage to solve these tasks and do these jobs, that may result in the agent failing and result in the agent not being able to achieve its job or its goal correctly.

Thumbnail 730

And unlike traditional software, you really only know if the agent is effective or not once it's out in the real world in front of your users and customers. And you need to be able to go in and measure and evaluate those agents continuously to address the agent behavior proactively in real time. So what are the primary points of failures for an agent? There are three primary ones.

First, there are quality failures that may be encountered. That includes hallucinations, factual errors, faulty reasoning and planning, and poor agent tool selection, which results in inconsistent outputs. And these quality failures will cascade into reliability issues, which may include context loss for the agent, poor handling of errors, and security and vulnerability gaps. And those reliability issues will result in inefficiencies associated with those agents, which may result in higher costs for you and your company, and it may result in higher latency associated with those agents.

The net effect of all three of these points of failure is a potential loss of customer trust in the Gen AI application that you have built or your product and service. And the worst-case scenario is you've built an agent, you've put it into production, you're super excited about how it's going to be able to give your customers a great experience, and that agent fails silently. And you only find out if the agent failed after receiving an onslaught of customer complaints.

The problem with that is by the time you hear it from the customer, it's too late. You may lose that customer. You may generate a bad reputation for your business, for your company, and for your product, and that may result in the loss of current or future business revenue. And we can't afford that, and you need to be able to address that upfront proactively, which is why you need to be able to understand in real time and get alerted the very second that the agent fails so you can proactively take action to address those points of failures well in advance before the customer experiences it.

Thumbnail 860

The Time-Consuming Challenge of Manual Agent Evaluation at Scale

And what we heard from developers all around the world is that they don't have the right tools and mechanisms today to be able to go in and evaluate agents in real time and to be able to monitor those agents. And developers have to go through multiple steps to complete an agent evaluation. This includes going in and finding and creating datasets, selecting the right appropriate measures and dimensions to be able to evaluate that agent on, selecting the right model to judge the outputs of those agents, building and maintaining the infrastructure to serve those evaluations, recording the results, making adjustments, and then continuously monitoring in production. And this process can take months and is the difference between

coming up with an idea six months ago and then having to go through multiple evaluations until you can get a trusted agent into production. And again, you have to repeat this process over and over again every time there's a new model, which today seems like it's every week, every time there's a new version of that agent, and every time there's a new source prompt or tool. And what we've heard from developers is that this process is extremely time consuming and is one of the most challenging and painful aspects of operating agents at scale.

And you don't just have to do this for one agent. The reality is you may have 10,000 agents that are off doing and executing mission critical tasks for your organization. And it simply is not feasible to be able to go in and do this at scale, which is why we took a look at this entire end-to-end problem and we asked ourselves how can we make this process simple and easy for you and provide a way that you can evaluate agents faster and easier so you can get agents to production faster in a secure manner and with confidence.

And the good news is there is a better way, and to tell you a little bit more about the solution that we've built that we think will help you manage this process better, I'm going to invite up Vivek Singh, who's our Technical Product Manager for Amazon Bedrock AgentCore. Hi everyone. My name is Vivek. I'm a Product Manager at Amazon AgentCore. And today, I'm very excited to announce and tell you about AgentCore Evaluations.

Thumbnail 1010

Thumbnail 1020

Introducing AgentCore Evaluations: A Fully Managed Solution with 13 Built-in Evaluators

With agents, I'm sure a lot of you, if you deployed an agent, would have faced certain similar problems. You deploy an agent, operational dashboards look green, and then three weeks later, you're firefighting quality issues, wondering what went wrong. That's the problem that we are solving with AgentCore Evaluations, a fully managed solution providing continuous assessment for AI agent quality.

Now, the keyword there is continuous and fully managed. You can get evaluation frameworks and prompts at a lot of places or build your own, but then you are left managing your LLM infrastructure. You have to manage capacity, you have to manage rate API limits, cost optimization, and infrastructure scaling. With AgentCore Evaluations, we handle all of that. The evaluation models, the infrastructure, the scaling, that's our problem, and we take care of that.

You and your teams can solely focus on improving the agent experience for your customers without having to worry about the LLM infrastructure for running evaluations and managing those systems. We provide 13 built-in evaluators across common quality dimensions: correctness, helpfulness, stereotyping, tool usage, and more. And when you need domain-specific assessment, you can create custom evaluators. Everything flows into AgentCore Observability through CloudWatch, so you don't have to monitor and manage another monitoring platform.

Thumbnail 1120

Let me show you how this works across your development cycle. AgentCore Evaluations operates in two modes: online evaluations and on-demand evaluations. Online evaluations for production environments continuously monitor live agent interactions. You configure sampling rules, one to two percent of your traffic for baseline monitoring, and we automatically score them in real time. This is how you detect quality degradations and catch those silent failures before they affect your customer experience.

The key is continuous, not weekly reviews or monthly audits. Continuous automated assessment of your production traffic. On-demand evaluation is for development workflows. This integrates with your CI/CD pipelines. Teams run evaluation suites against code changes, test a new prompt, validate new configurations, or compare two models. You can gate deployments when quality scores fall below your certain thresholds. Block that pull request if it degrades helpfulness by more than 10 percent.

Thumbnail 1190

Both modes use the same evaluators, so you're testing against the same quality dimensions throughout your development lifecycle. What you test in CI/CD is what you monitor in production. Now, what can you actually measure? 13 built-in evaluators organized across three levels. Session level evaluates the entire conversation. Did the agent achieve its goals? Did it fulfill the user request and did the conversation succeed end to end? Trace level assesses individual responses. This is where your quality signals lie.

Was the response correct? Was it helpful? Was it faithful to the provided context? Did it follow the right instructions? The span-level focuses on tool usage. This is for agents with heavy tool use. Did it select the right tools, and did it extract the right parameters from the user's query?

Here is the thing: you don't have to select all 13 evaluators. You pick the three or four, or whatever you want, that is specific for your use case. For a customer service agent, you probably care about helpfulness, goal success rate, and instruction following. For Retrieval-Augmented Generation components with your agents, correctness and faithfulness are most useful to you. For tool-heavy agents, tool selection accuracy and tool parameter correctness matter most. And when these 13 built-in evaluators do not cover your particular use case, you build your own.

Thumbnail 1260

There are four steps to building custom evaluators. Define your criteria. Let's say you need purchase intent detection for an e-commerce agent. You write your evaluation prompt with a scoring rubric. You define what high purchase intent means and how that compares to simple browsing. You create that grading for doing that scoring. You select the model to do the evaluation, and you configure the sampling rules. You get the same flexibility as built-in evaluators and the same automated continuous assessment, just tailored to your requirements and your domain.

Thumbnail 1310

How Evaluation Models Work: Structured Rubrics, Reasoning, and OpenTelemetry Integration

We have seen customers build evaluators for brand voice consistency, regulatory compliance, conversation flow quality, and whatever matters for their business that the built-in evaluators do not cover. Now, some of you must be thinking, okay, you're using a language model to evaluate my agent responses. How does that work? How do I trust that? Fair question. Let me address that now.

First, we use detailed structured rubrics. Every evaluator follows a specific framework with clear criteria. This isn't asking, "Is this response good?" It's asking, "Does this response follow the instructions and use the provided context? Does it meet the defined criteria? Does it address the user's specific question?" We have spent months engineering these rubrics to overcome the typical language model limitations: bias, inconsistency, lack of specificity, grading on a continuous scale, and grading on different categories.

Second, we require reasoning for the scoring. The judge provides detailed reasoning before it assigns any score. It has to explain why that judgment occurred and why that particular score was given. It explains why the agent choice was right or wrong, what should have happened, and what the impact is. There are no scores without justification. This ensures consistency and makes every assessment explainable.

Third, we provide complete context. We give the evaluation model everything: full conversation history, user intent, tools available to the agent, tools actually used, parameters passed, execution details from the traces, and the system instructions. It's not giving opinions. It's performing systematic assessment against clearly defined criteria with full visibility into what happened. And you can verify every judgment with complete visibility into the reasoning.

Thumbnail 1430

So once the evaluation result tells you that the tool selection was wrong or poor, you can see the reasoning to find out why, which tools should have been used, which was actually used, and why that matters. Let's take a quick look at how this plugs into your existing setup. Your agent runs using the framework that you prefer, LangGraph or other frameworks, and you can deploy it on a service of your choice. AgentCore, in this case, is associated with a memory and a gateway.

Traces are being generated using standard OpenTelemetry instrumentation. We support popular instrumentations like OpenInference and OpenTelemetry. If you are already doing observability, and you should be, you are already collecting these traces. You configure which evaluations to run, and from there, the service takes over. Using your sampling rules, we pull the traces, perform evaluations, and write those results back along with the explanations in your CloudWatch log groups for your traces.

You monitor and assess through CloudWatch dashboards and AgentCore Observability. All your operational metrics and quality metrics are in the same place. The key point here is no changes to your code and no redeployments. You're already collecting traces for observability. We are just adding quality scores to those. The evaluation happens service-side, fully managed.

Thumbnail 1500

Wanderlust Travel Platform Case Study: From Seven Weeks to Hours of Detection

Let me now show you what this looks like with an example. Wanderlust Travel Platform. They run a travel search and research assistant with multiple specialized tools: climate data, flight information, currency conversion, and web search.

Here's what started happening. The user query has all the right parameters—destination, duration, and budget—to invoke the calculator budget tool, but the agent instead invokes the generic web research tool and responds with a few generic blog posts, not a calculation, nothing personalized. User engagement drops, unhelpful feedback increases. And look at the silent failures. Look at the monitoring. Response time, latency, error rate, tool activation all appears healthy and normal. Every operational metric tells you that the system is working fine, and still the user experience is degrading. Four weeks to detect the issue, three weeks to diagnose and fix, seven weeks of degraded experience because their monitoring was solely focused on operational metrics and not quality assessment.

Thumbnail 1580

This is the gap between "the system is running" versus "the system is working well." They turned to Amazon Bedrock AgentCore Evaluations and got it working in three simple steps. First, select the agent. Thirty seconds, point and click. You could be using a framework of your choice—Strands, LangGraph—deployed on AgentCore, Lambda, EKS, and use an instrumentation library of your choice—Open Inference, Elementary, whatever you need.

Thumbnail 1610

Thumbnail 1650

Step two, select the evaluators. They selected tool selection accuracy for this agent that is heavy on tools, tool parameter accuracy that tells you whether the agent understands the user's intentions and is extracting the right parameters from the user query, and helpfulness—is the response useful to the user and providing user value? This is the user experience metric. Three metrics that cover tool behavior and user value, a good picture for this agent quality. No prompt engineering required. These are built-in, just checkboxes. Sampling rules to balance cost with coverage. They selected 2% sampling and got it going. Total setup, less than five minutes. Select agents, pick your evaluators, and set sampling rules. Done.

Thumbnail 1660

Let's go over what they found. Baseline for tool selection accuracy, parameter correctness, and helpfulness are over there: 0.91, 0.9, and 0.8. This is what healthy looks like. What they found was the tool selection accuracy just plummeted to 0.3 for this particular use case in this scenario, while tool parameter correctness remained basically stable and helpfulness declined by about 15%. Remember the increase in unhelpful feedback? These metrics are tracking user sentiment. The pattern is clear, and the diagnosis is there. The agent understands what the user wants but is picking the wrong tools.

Thumbnail 1710

Let's see what happens when you drill into the traces. On looking at a specific trace, they found the specific scores for that trace, but the key is they also found explanations for why that particular score. The system explained why it's wrong. It tells you the user's query had all the necessary parameters that should trigger the calculator budget tool, which would provide precise cost breakdown. Using web search introduces latency and accuracy issues. So not just scoring, explaining the logic—which tool should have been used, which was actually used, and what the user is missing. They looked at a few more traces, same pattern, and were quickly able to diagnose and root cause the issue.

Thumbnail 1760

Thumbnail 1780

They realized that a recent prompt modification had emphasized comprehensive information gathering but removed explicit tool selection guidance. They restored the tool selection guidance and provided a few more concrete examples and deployed the changes. Within a few days, the scores returned back to normal for tool selection accuracy as well as helpfulness.

Okay, now let me go over what this means systematically, all of what we just went over. The detection and diagnosis time goes from taking days or weeks to minutes or hours. Review the scores, drill into traces, see the reasoning, and see the pattern. Monitoring—weekly reviews become continuous. You're always watching. Visibility shifts from being unknown until investigated to real-time tracking with trace level details.

Think about your last quality issue for a moment. How long did it take to detect the issue? How long did it take you to diagnose the actual root cause? That's what automated evaluation changes. What previously required weeks of manual work now happens automatically, continuously at scale.

Thumbnail 1840

Let's go over and summarize what we just covered. This is what we did, so you don't have to. We spent a lot of time and extensive testing on prompt engineering, designed rubrics that overcome LLM limitations, built infrastructure for capacity and scaling, and created interactive dashboards and visualizations. Multiple quarters of engineering work. And what you get is select evaluators, configure sampling, then go into the dashboards to review the insights. Drill into the traces. Verify improvements after you have acted on the findings. So we have compressed months of infrastructure work into minutes of configuration. You focus on your agent's quality, not on building evaluation systems. That's the value of fully managed.

Thumbnail 1890

Summarizing, you move from reactive investigation to proactive monitoring, from manual reviews to automated assessment, from weeks of detection to hours of visibility, and from unknown quality to continuous insights. AgentCore Evaluations is available in preview today. Start monitoring your agents and catch those quality issues before they become customer complaints. AgentCore Evaluations is meant to help you move from hoping your agents work to knowing that they work. I'll now hand it over to Ishan to take us through a technical deep dive and a demo of the service.

Thumbnail 1950

Technical Deep Dive: Creating Online Pipelines and Custom Evaluators

Thank you, Vivek. This makes building trustworthy AI agents much easier, much faster. And today I have built four demos for all of us, from getting started to really diving deeper into evaluating agents and looking at the journey of going to production. Without further ado, let's do the first demo. So here, I'm creating an online pipeline, which is supposed to evaluate my agent continuously. So we are going to look at the setup now.

Thumbnail 1960

Thumbnail 1970

Thumbnail 1980

There are three main components here. I am choosing a data source. Then I am selecting the metrics that I want to use for the evaluator. And then, lastly, I am going to select sampling strategy. So in this case, I'm selecting fifteen percent. And that's it. That's all you need to do in order to continuously evaluate an AI agent across different metrics, across different levels. And we will talk about levels. Vivek has covered levels in detail, session trace span.

Thumbnail 2000

So what we did here is, firstly, we looked at data source. A data source here could be, one, your agent that is hosted on AgentCore Runtime, or it could also be an agent that you are managing the hosting for yourself, maybe in your own managed infrastructure or elsewhere in AWS and other services. As long as both of these agents are logging their OpenTelemetry traces into CloudWatch log group, we can seamlessly fetch those traces, apply evaluation on top of it, and provide you with the scoring and the explanations.

Thumbnail 2040

We just looked at a demo of how to create that online configuration. I selected some bunch of metrics there, and then I also selected a custom metric. So in my four demos, that was recorded on number three, but I'm showing you first. So this is where we create a custom evaluator. If you remember from my previous demo, I had a line item there where I checked one of the custom evaluators. So why do you need a custom evaluator in the first place? Built-in evaluators are really good to get started. And you may then start seeing the needs for more nuanced evaluations, and that's where the custom evaluation comes in. Again, it's based on LLM as such.

Thumbnail 2090

So let's look at the demo for creating a custom evaluation. Here, we have four components. First, we have the load template tool. So from here, you can easily select different types of templates. These basically give you access to the context variables so that you don't have to worry about setting those up.

Thumbnail 2110

Thumbnail 2120

Thumbnail 2140

Then you can edit the prompts as you need to. Then, I am configuring the model and the inference parameters. I set the temperature to 0 because I want more reliable results, but you can experiment based on what works for you. Then I'm configuring the type of scale I want, which is the rubric in this case, essentially telling me what score to give given the context of the agent and how to score it simply. And then lastly, I'm selecting the level that I want to use. As I was speaking earlier, Vivek also covered this, so we have three levels: session, trace, and span.

Thumbnail 2160

So, we just looked at 13 evaluators, and we also looked at how to create a custom evaluation. Let's dive into how the selection of these metrics works. And if you recall, we already looked at the slide, but we just want to quickly cover how you go about selecting these metrics. Essentially, you work backwards from the success criteria you have for your agentic use case. The success criteria is often a combination of multiple metrics, and usually it should at least have three components: the quality that the agent is providing in terms of the responses, then the latency at which you are getting those responses, and the cost of inference. Because at the end of the day, as a business stakeholder, you may be thinking that you want to give your users a response maybe in like three seconds, starting to get the first token in three seconds. The cost of inference should be roughly, let's say, less than one dollar for each interaction, and so on and so forth, and you want your users to have a good experience, and then you can define what that good experience means for your use case.

So, having covered the success criteria, how do you go about selecting these metrics? Well, it really depends on your use case. I'm going to talk about two real quick. If, let's say, and Vivek also covered this in a little detail, you have some sort of customer-facing application, you are really looking forward to whether the responses are grounded in the context that was provided to that application, in this case, the agent. And the other example here could be that if your agent is a part of, let's say, some back-end workflow, it is supposed to read in from some large text corpus and provide you with a structured response. So in this case, you may be more interested in instruction following, whether or not it's extracting everything as you want it, exactly in those formats, and so on and so forth. So selecting the metric really depends on your use case and your success criteria. Ultimately, what you want to do is what you can measure, you can improve, so take that as a mantra.

Thumbnail 2300

Well, this is all great. Let's look at the process of evaluating AI agents now. There are five main components here. So, you build an AI agent, you're just getting started with it, you build it for the use case. The next obvious thing that you do there is you come up with some questions that you are now testing this agent against. As the agent is producing a response, it is also producing traces, and as we mentioned, you should obviously be doing observability. So observability is tracking those traces and storing those for you, then using those traces to score based on the metrics that we covered. So you have now an agent, you have some test cases. Test cases here, in this case, may or may not have a ground truth, because you're just getting started, right, and ground truth data is very expensive, as we all know this.

So we have built an agent, we are now invoking that agent with the test questions, we are getting scores. Scores also have explanations with Amazon Bedrock AgentCore evaluations, so you're looking at those. Now you're analyzing whether this score is somewhere close to what your success criteria is. If no, then you are getting back into improving the agent. Maybe you need to break the agent into sub-agents or specialized agents, or maybe you're just looking to improve the model. Maybe you're testing with a faster model, or maybe you now want to go to a higher quality model. Then with these improvements, you put them back into the agent and then test again. Every time you saw a failure with your agent, you get that question back into your test set, and then this process continues until you have met your success criteria.

The Path to Production: Baselining, Shadow Mode, and AB Testing Strategies

So this is all great, right? What's the process here? What's the path of building and deploying trustworthy agents? So let's also look at that, and then I'll show you two more demos.

Thumbnail 2430

The path to production for an AI agent starts with creating a baseline. We just looked at the process where I covered five components. So we start with baselining an agent. Baselining of an agent again goes back to the testing. Once you've met the success criteria, you feel comfortable, you are able to trust the agent, and finally that agent can go into production.

Now, once it is in production, you may be wondering, am I done, or do I need to do something else? And as we see the speed at which generative AI is evolving, there are new models, new techniques, new protocols coming up literally every other week. We are also improving the data. The data is changing. Your user usage patterns would be changing as well at the same time.

So then what happens is every time you want to improve anything in the agent, or if you are updating a new tool, if you are adding any new functionality, or maybe just breaking down the agent into specialized agents, you're going back to that five component loop I was talking about. But since you are now in production, this process looks a tiny bit different. Again, those five components still remain there, but you are now also doing offline evaluation.

There is also a component of shadow mode, shadow evaluations, which basically means that you already have an agent in production which is serving your traffic. You have now built the second version of this agent, which is also serving the traffic, but it is not sending the responses to the user. So you can easily compare these two agents and see if this new agent that you have built is really solving for and improving the metrics that you were looking for.

Then comes the AB testing stage. And you may be wondering, we already did shadow testing, why do we need AB testing? Well, in the shadow mode, you never collected user feedback on this. Your actual users never interacted with this second agent that you built. With AB testing, you're now going to give a portion of your original traffic to this agent. Your users are going to interact with this agent, and then you are going to measure whether it's actually solving the problem you are trying to improve or not. And once you feel comfortable with that, you have a full rollout.

Thumbnail 2590

Across both of these stages, one on the left hand side with the POC to production, and if you are already in production and beyond, one thing remains common. You need two types of evaluators: on-demand and online. So you may be thinking that this is a repetition of a slide. Yes, we want to really reinforce the concept that you need both on-demand and online evaluations.

You would use on-demand evaluation when you are testing your agent. And by on-demand, by no means am I talking here about a batch process. It is still a real-time API. AgentCore Evaluation provides both of these as real-time APIs. So you are testing in your pre-deployment phase. You are using on-demand in your CI/CD pipelines. Vivek has covered that already, but you can also use on-demand in production because you may have some nuanced use cases. This is a real-time API. You can really bring together any type of nuanced use case you may have and evaluate using on-demand.

And then continuous monitoring is really needed as you go into production, as you put the agents in production, because in the case of AgentCore Evaluation, it is doing the heavy lifting of building the pipeline. And I believe all of us have interacted with data. We have all seen data pipelines are complex. So in this case, AgentCore Evaluation does the full heavy lifting of fetching the data from the CloudWatch log groups, putting that into the format that these APIs accept, and then doing the inference, collecting the score, collecting the explanation, and providing it to you.

For the online evaluation, you do want dashboards because you want your subject matter experts, your product experts to look at how your users are interacting, and then also at the same time looking at the scores of how helpful, for example, the agent was, or if the agent was faithful, that is, it did not hallucinate.

Thumbnail 2710

Live Demo: On-Demand Evaluation with Travel Agent from Baseline to Production

Now, let's dive deeper into how on-demand evaluation for AgentCore Evaluations works. As we see here, we have a developer who has configured some evaluations. We looked at the first two demos, where the second one more specifically showed us creating a custom evaluator, and then we also have the thirteen built-in metrics. On the on-demand side, you first invoke an agent. That agent produces traces. In the case of on-demand evaluation, you fetch those traces and put them into the on-demand API in the format that the on-demand API accepts. This can be part of your workflow when you're working in your IDE, so it can happen there. It can be part of your CI/CD pipelines, and as I was speaking earlier, you can also put this into production for a nuanced use case because it's a real-time API. Then seamlessly, the on-demand API invokes or works with the AgentCore Evaluation Service and invokes the models that power the evaluations and provide you with a score and an explanation.

Thumbnail 2800

Thumbnail 2820

Thumbnail 2830

Thumbnail 2840

Thumbnail 2850

Now, let's look at the demo of the on-demand evaluations. I'm going to go to that side because this screen is really small. What I have here is we are continuing the example of the travel agent that Vivek was speaking about earlier. This is an agent where I'm using Haiku 3.5. This agent has access to seven tools. As shown there, it is using Haiku 3.5. This is my baseline, baseline implementation. By baseline, I mean this is the first version of the agent that I created. I deployed this agent on AgentCore Runtime. You may ask, why did you deploy if it's a baseline agent? Well, I want to evaluate my agent exactly in the environment where I'm going to finally deploy it once it is in production. This is the process of deploying the agent with AgentCore. With the AgentCore Development Kit, it's super easy to deploy any agentic code there, literally five to seven lines of code, and this just deployed it. Then I just checked the status, whether it's deployed and ready for me. Probably this whole thing takes about one minute or so.

Thumbnail 2860

Thumbnail 2870

Thumbnail 2880

Thumbnail 2890

Thumbnail 2900

Now I am asking questions to this agent. There's a question set that I have prepared, and I'll show you in a second. This is what it looks like. In the very first question, it asks, I am planning my honeymoon to Maldives for ten days and my budget is forty-five hundred dollars. This agent is supposed to now look through things. As you can see here, it invokes tools, does all that, and generates traces for me. Nowhere in that entire notebook was I looking at or doing evaluations. My agent has run. Here is where I'm doing the evaluations. I'm looking at the session IDs, and please look at this closely. The code that I've highlighted, this is all you need in order to do on-demand evaluations. I just provided the set of metrics that I want to test. I give it the session IDs. We are already collecting those in AgentCore Observability. It is very easy for me to evaluate now.

Thumbnail 2920

Thumbnail 2930

As you can see, all the pipeline, all the data management, everything is taken care of. Here are the scores. This is my baseline version. It does an okay job. My goal success rate is in the meeting range. As I also show in the explanation below for one of the cases where my goal success rate is zero, it's because my agent in this case asked three follow-up questions and did not invoke anything. The goal success rate evaluator gave it a zero score, and it also calls that out. All of that is good.

Thumbnail 2960

Thumbnail 2980

Thumbnail 2990

Thumbnail 3000

Then I'm skipping a part of the process. I'll cover that later, which is how I went to this notebook. This is my notebook where I'm doing my first experiment, where I improved the system prompt. It is the same agent, using the same model, same configuration, but my system prompt has changed now based on the analysis that I have done of the explanation. As you can see, I'm giving it specific instructions around, hey, be more concise, use tools, and all those things. I'm deploying it again because, as I was speaking, I want to evaluate my agent in the environment I'm finally going to deploy it in. Then, well, this is my trial one. I run through the same list of questions again. As you can see, I was doing this yesterday, and I recorded it live.

Thumbnail 3010

The system is generating those responses there, and then it is supposed to give me the scores and explanations. Now, let's talk about how I went from my baseline to experiment. What did I change? How did I know what to change? That's the key to evaluation and iteration and improving these agents.

If you remember from the explanation that I was showing you in the first place, it said that my agent did not invoke a tool. It just continuously collected information from the user. Essentially, I analyzed all those different explanations that I got as a response of my first baseline notebook. Then I came up with, hey, these are the top failure points. Let's improve those in my system prompts. I could have also taken a route to update the model, but in this case, I did not think that at this time I needed that because I did not fully exhaust my prompt engineering capabilities.

Thumbnail 3070

This is the score lift that I see after I analyzed the explanations and improved my agent system prompt. As you can see, there are significant lifts, especially in Goal Success Rate and Conciseness. Well, my Conciseness score is still low, to be honest, but hey, since this is a demo, I feel comfortable with deploying this agent to production. But we can see good improvement between the baseline and just the first experiment.

So how do you go about using the explanations and improving the agent? There are two ways. You can one, have your Subject Matter Expert look at these explanations, dig deeper, do some data analysis and see if they can come up with these failure points and suggest you prompts. Or you could also use a hybrid approach, and that's exactly what I did. I used an LLM to analyze all those explanations because, to be honest, as you scale, let's say you have thousands of test cases, as a human, I don't think I could read those. So I just used an LLM to come up with some sort of extraction of problem. Then I looked through the insights and then came up with a prompt.

Thumbnail 3140

As I was saying, since this specifically is a demo, I felt comfortable deploying it. You may want to do more trials, and you would repeat that until it meets your success criteria. Now, let's talk about the online evaluation. So online evaluation, in this case, and if you remember from my demo one, it took me about twenty seconds to set up the online config, and that's exactly what online evaluation uses.

Every time you invoke an agent, and let's say in this case, the agent is deployed on AgentCore Runtime, but as I was speaking to earlier, it could also be an agent that's not on AgentCore Runtime. So in this case, it's on AgentCore Runtime. It seamlessly transmits its traces to our AgentCore Observability offering. From there, AgentCore Evaluations can seamlessly read the traces as the session completes and then evaluate that session or trace or span with the metric that you have chosen to finally produce the score and explanations of why that score was generated for you.

Thumbnail 3210

Thumbnail 3230

Thumbnail 3240

Now, finally, I think this is the last demo, and this is where I'm going to show you our dashboard. Since I put my agent in production, I set up the online config, the demo number one. So when you invoke the agent, this is what happens. On your AgentCore Observability dashboard, you can also see evaluation scores. You can see that my agent does a decent job. I have, I think, over three hundred something invocations. So these scores are not with just toy sample, it's a decent sample.

Thumbnail 3260

Thumbnail 3270

You can also filter things down based on what the category or the label, remember from the second demo, we created some labels, rubric, so you can filter things through that. You can also look into the traces. Trace is a specific question that was asked of the agent, and then you can look at the explanations right there on the dashboard. So if you click on, in this case, I clicked on Correctness, it tells me what's wrong with this because the score is 0.5, and there is a mix-up with how the agent used the context from the tool, so it gave the score of 0.5. And you get explanation across all metrics that you have chosen, as shown in demo number one.

Thumbnail 3290

Best Practices for Agent Evaluation and Getting Started Resources

Okay, so best practices. What are the best practices of evaluating agents? How do you go about evaluation of agents?

Thumbnail 3310

Firstly, define multi-dimensional success criteria. Success criteria, not success rate. Define multi-dimensional success criteria. And your success criteria should also have the experience as a part of it, because that's what you're creating with these agents. Experience here could be based on, if it's a customer-facing app, it could be an experience. If it's doing some back-end workflow, it could be your downstream impact. So get your subject matter experts involved sooner, so that you can design those metrics, you can build human-in-the-loop with them, so that you are really targeting this agent to do the task to solve for the use case.

Thumbnail 3350

Do rigorous testing. And as I was showing you, there's a baseline, then there was a trial, and then there could be a number of trials until you meet the success criteria. So do that. I suggest you at least try to define a success criteria as if, not your final, maybe 5 to 7% below the threshold. That's a good starting point as you're launching the agent, and then sort of work towards it, because it's very easy to go up to 80%, and then there are diminishing returns.

Thumbnail 3380

Your data also evolves with your agent and your user access patterns. So note that. Then your evaluation framework also should improve over time. Built-in evaluators are really great to get started, but then as you have more nuanced use cases, you want to build those custom evaluators. There is one more thing that I want to cover on this. You may also realize as you are graduating from your built-in evaluators to a custom evaluator, that sometimes there would be cases when your subject matter expert would be like, yeah, this score of 0.8, I don't agree with it, maybe it's 0.7. So this is called a calibration gap, so you also want to do some calibration there.

Thumbnail 3440

Thumbnail 3450

Then monitor your agents continuously. You need that, because what you can measure, you can only improve that. And do rigorous statistical analysis as you are improving your agent from version 1 to version 2 to version X. And with that, this is the end of the session.

Thumbnail 3460

We have documentation, GitHub samples, workshops, and AgentCore SDK for you. So feel free to take a look at those and get started. We are very excited for you to get started with AgentCore Evaluations. Well, we really hope you are enjoying re:Invent, and have a great rest of your day.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)