Kazuya

Posted on Dec 6, 2025

AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

In this video, Amanda Lester, Vivek Singh, and Ishan Singh introduce Amazon Bedrock AgentCore Evaluations, a fully managed solution for continuous AI agent quality assessment. They address the trust gap in autonomous agents by demonstrating how to evaluate agents across 13 built-in dimensions including correctness, helpfulness, and tool usage, plus custom evaluators. The session covers both online evaluations for production monitoring and on-demand evaluations for CI/CD pipelines, using a travel agent example to show how tool selection accuracy dropped from 0.91 to 0.3, enabling detection in hours versus weeks. Live demos illustrate setup in under five minutes, trace-level analysis with detailed reasoning, and CloudWatch dashboard integration for continuous monitoring.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Amazon Bedrock AgentCore Evaluations at re:Invent

Hello everyone and welcome to Amazon Reinvent. It's great to have you all here. My name is Amanda Lester and I am the worldwide go-to-market leader for Amazon Bedrock AgentCore, and I am joined today by two of my esteemed colleagues: Vivek Singh, senior technical product manager for AgentCore, and Ishan Singh, senior GenAI data scientist here at AWS. We are incredibly excited to be able to present to you today.

We're going to discuss how you can improve the quality of your agents in production with Amazon Bedrock AgentCore Evaluations, which we just recently launched during the keynote. We're incredibly excited to present to you what we have developed for agent evaluations, which we believe is going to fundamentally help you to improve the way that you do business. In today's session, you're going to learn several things. First, you're going to learn about Amazon Bedrock AgentCore. We're also going to discuss some of the key fundamental challenges that are associated with operating your agents at scale in production.

Third, we're going to provide an overview and introduction of our solution that we've built to address some of those challenges, which is AgentCore Evaluations. Fourth, we're going to provide a couple of demos of our solution. And finally, fifth, we're going to provide some best practices and resources that you can use to be able to get started evaluating agents and get those agents that you've built into production much faster and quicker than ever before.

The Technological Revolution: Amazon Bedrock AgentCore Platform Overview

What is incredibly exciting about this year's Amazon Reinvent is that we are at the edge of another technological revolution which is unfolding right before our very eyes. This future is happening now. It's not a dream. It's happening now, and developers all around the world are empowered to reimagine customer experience and are leveraging agents today to improve their operations more efficiently and effectively. This is happening across virtually every single industry and every size of company, from startups to enterprises, all around the world.

In order for developers to be able to take advantage of the benefits of agents, developers need to have the confidence and the right foundational set of services and tools to be able to bring those agents into production. One of the ways that AWS is helping to get those agents into production faster for you is that we have built Amazon Bedrock AgentCore, which is our most advanced agentic platform. It provides developers with everything that you need in order to get your agents into production faster, and this includes a comprehensive set of services to be able to deploy and operate your agents at scale in a secure manner. You can leverage that with any framework and any model.

AgentCore includes a foundational set of services that you can use to run your agents at scale securely. This includes a set of services that you can use to enhance your agents with tools and memory. It includes purpose-built infrastructure that you can use to deploy your agents at scale, and also controls that you can leverage to gain insights into your agentic operations. We built Amazon Bedrock AgentCore to be extremely flexible. We understand that as an agent developer, you want the choice and flexibility to build with open source protocols, such as the Model Context Protocol, known as MCP, and A2A for agent-to-agent intercommunications.

It's also incredibly important that you have the choice to be able to build your agents with the right frameworks of your choice, which is why we've built AgentCore with the flexibility to be able to leverage any agentic framework of your choice. Once you've built your agent with your agentic framework and you're ready to deploy your agent into production securely and at scale, you can do so with the right confidence and the right set of foundational services to be able to do so with trust. But we are not done innovating on your behalf. Agents are non-deterministic by nature, which means that as a developer, there are an entirely new set of services required to handle this non-deterministic nature.

The Non-Deterministic Nature of Agents and the Trust Gap Challenge

What do I mean by non-deterministic? Agents can reason and act autonomously. This is incredibly exciting because it means that agents are fundamentally changing the way that work can get done. We are moving to a world where agents can go off and do work on your behalf to achieve a specific goal or task that you set out for them, and they can do this all without direct supervision.

Agents can reason, create workflows to solve problems, and make decisions, all without direct supervision. This is incredibly exciting and powerful because it means that it can free you up from mundane tasks and busy work. You can then leverage that extra time to work on higher-value strategic tasks. But because these agents are autonomous, which is what makes them so powerful, it is fundamentally critical that you can trust that the agents are going to perform their jobs correctly.

The fundamental question that every single person in this room needs to answer if you are going to leverage agents is: can you trust this powerful new digital workforce to do their job correctly? The problem that is keeping developers and CTOs up at night is whether you can trust that these agents, because they are autonomous and you are going to hand over mission-critical business processes and tasks to them, will do that job correctly, efficiently, and effectively, and will provide a good experience for your customers.

Can you trust that the agent is going to address the situation and provide an optimal solution? It is not enough for the agent to just produce an answer. That answer needs to be the right answer, the correct answer, and an accurate answer. If the agent provides a wrong answer, it may cause more problems for your customers, your users, your company, and your developers. The fundamental benchmark now for getting an agent into production is trust.

Agents have a trust gap right now. If you cannot trust your agents to be reliable and to consistently do their job accurately and effectively and produce the right answers in a consistent manner, that may result in a poor customer experience. This is one of the biggest fundamental blockers to adoption today for companies around the world and agent developers that want to take advantage of agents. You need to ask yourself: if you are going to use agents, how can we make sure that you are building agents that are trusted so you can create a good customer experience and bridge this gap?

Why Agent Performance Reviews Are Critical: Understanding Quality Failures and Silent Failures

One of the ways that you can bridge this gap is by making sure that your agents have a job performance review. In other words, you need to go in and do an agent evaluation, and you need to complete this performance review of your agents to evaluate whether your agents are making the correct decisions, delivering the correct results, are efficient and effective at their job, and can be relied on to do that job and execute those tasks and workflows in a consistent manner.

When you are creating an agent evaluation, there are some questions that you need to answer on a regular basis. These fundamentally include whether the agent achieved its goal and its task that it was set out to do. Is that agent making the right decision? Did it generate the correct response? Did the agent select the appropriate tool to do its job, and are the answers that the agent has generated accurate? Were they polite to your customers or not?

What is fundamental about these types of questions is that they are subjective, which makes it very difficult to evaluate and measure these agents to complete a job review. But it is fundamentally critical.

An imperative to assess the performance of your agents and whether you can rely on them to do their job effectively. When you're conducting agent evaluations, this is not a one-time task that you perform when you deploy your agents into production. This needs to be done continuously, and here's why.

Agent evaluation needs to be done continuously because it needs to happen in real time. We're not talking about doing this once a month or once a year. You need to identify in real time, monitor if these agents are failing, so you can monitor the behavior and quality of those agents, and address problems proactively. You need to be alerted the very second that these agents fail silently in production.

Agents may fail because every time you update a model, every time there is a new version of the agent that you're rolling out, every time you modify a system prompt or add and remove tools for your agents to leverage to solve these tasks and do these jobs, that may result in the agent failing and not being able to achieve its job or goal correctly. Unlike traditional software, you really only know if the agent is effective once it's out in the real world in front of your users and customers. You need to measure and evaluate those agents continuously to address agent behavior proactively in real time.

What are the primary points of failures for an agent? There are three primary ones. First, there are quality failures that you may encounter. These include hallucinations, factual errors, faulty reasoning and planning, and poor agent tool selection, which results in inconsistent outputs. These quality failures will cascade into reliability issues, which may include context loss for the agent, poor handling of errors, and security and vulnerability gaps.

Those reliability issues will result in inefficiencies associated with those agents, which may result in higher costs for you and your company and higher latency associated with those agents. The net effect of all three of these points of failure is a potential loss of customer trust in the generative AI application that you have built or your product and service. In the worst case scenario, you've built an agent, put it into production, and you're excited about how it's going to give your customers a great experience, and that agent fails silently.

You only find out if the agent failed after receiving an onslaught of customer complaints. The problem is that by the time you hear it from the customer, it's too late. You may lose that customer. You may generate a bad reputation for your business, your company, and your product, and that may result in the loss of current or future business revenue. We can't afford that, and you need to address it upfront proactively.

The Complex Challenge of Evaluating Agents at Scale

This is why you need to understand in real time and get alerted the very second that the agent fails so you can proactively take action to address those points of failures well in advance before the customer experiences it. What we heard from developers all around the world is that they don't have the right tools and mechanisms today to evaluate agents in real time and to monitor those agents. Developers have to go through multiple steps to complete an agent evaluation.

This includes finding and creating datasets, selecting the right appropriate measures and dimensions to evaluate that agent on, selecting the right model to judge the outputs of those agents, building and maintaining the infrastructure to serve those evaluations, recording the results, making adjustments, and then continuously monitoring in production. This process can take months and is the difference between coming up with an idea six months ago and then having to go through multiple evaluations until you can get a trusted agent into production.

This is just the beginning. You have to repeat this process over and over again every time there's a new model, which today seems like it's every week. Every time there's a new version of that agent and every time there's a new source prompt or tool, the process becomes even more complex. What we've heard from developers is that this process is extremely time consuming and is one of the most challenging and painful aspects of operating agents at scale.

You don't just have to do this for one agent. The reality is you may have ten thousand agents that are off doing and executing mission critical tasks for your organization. Simply put, it is not feasible to be able to go in and do this at scale. This is why we took a look at this entire end-to-end problem and asked ourselves how can we make this process simple and easy for you and provide a way that you can evaluate agents faster and easier so you can get agents to production faster in a secure manner and with confidence.

The good news is there is a better way. To tell you a little bit more about the solution that we've built that we think will help you manage this process better, I'm going to invite up Vivek Singh, who is our technical product manager for Amazon Bedrock AgentCore.

Introducing AgentCore Evaluations: A Fully Managed Solution with Built-In and Custom Evaluators

Hi everyone. My name is Vivek and I'm a product manager at Amazon Bedrock AgentCore. Today, I'm very excited to announce and tell you about AgentCore Evaluations. With agents, I'm sure a lot of you, if you deployed an agent, would have faced certain similar problems. You deploy an agent, operational dashboards look green, and then three weeks later, you're firefighting quality issues, wondering what went wrong. That's the problem that we are solving with AgentCore Evaluations.

AgentCore Evaluations is a fully managed solution providing continuous assessment for AI agent quality. The key words there are continuous and fully managed. You can get evaluation frameworks and prompts at a lot of places or build your own, but then you are left managing your LLM infrastructure. You have to manage capacity, you have to manage rate API limits, cost optimization, and infrastructure scaling. With AgentCore Evaluations, we handle all of that. The evaluation models, the infrastructure, the scaling—that's our problem, and we take care of it.

You and your teams can solely focus on improving the agent experience for your customers without having to worry about the LLM infrastructure for running evaluations and managing those systems. We provide thirteen built-in evaluators across common quality dimensions including correctness, helpfulness, stereotyping, tool usage, and more. When you need domain-specific assessment, you can create custom evaluators. Everything flows into AgentCore Observability through CloudWatch, so you don't have to monitor and manage another monitoring platform.

Let me show you how this works across your development cycle. AgentCore Evaluations operates in two modes: online evaluations and on-demand evaluations. Online evaluations are for production environments. This continuously monitors live agent interactions. You configure sampling rules, one to two percent of your traffic for baseline monitoring, and we automatically score them in real time. This is how you detect quality degradations and catch those silent failures before they affect your customer experience. The key is continuous, not weekly reviews or monthly audits. Continuous automated assessment of your production traffic.

On-demand evaluation is for development workflows. This integrates with your CI/CD pipelines. Teams run evaluation suites against code changes, test a new prompt, validate new configurations, or compare two models. You can gate deployments when quality scores fall below certain thresholds. Block that pull request if it degrades helpfulness by more than ten percent. Both modes use the same evaluators, so you're testing against the same quality dimensions throughout your development life cycle. What you test in CI/CD is what you monitor in production.

Now, what can you actually measure? Thirteen built-in evaluators are organized across three levels. Session level evaluates the entire conversation. Did the agent achieve its goals? Did it fulfill the user request and did the conversation succeed end to end? Trace level assesses individual responses. This is where your quality signals lie.

Was the response correct? Was it helpful? Faithful to the provided context? Did it follow the right instructions? The span-level focuses on tool usage. This is for agents with heavy tool use. Did it select the right tools and did it extract the right parameters from the user's query?

Here is the thing: you don't have to select all 13. You pick the 3 or 4, or whatever you want that is specific for your use case. For a customer service agent, you probably care about helpfulness, goal success rate, and instruction following. For RAG components with your agents, correctness and faithfulness are most useful to you. For tool-heavy agents, tool selection accuracy and tool parameter correctness matter most.

When these 13 do not cover your particular use case, you build your own. There are 4 steps. First, define your criteria. Let's say purchase intent detection for an e-commerce agent. You write your evaluation prompt with a scoring rubric. You define what high purchase intent means and how that compares to simple browsing. You create the grading for doing that scoring. You select the model to do the evaluation, and you configure the sampling rules. You get the same flexibility as built-in evaluators and the same automated continuous assessment, just tailored to your requirements and your domain.

How LLM-Based Evaluation Works: Structured Rubrics, Reasoning, and Integration

We have seen customers build evaluators for brand voice consistency, regulatory compliance, conversation flow quality, and whatever matters for their business that the built-in evaluators do not cover. Now, some of you must be thinking: you're using an LLM to evaluate my agent responses. How does that work? How do I trust that? That's a fair question. Let me address that now.

First, we use detailed structured rubrics. Every evaluator follows a specific framework with clear criteria. This isn't just asking, "Is this response good?" It's asking, "Does this response follow the instructions and use the provided context? Does it meet the defined criteria? Does it address the user's specific question?" We have spent months engineering these rubrics to overcome typical LLM limitations like bias, inconsistency, lack of specificity, and to enable grading on a continuous scale and across different categories.

Second, we provide reasoning for the scoring. The judge provides detailed reasoning before it assigns any score. It has to explain why that judgment occurred, why that particular score was given, why the agent choice was right or wrong, what should have happened, and what the impact is. No scores without justification. This ensures consistency and makes every assessment explainable.

Third, we provide complete context. We give the evaluation model everything: full conversation history, user intent, tools available to the agent, tools actually used, parameters passed, execution details from the traces, and the system instructions. It's not giving opinions; it's performing systematic assessment against clearly defined criteria with full visibility into what happened. You can verify each judgment with complete visibility into the reasoning. So once the evaluation result tells you that the tool selection was wrong or poor, you can see the reasoning to find out why, which tools should have been used, which were actually used, and why that matters.

Let's take a quick look at how this plugs into your existing setup. Your agent runs using the framework that you prefer—LangGraph, Strands—and you can deploy it on a service of your choice: Agent Core, in this case, associated with a memory and a gateway. Traces are being generated using standard OpenTelemetry instrumentation. We support popular instrumentations like OpenInference and OpenTelemetry. If you are already doing observability, and you should be, you are already collecting these traces.

You configure which evaluations to run, and from there, the service takes over. Using your sampling rules, we pull the traces, perform evaluations, and write those results back along with the explanations in your log groups for your traces. You monitor and assist through CloudWatch dashboards and Agent Core observability. All your operational metrics and quality metrics are in the same place. The key point here is that there are no changes to your code and no redeployments. You're already collecting traces for observability; we are just adding quality scores to those. The evaluation happens service-side and is fully managed.

Wanderlust Travel Platform Case Study: Detecting and Diagnosing Silent Failures in Minutes

Let me now show you what this looks like with an example. Wanderlust Travel Platform runs a travel search research assistant with multiple specialized tools: climate data, flight information, currency conversion, and web search.

Here's what started happening. The user query had all the right parameters: destination, duration, and budget. To invoke the calculator budget tool, but the agent instead invoked the generic web research tool and responded with a few generic blog posts. Not a calculation, nothing personalized. User engagement dropped and unhelpful feedback increased.

Look at the silent failures in the monitoring. Response time, latency, error rate, and tool activation all appeared healthy and normal. Every operational metric told you that the system was working fine, and still the user experience was degrading. Four weeks to detect the issue, three weeks to diagnose and fix, seven weeks of degraded experience because their monitoring was solely focused on operational metrics and not quality assessment. This is the gap between the system is running versus the system is working well.

They turned to AgentCore Evaluations and got it working in three simple steps. First, select the agent in thirty seconds with point and click. You could be using a framework of your choice, such as LangChain or LangGraph, deployed on AgentCore, Lambda, or EKS, and use an instrumentation library of your choice like OpenInference or Elementary, whatever you need.

Step two, select the evaluators. They selected tool selection accuracy for this agent, which is heavy on tools. Tool parameter accuracy tells you whether the agent understands the user's intentions and is extracting the right parameters from the user query. And helpfulness, which asks whether the response is useful to the user and providing user value. This is the user experience metric. Three metrics that cover tool behavior and user value provide a good picture for this agent's quality. No prompt engineering required. These are built-in, just checkboxes.

Sampling rules balance cost with coverage. They selected two percent sampling and got it going. Total setup was less than five minutes. Select agents, pick your evaluators, and set sampling rules. Done. Let's go over what they found. Baseline for tool selection accuracy, parameter correctness, and helpfulness were 0.91, 0.9, and 0.8. This is what healthy looks like. What they found was the tool selection accuracy just plummeted to 0.3 for this particular use case in this scenario, while tool parameter correctness remained basically stable and helpfulness declined by about fifteen percent. Remember the increase in unhelpful feedback. These metrics are tracking user sentiment. The pattern is clear and the diagnosis is there.

The agent understands what the user wants but is picking the wrong tools. Let's see what happens when you drill into the traces. Looking at a specific trace, they found the specific scores for that trace, but the key is they also found explanations for why that particular score. The system explained why it's wrong. It tells you the user's query had all the necessary parameters that should trigger the calculator budget tool, which would provide precise cost breakdown. Using web search introduces latency and accuracy issues. So not just scoring, but explaining the logic, which tool should have been used, which was actually used, and what the user is missing.

They looked at a few more traces and saw the same pattern, and were quickly able to diagnose and root cause the issue. They realized that a recent prompt modification had emphasized comprehensive information gathering but removed explicit tool selection guidance. They restored the tool selection guidance and provided a few more concrete examples and deployed the changes. Within a few days, the scores returned back to normal for tool selection accuracy as well as helpfulness.

Now, let me go over what this means systematically, all of what we just went over. The detection and diagnosis time goes from taking days or weeks to minutes or hours. Review the scores, drill into traces, see the reasoning, and see the pattern. Monitoring and weekly reviews become continuous. You're always watching.

Visibility, which was unknown until investigated, goes to real-time tracking with trace level details. Think about your last quality issue for a moment. How long did it take to detect the issue? How long did it take you to diagnose the actual root cause? That's what automated evaluation changes. What previously required weeks of manual work now happens automatically, continuously at scale.

Let's go over and summarize what we just covered. This is what we did, so you don't have to. We spent a lot of time and extensive testing on prompt engineering, designed rubrics that overcome LLM limitations, built infrastructure for capacity and scaling, and created interactive dashboards and visualizations. That's multiple quarters of engineering work. What you get is select evaluators, configure sampling, then go into the dashboards to review the insights. Drill into the traces. Verify improvements after you have acted on the findings. So we have compressed months of infrastructure work into minutes of configuration. You focus on your agent's quality, not on building evaluation systems. That's the value of fully managed.

Summarizing, you move from reactive investigation to proactive monitoring, from manual reviews to automated assessment, from weeks of detection to hours of visibility, and from unknown quality to continuous insights. Agent Core Evaluations is available in preview today. Start monitoring your agents and catch those quality issues before they become customer complaints. Agent Core Evaluations is meant to help you move from hoping your agents work to knowing that they work.

Technical Deep Dive: Creating Online Pipelines, Custom Evaluators, and the Five-Component Evaluation Process

I'll now hand it over to Ishan to take us through a technical deep dive and a demo of the service. This makes building trustworthy AI agents much easier and much faster. Today I have built four demos for all of us, from getting started to really diving deeper into evaluating agents and looking at the journey of going to production. Without further ado, let's do the first demo.

Here, I'm creating an online pipeline, which is supposed to evaluate my agent continuously. We are going to look at the setup now. There are three main components here. I am choosing a data source. Then I am selecting the metrics that I want to use for the evaluator. And then, lastly, I am going to select sampling strategy. In this case, I'm selecting 15%. And that's it. That's all you need to do in order to continuously evaluate an AI agent across different metrics, across different levels. We will talk about levels. Vivek has covered levels in detail: session, trace, and span. What we did here is, firstly, we looked at data source. A data source here could be one of two things: your agent that is hosted on Agent Core runtime, or it could also be an agent that you are managing the hosting for yourself, maybe in your own managed infrastructure or elsewhere in AWS and other services. As long as both of these agents are logging their OpenTelemetry traces into CloudWatch log group, we can seamlessly fetch those traces, apply evaluation on top of it, and provide you with the scoring and the explanations.

Well, we just looked at a demo of how to create that online config. I selected some metrics there, and then I also selected a custom metric. In my four demos, that was recorded as number three. But I'm showing you first. This is where we create a custom evaluator. If you remember from my previous demo, I had a line item there where I checked one of the custom evaluators. So why do you need a custom evaluator in the first place? Built-in evaluators are really good to get started. You may then start seeing the needs for more nuanced evaluations, and that's where the custom evaluation comes in. Again, it's based on LLM as such.

Let's look at the demo for creating a custom evaluation. Here, we have four components. First, we have the load template tool. From here, you can easily select different types of templates. These basically give you access to the context variables so that you don't have to worry about setting those up.

And then you can edit the prompts as you need to. Then I'm configuring the model and the inference parameters. I set the temperature to 0 because I want reliable results. But you can experiment based on what works for you. Then I'm configuring the type of scale I want, which is the rubric in this case, essentially telling me what to give a score given the context of the agent and how to score it simply. And then lastly, I'm selecting the level that I want to use. And as I was speaking earlier, Vivek also covered this. So we have 3 levels: session, trace, and span.

So we just looked at 13 evaluators. We also looked at how to create a custom evaluation. Let's dive in to how the selection of these metrics work. And if you recall, we already looked at the slide. But we just want to quickly cover how you go about selecting these metrics. Essentially, you work backwards from the success criteria you have for your agentic use case. The success criteria is often a combination of multiple metrics. And usually, it should at least have 3 components: the quality that the agent is providing in terms of the responses, then the latency at which you are getting those responses, and the cost of inference.

Because at the end of the day, as a business stakeholder, you may be thinking that you want to give your users a response maybe in like 3 seconds, getting the first token in 3 seconds. The cost of inference should be roughly less than $1 for each interaction, and so on and so forth. And you want your user to have a good experience, and then you can define what that good experience means for your use case. So having covered the success criteria, how do you go about selecting these metrics? Well, it really depends on your use case. So I'm going to talk about 2 real quick.

If, let's say, and Vivek also covered this in a little detail, you have some sort of customer-facing application, you are really looking forward to whether the responses are grounded in the context that was provided to the application, in this case, the agent. And the other example here could be that if your agent is part of some back-end workflow, it is supposed to read in from some large text corpus and provide you with a structured response. So in this case, you may be more interested in instruction following, whether or not it's extracting everything as you want it, exactly in those formats, and so on and so forth.

So selecting the metric really depends on your use case and your success criteria. Ultimately, what you want to do is what you can measure, you can improve. So take that as a mantra. Well, this is all great. Let's look at the process of evaluating AI agents now. There are 5 main components here. So you build an AI agent, that's like you're just getting started with it. You build it for the use case. The next obvious thing that you do there is you come up with some questions that you are now testing this agent against.

As the agent is producing a response, it is also producing traces, and as we mentioned, you should obviously be doing observability. So observability is tracking those traces, storing those for you, then using those traces to score based on the metrics that we covered. So you have now an agent, you have some test cases. Test cases here, in this case, may or may not have a ground truth, because you're just getting started, right? And ground truth data is very expensive, as we all know.

So we have built an agent, we are now invoking that agent with the test questions. We are getting scores, and scores also have explanations with agent core evaluations. So you're looking at those. Now you're analyzing: is this score somewhere close to what my success criteria is? Yes or no. If no, then you are getting back into improving the agent. Maybe you need to break the agent into sub-agents or specialized agents. Or maybe you're just looking to improve the model, maybe you're testing with a faster model. Maybe you now want to go to a higher quality model.

Then with these improvements, you put them back into the agent and then test again. Every time you saw a failure with your agent, you sort of get that question back into your test set, and then this process continues until you have met your success criteria.

The Path to Production: From Baseline to Full Rollout with On-Demand and Online Evaluations

This is all great, right? So what is the process here, and what is the path of building and deploying trustworthy agents? Let's look at that, and then I'll show you two more demos. The path to production for an AI agent starts with creating a baseline. We just looked at the process where I covered five components. So we start with baselining an agent. Baselining of an agent again goes back to testing. Once you have met the success criteria, you feel comfortable, and you are able to trust the agent, that agent can finally go into production.

Now, once an agent is in production, you may be wondering if you are done or if you need to do something else. Given the speed at which generative AI is evolving, there are new models, new techniques, and new protocols coming up literally every other week. We are also improving the data. The data is changing, and your user usage patterns would be changing as well at the same time. So what happens is every time you want to improve anything in the agent, or if you are updating a new tool, if you are adding any new functionality, or maybe breaking down the agent into specialized agents, you are going back to that five component loop I was talking about.

However, since you are now in production, this process looks a bit different. Those five components still remain there, but you are now also doing offline evaluation. There is also a component of shadow mode evaluations, which basically means that you already have an agent in production serving your traffic. You have now built a second version of this agent, which is also serving traffic, but it is not sending responses to the user. So you can easily compare these two agents and see if this new agent that you have built is really improving the metrics that you were looking for.

Then comes the AB testing stage. You may be wondering why we need AB testing if we already did shadow testing. Well, in shadow mode, you never collected user feedback on this. Your actual users never interacted with this second agent that you built. With AB testing, you are now going to give a portion of your original traffic to this agent. Your users are going to interact with this agent, and then you are going to measure whether it is actually solving the problem you are trying to improve. Once you feel comfortable with that, you have a full rollout.

Across both of these stages, one on the left hand side with the POC to broad, and then if you are already in the broad and beyond, one thing remains common. You need two types of evaluators: on-demand and online. You may be thinking that this is all a repetition of a slide. Yes, we want to really reinforce the concept that you need both on-demand and online evaluations. You would use on-demand evaluation when you are testing your agent. By on-demand, I am talking about a real-time API. Agent Core evaluation provides both of these as real-time APIs.

You are testing in your pre-deployment phase. You are using on-demand in your CI/CD pipelines, which Vivek has covered already, but you can also use on-demand in production because you may have some nuanced use cases. This is a real-time API, so you can really bring together any type of nuanced use case you may have and evaluate using on-demand. Continuous monitoring is really needed as you go into production and put the agents in production. With Agent Core evaluation, it is doing the heavy lifting of building the pipeline. All of us have interacted with data, and we have all seen that data pipelines are complex.

In this case, Agent Core evaluation does the full heavy lifting of fetching the data from CloudWatch log groups, putting that into the format that these APIs accept, doing the inference, collecting the score, collecting the explanation, and providing it to you. For online evaluation, you do want dashboards because you want your subject matter experts and your product experts to look at how your users are interacting and also at the same time look at the scores of how helpful, for example, the agent was, or if the agent was faithful, meaning it did not hallucinate.

Live Demo: Evaluating a Travel Agent from Baseline to Production with AgentCore Runtime

Now let's dive deeper into how on-demand evaluation for agent core evaluations work. As we see here, we have a developer who has configured some evaluations. We looked at the first two demos, where the second one more specifically showed us creating a custom evaluator, and we also have the 13 built-in metrics available.

On the on-demand side, you first invoke an agent, which produces traces. In the case of on-demand evaluation, you fetch those traces and put them into the on-demand API in the format that it accepts. This can happen as part of your work in your IDE, it can be part of your CI/CD pipelines, and as I mentioned earlier, you can also put this into production for nuanced use cases because it's a real-time API. The on-demand API then works with the AgentCore Evaluation Service and invokes the models that power the evaluations, providing you with a score and an explanation.

Now let's look at the demo of the on-demand evaluations. I'm going to switch to that side because this screen is quite small. What I have here is a continuation of the example of the travel agent that was discussed earlier. This is an agent where I'm using Haiku 4.5, and this agent has access to 7 tools. This is my baseline implementation, meaning this is the first version of the agent that I created. I deployed this agent on AgentCore runtime. You may ask why I deployed it if it's a baseline agent. Well, I want to evaluate my agent exactly in the environment where I'm going to finally deploy it once it is in production.

This is the process of deploying the agent with AgentCore. With the AgentCore development kit, it's very easy to deploy any agentic code there with literally 5 to 7 lines of code, and this just deployed it. Then I checked the status to verify whether it's deployed and ready for me. This whole process takes about 1 minute or so. Now I'm asking questions to this agent. There's a question set that I prepared, and I'll show you in a second what it looks like.

In the very first question, I ask: "I am planning my honeymoon to Maldives for 10 days and my budget is 4500." The agent is supposed to look through things, invoke tools, do all that work, and generate traces for me. Nowhere in that entire notebook was I looking at or doing evaluations while my agent was running. Here is where I'm doing the evaluations. I'm looking at the session IDs. Please look at this closely. The code that I've highlighted is all you need in order to do on-demand evaluations. I just provided the set of metrics that I want to test and gave it the session IDs. We are already collecting those in AgentCore observability, so it's very easy for me to evaluate now.

As you can see, all the pipeline and all the data management is taken care of. Here are the scores for my baseline version, which does an okay job. My goal success rate is in the middle range, and as I also show in the explanation below, for one of the cases where my goal success rate is 0, my agent asked 3 follow-up questions and did not invoke anything. The goal success rate evaluator gave it a 0 score and called that out. So all of that is good.

Then I'm skipping a part of the process that I'll cover later, which is how I went to this notebook. This is my notebook where I'm doing my first experiment, where I improved the system prompt. It's the same agent using the same model and same config, but my system prompt has changed now based on the analysis I did of the explanation. As you can see, I'm giving it specific instructions around being more concise and using tools, and all those things. I'm deploying it again because, as I mentioned, I want to evaluate my agent in the environment I'm finally going to deploy it in.

This is my trial one. I run through the same list of questions again. As you can see, I was doing this yesterday and recorded it live, so it is generating those responses there.

And then it is supposed to give me the scores and explanations. Now, let's talk about how I went from my baseline to experiment. What did I change? How did I know what to change? That's what the key is to evaluation and iteration and improving these agents. So if you remember from the explanation that I was showing you in the first place, it said that my agent did not invoke a tool. It just continuously collected information from the user. Essentially, I analyzed all those different explanations that I got as a response from my first baseline notebook, and then I came up with these top failure points and decided to improve those in my system prompts. I could have also taken a route to update the model, but in this case, I did not think that I needed that at this time because I did not fully exhaust my prompt engineering capabilities.

All right, so this is the score lift that I see after I analyzed the explanations and improved my agent system prompt. As you can see, there are significant lifts, especially in goal success rate and conciseness. Well, my conciseness score is still low, to be honest, but since this is a demo, I feel comfortable with deploying this agent to production. We can see good improvement between the baseline and just the first experiment.

So how do you go about using the explanations and improving the agent? There are two ways. One, you can have your subject matter expert look at these explanations, dig deeper, do some data analysis, and see if they can come up with these failure points and suggest prompts. Or you could also use a hybrid approach, and that's exactly what I did. I used an LLM to analyze all those explanations because, to be honest, as you scale—let's say you have thousands of test cases—as a human, I don't think I could read those. So I just used an LLM to come up with some sort of extraction of the problem, then I looked through the insights and came up with a prompt.

All right, so as I was saying, since this specifically is a demo, I felt comfortable deploying it. You may want to do more trials, and you would repeat that until it meets your success criteria. Now, let's talk about online evaluation. So online evaluation, in this case, and if you remember from my demo one, it took me about twenty seconds to set up the online config, and that's exactly what online evaluation uses. So every time you invoke an agent—and let's say in this case the agent is deployed on AgentCore runtime, but as I was speaking to earlier, it could also be an agent that's not on AgentCore runtime—in this case, it's on AgentCore runtime. It seamlessly transmits its traces to our AgentCore observability offering, and from there, AgentCore Evaluations can seamlessly read the traces as the session completes and then evaluate that session, trace, or span with the metric that you have chosen to finally produce the score and explanations of why that score was generated for you.

Now, finally, I think this is the last demo, and this is where I'm going to show you our dashboard. Since I put my agent in production, I set up the online config in demo number one. So when you invoke the agent, this is what happens. On your AgentCore Observatory dashboard, you can also see evaluation scores. You can see that my agent does a decent job. I have, I think, over three hundred something invocations. So these scores are not with just toy samples; it's a decent sample. You can also filter things down based on what the category or the label is. Remember from the second demo, we created some labels and rubrics, so you can filter things through that. You can also look into the traces. A trace is a specific question that was asked of the agent, and then you can look at the explanations right there on the dashboard. So if you click on, in this case, I clicked on correctness, it tells me what's wrong with this because the score is zero point five, and there is a mix-up with how the agent used the context from the tool. So it gave the score of zero point five. And you get explanations across all metrics that you have chosen, as shown in demo number one. Okay, so best practices. How do you go about evaluation of agents? What are the best practices of evaluating agents?

Best Practices for Agent Evaluation and Getting Started with AgentCore Evaluations

Firstly, define a multi-dimensional success criteria. Your success criteria should also have the experience as a part of it, because that's what you're creating with these agents. Experience here could be based on customer-facing app experience, or if it's a back-end workflow, it could be your downstream impact. Get your subject matter experts involved sooner so that you can design those metrics and build human-in-the-loop evaluations with them. This ensures you're really targeting the agent to do the task and solve for the use case.

Do rigorous testing. There's a baseline, then there was a trial, and then there could be a number of trials until you meet the success criteria. I suggest you define a success criteria starting at perhaps five to seven percent below the threshold as a good starting point when launching the agent, and then work towards your target. It's very easy to reach eighty percent and then encounter diminishing returns.

Your data evolves with your agent and your user access patterns, so account for that. Your evaluation framework should also improve over time. Built-in evaluators are great to get started, but as you have more nuanced use cases, you want to build custom evaluators. You may also realize as you graduate from built-in evaluators to custom evaluators that sometimes your subject matter expert would say, "Yeah, this score of 0.8, I don't agree with it. Maybe it's 0.7." This is called a calibration gap, so you also want to do some calibration there.

Then monitor your agents continuously. What you can measure, you can only improve. Do rigorous statistical analysis as you improve your agent from version one to version two to version X. With that, this is the end of the session. We have documentation, GitHub samples, workshops, and the Agent Core SDK for you. Feel free to take a look at those and get started. We are very excited for you to get started with Agent Core Evaluations. We really hope you're enjoying re:Invent and have a great rest of your day.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community