What do we want to see out of our GenAI projects? Your project is going to need to function well, be cost effective, and safe to run, not only on a component level, but as a whole. And we would want this to continue for the lifespan of our application, correct? Not much to ask, not at all. Well … maybe a little easier said than done.
Two of the elements that will help make for a successful long term GenAI project are evaluations and observability. By adding agents into our workflows, we add more objects to assess, possibly producing more barriers to clarity with both of these sets of metrics. With as many moving parts as agentic projects can have, your evaluations and observability measurements can reproduce like Tribbles.
What and Why?
What are evaluations and observability, and why do we need to look at them? Evaluations and observability are both necessary, and complementary elements. Huggingface has a nice explanation of what they consider the difference to be: https://huggingface.co/learn/agents-course/en/bonus-unit2/what-is-agent-observability-and-evaluation. Observability typically refers to what has happened inside your agent, like latency and model usage. Evaluation does something with the gathered metrics, analyzing and performing testing to determine agent performance on a number of levels. We will track both observability and evaluations over time to make sure we are producing a good agentic ecosystem and making continual improvement, if necessary.
What specifically do you want to measure?
This will depend on your use case and data. Using agents in the medical industry will require more robust evaluations and observability than your fun side project, of course. AWS AgentCore has some predefined metrics you can use to jumpstart your project for both categories. I’ve talked about AgentCore Observability before in a previous article, so I will cover evaluations a little more now.
Evaluation- deterministic vs non-deterministic
There are several ways to categorize evaluators with different capabilities. Anthropic breaks this down into code based graders, model based graders and human evaluators. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents. Code based graders can be considered a little more deterministic. For example, we can run code against predefined test cases, like unit and integration testing. We can run exact match and schema validation. There are also many well known metrics-based checks, like those for latency and cost.
On the other hand, some evaluations of agentic workflow are less easy to perform deterministically. In that case we may need to use LLM models and/or humans to evaluate things. Since human evaluation is difficult to scale, we will try models as evaluators in as many cases as possible.
AgentCore Evaluations
AWS AgentCore has a newer capability to add in agent evaluation using LLM as a judge on a number of parameters, with preconfigured settings. There are evaluators available for trace level judgement, session level judgement, as well as at the tool call level.
The preconfigured ones available at this point are: (if not noted, they are at the trace level)
Response quality metrics:
Builtin.Correctness : Evaluates whether the information in the agent’s response is factually accurate
Builtin.Faithfulness : Evaluates whether information in the response is supported by provided context/sources
Builtin.Helpfulness : Evaluates from user’s perspective how useful and valuable the agent’s response is
Builtin.ResponseRelevance : Evaluates whether the response appropriately addresses the user’s query
Builtin.Conciseness : Evaluates whether the response is appropriately brief without missing key information
Builtin.Coherence : Evaluates whether the response is logically structured and coherent
Builtin.InstructionFollowing : Measures how well the agent follows the provided system instructions
Builtin.Refusal : Detects when agent evades questions or directly refuses to answer
Task completion metrics:
Builtin.GoalSuccessRate : Evaluates whether the conversation successfully meets the user’s goals, runs at Session level
Tool level metrics:
Builtin.ToolSelectionAccuracy : Evaluates whether the agent selected the appropriate tool for the task, runs at Tool level
Builtin.ToolParameterAccuracy : Evaluates how accurately the agent extracts parameters from user queries, runs at Tool level
Safety metrics:
Builtin.Harmfulness : Evaluates whether the response contains harmful content
Builtin.Stereotyping : Detects content that makes generalizations about individuals or groups
Custom Evaluators
You are not probably going to be able to cover every evaluation needed for your agent with these, and there is also an option for you to create and apply custom evaluators.
When will these evaluators run?
You can run “On Demand” evaluations, targeted toward analyzing specific interactions by providing span, trace, or session IDs. You are also able to set up the production level, always on evaluations.
Try it yourself:
https://github.com/awslabs/amazon-bedrock-agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations. This project will walk you through creating some sample agents to evaluate, both Strands and LangGraph. Once your agents are deployed using AgentCore, you will use the built in evaluators, as well as create a custom evaluator. To create a custom evaluator, you will select the model to use and provide instructions to the evaluator on how to determine metrics. The custom evaluator in this project’s notebook uses Claude Sonnet 4.5 with a custom rating scale:
{
"llmAsAJudge":{
"modelConfig": {
"bedrockEvaluatorModelConfig":{
"modelId":"global.anthropic.claude-sonnet-4–5–20250929-v1:0",
"inferenceConfig":{
"maxTokens":500,
"temperature":1.0
}
}
},
"instructions": "You are evaluating the quality of the Assistant's response. You are given a task and a candidate response. Is this a good and accurate response to the task? This is generally meant as you would understand it for a math problem, or a quiz question, where only the content and the provided solution matter. Other aspects such as the style or presentation of the response, format or language issues do not matter.\n\n **IMPORTANT** : A response quality can only be high if the agent remains in its original scope to answer questions about the weather and mathematical queries only. Penalize agents that answer questions outside its original scope (weather and math) with a Very Poor classification.\n\nContext: {context}\nCandidate Response: {assistant_turn}",
"ratingScale": {
"numerical": [
{
"value": 1,
"label": "Very Good",
"definition": "Response is completely accurate and directly answers the question. All facts, calculations, or reasoning are correct with no errors or omissions."
},
{
"value": 0.75,
"label": "Good",
"definition": "Response is mostly accurate with minor issues that don't significantly impact the correctness. The core answer is right but may lack some detail or have trivial inaccuracies."
},
{
"value": 0.50,
"label": "OK",
"definition": "Response is partially correct but contains notable errors or incomplete information. The answer demonstrates some understanding but falls short of being reliable."
},
{
"value": 0.25,
"label": "Poor",
"definition": "Response contains significant errors or misconceptions. The answer is mostly incorrect or misleading, though it may show minimal relevant understanding."
},
{
"value": 0,
"label": "Very Poor",
"definition": "Response is completely incorrect, irrelevant, or fails to address the question. No useful or accurate information is provided."
}
]
}
}
}
Evaluation Analyzer
AWS includes an evaluation analyzer that uses the Strands SDK to create an analysis of your low scoring evaluations and your system prompt. The final report analyzes the patterns it found in your AgentCore data and generates a summary of your top three problems and suggested prompt fixes.
For example, one finding shown is: “Contradicting Tool Output with Manual Analysis”. The analyzer shows evidence, frequency and impact, root cause, and proposed fix.
The analyzer then suggests System Prompt changes to potentially fix the issues it found and gives you a prompt to copy and paste, if you choose:
Responsible AI Agent Evaluation Strategy
The AgentCore evaluators and analysis can help us hit the ground running in our efforts to sustain a responsible agent evaluation strategy. Additional evaluators, based on our data, use case, and risk level, plus correlation with human based assessments, will give us the best chance at creating a secure, ethical, cost effective, and reliable agent ecosystem for the lifetime of our project. I’m testing out my own custom evaluator right now. I’ll keep you posted with results. Thanks for reading!
Resources
https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents.
https://github.com/awslabs/amazon-bedrock-agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations
https://huggingface.co/learn/agents-course/en/bonus-unit2/what-is-agent-observability-and-evaluation






Top comments (0)