<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shashank agarwal</title>
    <description>The latest articles on DEV Community by shashank agarwal (@imshashank).</description>
    <link>https://dev.to/imshashank</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1348998%2F681a6e08-bfe7-46f1-90ad-bb1a45963e91.jpeg</url>
      <title>DEV Community: shashank agarwal</title>
      <link>https://dev.to/imshashank</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/imshashank"/>
    <language>en</language>
    <item>
      <title>The AI Agent Feedback Loop: From Evaluation to Continuous Improvement</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Thu, 01 Jan 2026 00:27:00 +0000</pubDate>
      <link>https://dev.to/imshashank/the-ai-agent-feedback-loop-from-evaluation-to-continuous-improvement-5hm4</link>
      <guid>https://dev.to/imshashank/the-ai-agent-feedback-loop-from-evaluation-to-continuous-improvement-5hm4</guid>
      <description>&lt;h2&gt;
  
  
  Evaluation is Just the First Step
&lt;/h2&gt;

&lt;p&gt;So you've built an evaluation framework for your AI agent. You're tracking metrics, scoring conversations, and identifying failures. That's great. But evaluation, on its own, is useless.&lt;/p&gt;

&lt;p&gt;Data without action is just a dashboard. The real value of evaluation is in creating a tight, continuous &lt;strong&gt;feedback loop&lt;/strong&gt; that drives improvement. It's about turning insights into action.&lt;/p&gt;

&lt;p&gt;Most teams get stuck at the evaluation step. They have a spreadsheet full of failing test cases, but no clear process for fixing them. The result is a backlog of issues and a development process that feels like playing whack-a-mole.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 7 Steps of a Powerful Feedback Loop
&lt;/h2&gt;

&lt;p&gt;A truly effective feedback loop is a systematic, automated process that takes you from raw data to a better agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Evaluate at Scale
&lt;/h3&gt;

&lt;p&gt;First, you need to be running your evaluation framework on every single agent interaction in production. This gives you the comprehensive dataset you need to find meaningful patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Identify Failure Patterns
&lt;/h3&gt;

&lt;p&gt;Don't just look at individual failures. Look for patterns. Is a specific type of scorer (e.g., &lt;code&gt;is_concise&lt;/code&gt;) failing frequently? Is a particular agent or prompt causing most of the issues?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Diagnose the Root Cause
&lt;/h3&gt;

&lt;p&gt;This is the most critical step. Once you've identified a pattern, you need to understand the &lt;em&gt;why&lt;/em&gt;. Is the agent failing because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The system prompt is ambiguous?&lt;/li&gt;
&lt;li&gt;The underlying LLM has a knowledge gap?&lt;/li&gt;
&lt;li&gt;A specific tool is returning bad data?&lt;/li&gt;
&lt;li&gt;The reasoning logic is flawed?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This requires a powerful analysis engine (like our NovaPilot) that can sift through thousands of traces to find the common thread.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Generate Actionable Recommendations
&lt;/h3&gt;

&lt;p&gt;The diagnosis should lead to a specific, testable hypothesis for a fix. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis:&lt;/strong&gt; "The agent is being too verbose because the system prompt doesn't explicitly ask for conciseness."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommendation:&lt;/strong&gt; "Add the following instruction to the system prompt: 'Your answers should be clear and concise, under 200 words.'"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5: Implement the Change
&lt;/h3&gt;

&lt;p&gt;Implement the recommended fix. This could be a prompt change, a model swap, or a tweak to a tool's logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Re-evaluate and Compare
&lt;/h3&gt;

&lt;p&gt;Run the evaluation framework again on the same set of interactions with the new change. Compare the results. Did the scores for the &lt;code&gt;is_concise&lt;/code&gt; scorer improve? Did any other scores get worse (a regression)?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Iterate
&lt;/h3&gt;

&lt;p&gt;Based on the results of the re-evaluation, you either deploy the change to production or you go back to Step 3 to refine your diagnosis. This is a continuous cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Goal: Faster Iteration
&lt;/h2&gt;

&lt;p&gt;The teams that build the best AI agents are the ones that can iterate through this feedback loop the fastest. If it takes you two weeks to manually diagnose a problem and test a fix, you'll be quickly outpaced by a team that can do it in two hours.&lt;/p&gt;

&lt;p&gt;This is why automation is key. Every step of this process, from trace extraction to root cause analysis to re-evaluation, should be as automated as possible.&lt;/p&gt;

&lt;p&gt;Your goal isn't just to evaluate your agents. It's to build a system that allows them to continuously and automatically improve.&lt;/p&gt;

&lt;p&gt;Noveum.ai's platform automates this entire feedback loop, from evaluation to root cause analysis to actionable recommendations for improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does your feedback loop for agent improvement look like today? Share your process!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Monitoring vs. Evaluation: The Critical Distinction Most AI Devs Miss</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Wed, 31 Dec 2025 00:24:00 +0000</pubDate>
      <link>https://dev.to/imshashank/monitoring-vs-evaluation-the-critical-distinction-most-ai-devs-miss-1a3d</link>
      <guid>https://dev.to/imshashank/monitoring-vs-evaluation-the-critical-distinction-most-ai-devs-miss-1a3d</guid>
      <description>&lt;h2&gt;
  
  
  Are You Tracking the Right Things?
&lt;/h2&gt;

&lt;p&gt;In the world of DevOps and SRE, we're obsessed with monitoring. We track latency, error rates, CPU utilization, and requests per second. These metrics are essential for understanding the health of our systems.&lt;/p&gt;

&lt;p&gt;Naturally, when we started building AI agents, we applied the same mindset. We created dashboards to monitor our LLM API costs, token counts, and API error rates.&lt;/p&gt;

&lt;p&gt;But this is a critical mistake. For AI agents, &lt;strong&gt;monitoring is not the same as evaluation&lt;/strong&gt;, and confusing the two can lead to a false sense of security.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Tells You If It's Running. Evaluation Tells You If It's Working.
&lt;/h2&gt;

&lt;p&gt;Let's break down the difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; is about tracking the operational health of your application. It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many requests did we process?&lt;/li&gt;
&lt;li&gt;What was the average latency?&lt;/li&gt;
&lt;li&gt;How many times did the OpenAI API return a 500 error?&lt;/li&gt;
&lt;li&gt;How much did we spend on tokens today?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Evaluation&lt;/strong&gt; is about assessing the quality and correctness of your agent's behavior. It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the agent actually solve the user's problem?&lt;/li&gt;
&lt;li&gt;Did it follow the instructions in its system prompt?&lt;/li&gt;
&lt;li&gt;Did it provide factually accurate information?&lt;/li&gt;
&lt;li&gt;Did it violate any compliance or safety rules?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Dangerous Blind Spot
&lt;/h3&gt;

&lt;p&gt;You can have a perfectly monitored system that is a complete failure from an evaluation perspective. Your dashboard could be all green:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ 99.99% uptime&lt;/li&gt;
&lt;li&gt;✅ 500ms average latency&lt;/li&gt;
&lt;li&gt;✅ 0 API errors&lt;/li&gt;
&lt;li&gt;✅ Costs are within budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ 15% of the agent's responses are factually incorrect.&lt;/li&gt;
&lt;li&gt;❌ 10% of interactions violate your company's brand voice guidelines.&lt;/li&gt;
&lt;li&gt;❌ 5% of conversations expose sensitive user data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your monitoring dashboard tells you that your system is &lt;em&gt;running&lt;/em&gt;. It tells you nothing about whether your system is &lt;em&gt;working correctly&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prioritize Evaluation, Then Monitor
&lt;/h2&gt;

&lt;p&gt;For AI agents, evaluation is the more important discipline. You must first have confidence that your agent is behaving as intended. Only then should you focus on optimizing its performance and cost.&lt;/p&gt;

&lt;p&gt;The ideal approach, of course, is to integrate both. A comprehensive &lt;strong&gt;AI observability&lt;/strong&gt; platform should give you a single pane of glass that shows you both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational Metrics:&lt;/strong&gt; Latency, cost, throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Metrics:&lt;/strong&gt; Accuracy, compliance, helpfulness, safety.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But if you have to choose where to start, start with evaluation. It's better to have a slow, expensive agent that works correctly than a fast, cheap agent that is silently causing harm to your users and your business.&lt;/p&gt;

&lt;p&gt;Stop conflating monitoring with evaluation. They are two different disciplines, and for AI agents, evaluation is the one that truly matters.&lt;/p&gt;

&lt;p&gt;To implement both monitoring and evaluation, &lt;a href="https://noveum.ai/en/solutions/llm-observability" rel="noopener noreferrer"&gt;Noveum.ai's LLM Observability Platform&lt;/a&gt; provides a unified dashboard for operational metrics and quality evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does your team distinguish between monitoring and evaluation for your AI apps?&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Build an AI Agent Evaluation Framework That Scales</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Mon, 29 Dec 2025 00:18:00 +0000</pubDate>
      <link>https://dev.to/imshashank/how-to-build-an-ai-agent-evaluation-framework-that-scales-3570</link>
      <guid>https://dev.to/imshashank/how-to-build-an-ai-agent-evaluation-framework-that-scales-3570</guid>
      <description>&lt;h2&gt;
  
  
  The Scaling Problem
&lt;/h2&gt;

&lt;p&gt;So, you've built a great AI agent. You've tested it with a few dozen examples, and it works perfectly. Now, you're ready to deploy it to production, where it will handle thousands or even millions of conversations.&lt;/p&gt;

&lt;p&gt;Suddenly, your evaluation strategy breaks. You can't manually review every conversation. Your small test set doesn't cover the infinite variety of real-world user behavior. How do you ensure quality at scale?&lt;/p&gt;

&lt;p&gt;The answer is to build an &lt;strong&gt;automated, scalable evaluation framework&lt;/strong&gt;. Manual spot-checking is not a strategy; it's a liability.&lt;/p&gt;

&lt;p&gt;Here's a blueprint for building an evaluation system that can handle production-level traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 7 Components of a Scalable Evaluation Framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Automated Trace Extraction
&lt;/h3&gt;

&lt;p&gt;Your framework must automatically capture the complete, detailed trace of every single agent interaction. This is your raw data. It should be a non-negotiable part of your agent's architecture to log every reasoning step, tool call, and output.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Intelligent Trace Parsing (The ETL Agent)
&lt;/h3&gt;

&lt;p&gt;Raw traces are often messy, unstructured JSON or text logs. You need a process to parse this raw data into a clean, structured format. At Noveum.ai, we use a dedicated AI agent for this—an ETL (Extract, Transform, Load) agent that reads the raw trace and intelligently extracts key information like tool calls, parameters, reasoning steps, and final outputs into a standardized schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A Comprehensive Scorer Library
&lt;/h3&gt;

&lt;p&gt;This is the core of your evaluation engine. You need a library of 70+ automated "scorers," each designed to evaluate a specific dimension of quality. These should cover everything from factual accuracy and instruction following to PII detection and token efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Automated Scorer Recommendation
&lt;/h3&gt;

&lt;p&gt;With 70+ scorers, which ones should you run on a given dataset? A truly scalable system uses another AI agent to analyze your dataset and recommend the top 10-15 most relevant scorers for your specific use case. This saves compute time and focuses your evaluation on what matters most.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Aggregated Quality Assessment
&lt;/h3&gt;

&lt;p&gt;After running the scorers, you'll have thousands of individual data points. Your framework needs to aggregate these scores into a meaningful, high-level assessment of agent quality. This includes identifying trends, common failure modes, and overall performance against your business KPIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Automated Root Cause Analysis (NovaPilot)
&lt;/h3&gt;

&lt;p&gt;This is the most critical component. It's not enough to know &lt;em&gt;that&lt;/em&gt; your agent is failing. You need to know &lt;em&gt;why&lt;/em&gt;. A powerful analysis engine (like our NovaPilot) should be able to analyze all the failing traces and scores to diagnose the root cause of the problem. Is it a bad prompt? A faulty tool? A limitation of the model?&lt;/p&gt;

&lt;h3&gt;
  
  
  7. A Continuous Improvement Loop
&lt;/h3&gt;

&lt;p&gt;Finally, the framework must close the loop. The insights from the root cause analysis should feed directly back into the development process. The system should suggest specific, actionable fixes—like a revised system prompt or a change in model parameters—that will resolve the identified issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Manual to Automated
&lt;/h2&gt;

&lt;p&gt;Building this kind of framework is a significant engineering effort. But it's the only way to move from manual, unreliable spot-checking to a truly scalable, automated quality assurance process. It's the difference between building a prototype and building a production-ready AI system.&lt;/p&gt;

&lt;p&gt;If you're ready to implement this at scale, &lt;a href="https://noveum.ai/en/solutions/agent-evaluation" rel="noopener noreferrer"&gt;Noveum.ai's comprehensive evaluation platform&lt;/a&gt; automates all seven components of a scalable evaluation framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the biggest bottleneck you're facing in scaling your agent evaluation? Let's discuss.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Is Your AI Agent a Compliance Risk? How to Find Violations Hidden in Traces</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Fri, 26 Dec 2025 00:16:00 +0000</pubDate>
      <link>https://dev.to/imshashank/is-your-ai-agent-a-compliance-risk-how-to-find-violations-hidden-in-traces-3g0e</link>
      <guid>https://dev.to/imshashank/is-your-ai-agent-a-compliance-risk-how-to-find-violations-hidden-in-traces-3g0e</guid>
      <description>&lt;h2&gt;
  
  
  The Silent Risk
&lt;/h2&gt;

&lt;p&gt;Here's a thought that should keep every AI developer up at night: your agent might be silently violating compliance regulations like GDPR, HIPAA, or CCPA, and your standard evaluation metrics will never catch it.&lt;/p&gt;

&lt;p&gt;Compliance isn't about the final output. It's about the &lt;strong&gt;process&lt;/strong&gt;. It's about how the agent handles data, makes decisions, and interacts with the user at every step of its trajectory. A final answer can be perfectly correct and helpful, yet have been generated through a process that creates significant legal and financial risk for your company.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Violations Hide
&lt;/h2&gt;

&lt;p&gt;Let's take GDPR as an example. The principles of data minimization and purpose limitation are central. Now, consider a simple customer support agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The User's Request:&lt;/strong&gt; "I'd like to check the status of my order."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Compliant Trajectory
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; "I can help with that. Could you please provide your order number?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User:&lt;/strong&gt; "It's 12345."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; (Calls &lt;code&gt;getOrderStatus('12345')&lt;/code&gt; tool)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; "Your order has shipped and is expected to arrive tomorrow."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is clean. The agent only asked for the data it absolutely needed (the order number) to fulfill the specific purpose.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Non-Compliant Trajectory
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; "To help you, I need to verify your identity. Please provide your full name, email address, and date of birth."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User:&lt;/strong&gt; (Provides the information)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; (Stores this PII in its conversation history)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; "Thank you. Now, what is your order number?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User:&lt;/strong&gt; "It's 12345."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; (Calls &lt;code&gt;getOrderStatus('12345')&lt;/code&gt; tool)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; "Your order has shipped and is expected to arrive tomorrow."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both agents successfully answered the user's question. But the second agent created a massive compliance risk. It violated the principle of data minimization by collecting PII that was not necessary for the task.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect These Violations
&lt;/h2&gt;

&lt;p&gt;You cannot detect this kind of failure by looking at the final output. You must analyze the entire trace of the agent's interaction. Your evaluation framework needs automated scorers that can check for compliance at each step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PII Detection Scorer:&lt;/strong&gt; Does the agent's internal reasoning or final output contain Personally Identifiable Information?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Minimization Scorer:&lt;/strong&gt; Did the agent ask for more information than was strictly necessary to complete the task?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purpose Limitation Scorer:&lt;/strong&gt; Did the agent use the provided data for any purpose other than what the user consented to?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruction Following Scorer:&lt;/strong&gt; Did the agent violate any compliance-related rules in its system prompt (e.g., "Never store user PII in your memory")?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional software testing doesn't prepare us for this. We need to adopt a new mindset where we are constantly auditing the &lt;em&gt;process&lt;/em&gt; of our AI agents, not just the results.&lt;/p&gt;

&lt;p&gt;By implementing trajectory-based compliance evaluation, you can automatically flag these hidden risks before they become a major incident.&lt;/p&gt;

&lt;p&gt;Noveum.ai's &lt;a href="https://noveum.ai/en/solutions/ai-agent-monitoring" rel="noopener noreferrer"&gt;AI Agent Monitoring solution&lt;/a&gt; includes compliance scorers that automatically detect violations in your agent's traces across GDPR, HIPAA, and other regulatory frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How are you currently ensuring your AI agents are compliant? Let's share strategies in the comments.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>privacy</category>
      <category>agents</category>
      <category>security</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Hidden Costs of Inefficient AI Agents (And How to Fix Them)</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Wed, 24 Dec 2025 00:16:00 +0000</pubDate>
      <link>https://dev.to/imshashank/the-hidden-costs-of-inefficient-ai-agents-and-how-to-fix-them-2k3d</link>
      <guid>https://dev.to/imshashank/the-hidden-costs-of-inefficient-ai-agents-and-how-to-fix-them-2k3d</guid>
      <description>&lt;h2&gt;
  
  
  Looking Beyond Token Counts
&lt;/h2&gt;

&lt;p&gt;Every developer working with LLMs is acutely aware of token costs. We optimize prompts, choose smaller models, and set max token limits. But this is only scratching the surface of AI agent costs.&lt;/p&gt;

&lt;p&gt;The real, hidden costs of AI agents aren't in the token counts of the final output. They're buried in the &lt;strong&gt;inefficiencies of the agent's trajectory&lt;/strong&gt;—the step-by-step process of reasoning and tool use that leads to the final answer.&lt;/p&gt;

&lt;p&gt;Let's look at a concrete example of two agents tasked with answering, "What is the current price of Apple stock and what was the biggest news about them this week?"&lt;/p&gt;

&lt;h3&gt;
  
  
  The Inefficient Agent Trajectory
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;User Asks Question.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasons (500 tokens):&lt;/strong&gt; "Okay, I need to find the stock price and the latest news. I'll start with the stock price."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Calls &lt;code&gt;getStockPrice('AAPL')&lt;/code&gt; Tool.&lt;/strong&gt; (1 API call)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasons (400 tokens):&lt;/strong&gt; "Great, I have the price. Now I need to find the news."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Calls &lt;code&gt;searchNews('Apple')&lt;/code&gt; Tool.&lt;/strong&gt; (1 API call)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasons (300 tokens):&lt;/strong&gt; "Okay, I have the news. Now I need to combine them into a final answer."&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent Provides Final Answer (200 tokens).&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total Cost:&lt;/strong&gt; 1400 LLM tokens + 2 tool calls (sequential)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Efficient Agent Trajectory
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;User Asks Question.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasons (200 tokens):&lt;/strong&gt; "I need two pieces of information: stock price and news. I can get these at the same time."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Calls &lt;code&gt;getStockPrice('AAPL')&lt;/code&gt; and &lt;code&gt;searchNews('Apple')&lt;/code&gt; in Parallel.&lt;/strong&gt; (2 API calls, but in parallel)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasons (200 tokens):&lt;/strong&gt; "I have both pieces of information. I will now synthesize them."&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent Provides Final Answer (150 tokens).&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total Cost:&lt;/strong&gt; 550 LLM tokens + 2 tool calls (in parallel)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Both agents produced the same correct answer. But the efficient agent was &lt;strong&gt;60% cheaper&lt;/strong&gt; in terms of LLM token consumption and likely much faster because it executed its tool calls in parallel.&lt;/p&gt;

&lt;p&gt;Now, imagine this inefficiency scaled across millions of interactions. The hidden costs become astronomical.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Find and Fix Inefficiencies
&lt;/h2&gt;

&lt;p&gt;You can't find these problems by looking at the final output. You have to analyze the entire trajectory. Your evaluation framework should be asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Redundant Tool Calls:&lt;/strong&gt; Is the agent calling the same tool with the same parameters multiple times in a single trajectory?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verbose Reasoning:&lt;/strong&gt; Are the internal reasoning steps unnecessarily long and complex?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential vs. Parallel:&lt;/strong&gt; Is the agent calling tools one by one when it could be executing them in parallel?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suboptimal Tool Selection:&lt;/strong&gt; Is it using an expensive, powerful tool for a simple task that a cheaper tool could handle?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where true cost optimization for AI agents happens. It's not about nickel-and-diming your token counts. It's about fundamentally improving the efficiency of your agent's decision-making process.&lt;/p&gt;

&lt;p&gt;By implementing trajectory analysis, you can identify these hidden costs and provide targeted feedback to your system prompt or agent logic to fix them, leading to massive savings at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the most inefficient agent behavior you've seen in production? Share your war stories!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>5 Types of AI Hallucinations (And How to Detect Them)</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Mon, 22 Dec 2025 00:15:00 +0000</pubDate>
      <link>https://dev.to/imshashank/5-types-of-ai-hallucinations-and-how-to-detect-them-3ab9</link>
      <guid>https://dev.to/imshashank/5-types-of-ai-hallucinations-and-how-to-detect-them-3ab9</guid>
      <description>&lt;h2&gt;
  
  
  The Many Faces of Hallucination
&lt;/h2&gt;

&lt;p&gt;When developers hear "AI hallucination," they usually picture an LLM confidently making up a completely false fact, like claiming the moon is made of cheese. While this &lt;strong&gt;Factual Hallucination&lt;/strong&gt; is a real problem, it's only one piece of a much larger puzzle.&lt;/p&gt;

&lt;p&gt;In production AI agents, there are several other, more subtle types of hallucinations that can be just as damaging. If your evaluation framework only checks for factual errors, you're leaving your application vulnerable.&lt;/p&gt;

&lt;p&gt;Here are the five key types of hallucinations you need to be detecting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Type 1: Factual Hallucination
&lt;/h3&gt;

&lt;p&gt;This is the classic definition. The agent states something as a fact that is verifiably false in the real world.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; "The first person to walk on Mars was Neil Armstrong."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Requires external knowledge validation, often through a search tool or a curated knowledge base.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Type 2: Contextual Hallucination (RAG Failure)
&lt;/h3&gt;

&lt;p&gt;This is particularly dangerous in Retrieval-Augmented Generation (RAG) systems. The agent is given a specific context (like a document or a database query result) and is instructed to answer based &lt;em&gt;only&lt;/em&gt; on that context. A contextual hallucination occurs when the agent ignores the context and uses its general knowledge instead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; A user asks, "According to the provided legal document, what is the termination clause?" The agent responds with a generic, boilerplate termination clause instead of the specific one from the document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Requires comparing the agent's response directly against the provided context to ensure all claims are supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Type 3: Instruction Hallucination
&lt;/h3&gt;

&lt;p&gt;This happens when the agent directly violates one of the core instructions in its system prompt.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; The system prompt says, "You are a helpful assistant. You must never be rude to the user." The agent responds to a user's question with, "That's a stupid question."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Requires parsing the system prompt into a set of rules and programmatically checking the agent's behavior against those rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Type 4: Role Hallucination
&lt;/h3&gt;

&lt;p&gt;This is a subtle but important failure where the agent forgets its assigned persona or role.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; An agent is designed to be a playful, pirate-themed chatbot for a children's game. Midway through the conversation, it drops the persona and starts speaking like a formal, technical document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Requires evaluating the agent's tone, style, and vocabulary against the persona defined in the system prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Type 5: Consistency Hallucination
&lt;/h3&gt;

&lt;p&gt;The agent contradicts itself within the same conversation, showing a lack of stable reasoning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; In the first turn, the agent says, "I cannot access external websites." Three turns later, it says, "I have just checked that website for you."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Requires analyzing the entire conversation history for logical contradictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most off-the-shelf evaluation frameworks are only good at catching Type 1 hallucinations. They can't detect when your agent is ignoring its context, violating its instructions, or breaking character. This is why so many agents that perform well in benchmarks fail spectacularly in production.&lt;/p&gt;

&lt;p&gt;A robust evaluation strategy must include specific scorers for all five types of hallucinations. You need to analyze the agent's output not just against the real world, but also against its provided context, its system prompt, and its own conversation history.&lt;/p&gt;

&lt;p&gt;Noveum.ai's &lt;a href="https://noveum.ai/en/solutions/llm-observability" rel="noopener noreferrer"&gt;LLM Observability Platform&lt;/a&gt; includes dedicated scorers for detecting all five types of hallucinations across your entire agent fleet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which type of hallucination do you find most challenging to deal with in your projects? Let's discuss below.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>agents</category>
      <category>discuss</category>
    </item>
    <item>
      <title>How to Analyze AI Agent Traces Like a Detective</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Fri, 19 Dec 2025 00:14:00 +0000</pubDate>
      <link>https://dev.to/imshashank/how-to-analyze-ai-agent-traces-like-a-detective-3a03</link>
      <guid>https://dev.to/imshashank/how-to-analyze-ai-agent-traces-like-a-detective-3a03</guid>
      <description>&lt;h2&gt;
  
  
  The Final Output is Just the Tip of the Iceberg
&lt;/h2&gt;

&lt;p&gt;When an AI agent fails, the natural instinct is to look at the final, incorrect output and try to figure out what went wrong. This is like a detective arriving at a crime scene and only looking at the victim, ignoring all the surrounding evidence.&lt;/p&gt;

&lt;p&gt;To truly understand and fix agent failures, you need to become a detective and investigate the &lt;strong&gt;trace&lt;/strong&gt;. A trace is the complete, step-by-step log of everything the agent did during an interaction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every internal thought or reasoning step.&lt;/li&gt;
&lt;li&gt;Every tool it decided to call.&lt;/li&gt;
&lt;li&gt;The exact parameters it used for each tool call.&lt;/li&gt;
&lt;li&gt;The raw output it received from each tool.&lt;/li&gt;
&lt;li&gt;Every decision it made based on new information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analyzing these traces is the single most powerful debugging technique for AI agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  A 6-Step Framework for Trace Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mn1myxo4qm6ofj48g54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mn1myxo4qm6ofj48g54.png" alt="Traces being analyzed by Noveum.ai platform" width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's a systematic approach to dissecting an agent trace to find the root cause of any failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Understand the Goal
&lt;/h3&gt;

&lt;p&gt;First, clearly define what a successful outcome would have looked like. What was the user's intent? What was the agent &lt;em&gt;supposed&lt;/em&gt; to do according to its system prompt?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Follow the Trajectory
&lt;/h3&gt;

&lt;p&gt;Start from the beginning of the trace and walk through each step of the agent's reasoning process. Don't make assumptions. Read the agent's internal monologue. Does its chain of thought make logical sense?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Identify Key Decision Points
&lt;/h3&gt;

&lt;p&gt;Pinpoint the exact moments where the agent made a choice. This could be deciding which tool to use, what parameters to pass, or how to interpret a tool's response. Was the choice it made the optimal one?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Scrutinize Tool Calls
&lt;/h3&gt;

&lt;p&gt;This is often where things go wrong. For every tool call, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was this the right tool for this specific sub-task?&lt;/li&gt;
&lt;li&gt;Were the parameters passed to the tool correct and well-formed?&lt;/li&gt;
&lt;li&gt;Was the tool's output what you expected? Did the agent handle an error or unexpected output gracefully?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5: Check for Compliance and Constraint Violations
&lt;/h3&gt;

&lt;p&gt;At each step, cross-reference the agent's action with its system prompt. Did it violate any of its core instructions? For example, if it's not supposed to give financial advice, did it call a stock price API?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Pinpoint the Root Cause
&lt;/h3&gt;

&lt;p&gt;By following these steps, you can move beyond simply identifying the failure and pinpoint its origin. Was the root cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Reasoning Error?&lt;/strong&gt; The agent's logic was flawed.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Tool Use Error?&lt;/strong&gt; The agent used the wrong tool or used it incorrectly.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Prompt Issue?&lt;/strong&gt; The system prompt was ambiguous or incomplete.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Model Limitation?&lt;/strong&gt; The underlying LLM simply wasn't capable of the required reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Practical Example
&lt;/h3&gt;

&lt;p&gt;Imagine a customer support agent that gives a user the wrong refund amount. The trace might reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent correctly understands the user's request for a refund.&lt;/li&gt;
&lt;li&gt;Agent correctly calls the &lt;code&gt;getOrderDetails&lt;/code&gt; tool with the right order ID.&lt;/li&gt;
&lt;li&gt;The tool returns the correct order data, including &lt;code&gt;price: 99.99&lt;/code&gt; and &lt;code&gt;discount: 10.00&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The agent's reasoning step says: "The refund amount is the price. I will refund $99.99."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root Cause:&lt;/strong&gt; A reasoning error. The agent failed to account for the discount.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you know exactly what to fix. You don't need to debug the tool or the data. You need to improve the agent's reasoning, likely by updating the system prompt to explicitly mention how to handle discounts.&lt;/p&gt;

&lt;p&gt;Without trace analysis, you're just debugging in the dark.&lt;/p&gt;

&lt;p&gt;To streamline your trace analysis process, &lt;a href="https://noveum.ai/en/solutions/debugging" rel="noopener noreferrer"&gt;Noveum.ai's Debugging and Tracing solution&lt;/a&gt; provides hierarchical trace visualization and automated root cause analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Have you ever analyzed an agent trace to find a surprising root cause? Share your story!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>agents</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Beyond Accuracy: The 73+ Dimensions of AI Agent Quality</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Wed, 17 Dec 2025 00:12:00 +0000</pubDate>
      <link>https://dev.to/imshashank/beyond-accuracy-the-73-dimensions-of-ai-agent-quality-41ni</link>
      <guid>https://dev.to/imshashank/beyond-accuracy-the-73-dimensions-of-ai-agent-quality-41ni</guid>
      <description>&lt;h2&gt;
  
  
  "Is My Agent Good?" Is the Wrong Question
&lt;/h2&gt;

&lt;p&gt;When a developer asks, "Is my AI agent good?" they're often looking for a single score, like an accuracy percentage. This is a dangerous oversimplification. An AI agent is a complex system, and its quality can't be boiled down to one number.&lt;/p&gt;

&lt;p&gt;An agent isn't just "good" or "bad." It can be factually accurate but dangerously non-compliant. It can be helpful but horribly inefficient. It can be safe but provide a terrible user experience.&lt;/p&gt;

&lt;p&gt;To truly understand your agent's performance, you need to evaluate it across multiple dimensions simultaneously. At Noveum.ai, we've identified over 73 distinct scorers, which we group into several key categories.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2m8pm5vmcwaopcmcsh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2m8pm5vmcwaopcmcsh1.png" alt="Agent Health Dashboard from Noveum.ai" width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Dimensions of Agent Quality
&lt;/h2&gt;

&lt;p&gt;Here are some of the most critical dimensions you should be tracking:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Correctness Dimensions
&lt;/h3&gt;

&lt;p&gt;This is about the factual and logical integrity of the agent's output.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Factual Accuracy:&lt;/strong&gt; Does the agent provide information that is verifiably true?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruction Following:&lt;/strong&gt; Does the agent adhere to the explicit instructions in its system prompt?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Adherence:&lt;/strong&gt; Does the agent use only the information provided in the given context, especially in RAG systems?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Safety and Security Dimensions
&lt;/h3&gt;

&lt;p&gt;These scorers protect your users and your company from harm.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Toxicity Detection:&lt;/strong&gt; Does the agent avoid generating hateful, offensive, or inappropriate language?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII Protection:&lt;/strong&gt; Does it refuse to process or reveal Personally Identifiable Information?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Injection Resistance:&lt;/strong&gt; Can the agent be tricked into violating its instructions by a malicious user prompt?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Efficiency Dimensions
&lt;/h3&gt;

&lt;p&gt;An agent that works but is slow and expensive is a liability in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool Call Efficiency:&lt;/strong&gt; Is the agent making redundant or unnecessary API calls?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Efficiency:&lt;/strong&gt; Is it being overly verbose, driving up LLM costs?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning Efficiency:&lt;/strong&gt; Does it get stuck in loops or take a convoluted path to a simple answer?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. User Experience Dimensions
&lt;/h3&gt;

&lt;p&gt;This measures how it &lt;em&gt;feels&lt;/em&gt; to interact with your agent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conversation Coherence:&lt;/strong&gt; Does the agent maintain a logical and easy-to-follow conversation flow?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance:&lt;/strong&gt; Does it stay on topic and provide answers that are relevant to the user's query?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Helpfulness:&lt;/strong&gt; Does it actually solve the user's underlying problem?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Compliance Dimensions
&lt;/h3&gt;

&lt;p&gt;For any enterprise application, this is non-negotiable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory Compliance:&lt;/strong&gt; Does the agent's behavior align with legal frameworks like GDPR, HIPAA, or CCPA?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Company Policy Adherence:&lt;/strong&gt; Does it follow your internal guidelines for brand voice, tone, and values?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Multi-Dimensional Evaluation Matters
&lt;/h2&gt;

&lt;p&gt;Most teams only look at one or two of these categories, typically correctness. This creates massive blind spots. You might have an agent that's 99% factually accurate but leaks PII in 5% of conversations. Without a multi-dimensional evaluation framework, you'd never know until it's too late.&lt;/p&gt;

&lt;p&gt;The only way to de-risk your AI agent for production is to have a comprehensive suite of scorers that evaluates its performance from every possible angle. Stop chasing a single accuracy score and start building a holistic view of your agent's quality.&lt;/p&gt;

&lt;p&gt;Noveum.ai's &lt;a href="https://noveum.ai/en/solutions/scorers" rel="noopener noreferrer"&gt;Noveum.ai comprehensive scorer library&lt;/a&gt; includes 73+ pre-built scorers that evaluate agents across all critical dimensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which dimension do you think is most overlooked by developers today? Share your thoughts below!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your System Prompt is Your Ground Truth: Ditch Manual Labeling for AI Agent Evaluation</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Mon, 15 Dec 2025 02:38:00 +0000</pubDate>
      <link>https://dev.to/imshashank/your-system-prompt-is-your-ground-truth-ditch-manual-labeling-for-ai-agent-evaluation-2n5j</link>
      <guid>https://dev.to/imshashank/your-system-prompt-is-your-ground-truth-ditch-manual-labeling-for-ai-agent-evaluation-2n5j</guid>
      <description>&lt;h2&gt;
  
  
  The Manual Labeling Trap
&lt;/h2&gt;

&lt;p&gt;Here's a hard truth for developers building AI agents: if you're relying on manual labeling to create your evaluation datasets, you're setting yourself up for failure.&lt;/p&gt;

&lt;p&gt;We've seen it time and time again. Teams spend months and thousands of dollars hiring annotators to create a "golden dataset." They write complex guidelines, hold training sessions, and run quality checks. The result? A dataset that is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expensive:&lt;/strong&gt; Manual annotation is a significant budget drain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow:&lt;/strong&gt; It can take weeks or months to label a sufficiently large dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent:&lt;/strong&gt; Human annotators are subjective. Two different people will often label the same interaction differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brittle:&lt;/strong&gt; The moment you change your agent's system prompt or add a new tool, your entire dataset becomes obsolete.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach is a dead end. It doesn't scale, and it can't keep up with the pace of modern AI development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1latjk3krgftwhh3yfs9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1latjk3krgftwhh3yfs9.png" alt="Various in-build Evaluators from Noveum.ai" width="728" height="902"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paradigm Shift: System Prompt as Ground Truth
&lt;/h2&gt;

&lt;p&gt;There's a better way, and it's been hiding in plain sight: &lt;strong&gt;your system prompt &lt;em&gt;is&lt;/em&gt; your ground truth.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about it. Your system prompt is the constitution for your AI agent. It explicitly defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Agent's Role:&lt;/strong&gt; What is its designated function? (e.g., "You are a senior software engineer helping with code reviews.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Its Constraints:&lt;/strong&gt; What are the hard rules it must never break? (e.g., "You must never suggest code that introduces security vulnerabilities.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Its Instructions:&lt;/strong&gt; How should it behave in specific scenarios? (e.g., "When you see a logic error, provide a corrected code snippet and explain the reasoning.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Its Values:&lt;/strong&gt; What principles should guide its behavior? (e.g., "Prioritize clarity and maintainability in your suggestions.")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything the agent does can, and should, be evaluated against this foundational document. You don't need a human to tell you if the agent followed the rules. You just need a system that can programmatically check the agent's behavior against the prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Concrete Example
&lt;/h3&gt;

&lt;p&gt;Let's say your system prompt includes this instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You are a customer support agent for an e-commerce store. You must be polite, professional, and never discuss politics or religion."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of manually labeling thousands of conversations, you can create automated scorers that check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;is_polite()&lt;/code&gt;: Analyzes the agent's language for politeness.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_professional()&lt;/code&gt;: Checks for slang or overly casual language.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;avoids_prohibited_topics()&lt;/code&gt;: Scans the conversation for keywords related to politics or religion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't subjective labels; they are objective, automated checks derived directly from your requirements. This is the foundation of a scalable, reliable, and cost-effective evaluation strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benefits of This Approach
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; You can evaluate thousands of interactions in minutes, not months.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-Effective:&lt;/strong&gt; It eliminates the need for expensive manual annotation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency:&lt;/strong&gt; The evaluation is objective and repeatable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agility:&lt;/strong&gt; When you update your system prompt, you simply update your scorers. Your entire evaluation framework adapts instantly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system prompt is the ultimate source of truth for your agent's behavior. Stop wasting time and money on manual labeling and start building an evaluation framework that uses your prompt as its guide.&lt;/p&gt;

&lt;p&gt;To see how this approach works in practice, explore &lt;a href="https://noveum.ai/en/solutions/agent-evaluation" rel="noopener noreferrer"&gt;Noveum.ai's Agent Evaluation Framework&lt;/a&gt;, which uses system prompts as ground truth for automated evaluation without manual labeling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How are you currently defining ground truth for your agents? Let's discuss in the comments.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>agents</category>
    </item>
    <item>
      <title>Stop Evaluating AI Agents Like ML Models: A Paradigm Shift for Developers</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Fri, 12 Dec 2025 02:09:00 +0000</pubDate>
      <link>https://dev.to/imshashank/stop-evaluating-ai-agents-like-ml-models-a-paradigm-shift-for-developers-4pgf</link>
      <guid>https://dev.to/imshashank/stop-evaluating-ai-agents-like-ml-models-a-paradigm-shift-for-developers-4pgf</guid>
      <description>&lt;h2&gt;
  
  
  The Flaw in Our Thinking
&lt;/h2&gt;

&lt;p&gt;For years, we've been conditioned to evaluate machine learning models with a standard set of metrics: accuracy, precision, recall, F1-score. We feed the model an input, check the output against a ground truth label, and score it. This works perfectly for tasks like classification or regression.&lt;/p&gt;

&lt;p&gt;But most developers are now realizing this approach completely breaks down for AI agents. Why? Because an AI agent isn't just producing a single output. It's executing a complex, multi-step &lt;strong&gt;trajectory of decisions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Applying simple input/output metrics to an agent is like judging a chess grandmaster based only on whether they won or lost, without analyzing the entire game. You miss the brilliance, the blunders, and the critical turning points.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvshz7msvbp74tkez6pz2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvshz7msvbp74tkez6pz2.png" alt="Vizualiation on an AI Agent full trajectory with Noveum.ai" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  From Single Predictions to Complex Trajectories
&lt;/h2&gt;

&lt;p&gt;Let's break down a typical agent's workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Receives User Input:&lt;/strong&gt; The agent ingests the initial prompt or query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasons About the Problem:&lt;/strong&gt; It forms an internal plan or hypothesis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decides on a Tool:&lt;/strong&gt; It selects a tool (e.g., an API call, a database query, a web search) from its available arsenal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Receives Tool Output:&lt;/strong&gt; It gets the result from the tool call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasons About the Result:&lt;/strong&gt; It analyzes the new information and updates its plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decides on the Next Action:&lt;/strong&gt; This could be calling another tool, asking a clarifying question, or formulating the final answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provides Final Response:&lt;/strong&gt; The agent delivers the result to the user.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you only evaluate the final response, you're blind to potential failures in steps 2 through 6. The agent could have reached the right answer through a horribly inefficient or even incorrect process. This is a ticking time bomb in a production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  A New Framework: Trajectory-Based Evaluation
&lt;/h2&gt;

&lt;p&gt;To properly evaluate an agent, you must analyze its entire decision-making journey. This requires a shift in mindset and tooling. Instead of asking "Was the answer correct?", you need to ask a series of deeper questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instruction Adherence:&lt;/strong&gt; Did the agent follow its core system prompt at every step of the conversation? If it was told to be a helpful pirate, did it maintain that persona?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical Coherence:&lt;/strong&gt; Was the reasoning sound at each decision point? Did the agent make logical leaps or get stuck in loops?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Use Efficiency:&lt;/strong&gt; Did it use the right tools for the job? Did it call them in the correct sequence? Could it have achieved the same result with fewer calls?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robustness and Edge Cases:&lt;/strong&gt; How did the agent handle unexpected tool outputs, errors, or ambiguous user queries?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why traditional metrics fail. You can't capture the nuance of an agent's performance with a single number. You need a framework that can dissect the entire process.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for You
&lt;/h2&gt;

&lt;p&gt;As a developer building with AI agents, you need to move beyond simple test cases. Your evaluation suite should include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trace Analysis:&lt;/strong&gt; The ability to log and inspect the full trajectory of every agent interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Dimensional Scoring:&lt;/strong&gt; A system that can score not just the final output, but also the quality of the reasoning, tool use, and adherence to constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Evaluation:&lt;/strong&gt; A way to run these complex evaluations at scale, so you're not manually inspecting thousands of traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stop thinking in terms of input/output. Start thinking in terms of trajectories. It's the only way to build reliable, production-ready AI agents.&lt;/p&gt;

&lt;p&gt;If you're looking to implement trajectory-based evaluation for your agents, check out &lt;a href="https://noveum.ai/en/solutions/ai-agent-monitoring" rel="noopener noreferrer"&gt;Noveum.ai's AI Agent Monitoring solution&lt;/a&gt;, which provides comprehensive trace analysis and multi-dimensional evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the biggest mistake you've seen in agent evaluation? Share your thoughts in the comments!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How to Build an AI Agent Evaluation Framework from Scratch</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Wed, 10 Dec 2025 08:55:17 +0000</pubDate>
      <link>https://dev.to/imshashank/how-to-build-an-ai-agent-evaluation-framework-from-scratch-5h54</link>
      <guid>https://dev.to/imshashank/how-to-build-an-ai-agent-evaluation-framework-from-scratch-5h54</guid>
      <description>&lt;p&gt;Building AI agents is hard. Evaluating them is harder.&lt;/p&gt;

&lt;p&gt;Most teams I talk to are evaluating their agents the wrong way. They look at the final output and ask, "Is it correct?" But that's like grading a math test by only looking at the final answer, not the work.&lt;/p&gt;

&lt;p&gt;In this post, I'll show you how to build a proper AI agent evaluation framework from scratch. We'll cover the concepts, the implementation, and the best practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Traditional Evaluation Fails for Agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional ML evaluation metrics (accuracy, precision, recall) don't work for agents because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agents take multiple steps:&lt;/strong&gt; An agent might get the right answer through the wrong path. Traditional metrics only look at the final output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The path matters:&lt;/strong&gt; An agent that takes 10 steps to answer a question is worse than one that takes 2 steps, even if both get the right answer. Cost and efficiency matter.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hallucinations are subtle:&lt;/strong&gt; An agent might hallucinate in an intermediate step but still get the right final answer. You'd miss this with output-only evaluation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compliance violations are hidden:&lt;/strong&gt; An agent might violate a constraint (like discussing a competitor) in the middle of a conversation but still provide a correct final answer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Right Way to Evaluate Agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the framework I recommend:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Define Your Ground Truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't manually label data. Use your system prompt as ground truth. Your system prompt defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What the agent should do&lt;/li&gt;
&lt;li&gt;How it should behave&lt;/li&gt;
&lt;li&gt;What constraints it should follow&lt;/li&gt;
&lt;li&gt;What role it should play&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is your evaluation ground truth. Everything else is a deviation from this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Collect Traces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every time your agent runs, collect a trace. A trace includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The initial user input&lt;/li&gt;
&lt;li&gt;Every LLM call (input and output)&lt;/li&gt;
&lt;li&gt;Every tool call&lt;/li&gt;
&lt;li&gt;The final output&lt;/li&gt;
&lt;li&gt;Metadata (tokens, latency, cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what a trace structure might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMCall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;tokens_used&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;tool_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LLMCall&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;final_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;total_latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Define Evaluation Dimensions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't use a single metric. Evaluate across multiple dimensions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;TASK_COMPLETION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Did it achieve the goal?
&lt;/span&gt;    &lt;span class="n"&gt;EFFICIENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Did it take the optimal path?
&lt;/span&gt;    &lt;span class="n"&gt;HALLUCINATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Did it invent facts?
&lt;/span&gt;    &lt;span class="n"&gt;COMPLIANCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compliance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Did it follow constraints?
&lt;/span&gt;    &lt;span class="n"&gt;COHERENCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coherence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Was it logically consistent?
&lt;/span&gt;    &lt;span class="n"&gt;COST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# How many tokens did it use?
&lt;/span&gt;    &lt;span class="n"&gt;TOOL_VALIDITY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_validity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Were tool calls valid?
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Implement Scorers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each dimension, implement a scorer. Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_task_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Score whether the agent completed its task.

    Uses the system prompt to determine what &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; means.
    Returns a score from 0-10.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract task from system prompt
&lt;/span&gt;    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_task_from_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if final output indicates task completion
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;indicates_task_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;final_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_efficiency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Score how efficient the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s path was.

    Fewer steps = higher efficiency.
    Returns a score from 0-10.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Count steps taken
&lt;/span&gt;    &lt;span class="n"&gt;steps_taken&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Estimate optimal steps (this is domain-specific)
&lt;/span&gt;    &lt;span class="n"&gt;optimal_steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimate_optimal_steps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate efficiency ratio
&lt;/span&gt;    &lt;span class="n"&gt;efficiency_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optimal_steps&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;steps_taken&lt;/span&gt;

    &lt;span class="c1"&gt;# Convert to 0-10 scale
&lt;/span&gt;    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;efficiency_ratio&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Score whether the agent hallucinated.

    Hallucinations = lower score.
    Returns a score from 0-10 (10 = no hallucinations).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;hallucinations_detected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# Check each LLM output for hallucinations
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;llm_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;contains_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;hallucinations_detected&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="c1"&gt;# Convert to score
&lt;/span&gt;    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hallucinations_detected&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_compliance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Score whether the agent followed its constraints.

    Constraint violations = lower score.
    Returns a score from 0-10 (10 = no violations).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract constraints from system prompt
&lt;/span&gt;    &lt;span class="n"&gt;constraints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_constraints_from_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# Check each LLM output against constraints
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;llm_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;constraint&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;violates_constraint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;constraint&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="c1"&gt;# Convert to score
&lt;/span&gt;    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5: Aggregate Scores&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combine individual dimension scores into an overall evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_agent_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Evaluate an agent trace across all dimensions.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TASK_COMPLETION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_task_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EFFICIENCY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_efficiency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HALLUCINATION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMPLIANCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_compliance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COHERENCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_coherence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COST&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TOOL_VALIDITY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;score_tool_validity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate overall score (weighted average)
&lt;/span&gt;    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TASK_COMPLETION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMPLIANCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HALLUCINATION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EFFICIENCY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COHERENCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COST&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TOOL_VALIDITY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Included in task completion
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;overall_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;overall_score&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6: Identify Root Causes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When an agent scores poorly, analyze why:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;identify_root_causes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Identify why the agent performed poorly.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;root_causes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HALLUCINATION&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent is hallucinating. Review system prompt for clarity.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMPLIANCE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent is violating constraints. Strengthen system prompt.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EFFICIENCY&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent is taking inefficient paths. Consider simplifying task or providing better tools.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TASK_COMPLETION&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent is not completing task. Review system prompt and tool availability.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;root_causes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 7: Continuous Improvement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use evaluation results to improve your agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_recommendations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Generate specific recommendations for improving the agent.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;recommendations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="n"&gt;root_causes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;identify_root_causes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucinating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add specific facts to system prompt that agent should reference.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide relevant context in user input.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violating constraints&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Make constraints more explicit in system prompt.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Consider using tool constraints to prevent violations.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inefficient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide better tools to reduce steps needed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simplify the task or break it into sub-tasks.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;recommendations&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Putting It All Together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's how you'd use this framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Collect a trace from your agent
&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect_agent_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate the trace
&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_agent_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Identify problems
&lt;/span&gt;&lt;span class="n"&gt;root_causes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;identify_root_causes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate recommendations
&lt;/span&gt;&lt;span class="n"&gt;recommendations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_recommendations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Log results
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Overall Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;overall&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task Completion: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TASK_COMPLETION&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Efficiency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EFFICIENCY&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hallucination: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HALLUCINATION&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compliance: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EvaluationDimensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMPLIANCE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Root Causes:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;root_causes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cause&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recommendations:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Limitations of DIY Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building your own evaluation framework is a good exercise, but it has limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scorer Implementation:&lt;/strong&gt; Implementing scorers for hallucination, compliance, and coherence is non-trivial. You need NLP expertise.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; As your agent grows more complex, maintaining scorers becomes a full-time job.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimization:&lt;/strong&gt; Hand-written scorers are often suboptimal. ML-based scorers (like LLM-as-Judge) perform better but require more infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Root Cause Analysis:&lt;/strong&gt; Identifying root causes and generating recommendations requires deep domain knowledge.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where a purpose-built evaluation platform becomes valuable. Noveum.ai, for example, provides all of this out of the box: 73+ pre-built scorers, automated root cause analysis through NovaPilot, and prescriptive recommendations. You can learn more about their approach to &lt;a href="https://noveum.ai/en/solutions/agent-evaluation" rel="noopener noreferrer"&gt;agent evaluation here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Evaluating AI agents properly requires evaluating the entire trajectory across multiple dimensions, not just the final output. By following this framework, you'll have much better visibility into your agent's behavior and be able to improve it iteratively.&lt;/p&gt;

&lt;p&gt;Start with the basic scorers I've outlined here, then expand as your needs grow. And remember: the system prompt is your ground truth. Use it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How to use System prompts as Ground Truth for Evaluation</title>
      <dc:creator>shashank agarwal</dc:creator>
      <pubDate>Wed, 10 Dec 2025 03:50:00 +0000</pubDate>
      <link>https://dev.to/imshashank/how-to-use-system-prompts-as-ground-truth-for-evaluation-ni6</link>
      <guid>https://dev.to/imshashank/how-to-use-system-prompts-as-ground-truth-for-evaluation-ni6</guid>
      <description>&lt;p&gt;Here's a hard truth: most teams don't know how to evaluate their AI agents because they don't have a clear ground truth.&lt;/p&gt;

&lt;p&gt;They spend months creating manual labels, hiring annotators, and building datasets. Then they realize the labels are inconsistent, expensive, and don't scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoflzut725dg8f4mzegi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoflzut725dg8f4mzegi.png" alt="Hallucination Detection from Noveum.ai with reasoning"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a better way.&lt;/p&gt;

&lt;p&gt;Your system prompt IS your ground truth.&lt;/p&gt;

&lt;p&gt;Think about it. Your system prompt defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent's role: What is it supposed to be?&lt;/li&gt;
&lt;li&gt;Its constraints: What should it NOT do?&lt;/li&gt;
&lt;li&gt;Its instructions: How should it behave?&lt;/li&gt;
&lt;li&gt;Its values: What matters to it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything the agent does should be evaluated against these instructions.&lt;/p&gt;

&lt;p&gt;For example, if your system prompt says: "You are a customer support agent. You must be polite, professional, and never discuss politics," then you can evaluate every response by asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Is it polite?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is it professional?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does it avoid political topics?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't subjective labels. They're objective criteria derived from your system prompt.&lt;/p&gt;

&lt;p&gt;This is the foundation of proper agent evaluation. You don't need expensive annotators. You need a framework that automatically evaluates whether the agent followed its instructions.&lt;/p&gt;

&lt;p&gt;The system prompt is the source of truth. Everything else is just implementation.&lt;/p&gt;

&lt;p&gt;That's how &lt;a href="https://noveum.ai/en/solutions/agent-evaluation%20embed%20%20" rel="noopener noreferrer"&gt;Noveum.ai&lt;/a&gt; works right now, looking to get early access. Reach out to us. &lt;/p&gt;

</description>
      <category>testing</category>
      <category>agents</category>
      <category>llm</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
