DEV Community: shashank agarwal

The AI Agent Feedback Loop: From Evaluation to Continuous Improvement

shashank agarwal — Thu, 01 Jan 2026 00:27:00 +0000

Evaluation is Just the First Step

So you've built an evaluation framework for your AI agent. You're tracking metrics, scoring conversations, and identifying failures. That's great. But evaluation, on its own, is useless.

Data without action is just a dashboard. The real value of evaluation is in creating a tight, continuous feedback loop that drives improvement. It's about turning insights into action.

Most teams get stuck at the evaluation step. They have a spreadsheet full of failing test cases, but no clear process for fixing them. The result is a backlog of issues and a development process that feels like playing whack-a-mole.

The 7 Steps of a Powerful Feedback Loop

A truly effective feedback loop is a systematic, automated process that takes you from raw data to a better agent.

Step 1: Evaluate at Scale

First, you need to be running your evaluation framework on every single agent interaction in production. This gives you the comprehensive dataset you need to find meaningful patterns.

Step 2: Identify Failure Patterns

Don't just look at individual failures. Look for patterns. Is a specific type of scorer (e.g., is_concise) failing frequently? Is a particular agent or prompt causing most of the issues?

Step 3: Diagnose the Root Cause

This is the most critical step. Once you've identified a pattern, you need to understand the why. Is the agent failing because:

The system prompt is ambiguous?
The underlying LLM has a knowledge gap?
A specific tool is returning bad data?
The reasoning logic is flawed?

This requires a powerful analysis engine (like our NovaPilot) that can sift through thousands of traces to find the common thread.

Step 4: Generate Actionable Recommendations

The diagnosis should lead to a specific, testable hypothesis for a fix. For example:

Hypothesis: "The agent is being too verbose because the system prompt doesn't explicitly ask for conciseness."
Recommendation: "Add the following instruction to the system prompt: 'Your answers should be clear and concise, under 200 words.'"

Step 5: Implement the Change

Implement the recommended fix. This could be a prompt change, a model swap, or a tweak to a tool's logic.

Step 6: Re-evaluate and Compare

Run the evaluation framework again on the same set of interactions with the new change. Compare the results. Did the scores for the is_concise scorer improve? Did any other scores get worse (a regression)?

Step 7: Iterate

Based on the results of the re-evaluation, you either deploy the change to production or you go back to Step 3 to refine your diagnosis. This is a continuous cycle.

The Goal: Faster Iteration

The teams that build the best AI agents are the ones that can iterate through this feedback loop the fastest. If it takes you two weeks to manually diagnose a problem and test a fix, you'll be quickly outpaced by a team that can do it in two hours.

This is why automation is key. Every step of this process, from trace extraction to root cause analysis to re-evaluation, should be as automated as possible.

Your goal isn't just to evaluate your agents. It's to build a system that allows them to continuously and automatically improve.

Noveum.ai's platform automates this entire feedback loop, from evaluation to root cause analysis to actionable recommendations for improvement.

What does your feedback loop for agent improvement look like today? Share your process!

Monitoring vs. Evaluation: The Critical Distinction Most AI Devs Miss

shashank agarwal — Wed, 31 Dec 2025 00:24:00 +0000

Are You Tracking the Right Things?

In the world of DevOps and SRE, we're obsessed with monitoring. We track latency, error rates, CPU utilization, and requests per second. These metrics are essential for understanding the health of our systems.

Naturally, when we started building AI agents, we applied the same mindset. We created dashboards to monitor our LLM API costs, token counts, and API error rates.

But this is a critical mistake. For AI agents, monitoring is not the same as evaluation, and confusing the two can lead to a false sense of security.

Monitoring Tells You If It's Running. Evaluation Tells You If It's Working.

Let's break down the difference:

Monitoring is about tracking the operational health of your application. It answers questions like:

How many requests did we process?
What was the average latency?
How many times did the OpenAI API return a 500 error?
How much did we spend on tokens today?

Evaluation is about assessing the quality and correctness of your agent's behavior. It answers questions like:

Did the agent actually solve the user's problem?
Did it follow the instructions in its system prompt?
Did it provide factually accurate information?
Did it violate any compliance or safety rules?

The Dangerous Blind Spot

You can have a perfectly monitored system that is a complete failure from an evaluation perspective. Your dashboard could be all green:

✅ 99.99% uptime
✅ 500ms average latency
✅ 0 API errors
✅ Costs are within budget

But in reality:

❌ 15% of the agent's responses are factually incorrect.
❌ 10% of interactions violate your company's brand voice guidelines.
❌ 5% of conversations expose sensitive user data.

Your monitoring dashboard tells you that your system is running. It tells you nothing about whether your system is working correctly.

Prioritize Evaluation, Then Monitor

For AI agents, evaluation is the more important discipline. You must first have confidence that your agent is behaving as intended. Only then should you focus on optimizing its performance and cost.

The ideal approach, of course, is to integrate both. A comprehensive AI observability platform should give you a single pane of glass that shows you both:

Operational Metrics: Latency, cost, throughput.
Quality Metrics: Accuracy, compliance, helpfulness, safety.

But if you have to choose where to start, start with evaluation. It's better to have a slow, expensive agent that works correctly than a fast, cheap agent that is silently causing harm to your users and your business.

Stop conflating monitoring with evaluation. They are two different disciplines, and for AI agents, evaluation is the one that truly matters.

To implement both monitoring and evaluation, Noveum.ai's LLM Observability Platform provides a unified dashboard for operational metrics and quality evaluation.

How does your team distinguish between monitoring and evaluation for your AI apps?

How to Build an AI Agent Evaluation Framework That Scales

shashank agarwal — Mon, 29 Dec 2025 00:18:00 +0000

The Scaling Problem

So, you've built a great AI agent. You've tested it with a few dozen examples, and it works perfectly. Now, you're ready to deploy it to production, where it will handle thousands or even millions of conversations.

Suddenly, your evaluation strategy breaks. You can't manually review every conversation. Your small test set doesn't cover the infinite variety of real-world user behavior. How do you ensure quality at scale?

The answer is to build an automated, scalable evaluation framework. Manual spot-checking is not a strategy; it's a liability.

Here's a blueprint for building an evaluation system that can handle production-level traffic.

The 7 Components of a Scalable Evaluation Framework

1. Automated Trace Extraction

Your framework must automatically capture the complete, detailed trace of every single agent interaction. This is your raw data. It should be a non-negotiable part of your agent's architecture to log every reasoning step, tool call, and output.

2. Intelligent Trace Parsing (The ETL Agent)

Raw traces are often messy, unstructured JSON or text logs. You need a process to parse this raw data into a clean, structured format. At Noveum.ai, we use a dedicated AI agent for this—an ETL (Extract, Transform, Load) agent that reads the raw trace and intelligently extracts key information like tool calls, parameters, reasoning steps, and final outputs into a standardized schema.

3. A Comprehensive Scorer Library

This is the core of your evaluation engine. You need a library of 70+ automated "scorers," each designed to evaluate a specific dimension of quality. These should cover everything from factual accuracy and instruction following to PII detection and token efficiency.

4. Automated Scorer Recommendation

With 70+ scorers, which ones should you run on a given dataset? A truly scalable system uses another AI agent to analyze your dataset and recommend the top 10-15 most relevant scorers for your specific use case. This saves compute time and focuses your evaluation on what matters most.

5. Aggregated Quality Assessment

After running the scorers, you'll have thousands of individual data points. Your framework needs to aggregate these scores into a meaningful, high-level assessment of agent quality. This includes identifying trends, common failure modes, and overall performance against your business KPIs.

6. Automated Root Cause Analysis (NovaPilot)

This is the most critical component. It's not enough to know that your agent is failing. You need to know why. A powerful analysis engine (like our NovaPilot) should be able to analyze all the failing traces and scores to diagnose the root cause of the problem. Is it a bad prompt? A faulty tool? A limitation of the model?

7. A Continuous Improvement Loop

Finally, the framework must close the loop. The insights from the root cause analysis should feed directly back into the development process. The system should suggest specific, actionable fixes—like a revised system prompt or a change in model parameters—that will resolve the identified issues.

From Manual to Automated

Building this kind of framework is a significant engineering effort. But it's the only way to move from manual, unreliable spot-checking to a truly scalable, automated quality assurance process. It's the difference between building a prototype and building a production-ready AI system.

If you're ready to implement this at scale, Noveum.ai's comprehensive evaluation platform automates all seven components of a scalable evaluation framework.

What's the biggest bottleneck you're facing in scaling your agent evaluation? Let's discuss.

Is Your AI Agent a Compliance Risk? How to Find Violations Hidden in Traces

shashank agarwal — Fri, 26 Dec 2025 00:16:00 +0000

The Silent Risk

Here's a thought that should keep every AI developer up at night: your agent might be silently violating compliance regulations like GDPR, HIPAA, or CCPA, and your standard evaluation metrics will never catch it.

Compliance isn't about the final output. It's about the process. It's about how the agent handles data, makes decisions, and interacts with the user at every step of its trajectory. A final answer can be perfectly correct and helpful, yet have been generated through a process that creates significant legal and financial risk for your company.

Where Violations Hide

Let's take GDPR as an example. The principles of data minimization and purpose limitation are central. Now, consider a simple customer support agent.

The User's Request: "I'd like to check the status of my order."

The Compliant Trajectory

Agent: "I can help with that. Could you please provide your order number?"
User: "It's 12345."
Agent: (Calls getOrderStatus('12345') tool)
Agent: "Your order has shipped and is expected to arrive tomorrow."

This is clean. The agent only asked for the data it absolutely needed (the order number) to fulfill the specific purpose.

The Non-Compliant Trajectory

Agent: "To help you, I need to verify your identity. Please provide your full name, email address, and date of birth."
User: (Provides the information)
Agent: (Stores this PII in its conversation history)
Agent: "Thank you. Now, what is your order number?"
User: "It's 12345."
Agent: (Calls getOrderStatus('12345') tool)
Agent: "Your order has shipped and is expected to arrive tomorrow."

Both agents successfully answered the user's question. But the second agent created a massive compliance risk. It violated the principle of data minimization by collecting PII that was not necessary for the task.

How to Detect These Violations

You cannot detect this kind of failure by looking at the final output. You must analyze the entire trace of the agent's interaction. Your evaluation framework needs automated scorers that can check for compliance at each step:

PII Detection Scorer: Does the agent's internal reasoning or final output contain Personally Identifiable Information?
Data Minimization Scorer: Did the agent ask for more information than was strictly necessary to complete the task?
Purpose Limitation Scorer: Did the agent use the provided data for any purpose other than what the user consented to?
Instruction Following Scorer: Did the agent violate any compliance-related rules in its system prompt (e.g., "Never store user PII in your memory")?

Traditional software testing doesn't prepare us for this. We need to adopt a new mindset where we are constantly auditing the process of our AI agents, not just the results.

By implementing trajectory-based compliance evaluation, you can automatically flag these hidden risks before they become a major incident.

Noveum.ai's AI Agent Monitoring solution includes compliance scorers that automatically detect violations in your agent's traces across GDPR, HIPAA, and other regulatory frameworks.

How are you currently ensuring your AI agents are compliant? Let's share strategies in the comments.

The Hidden Costs of Inefficient AI Agents (And How to Fix Them)

shashank agarwal — Wed, 24 Dec 2025 00:16:00 +0000

Looking Beyond Token Counts

Every developer working with LLMs is acutely aware of token costs. We optimize prompts, choose smaller models, and set max token limits. But this is only scratching the surface of AI agent costs.

The real, hidden costs of AI agents aren't in the token counts of the final output. They're buried in the inefficiencies of the agent's trajectory—the step-by-step process of reasoning and tool use that leads to the final answer.

Let's look at a concrete example of two agents tasked with answering, "What is the current price of Apple stock and what was the biggest news about them this week?"

The Inefficient Agent Trajectory

User Asks Question.
Agent Reasons (500 tokens): "Okay, I need to find the stock price and the latest news. I'll start with the stock price."
Agent Calls getStockPrice('AAPL') Tool. (1 API call)
Agent Reasons (400 tokens): "Great, I have the price. Now I need to find the news."
Agent Calls searchNews('Apple') Tool. (1 API call)
Agent Reasons (300 tokens): "Okay, I have the news. Now I need to combine them into a final answer."
Agent Provides Final Answer (200 tokens).

Total Cost: 1400 LLM tokens + 2 tool calls (sequential)

The Efficient Agent Trajectory

User Asks Question.
Agent Reasons (200 tokens): "I need two pieces of information: stock price and news. I can get these at the same time."
Agent Calls getStockPrice('AAPL') and searchNews('Apple') in Parallel. (2 API calls, but in parallel)
Agent Reasons (200 tokens): "I have both pieces of information. I will now synthesize them."
Agent Provides Final Answer (150 tokens).

Total Cost: 550 LLM tokens + 2 tool calls (in parallel)

The Result

Both agents produced the same correct answer. But the efficient agent was 60% cheaper in terms of LLM token consumption and likely much faster because it executed its tool calls in parallel.

Now, imagine this inefficiency scaled across millions of interactions. The hidden costs become astronomical.

How to Find and Fix Inefficiencies

You can't find these problems by looking at the final output. You have to analyze the entire trajectory. Your evaluation framework should be asking:

Redundant Tool Calls: Is the agent calling the same tool with the same parameters multiple times in a single trajectory?
Verbose Reasoning: Are the internal reasoning steps unnecessarily long and complex?
Sequential vs. Parallel: Is the agent calling tools one by one when it could be executing them in parallel?
Suboptimal Tool Selection: Is it using an expensive, powerful tool for a simple task that a cheaper tool could handle?

This is where true cost optimization for AI agents happens. It's not about nickel-and-diming your token counts. It's about fundamentally improving the efficiency of your agent's decision-making process.

By implementing trajectory analysis, you can identify these hidden costs and provide targeted feedback to your system prompt or agent logic to fix them, leading to massive savings at scale.

What's the most inefficient agent behavior you've seen in production? Share your war stories!

5 Types of AI Hallucinations (And How to Detect Them)

shashank agarwal — Mon, 22 Dec 2025 00:15:00 +0000

The Many Faces of Hallucination

When developers hear "AI hallucination," they usually picture an LLM confidently making up a completely false fact, like claiming the moon is made of cheese. While this Factual Hallucination is a real problem, it's only one piece of a much larger puzzle.

In production AI agents, there are several other, more subtle types of hallucinations that can be just as damaging. If your evaluation framework only checks for factual errors, you're leaving your application vulnerable.

Here are the five key types of hallucinations you need to be detecting.

Type 1: Factual Hallucination

This is the classic definition. The agent states something as a fact that is verifiably false in the real world.

Example: "The first person to walk on Mars was Neil Armstrong."
Detection: Requires external knowledge validation, often through a search tool or a curated knowledge base.

Type 2: Contextual Hallucination (RAG Failure)

This is particularly dangerous in Retrieval-Augmented Generation (RAG) systems. The agent is given a specific context (like a document or a database query result) and is instructed to answer based only on that context. A contextual hallucination occurs when the agent ignores the context and uses its general knowledge instead.

Example: A user asks, "According to the provided legal document, what is the termination clause?" The agent responds with a generic, boilerplate termination clause instead of the specific one from the document.
Detection: Requires comparing the agent's response directly against the provided context to ensure all claims are supported.

Type 3: Instruction Hallucination

This happens when the agent directly violates one of the core instructions in its system prompt.

Example: The system prompt says, "You are a helpful assistant. You must never be rude to the user." The agent responds to a user's question with, "That's a stupid question."
Detection: Requires parsing the system prompt into a set of rules and programmatically checking the agent's behavior against those rules.

Type 4: Role Hallucination

This is a subtle but important failure where the agent forgets its assigned persona or role.

Example: An agent is designed to be a playful, pirate-themed chatbot for a children's game. Midway through the conversation, it drops the persona and starts speaking like a formal, technical document.
Detection: Requires evaluating the agent's tone, style, and vocabulary against the persona defined in the system prompt.

Type 5: Consistency Hallucination

The agent contradicts itself within the same conversation, showing a lack of stable reasoning.

Example: In the first turn, the agent says, "I cannot access external websites." Three turns later, it says, "I have just checked that website for you."
Detection: Requires analyzing the entire conversation history for logical contradictions.

Why This Matters

Most off-the-shelf evaluation frameworks are only good at catching Type 1 hallucinations. They can't detect when your agent is ignoring its context, violating its instructions, or breaking character. This is why so many agents that perform well in benchmarks fail spectacularly in production.

A robust evaluation strategy must include specific scorers for all five types of hallucinations. You need to analyze the agent's output not just against the real world, but also against its provided context, its system prompt, and its own conversation history.

Noveum.ai's LLM Observability Platform includes dedicated scorers for detecting all five types of hallucinations across your entire agent fleet.

Which type of hallucination do you find most challenging to deal with in your projects? Let's discuss below.

How to Analyze AI Agent Traces Like a Detective

shashank agarwal — Fri, 19 Dec 2025 00:14:00 +0000

The Final Output is Just the Tip of the Iceberg

When an AI agent fails, the natural instinct is to look at the final, incorrect output and try to figure out what went wrong. This is like a detective arriving at a crime scene and only looking at the victim, ignoring all the surrounding evidence.

To truly understand and fix agent failures, you need to become a detective and investigate the trace. A trace is the complete, step-by-step log of everything the agent did during an interaction:

Every internal thought or reasoning step.
Every tool it decided to call.
The exact parameters it used for each tool call.
The raw output it received from each tool.
Every decision it made based on new information.

Analyzing these traces is the single most powerful debugging technique for AI agents.

A 6-Step Framework for Trace Analysis

Here's a systematic approach to dissecting an agent trace to find the root cause of any failure.

Step 1: Understand the Goal

First, clearly define what a successful outcome would have looked like. What was the user's intent? What was the agent supposed to do according to its system prompt?

Step 2: Follow the Trajectory

Start from the beginning of the trace and walk through each step of the agent's reasoning process. Don't make assumptions. Read the agent's internal monologue. Does its chain of thought make logical sense?

Step 3: Identify Key Decision Points

Pinpoint the exact moments where the agent made a choice. This could be deciding which tool to use, what parameters to pass, or how to interpret a tool's response. Was the choice it made the optimal one?

Step 4: Scrutinize Tool Calls

This is often where things go wrong. For every tool call, ask:

Was this the right tool for this specific sub-task?
Were the parameters passed to the tool correct and well-formed?
Was the tool's output what you expected? Did the agent handle an error or unexpected output gracefully?

Step 5: Check for Compliance and Constraint Violations

At each step, cross-reference the agent's action with its system prompt. Did it violate any of its core instructions? For example, if it's not supposed to give financial advice, did it call a stock price API?

Step 6: Pinpoint the Root Cause

By following these steps, you can move beyond simply identifying the failure and pinpoint its origin. Was the root cause:

A Reasoning Error? The agent's logic was flawed.
A Tool Use Error? The agent used the wrong tool or used it incorrectly.
A Prompt Issue? The system prompt was ambiguous or incomplete.
A Model Limitation? The underlying LLM simply wasn't capable of the required reasoning.

A Practical Example

Imagine a customer support agent that gives a user the wrong refund amount. The trace might reveal:

Agent correctly understands the user's request for a refund.
Agent correctly calls the getOrderDetails tool with the right order ID.
The tool returns the correct order data, including price: 99.99 and discount: 10.00.
The agent's reasoning step says: "The refund amount is the price. I will refund $99.99."
Root Cause: A reasoning error. The agent failed to account for the discount.

Now you know exactly what to fix. You don't need to debug the tool or the data. You need to improve the agent's reasoning, likely by updating the system prompt to explicitly mention how to handle discounts.

Without trace analysis, you're just debugging in the dark.

To streamline your trace analysis process, Noveum.ai's Debugging and Tracing solution provides hierarchical trace visualization and automated root cause analysis.

Have you ever analyzed an agent trace to find a surprising root cause? Share your story!

Beyond Accuracy: The 73+ Dimensions of AI Agent Quality

shashank agarwal — Wed, 17 Dec 2025 00:12:00 +0000

"Is My Agent Good?" Is the Wrong Question

When a developer asks, "Is my AI agent good?" they're often looking for a single score, like an accuracy percentage. This is a dangerous oversimplification. An AI agent is a complex system, and its quality can't be boiled down to one number.

An agent isn't just "good" or "bad." It can be factually accurate but dangerously non-compliant. It can be helpful but horribly inefficient. It can be safe but provide a terrible user experience.

To truly understand your agent's performance, you need to evaluate it across multiple dimensions simultaneously. At Noveum.ai, we've identified over 73 distinct scorers, which we group into several key categories.

The Core Dimensions of Agent Quality

Here are some of the most critical dimensions you should be tracking:

1. Correctness Dimensions

This is about the factual and logical integrity of the agent's output.

Factual Accuracy: Does the agent provide information that is verifiably true?
Instruction Following: Does the agent adhere to the explicit instructions in its system prompt?
Context Adherence: Does the agent use only the information provided in the given context, especially in RAG systems?

2. Safety and Security Dimensions

These scorers protect your users and your company from harm.

Toxicity Detection: Does the agent avoid generating hateful, offensive, or inappropriate language?
PII Protection: Does it refuse to process or reveal Personally Identifiable Information?
Prompt Injection Resistance: Can the agent be tricked into violating its instructions by a malicious user prompt?

3. Efficiency Dimensions

An agent that works but is slow and expensive is a liability in production.

Tool Call Efficiency: Is the agent making redundant or unnecessary API calls?
Token Efficiency: Is it being overly verbose, driving up LLM costs?
Reasoning Efficiency: Does it get stuck in loops or take a convoluted path to a simple answer?

4. User Experience Dimensions

This measures how it feels to interact with your agent.

Conversation Coherence: Does the agent maintain a logical and easy-to-follow conversation flow?
Relevance: Does it stay on topic and provide answers that are relevant to the user's query?
Helpfulness: Does it actually solve the user's underlying problem?

5. Compliance Dimensions

For any enterprise application, this is non-negotiable.

Regulatory Compliance: Does the agent's behavior align with legal frameworks like GDPR, HIPAA, or CCPA?
Company Policy Adherence: Does it follow your internal guidelines for brand voice, tone, and values?

Why Multi-Dimensional Evaluation Matters

Most teams only look at one or two of these categories, typically correctness. This creates massive blind spots. You might have an agent that's 99% factually accurate but leaks PII in 5% of conversations. Without a multi-dimensional evaluation framework, you'd never know until it's too late.

The only way to de-risk your AI agent for production is to have a comprehensive suite of scorers that evaluates its performance from every possible angle. Stop chasing a single accuracy score and start building a holistic view of your agent's quality.

Noveum.ai's Noveum.ai comprehensive scorer library includes 73+ pre-built scorers that evaluate agents across all critical dimensions.

Which dimension do you think is most overlooked by developers today? Share your thoughts below!

Your System Prompt is Your Ground Truth: Ditch Manual Labeling for AI Agent Evaluation

shashank agarwal — Mon, 15 Dec 2025 02:38:00 +0000

The Manual Labeling Trap

Here's a hard truth for developers building AI agents: if you're relying on manual labeling to create your evaluation datasets, you're setting yourself up for failure.

We've seen it time and time again. Teams spend months and thousands of dollars hiring annotators to create a "golden dataset." They write complex guidelines, hold training sessions, and run quality checks. The result? A dataset that is:

Expensive: Manual annotation is a significant budget drain.
Slow: It can take weeks or months to label a sufficiently large dataset.
Inconsistent: Human annotators are subjective. Two different people will often label the same interaction differently.
Brittle: The moment you change your agent's system prompt or add a new tool, your entire dataset becomes obsolete.

This approach is a dead end. It doesn't scale, and it can't keep up with the pace of modern AI development.

The Paradigm Shift: System Prompt as Ground Truth

There's a better way, and it's been hiding in plain sight: your system prompt is your ground truth.

Think about it. Your system prompt is the constitution for your AI agent. It explicitly defines:

The Agent's Role: What is its designated function? (e.g., "You are a senior software engineer helping with code reviews.")
Its Constraints: What are the hard rules it must never break? (e.g., "You must never suggest code that introduces security vulnerabilities.")
Its Instructions: How should it behave in specific scenarios? (e.g., "When you see a logic error, provide a corrected code snippet and explain the reasoning.")
Its Values: What principles should guide its behavior? (e.g., "Prioritize clarity and maintainability in your suggestions.")

Everything the agent does can, and should, be evaluated against this foundational document. You don't need a human to tell you if the agent followed the rules. You just need a system that can programmatically check the agent's behavior against the prompt.

A Concrete Example

Let's say your system prompt includes this instruction:

"You are a customer support agent for an e-commerce store. You must be polite, professional, and never discuss politics or religion."

Instead of manually labeling thousands of conversations, you can create automated scorers that check:

is_polite(): Analyzes the agent's language for politeness.
is_professional(): Checks for slang or overly casual language.
avoids_prohibited_topics(): Scans the conversation for keywords related to politics or religion.

These aren't subjective labels; they are objective, automated checks derived directly from your requirements. This is the foundation of a scalable, reliable, and cost-effective evaluation strategy.

The Benefits of This Approach

Speed: You can evaluate thousands of interactions in minutes, not months.
Cost-Effective: It eliminates the need for expensive manual annotation.
Consistency: The evaluation is objective and repeatable.
Agility: When you update your system prompt, you simply update your scorers. Your entire evaluation framework adapts instantly.

The system prompt is the ultimate source of truth for your agent's behavior. Stop wasting time and money on manual labeling and start building an evaluation framework that uses your prompt as its guide.

To see how this approach works in practice, explore Noveum.ai's Agent Evaluation Framework, which uses system prompts as ground truth for automated evaluation without manual labeling.

How are you currently defining ground truth for your agents? Let's discuss in the comments.

Stop Evaluating AI Agents Like ML Models: A Paradigm Shift for Developers

shashank agarwal — Fri, 12 Dec 2025 02:09:00 +0000

The Flaw in Our Thinking

For years, we've been conditioned to evaluate machine learning models with a standard set of metrics: accuracy, precision, recall, F1-score. We feed the model an input, check the output against a ground truth label, and score it. This works perfectly for tasks like classification or regression.

But most developers are now realizing this approach completely breaks down for AI agents. Why? Because an AI agent isn't just producing a single output. It's executing a complex, multi-step trajectory of decisions.

Applying simple input/output metrics to an agent is like judging a chess grandmaster based only on whether they won or lost, without analyzing the entire game. You miss the brilliance, the blunders, and the critical turning points.

From Single Predictions to Complex Trajectories

Let's break down a typical agent's workflow:

Receives User Input: The agent ingests the initial prompt or query.
Reasons About the Problem: It forms an internal plan or hypothesis.
Decides on a Tool: It selects a tool (e.g., an API call, a database query, a web search) from its available arsenal.
Receives Tool Output: It gets the result from the tool call.
Reasons About the Result: It analyzes the new information and updates its plan.
Decides on the Next Action: This could be calling another tool, asking a clarifying question, or formulating the final answer.
Provides Final Response: The agent delivers the result to the user.

If you only evaluate the final response, you're blind to potential failures in steps 2 through 6. The agent could have reached the right answer through a horribly inefficient or even incorrect process. This is a ticking time bomb in a production environment.

A New Framework: Trajectory-Based Evaluation

To properly evaluate an agent, you must analyze its entire decision-making journey. This requires a shift in mindset and tooling. Instead of asking "Was the answer correct?", you need to ask a series of deeper questions:

Instruction Adherence: Did the agent follow its core system prompt at every step of the conversation? If it was told to be a helpful pirate, did it maintain that persona?
Logical Coherence: Was the reasoning sound at each decision point? Did the agent make logical leaps or get stuck in loops?
Tool Use Efficiency: Did it use the right tools for the job? Did it call them in the correct sequence? Could it have achieved the same result with fewer calls?
Robustness and Edge Cases: How did the agent handle unexpected tool outputs, errors, or ambiguous user queries?

This is why traditional metrics fail. You can't capture the nuance of an agent's performance with a single number. You need a framework that can dissect the entire process.

What This Means for You

As a developer building with AI agents, you need to move beyond simple test cases. Your evaluation suite should include:

Trace Analysis: The ability to log and inspect the full trajectory of every agent interaction.
Multi-Dimensional Scoring: A system that can score not just the final output, but also the quality of the reasoning, tool use, and adherence to constraints.
Automated Evaluation: A way to run these complex evaluations at scale, so you're not manually inspecting thousands of traces.

Stop thinking in terms of input/output. Start thinking in terms of trajectories. It's the only way to build reliable, production-ready AI agents.

If you're looking to implement trajectory-based evaluation for your agents, check out Noveum.ai's AI Agent Monitoring solution, which provides comprehensive trace analysis and multi-dimensional evaluation.

What's the biggest mistake you've seen in agent evaluation? Share your thoughts in the comments!

How to Build an AI Agent Evaluation Framework from Scratch

shashank agarwal — Wed, 10 Dec 2025 08:55:17 +0000

Building AI agents is hard. Evaluating them is harder.

Most teams I talk to are evaluating their agents the wrong way. They look at the final output and ask, "Is it correct?" But that's like grading a math test by only looking at the final answer, not the work.

In this post, I'll show you how to build a proper AI agent evaluation framework from scratch. We'll cover the concepts, the implementation, and the best practices.

Why Traditional Evaluation Fails for Agents

Traditional ML evaluation metrics (accuracy, precision, recall) don't work for agents because:

Agents take multiple steps: An agent might get the right answer through the wrong path. Traditional metrics only look at the final output.
The path matters: An agent that takes 10 steps to answer a question is worse than one that takes 2 steps, even if both get the right answer. Cost and efficiency matter.
Hallucinations are subtle: An agent might hallucinate in an intermediate step but still get the right final answer. You'd miss this with output-only evaluation.
Compliance violations are hidden: An agent might violate a constraint (like discussing a competitor) in the middle of a conversation but still provide a correct final answer.

The Right Way to Evaluate Agents

Here's the framework I recommend:

Step 1: Define Your Ground Truth

Don't manually label data. Use your system prompt as ground truth. Your system prompt defines:

What the agent should do
How it should behave
What constraints it should follow
What role it should play

This is your evaluation ground truth. Everything else is a deviation from this.

Step 2: Collect Traces

Every time your agent runs, collect a trace. A trace includes:

The initial user input
Every LLM call (input and output)
Every tool call
The final output
Metadata (tokens, latency, cost)

Here's what a trace structure might look like:

from dataclasses import dataclass
from typing import List, Dict, Any

@dataclass
class LLMCall:
    input: str
    output: str
    model: str
    tokens_used: int
    latency_ms: float

@dataclass
class ToolCall:
    tool_name: str
    tool_input: Dict[str, Any]
    tool_output: str
    latency_ms: float

@dataclass
class AgentTrace:
    user_input: str
    system_prompt: str
    llm_calls: List[LLMCall]
    tool_calls: List[ToolCall]
    final_output: str
    total_tokens: int
    total_cost: float
    total_latency_ms: float

Step 3: Define Evaluation Dimensions

Don't use a single metric. Evaluate across multiple dimensions:

class EvaluationDimensions:
    TASK_COMPLETION = "task_completion"  # Did it achieve the goal?
    EFFICIENCY = "efficiency"  # Did it take the optimal path?
    HALLUCINATION = "hallucination"  # Did it invent facts?
    COMPLIANCE = "compliance"  # Did it follow constraints?
    COHERENCE = "coherence"  # Was it logically consistent?
    COST = "cost"  # How many tokens did it use?
    TOOL_VALIDITY = "tool_validity"  # Were tool calls valid?

Step 4: Implement Scorers

For each dimension, implement a scorer. Here's an example:

def score_task_completion(trace: AgentTrace) -> float:
    """
    Score whether the agent completed its task.

    Uses the system prompt to determine what "task completion" means.
    Returns a score from 0-10.
    """
    # Extract task from system prompt
    task = extract_task_from_prompt(trace.system_prompt)

    # Check if final output indicates task completion
    if indicates_task_completion(trace.final_output, task):
        return 10.0
    else:
        return 0.0

def score_efficiency(trace: AgentTrace) -> float:
    """
    Score how efficient the agent's path was.

    Fewer steps = higher efficiency.
    Returns a score from 0-10.
    """
    # Count steps taken
    steps_taken = len(trace.llm_calls) + len(trace.tool_calls)

    # Estimate optimal steps (this is domain-specific)
    optimal_steps = estimate_optimal_steps(trace.user_input)

    # Calculate efficiency ratio
    efficiency_ratio = optimal_steps / steps_taken

    # Convert to 0-10 scale
    score = min(efficiency_ratio * 10, 10.0)

    return score

def score_hallucination(trace: AgentTrace) -> float:
    """
    Score whether the agent hallucinated.

    Hallucinations = lower score.
    Returns a score from 0-10 (10 = no hallucinations).
    """
    hallucinations_detected = 0

    # Check each LLM output for hallucinations
    for llm_call in trace.llm_calls:
        if contains_hallucination(llm_call.output):
            hallucinations_detected += 1

    # Convert to score
    score = max(10 - (hallucinations_detected * 2), 0.0)

    return score

def score_compliance(trace: AgentTrace) -> float:
    """
    Score whether the agent followed its constraints.

    Constraint violations = lower score.
    Returns a score from 0-10 (10 = no violations).
    """
    # Extract constraints from system prompt
    constraints = extract_constraints_from_prompt(trace.system_prompt)

    violations = 0

    # Check each LLM output against constraints
    for llm_call in trace.llm_calls:
        for constraint in constraints:
            if violates_constraint(llm_call.output, constraint):
                violations += 1

    # Convert to score
    score = max(10 - (violations * 2), 0.0)

    return score

Step 5: Aggregate Scores

Combine individual dimension scores into an overall evaluation:

def evaluate_agent_trace(trace: AgentTrace) -> Dict[str, float]:
    """
    Evaluate an agent trace across all dimensions.
    """
    scores = {
        EvaluationDimensions.TASK_COMPLETION: score_task_completion(trace),
        EvaluationDimensions.EFFICIENCY: score_efficiency(trace),
        EvaluationDimensions.HALLUCINATION: score_hallucination(trace),
        EvaluationDimensions.COMPLIANCE: score_compliance(trace),
        EvaluationDimensions.COHERENCE: score_coherence(trace),
        EvaluationDimensions.COST: score_cost(trace),
        EvaluationDimensions.TOOL_VALIDITY: score_tool_validity(trace),
    }

    # Calculate overall score (weighted average)
    weights = {
        EvaluationDimensions.TASK_COMPLETION: 0.3,
        EvaluationDimensions.COMPLIANCE: 0.3,
        EvaluationDimensions.HALLUCINATION: 0.2,
        EvaluationDimensions.EFFICIENCY: 0.1,
        EvaluationDimensions.COHERENCE: 0.05,
        EvaluationDimensions.COST: 0.05,
        EvaluationDimensions.TOOL_VALIDITY: 0.0,  # Included in task completion
    }

    overall_score = sum(
        scores[dim] * weights[dim]
        for dim in scores
    )

    return {**scores, "overall": overall_score}

Step 6: Identify Root Causes

When an agent scores poorly, analyze why:

def identify_root_causes(trace: AgentTrace, scores: Dict[str, float]) -> List[str]:
    """
    Identify why the agent performed poorly.
    """
    root_causes = []

    if scores[EvaluationDimensions.HALLUCINATION] < 5:
        root_causes.append("Agent is hallucinating. Review system prompt for clarity.")

    if scores[EvaluationDimensions.COMPLIANCE] < 5:
        root_causes.append("Agent is violating constraints. Strengthen system prompt.")

    if scores[EvaluationDimensions.EFFICIENCY] < 5:
        root_causes.append("Agent is taking inefficient paths. Consider simplifying task or providing better tools.")

    if scores[EvaluationDimensions.TASK_COMPLETION] < 5:
        root_causes.append("Agent is not completing task. Review system prompt and tool availability.")

    return root_causes

Step 7: Continuous Improvement

Use evaluation results to improve your agent:

def generate_recommendations(trace: AgentTrace, scores: Dict[str, float]) -> List[str]:
    """
    Generate specific recommendations for improving the agent.
    """
    recommendations = []

    root_causes = identify_root_causes(trace, scores)

    for cause in root_causes:
        if "hallucinating" in cause:
            recommendations.append("Add specific facts to system prompt that agent should reference.")
            recommendations.append("Provide relevant context in user input.")

        if "violating constraints" in cause:
            recommendations.append("Make constraints more explicit in system prompt.")
            recommendations.append("Consider using tool constraints to prevent violations.")

        if "inefficient" in cause:
            recommendations.append("Provide better tools to reduce steps needed.")
            recommendations.append("Simplify the task or break it into sub-tasks.")

    return recommendations

Putting It All Together

Here's how you'd use this framework:

# Collect a trace from your agent
trace = collect_agent_trace(agent, user_input)

# Evaluate the trace
scores = evaluate_agent_trace(trace)

# Identify problems
root_causes = identify_root_causes(trace, scores)

# Generate recommendations
recommendations = generate_recommendations(trace, scores)

# Log results
print(f"Overall Score: {scores['overall']:.1f}/10")
print(f"Task Completion: {scores[EvaluationDimensions.TASK_COMPLETION]:.1f}/10")
print(f"Efficiency: {scores[EvaluationDimensions.EFFICIENCY]:.1f}/10")
print(f"Hallucination: {scores[EvaluationDimensions.HALLUCINATION]:.1f}/10")
print(f"Compliance: {scores[EvaluationDimensions.COMPLIANCE]:.1f}/10")
print()
print("Root Causes:")
for cause in root_causes:
    print(f"  - {cause}")
print()
print("Recommendations:")
for rec in recommendations:
    print(f"  - {rec}")

The Limitations of DIY Evaluation

Building your own evaluation framework is a good exercise, but it has limitations:

Scorer Implementation: Implementing scorers for hallucination, compliance, and coherence is non-trivial. You need NLP expertise.
Scalability: As your agent grows more complex, maintaining scorers becomes a full-time job.
Optimization: Hand-written scorers are often suboptimal. ML-based scorers (like LLM-as-Judge) perform better but require more infrastructure.
Root Cause Analysis: Identifying root causes and generating recommendations requires deep domain knowledge.

This is where a purpose-built evaluation platform becomes valuable. Noveum.ai, for example, provides all of this out of the box: 73+ pre-built scorers, automated root cause analysis through NovaPilot, and prescriptive recommendations. You can learn more about their approach to agent evaluation here.

Conclusion

Evaluating AI agents properly requires evaluating the entire trajectory across multiple dimensions, not just the final output. By following this framework, you'll have much better visibility into your agent's behavior and be able to improve it iteratively.

Start with the basic scorers I've outlined here, then expand as your needs grow. And remember: the system prompt is your ground truth. Use it.

How to use System prompts as Ground Truth for Evaluation

shashank agarwal — Wed, 10 Dec 2025 03:50:00 +0000

Here's a hard truth: most teams don't know how to evaluate their AI agents because they don't have a clear ground truth.

They spend months creating manual labels, hiring annotators, and building datasets. Then they realize the labels are inconsistent, expensive, and don't scale.

There's a better way.

Your system prompt IS your ground truth.

Think about it. Your system prompt defines:

The agent's role: What is it supposed to be?
Its constraints: What should it NOT do?
Its instructions: How should it behave?
Its values: What matters to it?

Everything the agent does should be evaluated against these instructions.

For example, if your system prompt says: "You are a customer support agent. You must be polite, professional, and never discuss politics," then you can evaluate every response by asking:

Is it polite?
Is it professional?
Does it avoid political topics?

These aren't subjective labels. They're objective criteria derived from your system prompt.

This is the foundation of proper agent evaluation. You don't need expensive annotators. You need a framework that automatically evaluates whether the agent followed its instructions.

The system prompt is the source of truth. Everything else is just implementation.

That's how Noveum.ai works right now, looking to get early access. Reach out to us.