DEV Community

Cover image for How to Build an AI Agent Evaluation Framework from Scratch
shashank agarwal
shashank agarwal

Posted on

How to Build an AI Agent Evaluation Framework from Scratch

Building AI agents is hard. Evaluating them is harder.

Most teams I talk to are evaluating their agents the wrong way. They look at the final output and ask, "Is it correct?" But that's like grading a math test by only looking at the final answer, not the work.

In this post, I'll show you how to build a proper AI agent evaluation framework from scratch. We'll cover the concepts, the implementation, and the best practices.

Why Traditional Evaluation Fails for Agents

Traditional ML evaluation metrics (accuracy, precision, recall) don't work for agents because:

  1. Agents take multiple steps: An agent might get the right answer through the wrong path. Traditional metrics only look at the final output.

  2. The path matters: An agent that takes 10 steps to answer a question is worse than one that takes 2 steps, even if both get the right answer. Cost and efficiency matter.

  3. Hallucinations are subtle: An agent might hallucinate in an intermediate step but still get the right final answer. You'd miss this with output-only evaluation.

  4. Compliance violations are hidden: An agent might violate a constraint (like discussing a competitor) in the middle of a conversation but still provide a correct final answer.

The Right Way to Evaluate Agents

Here's the framework I recommend:

Step 1: Define Your Ground Truth

Don't manually label data. Use your system prompt as ground truth. Your system prompt defines:

  • What the agent should do
  • How it should behave
  • What constraints it should follow
  • What role it should play

This is your evaluation ground truth. Everything else is a deviation from this.

Step 2: Collect Traces

Every time your agent runs, collect a trace. A trace includes:

  • The initial user input
  • Every LLM call (input and output)
  • Every tool call
  • The final output
  • Metadata (tokens, latency, cost)

Here's what a trace structure might look like:

from dataclasses import dataclass
from typing import List, Dict, Any

@dataclass
class LLMCall:
    input: str
    output: str
    model: str
    tokens_used: int
    latency_ms: float

@dataclass
class ToolCall:
    tool_name: str
    tool_input: Dict[str, Any]
    tool_output: str
    latency_ms: float

@dataclass
class AgentTrace:
    user_input: str
    system_prompt: str
    llm_calls: List[LLMCall]
    tool_calls: List[ToolCall]
    final_output: str
    total_tokens: int
    total_cost: float
    total_latency_ms: float
Enter fullscreen mode Exit fullscreen mode

Step 3: Define Evaluation Dimensions

Don't use a single metric. Evaluate across multiple dimensions:

class EvaluationDimensions:
    TASK_COMPLETION = "task_completion"  # Did it achieve the goal?
    EFFICIENCY = "efficiency"  # Did it take the optimal path?
    HALLUCINATION = "hallucination"  # Did it invent facts?
    COMPLIANCE = "compliance"  # Did it follow constraints?
    COHERENCE = "coherence"  # Was it logically consistent?
    COST = "cost"  # How many tokens did it use?
    TOOL_VALIDITY = "tool_validity"  # Were tool calls valid?
Enter fullscreen mode Exit fullscreen mode

Step 4: Implement Scorers

For each dimension, implement a scorer. Here's an example:

def score_task_completion(trace: AgentTrace) -> float:
    """
    Score whether the agent completed its task.

    Uses the system prompt to determine what "task completion" means.
    Returns a score from 0-10.
    """
    # Extract task from system prompt
    task = extract_task_from_prompt(trace.system_prompt)

    # Check if final output indicates task completion
    if indicates_task_completion(trace.final_output, task):
        return 10.0
    else:
        return 0.0

def score_efficiency(trace: AgentTrace) -> float:
    """
    Score how efficient the agent's path was.

    Fewer steps = higher efficiency.
    Returns a score from 0-10.
    """
    # Count steps taken
    steps_taken = len(trace.llm_calls) + len(trace.tool_calls)

    # Estimate optimal steps (this is domain-specific)
    optimal_steps = estimate_optimal_steps(trace.user_input)

    # Calculate efficiency ratio
    efficiency_ratio = optimal_steps / steps_taken

    # Convert to 0-10 scale
    score = min(efficiency_ratio * 10, 10.0)

    return score

def score_hallucination(trace: AgentTrace) -> float:
    """
    Score whether the agent hallucinated.

    Hallucinations = lower score.
    Returns a score from 0-10 (10 = no hallucinations).
    """
    hallucinations_detected = 0

    # Check each LLM output for hallucinations
    for llm_call in trace.llm_calls:
        if contains_hallucination(llm_call.output):
            hallucinations_detected += 1

    # Convert to score
    score = max(10 - (hallucinations_detected * 2), 0.0)

    return score

def score_compliance(trace: AgentTrace) -> float:
    """
    Score whether the agent followed its constraints.

    Constraint violations = lower score.
    Returns a score from 0-10 (10 = no violations).
    """
    # Extract constraints from system prompt
    constraints = extract_constraints_from_prompt(trace.system_prompt)

    violations = 0

    # Check each LLM output against constraints
    for llm_call in trace.llm_calls:
        for constraint in constraints:
            if violates_constraint(llm_call.output, constraint):
                violations += 1

    # Convert to score
    score = max(10 - (violations * 2), 0.0)

    return score
Enter fullscreen mode Exit fullscreen mode

Step 5: Aggregate Scores

Combine individual dimension scores into an overall evaluation:

def evaluate_agent_trace(trace: AgentTrace) -> Dict[str, float]:
    """
    Evaluate an agent trace across all dimensions.
    """
    scores = {
        EvaluationDimensions.TASK_COMPLETION: score_task_completion(trace),
        EvaluationDimensions.EFFICIENCY: score_efficiency(trace),
        EvaluationDimensions.HALLUCINATION: score_hallucination(trace),
        EvaluationDimensions.COMPLIANCE: score_compliance(trace),
        EvaluationDimensions.COHERENCE: score_coherence(trace),
        EvaluationDimensions.COST: score_cost(trace),
        EvaluationDimensions.TOOL_VALIDITY: score_tool_validity(trace),
    }

    # Calculate overall score (weighted average)
    weights = {
        EvaluationDimensions.TASK_COMPLETION: 0.3,
        EvaluationDimensions.COMPLIANCE: 0.3,
        EvaluationDimensions.HALLUCINATION: 0.2,
        EvaluationDimensions.EFFICIENCY: 0.1,
        EvaluationDimensions.COHERENCE: 0.05,
        EvaluationDimensions.COST: 0.05,
        EvaluationDimensions.TOOL_VALIDITY: 0.0,  # Included in task completion
    }

    overall_score = sum(
        scores[dim] * weights[dim]
        for dim in scores
    )

    return {**scores, "overall": overall_score}
Enter fullscreen mode Exit fullscreen mode

Step 6: Identify Root Causes

When an agent scores poorly, analyze why:

def identify_root_causes(trace: AgentTrace, scores: Dict[str, float]) -> List[str]:
    """
    Identify why the agent performed poorly.
    """
    root_causes = []

    if scores[EvaluationDimensions.HALLUCINATION] < 5:
        root_causes.append("Agent is hallucinating. Review system prompt for clarity.")

    if scores[EvaluationDimensions.COMPLIANCE] < 5:
        root_causes.append("Agent is violating constraints. Strengthen system prompt.")

    if scores[EvaluationDimensions.EFFICIENCY] < 5:
        root_causes.append("Agent is taking inefficient paths. Consider simplifying task or providing better tools.")

    if scores[EvaluationDimensions.TASK_COMPLETION] < 5:
        root_causes.append("Agent is not completing task. Review system prompt and tool availability.")

    return root_causes
Enter fullscreen mode Exit fullscreen mode

Step 7: Continuous Improvement

Use evaluation results to improve your agent:

def generate_recommendations(trace: AgentTrace, scores: Dict[str, float]) -> List[str]:
    """
    Generate specific recommendations for improving the agent.
    """
    recommendations = []

    root_causes = identify_root_causes(trace, scores)

    for cause in root_causes:
        if "hallucinating" in cause:
            recommendations.append("Add specific facts to system prompt that agent should reference.")
            recommendations.append("Provide relevant context in user input.")

        if "violating constraints" in cause:
            recommendations.append("Make constraints more explicit in system prompt.")
            recommendations.append("Consider using tool constraints to prevent violations.")

        if "inefficient" in cause:
            recommendations.append("Provide better tools to reduce steps needed.")
            recommendations.append("Simplify the task or break it into sub-tasks.")

    return recommendations
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Here's how you'd use this framework:

# Collect a trace from your agent
trace = collect_agent_trace(agent, user_input)

# Evaluate the trace
scores = evaluate_agent_trace(trace)

# Identify problems
root_causes = identify_root_causes(trace, scores)

# Generate recommendations
recommendations = generate_recommendations(trace, scores)

# Log results
print(f"Overall Score: {scores['overall']:.1f}/10")
print(f"Task Completion: {scores[EvaluationDimensions.TASK_COMPLETION]:.1f}/10")
print(f"Efficiency: {scores[EvaluationDimensions.EFFICIENCY]:.1f}/10")
print(f"Hallucination: {scores[EvaluationDimensions.HALLUCINATION]:.1f}/10")
print(f"Compliance: {scores[EvaluationDimensions.COMPLIANCE]:.1f}/10")
print()
print("Root Causes:")
for cause in root_causes:
    print(f"  - {cause}")
print()
print("Recommendations:")
for rec in recommendations:
    print(f"  - {rec}")
Enter fullscreen mode Exit fullscreen mode

The Limitations of DIY Evaluation

Building your own evaluation framework is a good exercise, but it has limitations:

  1. Scorer Implementation: Implementing scorers for hallucination, compliance, and coherence is non-trivial. You need NLP expertise.

  2. Scalability: As your agent grows more complex, maintaining scorers becomes a full-time job.

  3. Optimization: Hand-written scorers are often suboptimal. ML-based scorers (like LLM-as-Judge) perform better but require more infrastructure.

  4. Root Cause Analysis: Identifying root causes and generating recommendations requires deep domain knowledge.

This is where a purpose-built evaluation platform becomes valuable. Noveum.ai, for example, provides all of this out of the box: 73+ pre-built scorers, automated root cause analysis through NovaPilot, and prescriptive recommendations. You can learn more about their approach to agent evaluation here.

Conclusion

Evaluating AI agents properly requires evaluating the entire trajectory across multiple dimensions, not just the final output. By following this framework, you'll have much better visibility into your agent's behavior and be able to improve it iteratively.

Start with the basic scorers I've outlined here, then expand as your needs grow. And remember: the system prompt is your ground truth. Use it.

Top comments (0)