Gunnar Grosch

Posted on Mar 3

Evaluating Agent Output Quality: Lightweight Evals Without a Framework

#ai #agents #aws #testing

In Writing System Prompts That Actually Work, I ended with this advice: "run it against a few representative inputs and check the output against your Expectation section." That's a good starting point. But if you've iterated on a few prompts, you've probably noticed the problem: eyeballing doesn't scale. You change the Steps section, re-run your test input, skim the output, and think "yeah, that looks better." Then two iterations later you realize you broke something that was working before. You're doing regression testing by memory.

If you haven't read the RISEN post: RISEN is a framework for writing agent system prompts with five components: Role, Instructions, Steps, Expectation, and Narrowing. The key idea is that each component doubles as an eval lever. Expectation defines what the output should look like, so it tells you what structural checks to write. Narrowing defines what the agent should avoid, so it tells you what scope violations to flag.

This post covers practical evaluation patterns for agent output. No heavyweight eval framework required. Three tiers: structural checks you can run in pure code, an LLM-as-judge pattern for content quality, and a calibration loop for tuning both. I'll walk through each tier with a working demo that evaluates a RISEN-structured code review agent that reviews Lambda functions for security vulnerabilities, performance issues, and AWS best practice violations.

Why Eval Before Picking a Framework

Evaluation frameworks exist for a reason. The Strands Agents Python SDK includes an evals package with output evaluators, trajectory evaluators, and benchmark runners. If you're building a production agent in Python with dozens of test cases and CI integration, use it.

But most of the time you're not there yet. You're still iterating on your system prompt, changing a sentence in Narrowing to see if it stops the agent from going off-scope. For that stage, you need something lighter.

Here's the model I use:

Tier	What it checks	Cost	Speed
Structural	Format compliance, section presence, vocabulary	Free	Instant
LLM Judge	Content quality, finding detection, reasoning	~$0.01/check	Seconds
Human	Calibration, edge cases, subjective quality	Your time	Minutes

If you want to run it first and read later:

git clone https://github.com/gunnargrosch/agent-evals-demo.git
cd agent-evals-demo
npm install
npm run eval

Two principles guide the approach:

Binary pass/fail over scales. A finding is caught or it isn't. A section is present or it isn't. Likert scales (1-5 ratings) sound more nuanced, but they're harder to calibrate and harder to act on. If you need to score "how good" a finding is, you don't yet know what "good" means for your use case. Define it first, then check for it.

RISEN components are eval levers. Each component of your system prompt maps directly to something you can check:

Expectation defines structural checks: are the required sections present? Is the summary table formatted correctly?
Narrowing defines scope checks: did the agent stay in bounds? Did it avoid things you told it to avoid?
Steps define content checks: did the agent follow the workflow? Did it catch the issues each step should surface?

The Test Cases

To evaluate a code review agent, you need code to review. Not random code, but code with known issues so you can check whether the agent found them.

I built three test cases:

Test case	Purpose	Expected findings
Basic Vulnerabilities	Well-known issues. Calibration baseline.	5 (SSN exposure, no validation, client in handler, wildcard CORS, no error handling)
Subtle Issues	Harder problems. Tests depth.	4 (NoSQL injection, scan vs query, missing idempotency, no batch write)
False Positive Bait	Correct code that looks suspicious. Tests precision.	0 (should find nothing wrong)

The first is the same Lambda function from the RISEN demo: a user lookup function with SSN exposure, missing input validation, a DynamoDB client instantiated inside the handler, wildcard CORS, and no error handling. It has obvious problems that any decent review should catch. If your agent misses SSN exposure on this function, something is fundamentally broken.

The second is harder. It processes SQS messages and writes to DynamoDB with a FilterExpression built by string concatenation. That's NoSQL injection, but it's subtler than SQL injection and many reviewers miss it. The code also uses Scan instead of Query and doesn't handle SQS redelivery (missing idempotency).

The third is the interesting one. It has test constants that look like hardcoded secrets (TEST_API_KEY = 'test-ak-00000...'), a structured error handler that could look like error swallowing, and environment variable configuration that could be mistaken for hardcoded values. A good reviewer should find nothing wrong here. A trigger-happy one will flag false positives.

Each test case is a TypeScript object with the code, expected findings, and things that should not be flagged:

interface TestCase {
  name: string
  description: string
  input: string
  expectedFindings: Finding[]
  expectedAbsent: string[]
}

interface Finding {
  id: string
  description: string
  severity: 'critical' | 'high' | 'medium' | 'low'
  keywords: string[]
}

The keywords array gives the LLM judge contextual hints about what vocabulary to look for. Keywords are passed directly to the judge as part of the user message. The judge reads them and uses them to decide whether the review demonstrated real understanding of the problem. For the NoSQL injection finding: ['injection', 'FilterExpression', 'concatenat', 'ExpressionAttributeValues', 'parameteriz']. The review doesn't need all of them, but it needs to demonstrate understanding of the actual problem, not just mention a related keyword in passing.

The stems ('concatenat', 'parameteriz') are intentional. The judge reads them semantically, so 'concatenat' cues the judge to look for "concatenation", "concatenated", and similar. Calibrate keyword specificity carefully: too generic and the judge gives credit for tangentially related mentions; too specific and you miss a review that describes the problem correctly but uses different phrasing.

Tier 1: Structural Checks

Structural checks are pure code. No LLM calls, no cost, instant results. They verify that the agent's output matches the format defined in your Expectation and respects the boundaries in your Narrowing.

For the code review prompt, the Expectation section says:

Return a structured review with a summary table listing each finding with its severity, a detailed section for each finding with severity level, description, problematic code, and corrected code, and a summary count of findings by severity at the end.

That translates directly to checks:

function runStructuralChecks(output: string): StructuralCheckResult[] {
  const checks: StructuralCheckResult[] = []

  // Derived from Expectation: "summary table listing each finding"
  const hasTable = /\|.*severity.*\|/i.test(output)
    || /\|.*finding.*\|/i.test(output)
  checks.push({
    name: 'Summary table present',
    passed: hasTable,
    detail: hasTable
      ? 'Found summary table with findings'
      : 'No summary table found.',
  })

  // Derived from Expectation: "corrected code"
  // Assumes paired backtick fences — malformed output with odd backtick count
  // produces a non-integer, which Math.floor rounds down silently.
  const codeBlockCount = (output.match(/```
{% endraw %}
/g) || []).length / 2
  checks.push({
    name: 'Code blocks with fixes',
    passed: codeBlockCount >= 1,
    detail: {% raw %}`Found ${Math.floor(codeBlockCount)} code block(s)`{% endraw %},
  })

  // Derived from Narrowing: "Do not suggest rewriting the entire function"
  const outputLines = output.split('\n').length
  const longestCodeBlock = getLongestCodeBlock(output)
  const isFullRewrite = longestCodeBlock > 40 && longestCodeBlock > outputLines * 0.4
  checks.push({
    name: 'No full rewrite',
    passed: !isFullRewrite,
    detail: isFullRewrite
      ? {% raw %}`Longest code block is ${longestCodeBlock} lines.`{% endraw %}
      : 'Code blocks are targeted fixes.',
  })

  return checks
}
{% raw %}

The full demo includes six checks: summary table, severity vocabulary (at least two severity levels used), code blocks, no full rewrite, summary count at end, and scope compliance (no style/readability suggestions). All derived from two RISEN components.

Here's what the output looks like:


text
Structural Checks (6/6 passed)

  PASS Summary table present
  PASS Severity vocabulary
  PASS Code blocks with fixes
  PASS No full rewrite
  PASS Summary count at end
  PASS Stays within scope

These are pattern-matching checks, not semantic ones. An agent that rephrases a style suggestion to dodge the regex will pass. That's fine: structural checks catch gross format violations cheaply. The judge handles subtlety.

Structural checks catch the most common prompt problems: the agent ignored your format instructions, or it drifted out of scope. They're the first thing to run because they're free and fast. If structural checks fail, there's no point running the judge.

Tier 2: LLM-as-Judge

Structural checks tell you the output looks right. They can't tell you the content is right. For that, you need a judge: a second agent that reads the review output and assesses whether specific findings were caught.

Different model, mandatory reasoning

Two design decisions matter here. First, the judge uses a different model than the review agent. The review agent runs on Claude Sonnet 4.5. The judge runs on Claude Haiku 4.5. Using the same model to judge its own output creates self-enhancement bias: the model that produced a vague or incomplete finding will tend to accept that same vague finding as "caught." A different model gives you a more honest assessment, and Haiku is fast and cheap enough to run per-finding.

Second, the judge must write its reasoning before its verdict. This is the same principle behind chain-of-thought prompting: forcing the model to explain its logic before committing to an answer produces better answers. The judge prompt's Expectation section enforces this:


text
# Expectation
Return exactly one JSON object with this structure:
{
  "findingId": "<the finding ID provided>",
  "reasoning": "<2-3 sentences explaining why you believe the finding was
                 or was not caught>",
  "caught": <true or false>
}

Reasoning comes before the verdict in the JSON. The model writes the reasoning first, then decides.

The judge prompt

The judge gets a RISEN-structured prompt. It's worth showing in full. This is a RISEN prompt evaluating another RISEN prompt's output, and the structure makes that relationship explicit:


text
# Role
You are a code review evaluator. You assess whether a code review
correctly identified a specific security or performance issue. You are
precise and literal: a finding is caught only if the review clearly
describes the problem and its impact.

# Instructions
You will receive a code review output and a specific finding to check
for. Determine whether the review caught the finding. Return your
assessment as JSON.

# Steps
1. Read the finding description and keywords carefully.
2. Search the review output for mentions of the issue.
3. Assess whether the review identified the core problem (not just
   mentioned a related keyword in passing).
4. Write your reasoning first, then your verdict.

# Expectation
Return exactly one JSON object with this structure:
{
  "findingId": "<the finding ID provided>",
  "reasoning": "<2-3 sentences explaining why you believe the finding
                 was or was not caught>",
  "caught": <true or false>
}

Return ONLY the JSON object. No markdown fences, no extra text.

# Narrowing
- A finding is "caught" only if the review identifies the specific
  problem described, not just a vaguely related concern.
- If the review mentions the general area but misses the specific
  vulnerability (e.g., mentions DynamoDB but not the injection vector),
  that is NOT caught.
- Do not give credit for partial matches. The review must demonstrate
  understanding of the actual issue.

The Narrowing is where the precision comes from. Without it, the judge tends to give credit for proximity. The review mentions DynamoDB? Close enough to "NoSQL injection"? No. The review has to demonstrate understanding of the actual vulnerability.

Running the judge

The judge evaluates each expected finding independently:


typescript
async function judgeFindings(
  reviewOutput: string,
  findings: Finding[]
): Promise<JudgeVerdict[]> {
  // Each finding gets its own agent instance and runs in parallel.
  // Promise.allSettled means one timeout or error doesn't block the rest.
  const results = await Promise.allSettled(
    findings.map(async (finding) => {
      const agent = createJudgeAgent(judgePrompt)
      const prompt = `
Review output to evaluate:
---
${reviewOutput}
---

Finding to check:
- ID: ${finding.id}
- Description: ${finding.description}
- Severity: ${finding.severity}
- Keywords that indicate detection: ${finding.keywords.join(', ')}

Did the review catch this finding?`

      const result = await agent.invoke(prompt)
      return parseJudgeResponse(result.toString(), finding.id)
    })
  )

  return results.map((result, i) =>
    result.status === 'fulfilled'
      ? result.value
      : { findingId: findings[i].id, reasoning: 'Judge failed', caught: false }
  )
}

For the false positive test case, a separate judge prompt checks whether the review incorrectly flagged correct code as problematic.

The judge returns a JSON object with reasoning first and caught second. The caught boolean is what drives the PASS/FAIL in the terminal output; the reasoning string is what gets printed below it. You defined that schema in the Expectation section of the judge prompt.

Here's what the judge output looks like on the Subtle Issues test case:


text
Judge Verdicts (2/4 caught)

  PASS nosql-injection
    The review explicitly identifies the NoSQL injection vulnerability,
    clearly describing the core problem: user input is concatenated
    directly into the FilterExpression without parameterization.

  PASS scan-instead-of-query
    The review explicitly identifies this issue, clearly describing the
    core problem: the code uses ScanCommand which reads the entire table
    before filtering.

  FAIL missing-idempotency
    The review does not identify the missing idempotency issue. While
    the review addresses error handling, it does not discuss the specific
    problem of duplicate orders when SQS messages are reprocessed.

  FAIL no-batch-write
    The review does not identify the inefficiency of sequential
    PutCommand operations in a loop instead of BatchWriteCommand.

The baseline prompt catches 2 out of 4 findings on the harder test case. It finds the NoSQL injection and the scan problem, but misses idempotency and batch writes. 50% on the subtle test case isn't a failure: it's the starting point. That's the data you need to improve the prompt. If your results look different across runs, that's expected. The calibration section covers non-determinism and what to do about it.

Tier 3: The Calibration Loop

The first time you run evals, you'll disagree with some of the judge's verdicts. That's expected and useful. The calibration loop is how you turn those disagreements into a better rubric.

The process:

Run the eval.
Read every judge verdict, especially the reasoning.
For each disagreement, decide which of three things happened:
- The judge is wrong. Tighten the judge prompt's Narrowing or add a keyword to the finding definition.
- The agent is wrong. The agent should have caught this. Tighten the review prompt's Steps.
- The test case is wrong. The expected finding is unreasonable, or the code doesn't actually have the issue you thought it did. Fix the test case.

After my first run, I found two calibration issues. The "summary count at end" structural check was too strict: it only looked in the last 800 characters and required specific phrasing. I widened the search window and added more patterns. The false positive check for "hardcoded secret" was catching cases where the review mentioned test constants neutrally rather than flagging them as issues. I tightened the false positive judge prompt to distinguish between "flagged as a finding" and "mentioned in passing."

Two iterations were enough to get a stable rubric for three test cases. If you have more test cases or a more complex agent, you might need three or four rounds.

After the summary, the eval also prints a Suggestions section that maps each failure back to the RISEN component to edit: missed findings point to Steps, structural failures point to Expectation, false positives point to Narrowing. It doesn't tell you what to change, but it tells you where to look.

A note on non-determinism

LLMs don't produce the same output twice. Your eval might pass on one run and fail on the next for the same prompt and same input. This is normal, and it matters for how you interpret results.

For structural checks, non-determinism is rarely an issue: the agent either returns a table or it doesn't. For the judge, a borderline verdict (the review hinted at a finding but didn't nail it) may flip between runs. If a finding fails consistently, it's a real signal. If it flips, that's a signal too: the finding is on the boundary of what the current prompt reliably catches, and the judge rubric may need tightening.

For CI, this means not treating every eval failure as a blocker. Run structural checks in CI: they're stable. Use judge results as monitoring: track pass rates over multiple runs and alert on sustained regressions rather than single failures. The --ci flag exits non-zero on any failure; use it in CI only once your rubric is stable enough that flakiness is rare.

A simple strategy for borderline findings: run the eval three times and consider the finding caught if it passes at least two of three runs. The demo doesn't do this automatically. You'd wire it up in your CI script, but it filters out most random variance without being too permissive. The README notes which findings are intermittent in the baseline prompt; those are good candidates for this approach before you tighten the prompt further.

Comparing Across Prompt Iterations

The real payoff comes when you change your prompt and want to know if the change helped. The compare script runs both prompts against all test cases and shows what changed.

The v2 prompt adds explicit steps for NoSQL injection detection and idempotency checking:


text
# Steps
...
2. Check for security issues: injection (including NoSQL injection via
   string concatenation in DynamoDB expressions), overly permissive IAM
   assumptions, hardcoded secrets, missing input validation.
3. Check for data handling: look for string concatenation in
   FilterExpression, KeyConditionExpression, or ProjectionExpression.
   These must use ExpressionAttributeValues with placeholders.
4. Check for idempotency: if processing messages from SQS, SNS, or
   EventBridge, verify the handler is idempotent.
...

Here's the comparison output:


text
COMPARISON: BASELINE vs VARIANT

Basic Vulnerabilities
  Structural: 6/6 -> 6/6
  Findings:   5/5 -> 5/5

Subtle Issues
  Structural: 6/6 -> 6/6
  Findings:   2/4 -> 3/4 (+1)
    + now catches: missing-idempotency

False Positive Bait
  Structural: 5/6 -> 6/6 (+1)
  False pos:  0 -> 1 (-1)

The v2 prompt catches the missing idempotency issue that the baseline missed. That's the targeted improvement. But it also introduced a false positive on the clean code: it incorrectly flagged "wildcard CORS" on a function that uses environment-based CORS configuration.

This is the trade-off you're always navigating with prompt changes. Adding specificity to Steps improves recall (catches more real issues) but can hurt precision (flags more non-issues). The eval gives you data to make that trade-off deliberately instead of guessing.

The Same Pattern, Different Domain

Code review is a useful test bed, but the interesting question is whether the pattern holds for something more subjective. The demo includes a second domain: content review. An agent that reviews blog post drafts for completeness, structure, and technical accuracy.

The test cases are blog post drafts instead of Lambda functions. The expected findings are things like "missing prerequisites section" and "unexplained command flags" instead of "SSN exposure" and "NoSQL injection." The structural checks are completely different: section checklist present, finding categories used, readiness assessment at end. No severity tables, no code blocks with fixes.

But the pieces that change are exactly three files:

Test cases: blog post drafts with known issues, using the same TestCase interface.
Structural checks: regex/string checks derived from the content review prompt's Expectation section.
Agent prompt: a RISEN-structured prompt for technical editing instead of security review.

The judge, report formatter, and TestCase/Finding types stay the same. Run it with:


bash
npm run eval:content


text
TEST CASE: Incomplete Tutorial

Structural Checks (5/5 passed)

  PASS Section checklist present
  PASS Finding categories
  PASS Actionable suggestions
  PASS Readiness assessment
  PASS Stays within scope

Judge Verdicts (4/5 caught)

  PASS missing-prerequisites
  PASS unexplained-command-flags
  PASS missing-import
  PASS no-troubleshooting
  FAIL missing-conclusion

Different domain, same eval pattern. If you're building a summarization agent, a customer support agent, or anything else, the approach is the same: define test cases with known expected findings, write structural checks from your Expectation section, and let the judge handle content quality.

Running the Demo

The demo repo has everything you need to run this yourself.

You'll need:

An AWS account with Amazon Bedrock access to Claude Sonnet 4.5 and Claude Haiku 4.5
Node.js 20+
AWS credentials configured (AWS_PROFILE or default)

Getting started:


bash
git clone https://github.com/gunnargrosch/agent-evals-demo.git
cd agent-evals-demo
npm install
npm test

npm test runs the unit tests for the structural checks. No LLM calls, no AWS credentials needed. A good first step to verify the setup.

Commands:

Command	What it does	Time
`npm run eval:structural`	Code review structural checks only. No judge, no cost.	~30s
`npm run eval`	Code review full eval: structural + LLM judge.	~2min
`npm run eval:content`	Content review eval: structural + LLM judge.	~1min
`npm run compare`	Baseline vs v2 code review prompt, side by side.	~4min

You can also run the full eval with the v2 prompt:


bash
npm run eval -- --v2

To test your own code review prompt, write it in a text file and pass it with --prompt:


bash
npm run eval -- --prompt ./my-code-review-prompt.txt

If you want to eval a completely different kind of agent without writing any TypeScript, --test-case accepts a JSON file of test cases and --skip-structural skips the built-in code review checks:


bash
npm run eval -- --prompt ./my-prompt.txt --test-case ./my-cases.json --skip-structural

The compare script accepts --baseline and --variant for comparing any two prompt files:


bash
npm run compare -- --baseline ./my-prompt-v1.txt --variant ./my-prompt-v2.txt

Pass --ci to any eval script to exit non-zero on failures, useful for pipeline integration:


bash
npm run eval -- --ci

The eval uses Sonnet 4.5 for the review agent and Haiku 4.5 for the judge. A full npm run eval across all three test cases costs roughly $0.10-0.15 at current Bedrock pricing. npm run compare runs six agent calls plus the judge, so roughly double.

When to Graduate to a Framework

This approach works well for iterating on a single agent's system prompt with a handful of test cases. When you hit these signals, it's time to look at a framework:

More than 10 test cases. You'll want parallel execution, caching, and proper test runners.
CI integration. The demo has a --ci flag that exits non-zero on failures, so you can hook it into a pipeline. But once you need test result history, trend tracking, or gating deploys across multiple agents, a framework handles that better.
Multiple agents coordinating. Trajectory evaluation (did the agents take the right steps?) matters as much as output evaluation.
Team collaboration. Others need to run and extend evals without understanding your bespoke scripts.

The Strands Agents SDK includes an evals package (Python) with OutputEvaluator and TrajectoryEvaluator classes that handle these scenarios. Note that this is the Python SDK. The TypeScript SDK doesn't include an evals package yet, so graduating means either switching to Python or building on top of what this demo started. The lightweight approach in this post is for the earlier stage: when you're still figuring out what "good" looks like for your agent's output.

Additional Resources

agent-evals-demo: Demo repo with all code from this post
Writing System Prompts That Actually Work: The RISEN Framework for AI Agents: The RISEN framework post this builds on
risen-prompt-demo: Demo repo for the RISEN post
Strands Agents SDK (TypeScript): Agent framework used in the demo
Strands Agents evals (Python): Full eval framework for when you graduate
Your AI Product Needs Evals: Hamel Husain's guide to building evals

If you try it, the calibration loop is where the interesting disagreements show up. What does your current approach to agent eval look like? Let me know in the comments!

DEV Community