Stop Vibing, Start Eval-ing: EDD in Practice

#ai #productivity #learning #beginners

In the first part I talked about what EDD is and why it matters. Now I want to show how to actually do it. No theory, just code.

I'm going to build an eval harness from zero for a support agent. Same idea applies to anything you're building with LLMs.

The dataset

Everything starts with a JSON file. Real questions, expected behaviors, stuff the agent should and should not say.

[
  {
    "id": "wire-transfer-reversal",
    "input": "Can I reverse a wire transfer I made yesterday?",
    "expected": {
      "must_mention": ["24-hour window", "fee"],
      "must_not_mention": ["instant refund", "no fee"],
      "expected_tone": "helpful and clear"
    }
  },
  {
    "id": "account-locked",
    "input": "My account is locked and I can't access my funds",
    "expected": {
      "must_mention": ["identity verification", "support team"],
      "must_not_mention": ["call the police"],
      "expected_tone": "empathetic and urgent"
    }
  }
]

Start with 5 cases. I know it feels like nothing but the point is to start measuring, you add more as you find failures in production.

The grader

I use two types of grading. Deterministic checks for things I can verify with code, and LLM-as-judge for subjective stuff like tone. Deterministic always comes first because it's fast, free, and reproducible.

import json
from anthropic import Anthropic

client = Anthropic()

def deterministic_grade(output: str, expected: dict) -> dict:
    mentioned = sum(1 for t in expected["must_mention"] if t.lower() in output.lower())
    coverage = mentioned / len(expected["must_mention"])

    hallucinations = sum(1 for t in expected["must_not_mention"] if t.lower() in output.lower())
    no_hallucination = 1.0 if hallucinations == 0 else 0.0

    return {"coverage": coverage, "no_hallucination": no_hallucination}

def llm_judge_grade(output: str, expected: dict) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Score this response from 0.0 to 1.0.
Expected tone: {expected['expected_tone']}
Response: {output}
Reply ONLY with JSON: {{"tone": 0.85, "clarity": 0.9}}"""
        }]
    )
    return json.loads(response.content[0].text)

def grade(output: str, expected: dict) -> dict:
    scores = deterministic_grade(output, expected)
    scores.update(llm_judge_grade(output, expected))
    return scores

The runner

This ties everything together, loads the dataset, calls the agent, grades each response, and saves the results so I can compare later.

import json, time
from agent import run_agent
from grader import grade

def run_evals(dataset_path="evals/dataset.json"):
    with open(dataset_path) as f:
        cases = json.load(f)

    results = []
    for case in cases:
        start = time.time()
        output = run_agent(case["input"])
        latency = time.time() - start

        scores = grade(output, case["expected"])
        scores["latency_ok"] = 1.0 if latency < 2.0 else 0.0
        avg = sum(scores.values()) / len(scores)

        results.append({"id": case["id"], "avg": round(avg, 3)})
        print(f"  {case['id']}: {avg:.1%} ({latency:.1f}s)")

    overall = sum(r["avg"] for r in results) / len(results)
    print(f"\n  Overall: {overall:.1%}")

    with open("evals/results.json", "w") as f:
        json.dump(results, f, indent=2)

When I run this I get something like:

  wire-transfer-reversal: 92.5% (1.3s)
  account-locked: 87.0% (1.8s)
  international-fee: 78.3% (1.1s)

  Overall: 85.9%

That's my baseline. From here on every change I make goes through this before anything else. Changed the prompt? Run evals. Swapped models? Run evals. Added RAG? Run evals.

If accuracy goes from 0.78 to 0.83 but completeness drops from 0.91 to 0.86, now I have a real tradeoff conversation with my team instead of "I think it's better now".

Where to go from here

As your dataset grows you split into smoke evals (10 critical cases, 30 seconds, every change) and full evals (50+ cases, nightly). Same logic as unit tests vs integration tests.

The thing that really changes the game is feeding production failures back into your dataset. Every bad response you spot becomes a new eval case. After a few months that dataset becomes your most valuable asset because it represents real failure modes no one could have predicted upfront.

Your evals are your product. Start with five cases and measure everything.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.