Stop Vibing, Start Eval-ing: EDD for AI-Native Engineers

#mcp #llm #chatgpt #ai

When I was doing traditional development, I had TDD. I wrote a test, it passed or failed, done. But when you're working with LLMs the output is different every time you run it. You ask the model to generate a function and sometimes it's perfect, sometimes it changes the structure, sometimes it just ignores part of the spec. You can't just assert(output == expected) because the output is probabilistic, it's never exactly the same.

That's where EDD comes in, Eval-Driven Development. The idea is simple, instead of testing if something works yes or no, you measure how well it works on a scale of 0 to 100%. And the important part is you define what "good" means before you start building.

How it works in practice

Say I'm building a support agent for a fintech app. Before I write a single prompt I sit down and think ok, what does success look like here? The agent should resolve at least 80% of queries without escalation, it should be factually accurate above 95%, it should respond in under 2 seconds and cost less than $0.02 per conversation. That's my eval spec.

Then I build what they call an eval harness, which is basically three things: a dataset with real examples and expected outcomes, a grader that scores the output (can be a function or even another LLM), and a runner that executes everything. I grab 50 real customer questions, I define what a great answer looks like for each one, and I write a script that runs my agent against all 50 and gives me a score.

Quick example

One of my eval cases is "can I reverse a wire transfer?". The agent should explain the 24h reversal window, mention the $15 fee, and suggest contacting the bank for international wires.

My grader checks each of those things, did it mention the time window, did it mention the fee, did it hallucinate some policy that doesn't exist. Each check gets a score, I average them, and now I have a number. Not a gut feeling, a number.

Every change goes through evals

From that point on every change goes through evals. Changed the system prompt? Run evals. Swapped models? Run evals. Added RAG? Run evals.

Vercel does exactly this with v0, they run evals on every PR and they caught a tone regression, score dropped from 0.85 to 0.72, in 20 minutes after a model swap. Without evals that would have been weeks of user complaints before anyone noticed.

Your eval suite is your product

What I realized is that in AI engineering your eval suite is basically your product. Anyone can copy your prompts, anyone can use the same model. But a dataset with 200+ edge cases and grading criteria that you built from months of real production feedback, that's what actually differentiates you.

If you're starting out, start small. Five test cases, one grader, one local script. You don't need a platform, you need the habit of measuring before and after every change.

If you want to see how to actually build this from zero, with dataset, grader, and runner code you can copy and run, I wrote a second part: Stop Vibing, Start Eval-ing: EDD in Practice.

DEV Community

Stop Vibing, Start Eval-ing: EDD for AI-Native Engineers

How it works in practice

Quick example

Every change goes through evals

Your eval suite is your product

Top comments (0)