The Death of TDD: Why "Evaluation Engineering" is the New Source Code

I recently watched a Junior Engineer try to write a unit test for an LLM agent.

They were trying to assert that response == "I can help with that". The test failed because the AI replied, "I would be happy to help with that." The engineer sighed, updated the string, and ran it again. It failed again.

This is the state of AI engineering today: we are trying to force probabilistic systems into deterministic boxes. And it is breaking our workflows.

In traditional software, we write the implementation (parse_date()) and then the test (assert parse_date("2024-01-01") == date(2024, 1, 1)). But with AI, the AI writes the implementation. Our job is no longer to write the logic; our job is to write the exam.

I call this Evaluation Engineering, and it is the most valuable code you will write this year.

The Paradigm Shift: From TDD to Eval-DD

In the old world, the Human was the Coder and the Machine was the Executor. In the new world, the AI is the Coder and the Human is the Examiner.

You can’t write a unit test that covers every creative variation of an AI’s answer. Instead, you need to shift from Test-Driven Development (TDD) to Evaluation-Driven Development (Eval-DD).

Here is what that looks like in practice.

The 3 Pillars of Evaluation Engineering

I’ve built a simple framework to replace my unit tests. It consists of three core components that I believe every AI codebase needs.

1. The Golden Dataset (The "Spec")

Stop writing prose specifications. They are useless to an LLM. In Eval-DD, the dataset is the specification.

Instead of writing a Jira ticket that says "The bot should handle bad dates gracefully," you write this:

# The Dataset IS the Spec
dataset.add_case(
    id="edge_001", 
    input="Parse this date: 2024-13-01", # Invalid month
    expected_output="ERROR",
    tags=["invalid", "edge_case"],
    difficulty="hard"
)

This dataset defines exactly what "good" looks like. It is the single source of truth. If the AI passes this dataset, it is ready for production. If it fails, it is not.

2. The Scoring Rubric (The "Judge")

This is where most teams fail. They grade on binary Correctness.

But in the real world, an answer can be correct but toxic. Or safe but useless.

I use a ScoringRubric class that allows for multi-dimensional grading. It evaluates an AI response across different axes, weighted by importance.

rubric = ScoringRubric("Customer Service Rubric", "Evaluates correctness AND tone")

# Correctness is important...
rubric.add_criteria(
    dimension="correctness",
    weight=0.5, 
    description="Does it solve the problem?",
    evaluator=correctness_evaluator
)

# ...but so is not being a jerk.
rubric.add_criteria(
    dimension="tone",
    weight=0.4, 
    description="Is it polite and empathetic?",
    evaluator=tone_evaluator
)

If the AI answers "Just click forgot password, duh," it gets:

Correctness: 10/10
Tone: 0/10
Final Score: 5/10 (Fail)

This captures the nuance that a simple assert statement misses.

3. The Evaluation Runner (The "Test Suite")

Finally, you need a runner that executes your AI against the Golden Dataset and grades it with your Rubric.

This replaces pytest. It runs the exam, calculates the pass rate, and tells you if your prompt engineering actually worked.

runner = EvaluationRunner(dataset, rubric, my_ai_function)
results = runner.run(verbose=True)

if results['pass_rate'] > 0.9:
    print("🎉 AI meets requirements!")
else:
    print("❌ AI needs improvement")

Why This Matters

This isn't just semantics. It changes how you work.

You write the Rubric first. Before you write a single prompt, you define what success looks like.
You iterate on the Prompt, not the Code. When a test fails, you don't rewrite Python logic; you tweak the system prompt or provide few-shot examples to the LLM.
The "Source Code" moves. The intellectual property of your application is no longer the Python code wrapping the API call. The IP is the Evaluation Suite.

The Senior Engineer's New Job

If you are worried about AI taking your coding job, don't be. The job just changed.

The hard part isn't generating the code anymore (Cursor/Copilot can do that). The hard part is:

Defining the Golden Dataset (capturing edge cases).
Tuning the Rubric (encoding engineering judgment into weights).
Analyzing the Failures (figuring out why the AI messed up).

We are leaving the era of deterministic logic and entering the era of probabilistic engineering. Stop begging your AI to be good via prompts. Start grading it.