Evaluation Techniques

#ai #llm #machinelearning #testing

There are six main evaluation techniques, falling into two broad families: those that compare against a known answer, and those that use judgment. Here's a visual overview, then the explanation of each.Here's each technique explained:

Exact match — the simplest form. You know the correct answer, and you check whether the output equals it exactly. Works well for structured tasks: intent classification ("is this a booking request?"), entity extraction where the expected output is a fixed JSON, or tool selection ("should the agent call the calendar API or the email API here?"). Brittle for open-ended text because two correct answers can be worded differently.

Schema / constraint validation — instead of checking exact values, you check the shape of the output. Does entity extraction return a valid Task schema with all required fields? Did the plan builder produce a properly ordered list? This is what Pydantic and Zod do, and it's directly relevant to BuddingBuilder's FR #7. It catches malformed outputs even when the content is hard to verify.

Code execution / unit test — the gold standard for any agent that produces code or structured plans. You run the output and check whether tests pass. For BuddingBuilder this applies to any task whose result is deterministically verifiable — a calculation, a formatted document, a database query result.

Reference-based LLM judge — you have a golden answer, and you ask a judge model to compare the agent's output against it and score the match. Returns a score and a reason. More flexible than exact match because it can handle paraphrasing, but requires you to maintain a library of golden examples, which takes effort to build and keep current.

Rubric-based LLM judge — no golden answer needed. You give the judge a scoring rubric ("rate this response 1–5 on correctness, task completion, and safety") and it evaluates the output on its own. This is the most practical technique for staging, because you can write rubrics faster than you can curate golden answers. The key is writing rubrics that are specific enough that the judge can't wriggle around them.

Pairwise preference — the judge sees two outputs side by side and picks the better one. You're not asking "is this good?" but "which is better — the old prompt or the new one?" This is the right technique for promotion gates: before moving from dev to staging, run pairwise eval between the new version and the current prod version. If the new version wins consistently, promote. This is also how RLHF preference data is collected.

Human eval — a human reads and rates the output. Highest signal, but too slow and expensive to run on everything. Its real job is to calibrate your automated judges — you periodically sample flagged traces, have a human rate them, and check whether your judge model's scores agree. If they don't, your rubric needs refining.

Online monitoring — the only technique running continuously in prod. The guard model scores inputs before the agent acts; the output validator scores responses after. Neither produces a detailed critique — they produce a fast pass/fail signal with enough metadata to route flagged interactions to the human review queue. This is what closes BuddingBuilder's prod → dev feedback loop.

claude.ai

DEV Community

Evaluation Techniques

Top comments (0)