AI Evals, Explained: How We Actually Know Our AI Is Any Good

#ai #evals #llm #dotnet

Part 1 of a series on building production AI on .NET — drawn from TextStack, a reader with seven shipping AI features.

You can build an AI feature in an afternoon. Wiring up an API call and a prompt is genuinely easy now. The hard part — the part that separates a demo from a product — is answering one deceptively simple question:

Is it any good? And did my last change make it better or worse?

For normal code, that question has a normal answer: a test suite. Add(2, 2) should return 4; if it doesn't, the build goes red. But an AI feature doesn't return 4. Ask it to explain a word and it returns a paragraph — a slightly different paragraph every single time, and "correct" is a whole range of good answers, not one. You cannot write Assert.Equal against prose. The thing software engineering relies on most — a fast, automatic signal that something broke — is gone.

Evals are how you get that signal back. This post is a plain-English introduction to what they are and how we actually run them in production. No hype, no notebooks — just the mental model and a real implementation.

So what is an eval?

Strip away the jargon and an eval is just a systematic way to measure the quality of an AI output. Where a unit test gives you pass/fail by exact match, an eval gives you a graded judgement over a representative sample of inputs. Instead of "is this exactly right?" it asks "across 30 realistic cases, how good is this, on the axes I care about?"

That measurement gets used in three different places, and it helps to keep them separate:

As monitoring — you score a sample of real traffic over time, to catch quality silently drifting downward.
As a guardrail — you score an output before the user sees it, and block or retry if it fails.
As a ruler for improvement — you score before and after a change, so "did this prompt edit help?" finally has an answer.

Most teams want the third one first and never build it. That's the gap this series is about.

The lifecycle: Analyze → Measure → Improve

The most useful framing I've found is to treat evaluation as a loop of Analyze, Measure, Improve. It's worth internalising because it stops you from doing the steps in the wrong order — which is the single most common mistake.

1. Analyze — look at your failures before you measure anything.
The instinct is to jump straight to a metrics dashboard. Resist it. The highest-leverage activity in all of evals is boring: take 50–100 real outputs, read them, and label how each one is wrong. Not a score — a category. "Restated the dictionary definition instead of using the sentence's context." "Translation was accurate but too formal." You cluster these into a failure taxonomy, and that's what tells you which dimensions are even worth measuring. Skip this and you'll confidently measure the wrong things while users churn.

2. Measure — turn those failure modes into a repeatable number.
This is where the golden dataset and the LLM judge come in (the next two posts go deep on each). In short: you assemble a set of representative inputs with reference answers, run your feature over them, and have a second, stronger model score each output against a rubric built from your taxonomy.

3. Improve — change something, re-run, and trust the delta.
Now you can edit a prompt, swap a model, or restructure a pipeline, run the eval, and see whether quality moved. When you wire that comparison into CI, a quality regression turns the build red — the same safety net you have for ordinary code, finally extended to the non-deterministic part.

It's a flywheel: production traffic reveals new failure modes → you analyze them → they become new measured cases → improvements get gated → better output produces cleaner traffic. Round and round.

How we run evals at TextStack

Theory is cheap, so here's the concrete version. TextStack is an ASP.NET Core reading app with seven AI surfaces — Explain a word in context, Translate, generate vocabulary quiz distractors, book metadata, an audio podcast, and more. One rule sits above all of them:

Every AI feature ships with its own eval suite from day one. Eval is part of the pull request, not a follow-up.

Concretely, for each feature there's:

A golden dataset. ~30 hand-curated cases per feature, stored as plain JSON, each pairing a realistic input with a reference answer a human would accept.

Generation through the real path. The eval runs each case through the same code production uses — the same prompt, the same model gateway — so the test can never quietly drift away from what users actually get. (That drift is a classic, silent way to make an eval lie; more on it in the golden-dataset post.)

A dedicated judge. A second, stronger model (we use a gpt-4.1-class model, deliberately separate from the small, cheap models that generate the features) scores each output 1–5 on a short, feature-specific rubric — for Explain that's accuracy / conciseness / usefulness.

The judge runs on Microsoft.Extensions.AI.Evaluation — Microsoft's official, open-source evaluation library for .NET. This is a deliberate choice: most of the eval ecosystem assumes you're in Python (Braintrust, Phoenix, LangSmith), but a .NET shop doesn't have to leave the platform to do this properly. Our judge is implemented as a custom IEvaluator, so it slots into the same harness as Microsoft's built-in evaluators and runs as an ordinary dotnet test. The whole pipeline is plain C# — no Python bridge, no LangChain. The library is young and moving fast, which also makes it one of the more approachable corners of the .NET AI stack to contribute back to.

// A custom IEvaluator on Microsoft.Extensions.AI.Evaluation.
// One judge, many features: the rubric is a parameter, not hardcoded.
public sealed record Rubric(string Dim1, string Dim2, string Dim3);

var explain = new Rubric(
    "accuracy: matches the meaning the word carries in THIS sentence",
    "conciseness: 2-3 sentences, no dictionary boilerplate",
    "usefulness: would a learner find it genuinely helpful");

Persistence and a dashboard. Every run is stored, and an internal /ai-quality page shows scores and traces per feature, so quality is something we can actually watch over time — not a number that scrolls past in a CI log.

The honest status: we can run the full suite on demand and gate individual features against a quality floor; turning that into an automatic "fail the build if we regress more than X% versus last week" ratchet is the next step. The measuring instrument is built — and building the instrument is the hard 80%.

The traps (so you don't learn them the expensive way)

A quick preview of what the rest of the series unpacks, because these are where eval setups quietly break:

Metrics before error analysis — you measure what was easy to imagine, not what actually fails.
An easy golden set — the score goes up while the product goes down.
A judge you never validated — an LLM grading prose is itself a model; if it doesn't agree with human judgement, your whole pipeline is theatre.
Judge bias — judges quietly prefer longer answers, the first option shown, and text from their own model family.
Shipping on noise — with 30 cases, a 0.1 bump in the average is probably random, not progress.

Each of those is a post of its own.

Where this is going

Evals are not a dashboard you bolt on at the end. They're the discipline that lets you change an AI product without flying blind — look at your failures, measure them honestly, and gate on the result. Done right, they turn "I think this feature is fine" into "I can prove it, and I'll know the moment it stops being true."

Next in the series:

This post — what evals are and how we run them.
Error analysis — the unglamorous superpower, and how to build a failure taxonomy.
Golden datasets that don't lie — curation, leakage, and the drift trap.
LLM-as-judge, done right — rubrics, a dedicated judge, and the biases that wreck it.
From a number to a gate — evals in CI and online monitoring.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET, not a notebook. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.