DEV Community

Cover image for Stop Testing Your AI. Start Measuring It.

Stop Testing Your AI. Start Measuring It.

I built a training simulator as a side project. You hand a language model a short premise and it writes an entire choose-your-own-adventure up front — every scene, the characters and their dialogue, the handful of choices at each turn, and the scoring.

That created a testing problem I didn't expect. Every time I tweaked a prompt or swapped a model, the whole thing needed to be regenerated from scratch to be tested, multiple template variations & many more permutations and combinations. I couldn't read it all, and I couldn't tell whether a one-line change made it better or worse from a single playthrough. Manual review wasn't slow here; it was infeasible. So the real project became a harness that could test the permutations that matter automatically — well enough to let me iterate on prompts and models with confidence. This post is about that harness: what it checks, how it scores, how a run flows, and what it taught me when I used it to pick the models.

What goes in, what comes out

The input is small: a couple of pages of structured brief. A premise (scenario, industry, stakes). A cast — each character with a role, background, personality, and a receptiveness rating from 1 (hostile) to 5 (eager ally). A set of competency rubrics to score against. And a few scope dials — length tier, timeline in weeks, named milestone checkpoints.

The output is enormous. The model produces a branching scene graph: scenes anchored to a week and a role (opening, working, milestone, crisis, resolution); per-scene choices each tagged with a quality tier (optimal / acceptable / suboptimal / wrong); routing that wires each choice to the next scene; and success/partial/failure endings. On top of that sits per-character dialogue that reacts to how the relationship is going — the same choice in the same scene can land differently depending on the running state with that character. And a scoring layer threads through all of it, moving both each stakeholder's disposition and the player's competency.

The shape is multiplicative, and that's the whole problem:

authored reactions ≈ scenes × choices-per-scene × relationship states × characters

A medium template is roughly 21 scenes × 4 choices × ~4 relationship states × ~3 characters — on the order of a thousand individually-scored character reactions, plus the scene narratives, the routing graph, and the per-ending results.

It's produced by an AWS Step Functions pipeline — a sequence of Lambda phases, some calling a model, some pure deterministic code.

Why this is measurement, not testing

A unit test asserts add(2,2) === 4 — a known-correct answer. But there's no "correct" line for a skeptical stakeholder in scene seven, and no correct answer for whether four choices are a real dilemma or three obvious wrongs and a right one. The structure I can verify deterministically; the writing I can only judge by degrees, with an instrument that has its own error.

That reframe carries three rules unit testing doesn't have:

  1. The instrument has error bars. A model judge has real variance, so you sample it more than once and treat a single score as noisy.

  2. The ruler drifts. Swap or upgrade the judge and every number you've ever recorded silently changes meaning. You re-baseline, or you're comparing two rulers.

  3. A red is a finding, not a verdict. Before you fix anything, you triage what the number is actually measuring — because half the time the bug is in the instrument, not the output.

And a human can't be that instrument, for three compounding reasons:

  • the volume has no answer key (a reviewer just gets more lenient around block four hundred);
  • non-determinism means one read covers one path and the next regeneration isn't the path you read;
  • and every change re-authors the whole thing, so a manual pass is a tax you repay from scratch each iteration, while the regressions that matter (a dimension drifting 4.3 → 4.0) are invisible to a person with no prior number in hand.

So I built the grader.

One run, gated cheap-to-expensive

Most reds die before a token is spent on grading. The two expensive stages (generate, judge) sit behind the two cheap deterministic ones.

The eval CLI runs a fixed matrix of configurations (different scenarios at different lengths). Each one flows through ordered gates, cheapest first, so a broken template never reaches the expensive judge:

  1. Generate (expensive — generator tokens): drive the real production pipeline end-to-end, poll until done, record the template to a manifest before anything judges it.
  2. Structural check (cheap, deterministic): run the production validator, fail on any enforced violation. No judge spend if it fails.
  3. Playthrough (cheap, deterministic): walk the template with the production engine on several strategies. No judge spend if it fails.
  4. Judge (expensive): only reached if both cheap gates passed. Score each quality dimension.
  5. Floor + baseline (cheap, in-process): compare each score against a hard floor and a committed baseline.

Generation and judging are decoupled through that manifest, which makes the slow half resumable: if judging is interrupted, a --rejudge re-scores the same stored templates with zero regeneration cost. Losing a judge mid-run costs seconds, not the hour it took to generate.

Those gates map exactly onto four jobs -

One AI feature, split into four jobs, each with its own tool: guarantee what has a correct answer, grade what's subjective, play what gets traversed, and measure options against each other.

GUARANTEE: structure is enforced and self-healing

Structure has a correct answer, so it gets unit-tested with no model in the loop: every scene reachable, no dead ends, the routing valid, the score math sound. The catch is that checkable isn't enforced. My pipeline could call the validator mid-generation, get back "scene 7 is unreachable" — and return the broken outline anyway, because nothing consumed the verdict.

So I made generation self-healing, in two layers.

  1. I re-validate the model's final output myself and, if it breaks the contract, feed the specific errors back and make it retry.
  2. As a last resort, a pure, deterministic, unit-tested repair pass splices orphaned scenes back into the main spine. A no-op on healthy output, and my logs literally read "repaired unreachable scene(s)" when it fires. This is what lets a cheaper, less reliable model carry the structure phase: I don't need the model to be perfect, I need the pipeline to be self-correcting around it. A flaky generator plus enforce-and-repair beats an expensive generator you trust blindly.

The scoping lesson here stung: I first folded soft quality signals into this hard gate, and it instantly rejected 50 of my own known-good fixtures. A hard gate must be scoped to exactly what breaks the product if it ships — an unreachable scene breaks gameplay; a slightly flat line of dialogue does not. The first goes in the gate; the second goes to the judge.

GRADE: engineer the judge like an instrument

The judge is a second model that scores the subjective half. The iron law came from my very first run, which flagged a scenario as _"nearly linear — all choices route to the same place." _
It was wrong; the graph branched everywhere. The judge had tried to count edges in its head, which models do confidently and wrong. So I stopped asking. I compute every countable fact (branching factor, reachability, coverage) in code and hand it to the judge as ground truth. The same dimension on the same output went from a failing 2.0 to a clear 4.0 with zero change to the generator. Anything countable belongs in code; the judge scores meaning, never arithmetic.

Around that law, the judge is built like an instrument with known error:

  • Forced tool-use: the judge must return a structured {dimension, score, rationale} via a single constrained tool, on a 1–5 scale. No regexing scores out of prose.
  • Temperature 0, and still N = 3 samples per dimension, averaged. Why both? Even at temperature 0, one dimension scored 4.00, 3.33, and 2.67 across three identical runs on identical output. One sample is a coin flip you'd gate a release on by accident.
  • A hard floor of 3.0/5 and a committed regression baseline(tolerance 0.5). The floor catches "this is unacceptable"; the baseline catches "this silently got worse than last week," which is the failure a human can't feel.
  • A different, stronger model family than the generator. The generator is non-Anthropic; the judge is Claude Opus 4.8. Upgrading the judge re-sets the meaning of every past number, so the judge is a deliberate, fixed instrument. Not something you swap casually.

It scores 17 dimensions across the seven generation phases — concretely:

Phase Dimensions scored
Outline narrative coherence · milestone appropriateness · scene-graph pedagogy
Scenes & actions scene immersion · action distinctness · difficulty calibration · tier↔framework alignment
Responses in-character voice · tone monotonicity across relationship states · sentiment↔score alignment · response realism
Callback lines salience · concision & voice · continuity
Score calibration delta magnitude · delta-dimension appropriateness
Input fidelity scenario grounding · character fidelity · milestone fidelity
Tag semantics effect-tag accuracy · attribute-tag accuracy

Every dimension is a 1–5 score with the same 3.0 floor; any single floor breach reds the whole run.

The reusable habit: read the artifact, not the score. A tone dimension sat at 2.0 for weeks. I tuned the prompt, re-ran, watched it not move — chasing a number. Then I pulled the actual responses and read them: the character meant to be cold and skeptical was saying "I appreciate that you're willing to…" — a line indistinguishable from the warmest character in the cast. The bands had collapsed into one warm blur. The root cause was structural: each tonal variant was generated in its own isolated call, so asking one to be "colder than the others" was meaningless — it couldn't see the others. The fix was to stop asking for a relative property and give each call an absolute anchor:

# BEFORE — relative, and invisible to this call
Make this character's reply colder than the others'.

# AFTER — absolute rung on a shared scale, with the leaks named
You are warmth level 1 of 4: guarded and curt — clipped, transactional, no reassurance.
Never use warmth phrases such as: "I appreciate", "I'm glad to", "happy to", "great question".
Enter fullscreen mode Exit fullscreen mode

PLAY: simulate the user, not the artifact

Some bugs only exist when something traverses the artifact. Three of ten generated simulations let a consistently wrong player loop forever, and every structural check was green, correctly, because a cycling scene can eventually reach an ending, so reachability analysis is blind to it. So I wrote a deterministic walker that plays each simulation through every strategy — best path, worst path, alternating — with no model calls, in milliseconds. It found the infinite loops instantly. And the most useful part was where the bug lived: not in the generated content but in the runtime, which had no "after N turns, force an ending" budget. The fix shipped in the runtime, not the generator.

I walk every strategy, not just the failure path, because the same walker caught the opposite bug on the best path: on one model tier, optimal play ended the story weeks early. Worst-play looping and best-play content-starvation are opposite failures, invisible to each other, both falling out of the same nearly free tool.

MEASURE: ranking models, and why the answer was two of them

I started the way everyone does: pick one model, run the whole pipeline on it, compare. I ran a fleet that way, judged by Claude Sonnet 4.6. The result was one stubborn pattern — every model sat somewhere on a reliability-vs-quality axis, and not one was strong on both. Cost is modeled: tokens were measured once as a single pipeline-aggregate profile, ~15K in / ~126K out per generation with output ~88% of the bill, and each model's $/gen derived from that profile × its published rates, not measured separately.)

Generator (whole pipeline) Reliability Quality Latency ~$/gen Verdict
Amazon Nova 2 Lite 2/2 valid hard dims capped ~2/5 ~2–3 min $0.19 reliable, quality-capped
Qwen3 235B 1/2 (orphan scene) beats Nova everywhere (3–4) ~5 min $0.16 strong, unreliable
Qwen3 Coder 480B 2/2 valid capped (difficulty 2) ~4 min $0.32 reliable, capped
DeepSeek V3.2 0/2 (bad JSON, then orphan) failed structurally ~2 min $0.35 looked like the worst fit
Kimi K2 Thinking 2/2 full pass immersion 5, distinctness 4 ~15 min $0.63 best, but slow + pricey
Nova 1 Pro 0/2 (tool loop, 10 breaches) worse than Nova 2 Lite ~10 min+ $0.25 ruled out

No single-model winner. But when I tabulated where each model failed, the split was absolute: every reliability failure was a structure-phase failure, and every quality ceiling was a content-phase concern. Those are two different jobs.
Structure wants a cheap model that reliably emits valid JSON in a tool loop; content wants the strongest writer I can afford.
Forcing one model to be good at both is what made the search look unwinnable. So I stopped picking one model and split the pipeline (that's the two model roles in Figure 2).

For the content seam I ran a second, narrower bake-off and upgraded the judge to a stricter, stronger model, Claude Opus 4.8, re-running the same output. That swap mattered on its own: Qwen3 235B had "cracked" the hardest dimension under the Sonnet judge, then under Opus its difficulty score dipped to 2.0 with ten floor breaches. The content didn't change — the ruler did.

Content model difficulty distinctness reliability latency note
DeepSeek V3.2 3.23 3.63 0 fail / 0 timeout ~4.6 min best callback lines (3.90/3.93)
Claude Sonnet 4.6 3.77 (only ceiling-cracker) 3.9 0 fail 7.6 min ~$2.09/gen
Claude Haiku 4.5 3.0 0 fail 3.9 min ~$0.79/gen
Qwen3 235B 2.75 0 fail 4.1 min 10 breaches under Opus
gpt-oss-120b 2.0 (breached all) 2.0 0 timeout, fastest 3.4 min speed + reliability ≠ quality
Nova 2 Lite + reasoning 2.0 2/10 timed out >90 min 12.7 min reasoning is a liability for bulk content

Two findings no single-model test would have shown me. First, the model I'd dismissed as the worst became the model I shipped. DeepSeek went 0/2 in the full-pipeline round, but both failures were structure-phase tool-loop problems, not content problems. Given only the content job, it came second to Sonnet at a fraction of the cost. Second, fast and reliable still lost on quality: gpt-oss-120b was the fastest with zero failures, and floor-breached both hard dimensions at 2.0, below even cheap Haiku.

What shipped, and the tradeoffs

  • Structure → Amazon Nova 2 Lite. The cheapest model that reliably re-emits valid JSON as its final tool-loop answer, carried by the self-healing pipeline. ~$0.19/generation.
  • Content → DeepSeek V3.2. Reliable, strong, and ~1/7 the cost of the best writer.
  • Judge → Claude. Sonnet 4.6 as the standing grader; I escalate to Opus 4.8 for the hardest, most subjective comparisons.

The biggest tradeoff was deliberate and it was about cost, not quality denial. Claude (Sonnet) was the best content model I measured, full stop — but at ~$2.09 per generation versus DeepSeek's ~$0.35 and Nova's ~$0.19, roughly 6–11× the price.
And generation runs per template, over and over, every time I iterate on a prompt. The judge, by contrast, runs only when I evaluate — far less often. So the economically right move was to put the cheap, self-healed models in the pipeline that runs constantly, and reserve the expensive Claude for the judging seat where its cost is amortized across an entire eval, not paid per template. I accepted a measurably weaker writer in the pipeline to keep the per-generation cost an order of magnitude lower, because generation is pre-computed once and amortizes to fractions of a cent per play session.

The playbook

  1. Split the feature — guarantee, grade, play, measure — and give each the cheapest tool that sees its failures. Hold your own tooling to the same split.
  2. Enforce structure and make it self-healing — re-validate the final output, loop errors back to retry, keep a deterministic repair pass. That's what lets a cheap model be reliable enough to ship.
  3. Never make the judge do arithmetic — compute everything countable in code and inject it as ground truth.
  4. Engineer the judge as an instrument — forced tool-use, temperature 0, N≥3 averaged, a hard floor and a committed baseline, a stronger and different family than the generator. Any judge change invalidates every score; re-baseline.
  5. If your output is traversed, play it through every strategy with a deterministic walker — it's nearly free and catches loops and content-starvation no judge can express.
  6. Read the artifact, not the score — the specific leaking phrase points at the one-line fix, and prove every fix across the matrix, because one green run on a non-deterministic generator is not evidence.
  7. Make model selection a results table with reliability as its own column — and when no single model wins on reliability and quality, split the work by phase and compose two, instead of forcing one to be good at everything.

Every time I treated a red as a bad model, I was wrong. Every time I treated it as a measurement question first — what is this number actually measuring, and can I trust the instrument? — it pointed me at the real lever.

Stop testing your AI. Start measuring it, and engineer the instruments as carefully as the thing they measure.

Top comments (0)