I built a training simulator as a side project. You hand a language model a short premise and it writes an entire choose-your-own-adventure up front — every scene, the characters and their dialogue, the handful of choices at each turn, and the scoring.
That created a testing problem I didn't expect. Every time I tweaked a prompt or swapped a model, the whole thing needed to be regenerated from scratch to be tested, multiple template variations & many more permutations and combinations. I couldn't read it all, and I couldn't tell whether a one-line change made it better or worse from a single playthrough. Manual review wasn't slow here; it was infeasible. So the real project became a harness that could test the permutations that matter automatically — well enough to let me iterate on prompts and models with confidence. This post is about that harness: what it checks, how it scores, how a run flows, and what it taught me when I used it to pick the models.
What goes in, what comes out
The input is small: a couple of pages of structured brief. A premise (scenario, industry, stakes). A cast — each character with a role, background, personality, and a receptiveness rating from 1 (hostile) to 5 (eager ally). A set of competency rubrics to score against. And a few scope dials — length tier, timeline in weeks, named milestone checkpoints.
The output is enormous. The model produces a branching scene graph: scenes anchored to a week and a role (opening, working, milestone, crisis, resolution); per-scene choices each tagged with a quality tier (optimal / acceptable / suboptimal / wrong); routing that wires each choice to the next scene; and success/partial/failure endings. On top of that sits per-character dialogue that reacts to how the relationship is going — the same choice in the same scene can land differently depending on the running state with that character. And a scoring layer threads through all of it, moving both each stakeholder's disposition and the player's competency.
The shape is multiplicative, and that's the whole problem:
authored reactions ≈ scenes × choices-per-scene × relationship states × characters
A medium template is roughly 21 scenes × 4 choices × ~4 relationship states × ~3 characters — on the order of a thousand individually-scored character reactions, plus the scene narratives, the routing graph, and the per-ending results.
It's produced by an AWS Step Functions pipeline — a sequence of Lambda phases, some calling a model, some pure deterministic code.
Why this is measurement, not testing
A unit test asserts add(2,2) === 4 — a known-correct answer. But there's no "correct" line for a skeptical stakeholder in scene seven, and no correct answer for whether four choices are a real dilemma or three obvious wrongs and a right one. The structure I can verify deterministically; the writing I can only judge by degrees, with an instrument that has its own error.
That reframe carries three rules unit testing doesn't have:
The instrument has error bars. A model judge has real variance, so you sample it more than once and treat a single score as noisy.
The ruler drifts. Swap or upgrade the judge and every number you've ever recorded silently changes meaning. You re-baseline, or you're comparing two rulers.
A red is a finding, not a verdict. Before you fix anything, you triage what the number is actually measuring — because half the time the bug is in the instrument, not the output.
And a human can't be that instrument, for three compounding reasons:
- the volume has no answer key (a reviewer just gets more lenient around block four hundred);
- non-determinism means one read covers one path and the next regeneration isn't the path you read;
- and every change re-authors the whole thing, so a manual pass is a tax you repay from scratch each iteration, while the regressions that matter (a dimension drifting 4.3 → 4.0) are invisible to a person with no prior number in hand.
So I built the grader.
One run, gated cheap-to-expensive
The eval CLI runs a fixed matrix of configurations (different scenarios at different lengths). Each one flows through ordered gates, cheapest first, so a broken template never reaches the expensive judge:
- Generate (expensive — generator tokens): drive the real production pipeline end-to-end, poll until done, record the template to a manifest before anything judges it.
- Structural check (cheap, deterministic): run the production validator, fail on any enforced violation. No judge spend if it fails.
- Playthrough (cheap, deterministic): walk the template with the production engine on several strategies. No judge spend if it fails.
- Judge (expensive): only reached if both cheap gates passed. Score each quality dimension.
- Floor + baseline (cheap, in-process): compare each score against a hard floor and a committed baseline.
Generation and judging are decoupled through that manifest, which makes the slow half resumable: if judging is interrupted, a --rejudge re-scores the same stored templates with zero regeneration cost. Losing a judge mid-run costs seconds, not the hour it took to generate.
Those gates map exactly onto four jobs -
GUARANTEE: structure is enforced and self-healing
Structure has a correct answer, so it gets unit-tested with no model in the loop: every scene reachable, no dead ends, the routing valid, the score math sound. The catch is that checkable isn't enforced. My pipeline could call the validator mid-generation, get back "scene 7 is unreachable" — and return the broken outline anyway, because nothing consumed the verdict.
So I made generation self-healing, in two layers.
- I re-validate the model's final output myself and, if it breaks the contract, feed the specific errors back and make it retry.
- As a last resort, a pure, deterministic, unit-tested repair pass splices orphaned scenes back into the main spine. A no-op on healthy output, and my logs literally read "repaired unreachable scene(s)" when it fires. This is what lets a cheaper, less reliable model carry the structure phase: I don't need the model to be perfect, I need the pipeline to be self-correcting around it. A flaky generator plus enforce-and-repair beats an expensive generator you trust blindly.
The scoping lesson here stung: I first folded soft quality signals into this hard gate, and it instantly rejected 50 of my own known-good fixtures. A hard gate must be scoped to exactly what breaks the product if it ships — an unreachable scene breaks gameplay; a slightly flat line of dialogue does not. The first goes in the gate; the second goes to the judge.
GRADE: engineer the judge like an instrument
The judge is a second model that scores the subjective half. The iron law came from my very first run, which flagged a scenario as _"nearly linear — all choices route to the same place." _
It was wrong; the graph branched everywhere. The judge had tried to count edges in its head, which models do confidently and wrong. So I stopped asking. I compute every countable fact (branching factor, reachability, coverage) in code and hand it to the judge as ground truth. The same dimension on the same output went from a failing 2.0 to a clear 4.0 with zero change to the generator. Anything countable belongs in code; the judge scores meaning, never arithmetic.
Around that law, the judge is built like an instrument with known error:
- Forced tool-use: the judge must return a structured {dimension, score, rationale} via a single constrained tool, on a 1–5 scale. No regexing scores out of prose.
- Temperature 0, and still N = 3 samples per dimension, averaged. Why both? Even at temperature 0, one dimension scored 4.00, 3.33, and 2.67 across three identical runs on identical output. One sample is a coin flip you'd gate a release on by accident.
- A hard floor of 3.0/5 and a committed regression baseline(tolerance 0.5). The floor catches "this is unacceptable"; the baseline catches "this silently got worse than last week," which is the failure a human can't feel.
- A different, stronger model family than the generator. The generator is non-Anthropic; the judge is Claude Opus 4.8. Upgrading the judge re-sets the meaning of every past number, so the judge is a deliberate, fixed instrument. Not something you swap casually.
It scores 17 dimensions across the seven generation phases — concretely:
| Phase | Dimensions scored |
|---|---|
| Outline | narrative coherence · milestone appropriateness · scene-graph pedagogy |
| Scenes & actions | scene immersion · action distinctness · difficulty calibration · tier↔framework alignment |
| Responses | in-character voice · tone monotonicity across relationship states · sentiment↔score alignment · response realism |
| Callback lines | salience · concision & voice · continuity |
| Score calibration | delta magnitude · delta-dimension appropriateness |
| Input fidelity | scenario grounding · character fidelity · milestone fidelity |
| Tag semantics | effect-tag accuracy · attribute-tag accuracy |
Every dimension is a 1–5 score with the same 3.0 floor; any single floor breach reds the whole run.
The reusable habit: read the artifact, not the score. A tone dimension sat at 2.0 for weeks. I tuned the prompt, re-ran, watched it not move — chasing a number. Then I pulled the actual responses and read them: the character meant to be cold and skeptical was saying "I appreciate that you're willing to…" — a line indistinguishable from the warmest character in the cast. The bands had collapsed into one warm blur. The root cause was structural: each tonal variant was generated in its own isolated call, so asking one to be "colder than the others" was meaningless — it couldn't see the others. The fix was to stop asking for a relative property and give each call an absolute anchor:
# BEFORE — relative, and invisible to this call
Make this character's reply colder than the others'.
# AFTER — absolute rung on a shared scale, with the leaks named
You are warmth level 1 of 4: guarded and curt — clipped, transactional, no reassurance.
Never use warmth phrases such as: "I appreciate", "I'm glad to", "happy to", "great question".
PLAY: simulate the user, not the artifact
Some bugs only exist when something traverses the artifact. Three of ten generated simulations let a consistently wrong player loop forever, and every structural check was green, correctly, because a cycling scene can eventually reach an ending, so reachability analysis is blind to it. So I wrote a deterministic walker that plays each simulation through every strategy — best path, worst path, alternating — with no model calls, in milliseconds. It found the infinite loops instantly. And the most useful part was where the bug lived: not in the generated content but in the runtime, which had no "after N turns, force an ending" budget. The fix shipped in the runtime, not the generator.
I walk every strategy, not just the failure path, because the same walker caught the opposite bug on the best path: on one model tier, optimal play ended the story weeks early. Worst-play looping and best-play content-starvation are opposite failures, invisible to each other, both falling out of the same nearly free tool.
MEASURE: ranking models, and why the answer was two of them
I started the way everyone does: pick one model, run the whole pipeline on it, compare. I ran a fleet that way, judged by Claude Sonnet 4.6. The result was one stubborn pattern — every model sat somewhere on a reliability-vs-quality axis, and not one was strong on both. Cost is modeled: tokens were measured once as a single pipeline-aggregate profile, ~15K in / ~126K out per generation with output ~88% of the bill, and each model's $/gen derived from that profile × its published rates, not measured separately.)
| Generator (whole pipeline) | Reliability | Quality | Latency | ~$/gen | Verdict |
|---|---|---|---|---|---|
| Amazon Nova 2 Lite | 2/2 valid | hard dims capped ~2/5 | ~2–3 min | $0.19 | reliable, quality-capped |
| Qwen3 235B | 1/2 (orphan scene) | beats Nova everywhere (3–4) | ~5 min | $0.16 | strong, unreliable |
| Qwen3 Coder 480B | 2/2 valid | capped (difficulty 2) | ~4 min | $0.32 | reliable, capped |
| DeepSeek V3.2 | 0/2 (bad JSON, then orphan) | failed structurally | ~2 min | $0.35 | looked like the worst fit |
| Kimi K2 Thinking | 2/2 full pass | immersion 5, distinctness 4 | ~15 min | $0.63 | best, but slow + pricey |
| Nova 1 Pro | 0/2 (tool loop, 10 breaches) | worse than Nova 2 Lite | ~10 min+ | $0.25 | ruled out |
No single-model winner. But when I tabulated where each model failed, the split was absolute: every reliability failure was a structure-phase failure, and every quality ceiling was a content-phase concern. Those are two different jobs.
Structure wants a cheap model that reliably emits valid JSON in a tool loop; content wants the strongest writer I can afford.
Forcing one model to be good at both is what made the search look unwinnable. So I stopped picking one model and split the pipeline (that's the two model roles in Figure 2).
For the content seam I ran a second, narrower bake-off and upgraded the judge to a stricter, stronger model, Claude Opus 4.8, re-running the same output. That swap mattered on its own: Qwen3 235B had "cracked" the hardest dimension under the Sonnet judge, then under Opus its difficulty score dipped to 2.0 with ten floor breaches. The content didn't change — the ruler did.
| Content model | difficulty | distinctness | reliability | latency | note |
|---|---|---|---|---|---|
| DeepSeek V3.2 | 3.23 | 3.63 | 0 fail / 0 timeout | ~4.6 min | best callback lines (3.90/3.93) |
| Claude Sonnet 4.6 | 3.77 (only ceiling-cracker) | 3.9 | 0 fail | 7.6 min | ~$2.09/gen |
| Claude Haiku 4.5 | 3.0 | — | 0 fail | 3.9 min | ~$0.79/gen |
| Qwen3 235B | 2.75 | — | 0 fail | 4.1 min | 10 breaches under Opus |
| gpt-oss-120b | 2.0 (breached all) | 2.0 | 0 timeout, fastest | 3.4 min | speed + reliability ≠ quality |
| Nova 2 Lite + reasoning | 2.0 | — | 2/10 timed out >90 min | 12.7 min | reasoning is a liability for bulk content |
Two findings no single-model test would have shown me. First, the model I'd dismissed as the worst became the model I shipped. DeepSeek went 0/2 in the full-pipeline round, but both failures were structure-phase tool-loop problems, not content problems. Given only the content job, it came second to Sonnet at a fraction of the cost. Second, fast and reliable still lost on quality: gpt-oss-120b was the fastest with zero failures, and floor-breached both hard dimensions at 2.0, below even cheap Haiku.
What shipped, and the tradeoffs
- Structure → Amazon Nova 2 Lite. The cheapest model that reliably re-emits valid JSON as its final tool-loop answer, carried by the self-healing pipeline. ~$0.19/generation.
- Content → DeepSeek V3.2. Reliable, strong, and ~1/7 the cost of the best writer.
- Judge → Claude. Sonnet 4.6 as the standing grader; I escalate to Opus 4.8 for the hardest, most subjective comparisons.
The biggest tradeoff was deliberate and it was about cost, not quality denial. Claude (Sonnet) was the best content model I measured, full stop — but at ~$2.09 per generation versus DeepSeek's ~$0.35 and Nova's ~$0.19, roughly 6–11× the price.
And generation runs per template, over and over, every time I iterate on a prompt. The judge, by contrast, runs only when I evaluate — far less often. So the economically right move was to put the cheap, self-healed models in the pipeline that runs constantly, and reserve the expensive Claude for the judging seat where its cost is amortized across an entire eval, not paid per template. I accepted a measurably weaker writer in the pipeline to keep the per-generation cost an order of magnitude lower, because generation is pre-computed once and amortizes to fractions of a cent per play session.
The playbook
- Split the feature — guarantee, grade, play, measure — and give each the cheapest tool that sees its failures. Hold your own tooling to the same split.
- Enforce structure and make it self-healing — re-validate the final output, loop errors back to retry, keep a deterministic repair pass. That's what lets a cheap model be reliable enough to ship.
- Never make the judge do arithmetic — compute everything countable in code and inject it as ground truth.
- Engineer the judge as an instrument — forced tool-use, temperature 0, N≥3 averaged, a hard floor and a committed baseline, a stronger and different family than the generator. Any judge change invalidates every score; re-baseline.
- If your output is traversed, play it through every strategy with a deterministic walker — it's nearly free and catches loops and content-starvation no judge can express.
- Read the artifact, not the score — the specific leaking phrase points at the one-line fix, and prove every fix across the matrix, because one green run on a non-deterministic generator is not evidence.
- Make model selection a results table with reliability as its own column — and when no single model wins on reliability and quality, split the work by phase and compose two, instead of forcing one to be good at everything.
Every time I treated a red as a bad model, I was wrong. Every time I treated it as a measurement question first — what is this number actually measuring, and can I trust the instrument? — it pointed me at the real lever.
Stop testing your AI. Start measuring it, and engineer the instruments as carefully as the thing they measure.


Top comments (0)