DEV Community

Mario Hayashi
Mario Hayashi

Posted on • Originally published at blog.mariohayashi.com on

The Factory Must Grow (Part IV): Testing AI To Walk Away From the Keyboard

Welcome back. Thanks again for reading my previous posts. Part III was about stopping the line when the factory breaks. Part IV is about the question: how do I actually know the orchestrator is producing expected outputs?

A good factory runs by itself

Why: I Want to Walk Away From the Keyboard

The end goal of building this orchestrator is to one day leave the keyboard. Gradually reduce my AI-babysitting and then focus on direction, narrative, quality control and product or customer research. Perhaps go for a walk. Part I of this series opened up to what’s possible. Parts II and III were the architecture and the discipline around improving the system. But I want to eventually walk away from my laptop.

Early on, my orchestrator’s healthcheck reported success while the event log told a different story: the worker process exited successfully but didn’t invoke the model. Zero tokens were consumed but also there was no code! The dispatcher saw success and dutifully closed the issue. The orchestrator wasn’t wrong per se but the contract between orchestrator and worker was broken.

The hard part about autonomy (and engineering) isn’t getting it to work just once. It’s knowing that your unattended agents deliver what they claim. If I have to babysit the healthchecks, I haven’t replaced myself. I’ve just created a second job. The real “walk-away” test is to verify whether the orchestrator and agents produce the expected results.

End-to-end testing and evals can be the answer here. Unit tests have their purpose but we need to see orchestrator working in a real environment. What I need is a harness that tests the end-to-end (E2E) on real GitHub, a real LLM, a real AI agent worker and grades the outcome.

A passing E2E test

How: Harness, Test Scenarios, Five Layers

Harness

The harness is an E2E test runner script. It reads a YAML test scenario, files a real tracker issue (e.g. GitHub) from the spec, then polls 3x sources every fifteen to thirty seconds: GitHub state (issues, board status, PRs, etc), the orchestrator’s NDJSON event log and the worker outcome inside the event log. The harness evaluates a rubric of criteria, terminates early when a set of conditions are met, scores the run, writes a report and cleans up. The whole thing runs in < 15 minutes for a happy-path scenario.

The automation layer handles polling, APIs and bookkeeping. The judgement layer asserts and only looks at issues, board state, PRs, structured audit markers — things you and I would look at. Termination conditions are typed and don’t depend on random string comparisons.

Test Scenarios

I started with ten scenarios, covering multiple types of failure. There’s a happy path that writes a one-line diff to a known file. There’s a “split chain” that fans out into four sequential children with dependencies. And then there are three negative-path scenarios: paused (input too vague, worker should ask), abandoned (task impossible, worker should escalate) and unable (work undeliverable, the worker should bail). Finally, there is a marketing path and a PR-review path, each with their own happy and negative tests.

Negative paths matter. A test suite that only proves “it works when nothing weird happens” doesn’t inspire confidence. The negative path scenarios show that the orchestrator stops correctly and escalates correctly. This is Toyota’s Jidoka at work (see Part III blogpost).

Signals, Not Regex

Other harnesses I’ve seen for agent systems grep for strings. The worker says “Unable to complete” and the rubric matches /unable/i, so we get a green tick. While that works once or even a few times, it’ll be like whack-a-mole when you discover the thousands of ways it can go wrong. The prompt response could change all of a sudden across model releases and the regex could be too brittle. You need to “fix” the regex every week and by the end of it you won’t trust the system.

How end-to-end tests feel like when they keep breaking

The harness I made reads structured signals: GitHub labels, board fields, sub-issues, linked-PR diffs and HTML audit comments the orchestrator emits on every state transition (<!-- audit verdict=DONE kind=delivered -->). The audit comment is a structured field on a comment, not freetext. The rubric criteria matches that field, with no regex.

This is poka yoke mistake-proofing for harnesses: the scenario YAML is typed (termination rules required, criteria on structured fields), so regex can’t sneak in through config. It’s always possible for regex to slip into the test code but it needs to be intentional.

Caching to Prevent Burning Tokens

I’d only recommend real LLM evals if you can afford to run them. Nobody wants to burn tokens, so I built a cache proxy to store prompts and their results, which get replayed on the next run. But it wasn’t working exactly as I wanted, so I rebuilt it again.

The first cache was content-addressed: the message was normalised to reduce noise, hashed with (model, system, messages, tools) and we’d look up the response with the hash. However, two runs with the same prompt never had byte-identical bodies. Today’s date, the working directory and the GitHub issue number flipped every session and made caching really, really hard. The normalisation layer kept growing to scrub timestamps, paths and IDs and to rewrite GitHub entity references to known placeholders (”#786” to “parent issue”). I kept saying to myself, “the next one will be a cache hit”. After one week of adding one normalisation rule after another, I knew something was wrong. The hit-rate ceiling was about 30% on back-to-back identical reruns and, after any workflow edit, it dropped to zero.

Caching prompt results so that they can be replayed later

The mistake was architectural. A multi-turn run is a chain: turn N+1’s input is the entire message list so far, and that list embeds turn N’s assistant output plus any tool outputs (tool_result) appended after tool execution. One byte of drift in assistant prompt cascades into a permanent miss for subsequent turns. And you can’t reliably normalise assistant prompts. No rule addition will change the ceiling. I was chasing a structural problem with syntactic patches.

So I rebuilt it. The proxy doesn’t compare payloads. For each scenario it writes traffic to a per-dispatch NDJSON trace and, on subsequent runs, returns stored responses in order. Like a VHS recorder. Determinism comes from what was recorded and not from normalising traffic. If the scenario or workflow drifts, you will get TRACE-STALE. If the worker needs more turns than the recording has, you get TRACE-EXHAUSTED. Both of these failures force a re-record. You pay for a scenario recording once and subsequent evals will cost nothing.

🪨 Caveman x Orchestration to Reduce Burning Tokens

As an experiment, I also turned on the viral, compressed “caveman” style with github.com/juliusbrussee/caveman made by Julius Brussee (star it!). The worker sessions appear to inherit it. Caveman doesn’t get in the way of decisions and reduces waffling in the replies. Instead of “I traced the failure to a blocked hook that the worker retried three times” becomes caveman “Blocked hook retried 3x”. I’ve not run a A/B on tokens, so the benefits are honest speculation. But shorter assistant output means less text carried forward into the next turn. On long orchestration loops where the harness and agents expand context, this could be a big win...! How much of a real saving that’ll produce is still an open question. For now I’m treating it as a small, promising experiment, not a proven optimisation (yet).

🪨

“is rock!

want walk away from keyboard. good dream. but if caveman still stare at healthcheck like hawk, caveman not replaced — caveman now have two job. very bad.”


Five-layer failure model

When a scenario fails, the rubric isn’t enough. Every failure has to name its layer. Here are the layers I’ve settled on and some example failures:

  1. Environment : The worker process didn’t run. Subprocess succeeds with zero tokens consumed.Usually a quota or session cap turning up as no-op success.

  2. Cache / Proxy : The replay trace doesn’t match this run. TRACE-STALE means the scenario or workflow has moved; TRACE-EXHAUSTED means the worker took more turns than the recording has. Fix is to re-record, not to edit the proxy.

  3. Orchestrator : The dispatcher misclassifies the verdict, drifts the board state or mis-engages the andon cord.

  4. Worker Contract : The prompt didn’t enforce a behaviour. Fix is structural where the bad decision is defined better with a classifier, a typed schema. A prompt is only edited when the work is generative or genuinely ambiguous.

  5. Rubric : The test itself was wrong.

There’s a cost to getting the triage wrong. I shipped many commits editing the worker prompt. But the actual root cause was in the “environment” layer. The worker was returning success with zero tokens consumed and no amount of prompt editing could have prevented it. The fix was a classifier in the orchestrator that looked at the token count before returning “success”.

Naming the layer is now a rule. Every PR description that fixes an eval has to state which layer, show the evidence and why the fix is not in other layers.

Intermission: How You Can Get Started

If you want to build something like this for your own agent system, I have a few suggestions. Start with one happy-path scenario that delivers a trivial real artifact (one line of one file). Read structured signals like a label change, a PR diff size, a sub-issue count and avoid regex on prose. And type your termination conditions and be disciplined. Declare what ‘done’ looks like. It will pay back over time.

Work In Progress

The system still breaks and it’s not perfect. I keep finding issues with work I throw at the agents. The latest fix was the harness itself: the eval now reads the orchestrator’s audit trail and names the failure layer structurally, derived from the trace. The gap between “it ran” and “it worked” feels smaller now that I can leave the laptop for an hour and not check. E2E testing will buy us freedom in the long run.


I would love to have your feedback! I write more like this at blog.mariohayashi.com, and feel free to follow me on X: @logicalicy.

Top comments (0)