Mario Hayashi

Posted on May 8 • Originally published at blog.mariohayashi.com on May 4

The Factory Must Grow (Part III): Stopping the AI Agent Production Line Toyota-style

#agents #ai #architecture #softwareengineering

Welcome back. Thanks for reading this blog series. I got insightful questions and comments in the past weeks — please keep them coming. But, yes, let’s get back to business: the factory must grow.

The Factory Broke

My orchestrator reported a clean healthcheck: the PRD issue moved to Done, workers exited without errors. There was just one problem. A feature I was keen on didn’t ship. No branch, no commit, no PR. The orchestrator closed what appeared to be a successful issue even though the agent had done zilch.

Part II of this The Factory Must Grow series concluded along these lines: the right architecture helps debug tricky issues; loud failures help us learn to grow the factory. This post is about the operating discipline that goes into growing AI agent orchestration. For full context on how the factory came to be, see Part I — The Factory Must Grow: I Replaced Myself with AI. Now What?, where it all started.

Over two weeks, the system silently dropped a dozen or so PRD issues and closed them as if they shipped. The stalled agents kept retrying and spun endlessly in cycles before quitting. Nothing committed.

The hardest bugs to fix are the silent ones. A crash gives you a stack trace. A silent drop gives you false confidence. In my case, the orchestrator reported success, closed the issue and moved on when it shouldn’t have. I had some choice words with Claude but it wasn’t Claude’s fault.

The fix wasn’t a plaster. It was the realisation that I needed the factory to stop completely, while I investigated the failures one by one. Dead code beats limping code. I can recover from a halted run but not as easily from corrupted state. The discipline I needed to get on top of my mounting issues came from a Japanese car maker.

Why Toyota Production System - Does It Apply To Agents?

The Toyota Production System (TPS) is the set of principles Toyota’s factories run on: quality at source, make the wrong thing impossible, anyone can stop the line, walk symptoms back to root policies and waste as a defect. Also known as jidoka, poka yoke, the andon cord, five whys, muda.

Why did I settle on TPS specifically? There are a whole bunch of DevOps principles you could apply to agent orchestration, I’m sure. But I’ve been fascinated by Taiichi Ohno’s TPS recently and thought it’d be apt to use it for my factory and agent management.

A worker corrupting the board. Or a worker deciding to call it quits and going AWOL. AI agent failures can look like any of the following:

Agents can hallucinate about their own output. An AI worker can write a long essay claiming it made a feature but writing zero lines of code.
Agents are non-deterministic. You can’t necessarily replay a run with the same input and expect the exact same output. This can be a real issue, as we may want determinism in critical decision points.
Agents looping out of control. The worst offender. A misfiring agent burning millions of tokens.

How much of TPS can we apply to managing agents? That’s the rest of this post.

Jidoka: The Silent Drops

The biggest failures retrospectively, were the silent ones. Dozen PRD issues closed as if they shipped but none of them did. A worker reported success and the orchestrator transitioned the issue to “terminal” and the PRD was gone.

Jidoka is Sakichi Toyoda’s principle to design equipment to stop automatically immediately if there are abnormalities. On a car line it means the machine stops the moment a defect is observed: it does not wait for the quality team at the very end of the line. For agents, this means every worker writes a “verdict” file at the end of its run. The orchestrator doesn’t look at whether the agent ran successfully or not (which can create false negatives) but it reads this verdict. If the verdict is present and positive, we ship. If it’s missing or failed, we investigate.

While it’s fairly obvious whether a car has been built or not, an AI worker can answer “yes” to anything. So to achieve jidoka for agents, we have to verify the outputs of the worker that isn’t self-reported. Commits, diffs, a verdict file at a known path which is written deterministically.

Taiichi Ohno who built TPS said that stopping the machine when there is an issue forces awareness on everyone. When the problem is clearly understood, improvement becomes possible.

Replace “machine” with “AI agent” and it’s still true. In this particular case, “success” is not when the worker finished without error but is when the worker has output a deterministic result. (Of course, a malicious worker might game the system and lie but that’s out of scope for this discussion.)

Poka yoke: The Retry Storm

The second type of failure I encountered looked different. Some agents would keep retrying an issue, making zero progress. The retry logic was supposed to have a cap but it didn’t. The switch statement to decide the next step was missing some logic and fell back to retrying.

Shigeo Shingo’s “poka yoke” is about of making the wrong thing impossible as part of the machinery, not the operational discipline. Poka yoke prevents a part from being installed badly unless it’s oriented correctly. A prompt saying “remember to handle XYZ” is not poka yoke.

Poka yoke needs to live at the orchestrator’s interface boundary and defined in the form of accepted dispatch outcomes and state transitions. The fix was to make sure every possible dispatch outcome is handled explicitly, enforced at compile-time. A second poka yoke was placed at the state-transition layer. The orchestrator errors if asked to close an issue that still has the “waiting for human answers” label because a paused issue should never be silently closed.

Andon Cord: Hook Timeouts

The third failure didn’t burn tokens but killed work. A hook inside the worker loop occasionally timed out and took the whole run down with it. The issue was marked “exhausted” even though the agent was fine. No commit, no signal.

The andon cord is the rope anyone on a Toyota line can pull to stop production (these days, it’s an electronic button). Production halts, the line manager looks, the problem is fixed at the station where it surfaced. The principle is that it is always cheaper to stop the line than to ship a defect downstream.

I built the software version of the andon cord. Any worker, on detecting a potential systemic risk, can write a “halt” verdict to a file and exit. The orchestrator catches the “stop” signal and engages a workflow-scoped lock. The orchestrator stops dispatching workflows until a human jumps in and clears it.

Andon is doubly useful for non-deterministic AI workers. With agents, you can’t replay a halted run cheaply. Re-running the same prompt against the same state can yield a different result, possibly a different failure. But a missed stop could mean a corrupted board and not being able to debug the issue, as the original issue is buried deep in the logs. False positive stops are always better as they cost just five minutes and a re-dispatch.

Agents pulled the cord several times in the first week after the cord’s introduction. Each time, I investigated the root cause with Claude and the system got better.

Knowing the line would stop itself the moment it sensed something off meant I could look at each edge case properly, instead of adding a “plaster” fix with retries that would have papered over the root cause.

Muda: The Budget Burn

The fourth failure cost lots of tokens. An agent picked up an issue, ran, stalled, retried, ran, stalled, retried in cycles, burning tokens.

Muda means waste and, in TPS, waste is a defect.

A car manufacturing stamping press wasting time might waste steel and electricity for that car. An agent wasting time might waste tokens all the way up to the weekly token limit until you stop the agents. Muda on a car line is bad but Muda in an agent fleet accelerates with multiple agents running.

The no-progress guard is simple. After every reported success and every budget breach, the orchestrator checks the worker’s workspace for a diff. If the diff is empty, the issue is abandoned regardless of what the agent claimed. No commit means no progress and that means the run should be halted.

Muda in practice means not just asking “how much should I let an agent spend before I cut it off?” but asking “how much should I let an agent spend before it has produced the first commit?” The latter spend should be much smaller.

Five Whys

You’ll probably be familiar with the five whys. You ask the questions that surface the real issue. The root cause might be hiding in the worker logs or the orchestrator dispatch code. The TPS principle is to keep asking “why” until you hit a root policy. The five whys chain from the PRD-drop incident goes something like this:

Why was the issue closed? The orchestrator closed an issue on a successful run.
Why did that path fire? The dispatch handler treated success as terminal and then routed the issue to the workflow listed first in its terminal-states list.
Why did that move the issue to a paused state? The PRD workflow’s terminal-states list had “Awaiting Answers” first.
Why wasn’t this caught by config validation? No field required the workflow to declare which state actually meant “done”, separate from “terminal”.
Why was there no config field? The code had a hard-coded fallback to “Done” that quietly substituted whenever config left it undefined, so the absence was never validated as missing.

The final why requires every workflow config to declare its done state explicitly and makes the schema reject any config that doesn’t declare a done state. This needed a change to the schema and a migration but that’s better than a bug lurking under code smell.

Intermission: How You Can Get Started

I’ve been asked where to get started with agent orchestration. If you’re building your own agent factory, perhaps the fastest place to start is the Symphony spec. It’s a language and tracker-agnostic orchestrator spec, written by OpenAI. I suggest you read the spec, pick a language, pick an issue tracker and you have a starting point.

What would I add to it? I made a fork, so you can have a look at changes I’d make to the spec based on my own preferences and experiences. It contains “Recommended Extensions” that e.g. account for worker hallucinations, hook timeouts or prevent tokens from being burned.

Your mileage may vary.

Work In Progress

The system still breaks. It’s a work in progress and I expect it to stay that way for a while. Every new failure is now spotted earlier, thanks to jidoka, poka yoke, andon cord, muda and five whys. What’s changed with TPS is that the factory is no longer allowed to limp. If an agent moves an issue to “Done” but produces nothing, the line stops. If a hook times out, the line stops. If a verdict is missing, the line stops. I’d rather have dead code than limping code, because limping code can create real damage and dead code can be investigated.

What TPS teaches us for an agent factory is the discipline about _where_ to enforce principles. The principles aren’t optional gates on the way to autonomy (in my opinion). They make autonomy possible at all. You can’t trust a worker to run unattended until you’ve watched it fail in every shape it knows how to fail in. The only way to safely have workers fail is to stop the line every time and walk the problem to a root policy.

My factory stopped dropping PRDs. The finish line isn’t necessarily a system that never stops. It’s a system that stops for the right reasons and that I trust enough to leave running when it doesn’t. (And won’t burn millions of tokens!)

If you’re experimenting with AI in your workflow, I’d love to hear from you! I write more like this at blog.mariohayashi.com, and feel free to follow me on X: @logicalicy.

DEV Community