Joseph Yeo

Posted on Jun 28

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

#ai #agents #llm #automation

What it costs to be the last reviewer your own system has

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on a frontier model; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

In the last post, I described how my agent's rulebook learned to forget — rules that age out get flagged, and a human decides whether to retire them. I ended on a line that's been nagging me ever since: the whole thing works because I'm still small enough to read everything, and I suspect that doesn't last.

This post is about that suspicion. It's about the design decision I keep reaching for — keep a human in the loop — and the uncomfortable thing underneath it: that human is me, I am exactly one person, and "put a human on it" is not a plan. It's a debt I haven't been billed for yet.

"Keep a human in the loop" is the answer I keep reaching for

When an automated decision feels risky, my reflex is usually the same: don't let the machine do it alone, put a person on the final call. It sounds responsible. It's the answer I reach for when I don't yet trust the automation boundary, and it's the one I'd give if you asked me how to make an agent safe.

And it's right, as far as it goes. A human on the final decision catches the failure modes a confidence score can't see. I argued for exactly this last time: the machine is good at noticing a rule hasn't earned its keep lately; it is not good at knowing whether that's because the rule is obsolete or because I just haven't exercised it. So the machine flags, and I decide.

The part I glossed over is what "I decide" actually costs.

The cost nobody puts on the diagram

When you draw a human-in-the-loop system, the human is a small box near the end with an arrow labeled approve / reject. The box looks free. It isn't. Every decision routed to that box spends three things that don't show up in the diagram.

It spends attention — each call needs enough context loaded into a human head to judge it, and that context isn't cached between decisions the way it is for a machine.

It spends latency — the system now moves at the speed of when I happen to look, not the speed of the runs. A flag raised at 2am waits for me. The agent doesn't.

It spends a budget that doesn't grow — the machine's side scales with hardware. My side barely scales at all. I get the same hours next year. If the flags-per-week curve goes up and the human-hours curve is flat, the lines cross. After they cross, "a human reviews it" quietly becomes "a human is supposed to review it," which is a different claim wearing the same label.

That last failure is the one I actually fear. Not the human-in-the-loop that says no. The human-in-the-loop that has too much queued to look properly, and starts rubber-stamping — approving on a glance because the backlog is the real pressure. A reviewer who can't keep up doesn't fail loudly. They fail by agreeing, and a rubber-stamped approval still moves the system forward — it just carries a decision nobody actually made, buried until something downstream breaks.

What I tried: making the human's time the scarce resource it actually is

Once I stopped treating my own attention as free, the design question flipped. It stopped being "where should a human review?" and became "this human has a small, fixed number of real decisions in him per week — which ones are worth spending?"

That reframing changed the design more than another automation pass would have. A few things fell out of it.

Most decisions don't need me; they need a default. A flag I almost always approve isn't a decision, it's a ceremony. The honest move is to pick the safe default, let it happen automatically, and log it where I can audit a sample later — not to stand at the gate nodding. The clearest case: stale-rule flags for rules that only ever applied to throwaway scaffolding from old projects. There's no real call to make there — retiring them is the safe default, so the system does it and tells me, instead of asking. I moved a whole class of "review" into "do the safe thing and log it," and got the time back for the calls that were actually close.

The decisions worth my time are the irreversible and the ambiguous. Retiring a rule that can't easily be un-retired; anything where the machine's confidence is truly split rather than just low. Those I keep. They're rare, which is the point — keeping a human in the loop only scales if the loop is small.

Batching beats interrupting. Ten flags reviewed in one sitting, with shared context, cost a fraction of ten flags reviewed across ten interruptions. So the system holds non-urgent decisions and presents them together, instead of paging me the moment each one appears.

None of this removes the human. It does the opposite — it admits the human is the bottleneck and budgets around it, instead of pretending the bottleneck is free and acting surprised when it backs up.

What I deliberately refused to automate, even knowing it doesn't scale

Here's the tension I haven't resolved.

Some decisions I keep for myself even though I know that choice is what limits how far the system can run without me. Retiring hard-won knowledge is one. The call to let the agent act on something it's never done before is another. Not because a model couldn't make those calls — increasingly it could — but because I'm not yet willing to not know when they happen.

That's an honest admission, not a principle. It might be that I'm holding onto these out of caution that's already obsolete. It might be that one of them is the next thing to hand off, and I'm just attached. The reason I can still tell the difference is, again, that I'm small enough to feel each of these decisions individually. The day there are too many to feel is the day this stops being judgment and starts being a story I tell myself about judgment.

So I'm not claiming I solved it. I'm claiming I stopped pretending the human was free, and that alone changed which decisions I let reach me.

What this didn't prove

This is one person's setup, and the bottleneck I'm describing is me, specifically — my hours, my attention, my unwillingness to look away from certain decisions. A team has a different shape of this problem: more reviewers, but also coordination cost, and the rubber-stamp failure mode gets easier to hide, not harder, when "someone reviewed it" can mean anyone.

I haven't shown that my particular triage — defaults for the routine, human for the irreversible and the ambiguous — is the right cut. It's the cut that fit a single-operator system small enough to audit by sampling. I also haven't escaped the core problem; I've only delayed it. Every decision I automate to save my attention is a decision I now have to trust without watching, which is the exact move the earlier posts in this series were nervous about. I traded one risk for another with my eyes open. That's not a solution. It's a position.

The takeaway

"Keep a human in the loop" is true and incomplete. It's true because the human catches what the metric can't. It's incomplete because it quietly assumes the human's time is free, and in my setup the human's time is the least scalable resource in the whole system. A loop with a person in it only works while the loop stays small enough that the person can actually be in it — not nominally, actually.

If I had to compress it: the goal isn't to keep a human in the loop. It's to spend the human on the decisions that deserve a human, and to be honest that everything else was a default you chose, not a review you did.

I'd like to hear how others handle this, because I don't think being small saves me for long. In your systems, what do you actually keep a person on — and how do you tell the difference between a review that's real and one that's become a rubber stamp? And when you handed a decision to automation, how did you decide it was safe to stop watching?

Top comments (2)

Alex Shev • Jun 28

Human-in-the-loop scales better when the human approves boundaries, not every token. The agent should handle low-risk steps, surface uncertainty, and ask for permission at action boundaries: spend money, send externally, delete, deploy, publish, or touch production data.

Joseph Yeo • Jun 28

Agreed — approving boundaries instead of tokens is exactly the right cut, and your list (spend, send externally, delete, deploy, publish, touch prod) is a cleaner enumeration of "irreversible" than I gave in the post. I'm stealing that framing.

The thing I keep circling is that the boundary set isn't fixed either. Each new capability adds a boundary, and every boundary is still a call routed to me. So the same pressure shows up one level higher: when the boundaries themselves multiply, "approve at the boundary" quietly becomes "rubber-stamp at the boundary." Which is why I landed on treating even the boundary list as something that needs pruning, not just defining. But your action-boundary framing is the right place to start — thanks for laying it out so concretely.