AI made a lot of my work faster.
It did not make the week feel lighter.
That gap matters, because it is where most AI adoption advice gets too optimistic. The visible cost of producing work goes down. The hidden cost of deciding whether that work is right often goes up.
After looking back at my own production work, I changed the rule:
I only hand a task to AI when there is a machine-checkable gate on the way back.
If the result can only be judged by me, I either keep it by hand or add a gate first.
The work got faster only when a machine could say no
The pattern was simple once I stopped averaging everything together.
The work that sped up had a cheap oracle.
Tests pass. The compiler complains. Output matches a fixture. A schema lines up. A linter rejects the shape. In those cases, AI gives me generation speed and the machine gives me a way to catch bad output before it reaches my attention.
The work that did not speed up had no cheap oracle.
Does this match how we talk to customers? Does this fit the architecture we already agreed on? Is this the right abstraction for this codebase? Is this email technically correct but wrong in tone?
The model can produce five plausible answers. No compiler can tell me which one belongs. I become the oracle.
That changed what I delegate. After pruning the list, less than half of what I first wanted to hand off still qualified.
The delegated work got faster. The kept work did not get slower, because I stopped paying the round-trip cost of generating things that only a human could judge.
Generation got cheap. Judgment did not.
The thinking did not disappear. The trail did.
Low-level decisions left my plate. Variable names. Boilerplate. Obvious test cases. Small rewrites. I do not miss them.
Higher-level decisions grew heavier. Which approach do we adopt? Should we build this at all? Which tradeoff are we making? Which failure mode are we willing to accept?
It would be comforting to call that a clean upgrade: I now work at a higher level of abstraction. Sometimes that is true. Not always.
The symptom that stopped me from selling that story was memory. I would make a call, move on, and later fail to reconstruct why I chose that route. The decision remained. The reasoning evaporated.
The cost per decision changed. When a decision felt expensive, its reasoning stuck. When each call felt cheap, the reason slid off the desk.
That is not the same as thinking less. It is leaving fewer tracks.
A decision you cannot explain later is hard to learn from later. AI did not remove that problem. It made it easier to produce more of it.
Training is downstream of the judgment flood
This is where many organizational responses feel backward.
The number of "accept, fix, or throw away?" moments did not move by a few percent. It moved to a different scale. The exact count depends on the work, but the shape is clear enough: AI increases the number of things that ask for a decision.
The hard cases are not the obviously wrong ones. Those are cheap. A glance is enough.
The expensive cases are the ones that run fine but are subtly wrong. The code passes tests but violates a convention nobody wrote down. The paragraph reads well but answers a slightly different question. The proposal is coherent in isolation but conflicts with an earlier decision.
Catching that takes real attention. I have missed it more than once.
"We need to upskill people on AI" is not wrong. It is late in the chain.
Train a person and drop them into a flood of subtly wrong outputs, and the training does not change the math. The person still has to judge too many things. The better order is to shrink the number of decisions that reach a human first, then put trained humans on top of the smaller pile.
More bluntly: scale the gates before scaling the people.
The first lever is not "more humans who can take responsibility." The first lever is "fewer moments that demand a human take responsibility."
A gate is triage, not proof
This is where the argument can sound too neat, so it needs a correction.
Tests are not magic gates.
Gergely Orosz and Hamel Husain describe the Gulf of Specification as the gap between what we want an LLM to do and what our prompt or specification asks it to do. That maps cleanly to AI-generated tests. A test checks that the code matches what the test author wrote down. If the test author misunderstood the intent, green tests only prove that the code matches a possibly wrong spec.
Hamel Husain and Shreya Shankar make the same practical point in their evals FAQ: even as models improve, you still have to verify that the system is solving the right problem. The model does not read your mind.
A gate does not prove correctness. It shrinks the pile so that human attention lands where it matters.
The gates that helped me most were not exotic:
- Plan review before implementation review. Have the model produce a plan first. Review the intent while it is still cheap. Let implementation start only after the plan survives.
- A diff against a known-good reference. When the output should resemble an accepted ADR, an existing component, or a past customer email, compare it explicitly instead of trusting your eyes.
- A structured output validator. If you expect JSON, make the schema reject ambiguity before a human reads the content.
- A small assertion or lint rule. If the same subtle mistake appears twice, stop treating it as a reminder and turn it into a check.
None of these eliminate the "runs fine but subtly wrong" category. They lower the cost of noticing it.
The handoff question
The practical test is small.
Before handing a task to AI, ask:
Can something other than me reject the answer?
If yes, hand it off.
If no, either keep the work by hand or add a gate first.
That gate can be a test, schema check, lint rule, plan review, structured-output validator, or diff against a known-good reference. The form matters less than the return path. The output should not come back to a human as an undifferentiated pile of plausible text.
This changes the count or the quality of decisions reaching your desk. That is the part that scales.
The training-first instinct is comfortable. It also turns your best people into the bottleneck. Build the gates first. Put the people on top.
What is the smallest gate you can add before the next AI output reaches you?
References
- A pragmatic guide to LLM evals for devs, Gergely Orosz and Hamel Husain
- LLM Evals: Everything You Need to Know, Hamel Husain and Shreya Shankar
Sho Naka (nomurasan). I ship, I teach, and I spend most of my week deciding which part of my workflow to hand to a model and which to keep close.
This piece was adapted from a Japanese-language essay I wrote, with AI assistance for the cross-language rewrite. The reasoning, the data, and the conclusions are mine.
Top comments (0)