The most useful thing an AI coding agent can do is let you check its work. It will not enjoy it.
I once watched my own agent do beautiful work on the wrong half of the job.
I had built a harness called Mycelium to make Claude do real product development, discovery through delivery. On a little macOS app, it aced the discovery. Mapped the opportunity tree like it had written the textbook. Then it shipped the thing with zero tests and no accessibility, and told me it looked great. It wasn't lying, exactly. It just had no idea what it didn't know, and neither of us found out until I went looking.
That is the part nobody prices. Not how good the agent is. How long it takes you to find out whether it is right.
The agents people call a joy to use are the ones that hand you the answer and the receipt at the same time: a diff, a screen recording, a before and after you can glance at and trust in two seconds. The ones that wear you down hand you something you have to re-derive yourself just to know whether you can believe it. Same model underneath, sometimes the same task. The only difference is who pays to check, and how much it costs them. Hamel Husain has put the sharp version of this: if a thing is hard to evaluate, that is a product smell. He is right, and I would push it one step further.
If the cost of checking is the thing that matters, you do not win by making the agent smarter. You win by making it pay that cost up front, the moment it makes a claim, instead of dropping an unverifiable heap on your desk and letting you find the problem at review time, which is the most expensive time there is.
That is what a gate is, in the thing I build. People hear "gate" and picture a slow approval step with a clipboard. It is closer to an eval that runs at discovery time instead of after the output. Before the agent earns the right to write code, it has to clear one question: does the evidence for this actually exist. Not does it look plausible. Plausible is exactly how I ended up with a test-free app that demoed beautifully. Does it exist. When it does not, the agent stops, and it does not get to guess its way past. The current build has thirteen of these defined across the road from a vague idea to shipped work, and a project clears the ones its scale calls for.
Here is one, because the abstraction is cheap and the example is where it lives or dies.
An agent is about to build against an external API it has never actually read. The contract, the schema, the real shape of the payload: all imagined. Left alone it writes something plausible against an interface that exists only in its head, and you find out at integration time, which is to say the worst time. The gates already stop it on the missing evidence. The part I shipped recently is the part that makes the stop useful: instead of a shrug, the agent names what it is actually facing, "technical discovery," and sends itself off to read the docs and pull a real payload before it touches a line of code. The bill got paid up front, for almost nothing. The alternative is me paying it by hand, later, after the agent has already talked itself into being sure.
Move the check from after to before. From me to the loop. That is the whole trade.
Theory is cheap, though, and I have been burned by my own confidence often enough to distrust it. What tells you whether any of this holds is what happens when other people touch it.
Drew Hoskins, who wrote The Product-Minded Engineer, approved the line I now lead with: the agent should stop and say "I don't have enough to claim this" before it ships code or marks anything done. An agent admitting the gap is a lot cheaper than you finding it.
The evidence I actually trust came from someone with no reason to be kind about it. One of the developers testing Mycelium ran it on her own project, a public-sector app for next-of-kin in home care. At some point the agent decided to improve her brief by quietly rewriting it in its own words. She caught it. She made it keep her original and ask before changing the record. She enforced, by hand, the exact discipline the framework is built to enforce, in the one moment the framework had not quite managed to. She was the gatekeeper. The framework was supposed to be. Being out-disciplined by your own tester is humbling, and I recommend it to anyone who thinks their harness is finished.
I will give you the failure in the same breath, because a post that only reports wins is its own kind of product smell. That same tester told me, flatly, that the vocabulary reads like it was written for people who already know the frameworks. The gate caught the agent. The words lost the human. Straight into the backlog. The machinery was sound and the on-ramp was not, both true at once, and pretending otherwise would be exactly the overconfidence the gates exist to catch.
It is tempting to read all this and conclude that hard-to-check work is work agents should keep their hands off. I think that is backwards. The harder a thing is to verify, the more it needs the discipline, not less, because that is precisely where confident nonsense survives longest. Nobody can cheaply tell it is wrong. That is not the place to give up on checking. It is the place to spend the most on it, early, while it is still cheap.
The bet under all of it is small and faintly boring, which in my experience is what a good bet usually looks like. Gates do not make the agent smarter. They make it pay for its claims while the bill is small, instead of leaving the tab for you to find at review time, next to a demo that looked great. The agent does not get to move on until it has earned the right to. Same as the rest of us, on a good team.
Top comments (0)