What a Fast LLM Taught Me About Assumptions

Giovani Machado Corrêa — Tue, 23 Jun 2026 18:54:33 +0000

Most of our tooling was built for one kind of error. The expensive one is the other.

A cheap, fast model worked for over an hour on a coupled task without falling apart, and that didn't square with anything I knew about these models. They're lazy on long tasks, and the reputation is earned: you hand it an hour-long task and it delivers half, drifts, or declares victory early and leaves you with a result that looks done and isn't. This time, running inside the same disciplined workflow I already use, the usual thing didn't happen. And the task wasn't an easy one — a progress callback that had to thread through an entire pipeline up to the interface, the kind you can't slice into independent pieces, because showing progress only works if the callback crosses everything end to end. The only difference from the times I'd watched these models give up was that the plan had grown a list of deliverables: concrete states that had to exist by the time it was done.

The easy inference is that a weak model needs a list, and I went looking for whether anyone had already measured the effect of explicit deliverables — more out of a habit of never letting an explanation stand without checking it than out of any real suspicion. And it was the search that started knocking my explanations down, one by one.

The first to fall was the idea that the gain was in correctness. A study by Vincent Schmalbach on delegation contracts shows that, on small, well-specified tasks, correctness saturates: with or without a contract, the agent got it right just the same. What the contract changed was reviewability — the agents started documenting what they'd done and leaving evidence on the table, and they only did it when the contract asked. The amount of evidence scaled to what the contract demanded, regardless of what the task itself needed. The deliverables' payoff was in verifiability, and I'd been conflating verifiability with correctness as if they were the same thing.

There was still a gap between two explanations that, up to that point, predicted exactly the same thing. "Deliverables are a crutch for a weak model" and "deliverables are a tool for coupled work" say the same thing when the model is weak and the task is coupled, because both agree the deliverable helps there. The case that tells them apart is the opposite one: a strong model on an equally coupled task. If it's a crutch for weakness, the strong model shouldn't need it. If it's a tool for coupling, the strong model should suffer the same way — just later and at higher cost, because it documents better along the way but doesn't carve the real boundary any more accurately just for being smarter.

The second inference to fall was that deliverables were a crutch for a weak model. The effect shows up in both tiers — it just shrinks in the strong one — and the reason is that the contract hand-codes the reporting discipline the strong model already tends to do on its own. The strong model needs the contract less, but for the wrong reason, as far as my intuition was concerned. It's not that it needs it less because it's smarter and gets more right. It's that it needs it less because it already documents what it did. It was still about reviewability on both sides; correctness had never entered into it.

With both assumptions knocked down, what was left was the question the paper didn't answer, because it wasn't what the paper measured: its tasks were small, the kind where correctness saturates on its own, and my callback crossing the whole pipeline wasn't. On coupled work, the plan and the deliverables list play different roles, and only one of them survives the task going wrong in the middle. The plan is a sequence of steps built on a decomposition you guessed at the start, and when the task is large and coupled, that decomposition tends to be exactly the part that can be wrong: you draw a boundary where the system has none, and the model follows the plan with discipline, building precisely the wrong thing. The deliverables list lives at a different level — it describes the state the system has to be in at the end: the callback reaching the endpoint, the old interface still working, the test covering the transition, all of it holding at once. If the model gets the boundary wrong and redoes it down a path I never anticipated, the deliverable doesn't move, because it never spoke about the path. The plan becomes disposable. On coupled work, the route is a guess and the destination is the only thing you actually know.

SDD frameworks that hand each piece to a separate subagent attack this from another angle, but isolation only helps when the boundary between the parts already exists. When the boundary is the hard part, each piece comes out flawless in its own slice and wrong in the system.

And that caveat isn't an implementation detail — it's the point. Real software development was almost never about small, isolated units. Modules, interfaces, bounded contexts: all of it is a bet that you can fake a boundary in a system that's internally more coupled than the boundary admits — a bet engineering has been making for decades, because the alternative, holding the whole system in your head every time, doesn't scale for anyone.

So why would it work for an LLM? Isolating a subagent is the same old bet, just made by an AI instead of a person. And when the bet doesn't pay off — when the real boundary doesn't match the drawn one — it doesn't matter who did the cutting: the error lives exactly where nobody was looking, because everyone was only watching their own piece.

There are two kinds of error mixed into the same task.

The execution error — the swapped comma, the off-by-one, the forgotten edge case — yields to structure without drama: tests catch it, linting catches it, review catches it, and the more discipline you throw at it, the less it shows up. The assumption error is a different animal: the coupling boundary in the wrong place, the crooked reading of what needed to be done. Process reduces this one too, but only up to the point where the reviewers stop sharing the same blind spot — and the assumption everyone gets wrong together is exactly the one nobody catches. Both live in the same task, and the practical difference is the ceiling: on an execution error, more process keeps helping; on an assumption error, it stalls early.

And the assumption error is stubborn for a reason with no elegant way out: the thing that would catch a wrong assumption is human judgment about whether it's right, and judgment is exactly what failed and let it through in the first place. When I review to catch a bad assumption, I'm running one more pass of the same judgment that let it slip before. I can add people, add eyes, add passes, and it helps, but it doesn't escape — because the layer that's supposed to fix the error is made of the same substance that committed it. We fail when we review, and we fail trying not to fail at reviewing.

The obvious reaction at this point is "skill issue, review harder" — and it isn't wrong, it just runs into a ceiling that has nothing to do with discipline: a genuinely coupled task often doesn't fit in any single vantage point from which you could review the whole assumption, whether that's your own head or the model's context. So more review lowers the odds and never zeroes them out.

If zeroing it out is off the table, what's left is detecting early. Anyone who's run systems in production already lives with this, and there's even an acronym for it: MTTD, mean time to detect — the average time before someone notices something broke. You don't prevent every incident, so you measure and shorten the time to noticing. With a wrong assumption it's the same logic, and that's where I got stuck, because this is the thing nobody measures in coding agents.

Benchmarks measure the probability of being right — success rate, all of it binary and after the fact — and we celebrate a model for how often it lands on the correct answer, with almost no vocabulary for how long a wrong assumption survives before anyone notices. The closest thing that exists looks backward: it points at which step the agent got wrong after the run has already failed. What interests me is the opposite — how long the wrong assumption propagates before it gets caught in practice.

And here's an admission I'd rather not have to make: measuring time-to-detect doesn't escape the problem, it only shrinks it. Whoever checks the checkpoint is the same layer of judgment that let the assumption through, so TTD isn't immune to a tired reviewer, a rushed one, or one with the same blind spot as whoever wrote the assumption. What it buys isn't a guarantee — it's a smaller defeat: the error stays possible, it just gets cheaper to undo when someone, just as fallible, looks in time. It's not a solution. It's the size of the problem once you accept it doesn't have one.

And it's speed that turns this from a detail into a problem.

Because what speed changed is precisely the gap between making the mistake and noticing it. The odds of getting an assumption wrong didn't drop with AI — that part is human, and it still stands. What shrank was the time before anyone had a chance to see the drift. Before, a wrong assumption took a while to turn into much code, and that slowness was, by accident, a window for inspection. Now the model stacks three hours of competent work on top of the crooked assumption before you even notice.

And here's the perverse part: the more capable the model, the farther you let it go, because every time you check in it looks like it's going fine, and stopping to verify starts to feel like wasting a model that clearly knows what it's doing. Except looking fine at every check and being fine are two different things when the error is one of assumption, because a wrong assumption doesn't flash a warning light while it's being executed competently. The interval between checks stretches on its own, pushed along by confidence, and a wrong assumption found in the third hour costs three times what it would have in the first, because it's three times the competent work to undo. The model's capability inflates that interval without ever giving you a concrete reason to stop — and the absence of a reason is the danger.

Speed itself isn't the villain here, which is exactly why the cheap model at the start didn't fall apart: fast, with checkpoints kept at the same pace, means more chances to catch the crooked assumption per hour of wall-clock work, because each review cycle is cheaper and repeats more often. The danger shows up when speed buys confidence, and confidence spends that speed on a longer interval instead of more cycles — that's when the same fast model, unchecked, stacks up work just as fast as it could have been handing you chances to review. Capability isn't the problem. What you do with the time it frees up is.

The industry's answer to all this has been to add more process.

More spec, more plan, more phases — and the SDD that caught on ended up looking like the very waterfall we spent years running from: verification all bunched up at a single point, at the end, right in the era when the model is pushing you to let it run further before you get there. Even the modern version, with a living plan you revise midway, just adds one more layer to maintain — overhead on top of overhead, none of it touching the one thing that matters: the judgment about whether the assumption was right before any of the stacking began.

It's an execution-error tool thrown at an assumption error, dressed up as rigor.

What's left is an observation, not a recipe. The one study I found shows the contract helping nearly twice as much on the weaker model — but it measures reviewability, and what I want to know is something else: how much sooner did this weaker, faster model reach the goal, and what did it cost to recover from a crooked assumption? Nobody has measured that axis, so we're deciding in the dark whether more intelligence solves the right bottleneck.

My bet is that it doesn't — that pouring probabilistic intelligence into the part that's irreducibly human judgment buys speed and nothing more, and speed, on an error you don't prevent, is just more competent work stacked on top of an assumption nobody checked.

I started this trying to understand why a cheap model held up for an hour on a coupled task, and I ended up convinced that AI-assisted software engineering has been measuring the wrong bottleneck. Most of our tooling is still optimized for the execution error — precisely the one more process has always helped with. What got more expensive with AI is the other thing: how long a wrong assumption manages to survive before anyone notices. With an assumption that might be wrong, all you can do is accept that, keep going anyway, and get very good at noticing early when it starts to slip.

Maybe the question that matters isn't which model to run, but where, in all of this, it's still worth being the one who decides.

DEV Community: Giovani Machado Corrêa

What a Fast LLM Taught Me About Assumptions