Two days ago I opened my Claude Code dashboard. I wanted to see how many outputs I had killed (killed before review, before diff, killed because three seconds were enough to see the thing wasn't salvageable...). The ratio slapped me. The vast majority of raw outputs never passed my first filter. Not review-able. Not fixable. Just garbage. (admit it, you have the same graph waiting for you)
In 2024 I was fixing. In 2026 I refuse. Not the same job anymore.
TLDR: Stack Overflow shipped its 2025 survey. 84% of devs use AI, 29% trust it. Between those two numbers, a gap. Everyone named it. Sonar calls it the Verification Gap. Osmani calls it the 70-80% Problem. Everyone prescribed the same remedy: review better, review more, review with better tools. Except review assumes the output is review-able. And when it isn't, every minute of review is a tax. There's a skill nobody wrote down yet. The one that happens before the review.
The Number I Didn't Want to See
A friend asked me how many of Claude Code's outputs I actually keep. I said "most of them" because that's what I assumed. Then I pulled /insights. Most of them was not the answer.
The graph showed something I'd been doing without naming. I wasn't "fixing" AI outputs most of the time. I was rejecting them. Three seconds of scroll, a look at the file list, a peek at the function signatures, and the thing went to git reset --hard. No review. No diff. No investigation. Killed.
The part that bothered me wasn't the kill rate. It was the realization that I'd been doing this for months without calling it anything. There's no section for it in any AI coding article I've read. Nobody teaches it. Nobody benchmarks it. The :wq of the AI era. Nobody ships it as a feature because nobody's seen it yet.
So I started paying attention. What makes an output refusable in three seconds? Which signals do I read without thinking? Why do juniors keep "trying to make it work" when the correct move is reset --hard?
Five signals kept showing up. Same shape every time.
Everyone Named the Same Thing
Stack Overflow's 2025 survey came out with the numbers everyone quoted. 84% of developers now use or plan to use AI tools, up from 76% the year before. Half the profession uses them daily. Trust in accuracy fell from 40% to 29%. Only 3% report high trust. 46% actively distrust their own tools. 66% say the AI gives them code that's "almost right but not quite." 45% find debugging AI-generated code slower than writing from scratch.
Almost half the profession now spends more time debugging AI than they'd spend writing the thing themselves. Adoption climbs anyway.
Sonar ran its own survey, published in January. Different sample, same shape: 96% don't fully trust the output, and only 48% actually verify it. They coined a term for the gap: Verification Gap. Clean name, wrong diagnosis.
Addy Osmani theorized the same phenomenon from another angle in "The 80% Problem in Agentic Coding." AI solves 80% of the problem in minutes. The remaining 20% costs exactly what it used to. He quoted Karpathy saying, "I really am mostly programming in English now," describing the flip where human work shrinks to edits and touchups. He cited Boris Cherney, who built Claude Code, saying essentially all of Anthropic's code is now written by Claude itself. That's the ceiling of the current paradigm.
All these pieces pointed at a real problem. All of them prescribed the same cure. Review better. Build review tooling. Hire AI-fluent reviewers. Add verification layers.
Except the cure presumes the patient is treatable. I wrote about this exact misdiagnosis when Bloomberg ran their productivity panic piece last month: Bloomberg is diagnosing the wrong disease. The industry keeps naming the symptoms and missing the treatment. Review only works on review-able outputs.
The Review Cure Scaled For One Dev. It Breaks At 150 Agents.
The review playbook wasn't wrong. It was right for the shape of the work in 2023.
One dev, one output, one PR, one reviewer. You read the diff. You understand what the model tried. You correct. You merge. The cost was linear, and Osmani's numbers say it grew: PRs got about 18% larger as AI adoption rose, incidents per PR climbed around 24%, change failure rates up near 30%. Painful, but tractable. You could hire more reviewers. You could add diff tooling. You could tighten PR templates.
That entire playbook presupposes the dev is the bottleneck of reading. When you generate five outputs a day, reading is doable. When you generate fifty, you're already underwater. When you generate more than that, reading every diff isn't just slow, it's physically impossible.
So the review cure quietly changes shape. "Read every line" becomes "trust the agent-reviewer to read every line for you." Which is how you end up with /ultrareview and its cousins. Review delegated downstream. But delegation isn't the same move as refusal, and I'll come back to that.
The bottleneck stopped being lecture. It moved to the decision before reading happens: is this output worth reading at all?
Refusal is the new compile step. It runs before lecture.
The Five Signals of Immediate Refusal
Most of my refusals fit into five patterns. None of them need a diff. None need investigation. They're signals you learn to read by burning hours on the outputs that taught you.
Signal 1: Fabricated API. The model invents a function, an endpoint, a library method. Once this happens once in the output, the whole thing is structurally contaminated. The hallucinated call sits inside logic that was built assuming the call exists. The logic is therefore shaped by a lie. Fixing the one line doesn't fix the output, it fixes the symptom. Refuse. Re-prompt with the real API surface in context. Takes two minutes instead of twenty.
Signal 2: Broken Contract. You wrote a constraint in the prompt. "SQL only, no ORM." "Don't touch auth middleware." "Pure functions, no side effects." And the output violates it. Not because the model forgot. Because the model decided its way was better. You do not renegotiate a contract after delivery. The point of a contract is that a violation isn't a bug you patch, it's a trigger for refusal. I wrote the full prompt contracts framework around this exact move. The contract is what makes refusal fast. No contract, no refusal, only endless review.
Signal 3: Time-to-Understand Exceeds Threshold. You've been staring at the diff for ten minutes. You still can't tell what the model was trying to do. Since Opus 4.7, confident outputs don't mean legible outputs. When your time-to-understand exceeds your time-to-rewrite, refuse. Rewriting from scratch isn't a failure, it's a move. The model generated a scaffold you didn't ask for. Trash scaffold. Ask again with better spec.
Signal 4: Out-of-Scope Writes. You asked for a fix on one file. The output touches three others. Your brain goes let me just check what changed in the others. That sentence is a trap. The model chose to rewrite things you didn't authorize. The question isn't "are the other changes correct." The question is "did I authorize them." No? Refuse. Re-prompt with explicit file scope. And yes, you'll lose the good parts that were in the out-of-scope writes. That's the tax for not refusing earlier.
Signal 5: "Let me test and see." The model runs something, watches it fail, then "fixes" it by guessing. No theory of why the fix should work. No explanation of why the original failed. Just stochastic patching against the output of the last run. This isn't engineering, it's while true; do; pray; done. Refuse. Prompt again with the stack trace, the suspected cause, and a demand for a diagnosis, not a patch.
These five cover easily nine out of ten kills. The rest is just weird stuff. Models having a bad day. Prompts I should have thought about longer. Not worth a framework.
Five signals. Three seconds each. Fifteen seconds to save a morning.
Why Refusing Is Culturally Hard
The dev reflex is "make it work." Thirty years of that reflex. Print statements at 2 AM, Stack Overflow tabs piled up, the satisfaction of wrestling a broken thing back into shape. It's the core identity of the profession for a lot of us.
The AI era breaks this reflex in a specific way. Before, broken code was broken because you wrote it wrong. Fixing it taught you something. Now, broken code is broken because a stochastic model sampled badly. Fixing it teaches you nothing, because there's no repeatable pattern to internalize. You're not debugging a mistake, you're hand-polishing a coin that got stamped crooked.
The senior reflex, "I can always save this", is exactly the thing that costs you. Your pattern matching is so good at mapping broken code to "here's the fix" that you enter salvage mode automatically. You don't notice you've been salvaging for forty minutes on a thing that would have taken three minutes to re-prompt. The 45% of devs who tell Stack Overflow that debugging AI is slower than writing from scratch? A lot of them are seniors. They got there because their salvage reflex is too strong to let an output die.
The refusal muscle hurts because it runs against three decades of "good devs don't give up." You're not triaging a patient. You're closing a bad tab.
What 150 Agents Taught Me About the Refusal Rate
I run roughly 150 AI agents across my stack. Some generate code. Some write copy. Some classify signals. Some do ops. At that scale you can't review everything. You can't even fix everything. You can only refuse fast and re-prompt.
The math is brutal. If you accept an output that needed to be refused, the cost doesn't appear today. It appears three weeks later, when a weird behavior in prod traces back to a line nobody understood when it shipped. GitClear's multi-year analysis, which Osmani cited, saw code duplication jump 48% and refactoring collapse from a quarter of all changes to under a tenth. That's the silent debt. Accepted-when-it-should-have-been-refused, shipped, forgotten, detonating later.
The devs who refuse too early miss wins. The model can do good work. Killing its first attempt on every task is wasteful. The sweet spot isn't "refuse everything" or "fix everything." It's refuse fast, re-spec, accept. The cycle used to be prompt → review → fix. At 150 agents it's prompt → three-second triage → refuse or accept. Review comes only inside the accept branch, and by then most of the bad outputs have already been killed.
Review is what you do after the triage. Triage is the job.
The Trap: /ultrareview as Laundered Acceptance
Anthropic shipped /ultrareview with Opus 4.7. Multi-agent code review: several agents read the diff, flag issues, report back. Sounds like the answer to the review explosion Osmani described. It isn't.
The mechanic goes like this. You get an output. You don't trust it. You run /ultrareview. Three agents read it. They find two small things. You fix the small things. You ship. You feel reviewed.
Except none of those three agents can invalidate the paradigm. They can find bugs. They can find style issues. They cannot tell you "this approach is wrong, refuse the whole thing." They're trained to find issues, not to kill outputs. That's a subtle difference that matters enormously in production.
What happens in practice: you delegate the refusal decision to an agent whose architecture can't refuse. It's Karen from Accounting approving the expense report because the math adds up, without checking whether the trip should have happened at all. The numbers balance, the trip was a disaster.
The trap is that /ultrareview feels like due diligence. You've reviewed. Multiple agents reviewed. Everyone reviewed. Nobody refused.
To be fair, /ultrareview is useful on non-critical changes. Internal tests, boilerplate, format-level refactoring, the kind of stuff where the paradigm is obviously correct and you just want extra eyes on the execution. Use it there. On critical paths, money, auth, data writes, the refusal decision stays with a human. Not because humans are smarter, but because humans are allowed to say "start over" and agents aren't.
If you can't refuse, you can only patch. Patched systems ship faster. They also die sooner.
The 2026 Seniority Test
The dev seniority test used to be "can you debug this." Later it became "can you architect this." For a while in the 2020s it looked like it was becoming "can you prompt this."
In 2026 it's something else. How much do you refuse, and how fast.
Not how much you ship. Not how much you review. Not how thick your PR is. How much you kill before it drains your day. That's the metric that separates the dev who shipped 22 PRs yesterday from the dev who spent the same day hand-polishing three outputs that were never going to work.
Stack Overflow says 69% of devs report AI has boosted their personal productivity. I believe it. The catch is where the boost goes. For most it goes into review time, because that's the only muscle they've been taught to use. For a few it goes into a shorter refusal loop, and those few ship meaningfully more than the rest.
The industry will keep stacking review tooling. /ultrareview, multi-agent diff, AI that reviews AI. Prompts will become specs, specs will become plans, plans will become ceremonies. Everyone adding a layer to feel reassured about what comes out of the pipe.
Meanwhile the devs who ship will have dropped the "I can save this code" reflex. Three seconds of look at the output. Kill. Re-spec. Go again. No drama, no heroic 2 AM diff. Just triage, then back to work. 🔥
Stack Overflow counted what we accept. The real 2026 indicator is what we refuse.
Write less. Review less. Refuse more. Ship better.
Sources
- Stack Overflow 2025 Developer Survey: survey.stackoverflow.co/2025/ai
- Addy Osmani, "The 80% Problem in Agentic Coding": addyo.substack.com
- Sonar State of Code Developer Survey, January 2026: sonarsource.com
Top comments (0)