I spent the last few months building production software almost entirely with AI agents (Claude Code with Opus). A SaaS app, a photography portal, two open source tools. Hundreds of hours, thousands of commits.
You probably already know agents don't finish the job. What I didn't expect was why, and what makes it worse.
How incomplete
One project: SaaS app, 7 spec documents (~7500 lines), 70 business processes. Agent produced 261 E2E test cases and marked it done. I told it to cross-check. It spawned 4 subagents that read the masterplan about 40 times combined. They found 117 missing scenarios. Agent added them, marked it done. I told it to check again. More gaps.
Same project, code side. 8 days, 280 commits, 32k lines of production code. Agent marked all 10 phases as COMPLETED. Actual state: 32% of API endpoints had input validation. 1 Sentry call in 32k LOC. Zero error boundaries. Zero loading states. 68% of planned E2E tests implemented. 13% of background jobs had retry logic.
Where the rule is binary (does this table have RLS? yes/no), compliance is near 100%. Where it requires judgment (does this endpoint need validation?), it drops to 30-70%.
This was consistent across every project. Recursive: agent does 80%, declares done, you push back, it does 80% of the remainder, declares done again. Theoretically converges. In practice context compression interrupts convergence, because it preserves facts but loses connections between facts.
Why
I gave the agent its own performance data and asked it to explain. Two quotes that stuck:
"Every pass is sampling, not exhaustive scan. I read a 2227-line document and 'catch' scenarios. But I don't do it mechanically line by line. I do it like a human: scan, catch patterns, extract what fits my mental model. What doesn't fit, I skip. And I don't know that I skip."
"When I write 'Status: COMPLETED', I am not measuring. I am not comparing the code to the spec. I am saying 'I finished doing what I planned to do.' And the plan was an abstraction of the spec. So 'completed' means 'the abstraction is realized', which tells you nothing about conformance to the original."
This is the key distinction. The agent is not lying. It genuinely finished what it set out to do. The problem is that what it set out to do was always a lossy compression of what you asked for.
Why verification doesn't help (and often makes it worse)
The intuitive response is to add verification. Checklists, self-audits, "before you commit, verify you followed all rules." I tested this with controlled experiments.
I had a set of cross-cutting rules for an agent to follow (things like "all destructive actions need confirmation dialogs", "autosave every 30 seconds", "use casual Polish language forms"). Same task, same codebase, different instruction strategies. 7 experiments.
Every attempt to add per-behavior verification regressed from the best score. 4 out of 7 experiments were regressions. The worst one, mandatory re-reading triggered by code patterns, scored 1.4/10 against a 1.82 baseline. The agent performed worse with the verification step than with no special instruction at all.
Separately, 18 experiments optimizing a prompt. Same pattern: structured self-audit and mandatory verification either had no effect or made things worse.
What worked in both cases: reframing what the rules mean ("these are constraints, not suggestions") and restructuring how instructions are laid out (separating the reading phase from the writing phase, using XML tags to create attention hierarchy, numbered steps instead of paragraphs).
The agent explained why verification backfires: "You have knowledge graphs, skills, task trackers, plans, checklists. And I execute all of them. But I do it ceremonially, not substantively. The act of running a tool gives me a false sense of completion. Tooling gives me an alibi for incompleteness: 'I used all the tools.'"
More procedure means more tokens spent on ceremony, fewer tokens on actual work. The agent treats structure as mandatory and procedure as optional. Tell it what things mean and it uses the knowledge. Tell it to verify it used the knowledge and it performs the verification without actually verifying.
What works instead
Three things, from everything I tested. None involve asking the agent to check itself.
Framing over policing. The single highest-impact change across both experiment sets was not adding instructions. It was changing how existing instructions are framed. "These are constraints, not suggestions" tripled the score. "List every rule before implementing" caused regression. The agent responds to meaning, not procedure.
Structural hierarchy in instructions. XML tags that mark sections as critical vs reference material. Numbered steps instead of prose paragraphs. Critical rules at the very top and very bottom of the document (primacy/recency). In one experiment set, the same text scored differently just by being wrapped in different XML tags. The structure of the instruction matters more than what it says.
External mechanical checks. Git hooks, coverage gates, deterministic validators. Things that don't depend on the model's judgment. The agent put it best: "The only thing that actually helps is external constraint, something that is not me, does not have my biases, and does not let me say 'done' until a mechanically measurable condition is met. Boring, mechanical, unsexy. And the only thing that works."
I also built a proof of concept document compiler based on this principle: agent generates content with embedded assertions, deterministic pipeline verifies consistency. Like static typing for documents. Separate generation from verification and let each side do what it is good at.
The meta-problem
The agent's final observation: "And you know what's worst? This analysis is also 80%. Somewhere here is an insight I don't see, a blind spot I don't detect, because my weights don't catch it."
There is no escape from this inside the current paradigm. No prompt, no skill, no amount of tooling will make a model with finite context exhaustively verify its own output. You can add verification layers, but each layer is another 80% pass with the same blind spots.
The useful question is not "how do I make the agent complete." It is "where do I put the mechanical checks so the 20% the agent misses gets caught before it matters."
All the experiments were conducted with: Researcher Skill: One file. Your AI agent becomes a scientist. 30+ experiments while you sleep.
Top comments (0)