DEV Community: Krzysztof Dudek

AI agents say "COMPLETED" after doing 80% of the job. I have the numbers.

Krzysztof Dudek — Mon, 30 Mar 2026 08:04:13 +0000

I spent the last few months building production software almost entirely with AI agents (Claude Code with Opus). A SaaS app, a photography portal, two open source tools. Hundreds of hours, thousands of commits.

You probably already know agents don't finish the job. What I didn't expect was why, and what makes it worse.

How incomplete

One project: SaaS app, 7 spec documents (~7500 lines), 70 business processes. Agent produced 261 E2E test cases and marked it done. I told it to cross-check. It spawned 4 subagents that read the masterplan about 40 times combined. They found 117 missing scenarios. Agent added them, marked it done. I told it to check again. More gaps.

Same project, code side. 8 days, 280 commits, 32k lines of production code. Agent marked all 10 phases as COMPLETED. Actual state: 32% of API endpoints had input validation. 1 Sentry call in 32k LOC. Zero error boundaries. Zero loading states. 68% of planned E2E tests implemented. 13% of background jobs had retry logic.

Where the rule is binary (does this table have RLS? yes/no), compliance is near 100%. Where it requires judgment (does this endpoint need validation?), it drops to 30-70%.

This was consistent across every project. Recursive: agent does 80%, declares done, you push back, it does 80% of the remainder, declares done again. Theoretically converges. In practice context compression interrupts convergence, because it preserves facts but loses connections between facts.

Why

I gave the agent its own performance data and asked it to explain. Two quotes that stuck:

"Every pass is sampling, not exhaustive scan. I read a 2227-line document and 'catch' scenarios. But I don't do it mechanically line by line. I do it like a human: scan, catch patterns, extract what fits my mental model. What doesn't fit, I skip. And I don't know that I skip."

"When I write 'Status: COMPLETED', I am not measuring. I am not comparing the code to the spec. I am saying 'I finished doing what I planned to do.' And the plan was an abstraction of the spec. So 'completed' means 'the abstraction is realized', which tells you nothing about conformance to the original."

This is the key distinction. The agent is not lying. It genuinely finished what it set out to do. The problem is that what it set out to do was always a lossy compression of what you asked for.

Why verification doesn't help (and often makes it worse)

The intuitive response is to add verification. Checklists, self-audits, "before you commit, verify you followed all rules." I tested this with controlled experiments.

I had a set of cross-cutting rules for an agent to follow (things like "all destructive actions need confirmation dialogs", "autosave every 30 seconds", "use casual Polish language forms"). Same task, same codebase, different instruction strategies. 7 experiments.

Every attempt to add per-behavior verification regressed from the best score. 4 out of 7 experiments were regressions. The worst one, mandatory re-reading triggered by code patterns, scored 1.4/10 against a 1.82 baseline. The agent performed worse with the verification step than with no special instruction at all.

Separately, 18 experiments optimizing a prompt. Same pattern: structured self-audit and mandatory verification either had no effect or made things worse.

What worked in both cases: reframing what the rules mean ("these are constraints, not suggestions") and restructuring how instructions are laid out (separating the reading phase from the writing phase, using XML tags to create attention hierarchy, numbered steps instead of paragraphs).

The agent explained why verification backfires: "You have knowledge graphs, skills, task trackers, plans, checklists. And I execute all of them. But I do it ceremonially, not substantively. The act of running a tool gives me a false sense of completion. Tooling gives me an alibi for incompleteness: 'I used all the tools.'"

More procedure means more tokens spent on ceremony, fewer tokens on actual work. The agent treats structure as mandatory and procedure as optional. Tell it what things mean and it uses the knowledge. Tell it to verify it used the knowledge and it performs the verification without actually verifying.

What works instead

Three things, from everything I tested. None involve asking the agent to check itself.

Framing over policing. The single highest-impact change across both experiment sets was not adding instructions. It was changing how existing instructions are framed. "These are constraints, not suggestions" tripled the score. "List every rule before implementing" caused regression. The agent responds to meaning, not procedure.

Structural hierarchy in instructions. XML tags that mark sections as critical vs reference material. Numbered steps instead of prose paragraphs. Critical rules at the very top and very bottom of the document (primacy/recency). In one experiment set, the same text scored differently just by being wrapped in different XML tags. The structure of the instruction matters more than what it says.

External mechanical checks. Git hooks, coverage gates, deterministic validators. Things that don't depend on the model's judgment. The agent put it best: "The only thing that actually helps is external constraint, something that is not me, does not have my biases, and does not let me say 'done' until a mechanically measurable condition is met. Boring, mechanical, unsexy. And the only thing that works."

I also built a proof of concept document compiler based on this principle: agent generates content with embedded assertions, deterministic pipeline verifies consistency. Like static typing for documents. Separate generation from verification and let each side do what it is good at.

The meta-problem

The agent's final observation: "And you know what's worst? This analysis is also 80%. Somewhere here is an insight I don't see, a blind spot I don't detect, because my weights don't catch it."

There is no escape from this inside the current paradigm. No prompt, no skill, no amount of tooling will make a model with finite context exhaustively verify its own output. You can add verification layers, but each layer is another 80% pass with the same blind spots.

The useful question is not "how do I make the agent complete." It is "where do I put the mechanical checks so the 20% the agent misses gets caught before it matters."

All the experiments were conducted with: Researcher Skill: One file. Your AI agent becomes a scientist. 30+ experiments while you sleep.

Code has logic. It does not have meaning.

Krzysztof Dudek — Sun, 22 Mar 2026 20:37:05 +0000

Why repositories need semantic memory, not bigger context windows

Three weeks into a new project. I know the language. I know the framework. I still cannot ship a feature without asking anyone. Not because the code is hard. Because nobody wrote down why this service exists, what it is actually responsible for, and what breaks when you touch it.

That knowledge is somewhere. A Slack thread from October. A meeting recording nobody will watch. The head of a developer who left in January. The code itself is perfectly legible. Every symbol is indexed. Every function has a name. And I still do not know why the order service calls the payment gateway twice, or why there is a retry loop that looks wrong but is not.

Humans survive this. You ask around. You read between the lines. You build a map in your head over weeks and months. It is slow, expensive, and completely invisible to the organization. But it works.

Now hand that same codebase to an AI agent.

The agent does not ask around. It does not read between the lines. It processes the text in front of it. If the meaning is missing from that text, the agent does not recover it. It guesses. Or it confidently writes code that violates a constraint nobody documented.

This is why agents look brilliant on small repositories and unreliable on large ones. The usual explanation is model quality. I do not think that is the problem. The problem is context shape.

Give the agent one file and it misses the system around it. Give it the whole repository and it drowns. One file breaks neighboring contracts. The entire repo turns into noise. A bigger context window does not fix that. Fifty thousand tokens of noise is still noise. You do not need more input. You need the right input.

The tricks that stop working

Rules files. Long system prompts. Context dumps. They all help for a while. Then the project grows. The rules’ file becomes a junk drawer where a global naming convention sits next to a quirk of one specific service. The prompt becomes a wall nobody reads. The dump becomes so large that the model ignores half of it.

There is no clean way to say "give me only the meaning of this part of the system." Not with any of these tools. That is not an intelligence problem. It is an information architecture problem.

What repositories remember and what they forget

Most repositories already have one kind of memory. Git remembers changes. Who changed what, when, and how.

What git does not remember is what the system is. Why a rule exists. What a module is responsible for. What constraints apply here. What else breaks when an interface changes. What business process this code participates in.

That knowledge exists in real teams. It is scattered, implicit, or gone. And it matters more now because agents are part of the team. A human joining a project can ask a senior engineer why something is the way it is. An agent cannot DM your former teammate. It will either guess, fail, or keep asking you until you become the context window yourself.

What I built

That is the idea behind Yggdrasil (link to GitHub repo). Not a code generator. Not a graph database. Not another documentation ritual that humans will ignore in two months. Semantic memory for a repository.

The implementation is deliberately boring. Plain Markdown for content. Plain YAML for structure. A .yggdrasil/ folder inside the repo. It stores a structured map of modules, responsibilities, interfaces, constraints, cross-cutting aspects, and end-to-end business flows.

Here is what that looks like in practice:

.yggdrasil/
├── model/
│   └── orders/
│       └── order-service/
│           ├── yg-node.yaml        # type, relations, code mapping
│           ├── responsibility.md   # what it does and what it does NOT do
│           └── interface.md        # public methods, failure modes, contracts
├── aspects/
│   └── requires-audit/
│       └── content.md              # cross-cutting rule: all data changes need audit
└── flows/
    └── checkout/
        └── description.md          # full business flow: happy path, failures, invariants

A node's responsibility.md says things like: "OrderService creates orders, manages state transitions, orchestrates payment and inventory. It is NOT responsible for computing prices, managing stock levels, or sending emails." That negative boundary is as important as the positive one. It tells the agent what to leave alone.

The yg-node.yaml declares relations: this service calls that service, consumes these methods, and if the call fails, here is what happens. It also maps to actual source files, so the agent knows where the code lives.

Aspects work like cross-cutting rules. You define "requires audit" once. Then you tag nodes with it. When the agent works on a tagged node, it gets the audit requirements automatically. If a node has an exception to the rule, that exception is declared right there in the node, not buried in a global file.

Flows describe business processes end to end. The checkout flow lists every participant, every path (happy, payment failed, inventory unavailable), and the invariants that must hold across all paths.

Small context, right context

The whole point is to give the agent 5,000 useful tokens instead of 50,000 random ones. Before the agent touches code, it gets a bounded package: the unit's responsibility, its interface, the constraints that apply to it, the interfaces of its dependencies, and the business flow it participates in. Nothing more. Nothing less.

This changes the question. Not "how do we show the model more of the repo." But "how do we make the repo legible." Those are not the same thing. A codebase can be fully visible and still semantically opaque. You can index every symbol and still not know what a service is actually responsible for or what breaks if you change it. Search answers "where is X." It does not answer "what is X for."

Think of it as the difference between a compass and a map. A compass tells you a direction. A map tells you what exists, what connects to what, and what terrain you are standing on. The repository needs a map.

Why this does not rot like normal docs

Normal documentation rots because people stop reading it and stop updating it. Agent-facing semantic memory has a harsher test. If it is wrong, the output gets worse immediately. Bad memory produces bad code. That is painful enough to force maintenance.

In this model, code and graph are one unit of work. Change one without the other, and you create drift. And drift is not a corner case. It is a normal life. People hotfix things. They experiment. They edit code directly. They forget to update the knowledge around it. So the system treats drift as first class. Detect it. Force a decision. Either the graph absorbs reality, or the code gets brought back in line.

Adoption does not need to be all or nothing either. A project with 500 files should not model the entire world before getting value. Start where the pain is. One module where the agent keeps making the same mistake. One area where people keep re-explaining the same decision. Coverage grows where the work is happening.

The honest part

This does not replace source code. And it does not help equally in every situation.

The real advantage is cross-module reasoning. It helps when the question is "why was this designed this way," or "what else is affected by this change," or "how does this business flow work end to end." Declared relations, shared constraints, and flows create actual leverage over raw file reading.

But if I need to know the exact failure behavior of a call, the exact await pattern, the exact transaction boundary, or whether a feature exists in the implementation right now, I trust the code first. The graph is strongest at why, should we, and what else. The code is strongest at what exactly happens here. Those are complementary. Treating one as a replacement for the other is a mistake.

Another uncomfortable finding. Agents are much better at spotting contradictions than omissions. If the graph says something false and the code says something else, they often catch it. If an important rule is simply missing from the graph, they are much worse at noticing. That means incompleteness is more dangerous than inconsistency. The agent can confidently reason from a map that has a hole in it. The hard part is not keeping semantic memory correct. It is keeping it complete enough where it matters.

The bottom line

A huge amount of value in a software system has never lived in code. It lives in responsibility boundaries, business rules, rejected alternatives, architecture constraints, and the reason a strange decision was made six months ago. Humans carried that in their heads because they had to. Now we work with tools that cannot survive on tacit knowledge.

Git remembers changes. The repository should also remember meaning. If it does, the next engineer does not have to rebuild the whole map from scratch. And the next agent does not have to guess why the road bends there in the first place.