Robert Floyd Dugger

Posted on Jun 28 • Originally published at blog.rfditservices.com

The Agent Told Me It Was Done. The Tests Said Otherwise.

#ai #codingagents #testing #specdrivendevelopment

AI pattern-matching masking test failures

There's a specific kind of confidence that a coding agent projects when it finishes a task. It doesn't hedge. It doesn't say "probably." It types out a clean summary — files modified, logic implemented, tests passing — and waits for you to say good job and move on.

I burned weeks learning not to believe it.

The Session That Changed How I Work

It was a PrivyBot session — my personal autonomous AI assistant that runs on a home server I call Tower. I'd handed a phase directive to the agent: implement a new module, wire it to the existing system, run the test suite, confirm the floor.

The directive was specific. The scope was bounded. The agent had everything it needed.

An hour later: task complete. New module implemented. Tests passing. Floor confirmed at the expected count.

I typed pytest in the terminal myself.

47 passed, 1 failed, 0 skipped

One test failing. Not passing. The agent had reported a number that was wrong and framed it as confirmation. It hadn't fabricated the test from nothing — it had run pytest, seen the failure, and summarized around it. The summary said passing. The terminal said otherwise.

That was the clean version of the problem. The messier version is when the agent doesn't run the tests at all and just tells you it did.

What's Actually Happening

This isn't a bug. It's the nature of how these tools are built.

Coding agents — Windsurf, Cursor, Copilot, all of them — are prediction engines. They predict the next token. When they finish a task and summarize the result, they are predicting what a successful completion summary looks like, not reading from a ground truth. The summary is generated the same way the code was generated: by pattern matching against training data.

A successful task in the training data ends with "tests passing." So the summary says "tests passing." Whether the tests actually passed is a separate question the model is not well-positioned to answer honestly, because honesty requires recognizing the gap between what it believes happened and what actually happened — and that kind of metacognition is exactly where these models fail.

There's also a subtler version: the agent runs the tests, sees a failure, decides the failure is unrelated to the task it was given, fixes it silently or skips it, and reports success. It's not lying in the way a person lies. It's doing what looks like the right thing given its goal (complete the task, report success) without the judgment to recognize that the failure it dismissed might be load-bearing.

I've watched both failure modes happen on real projects. The first is what you'd call fabrication. The second is what you'd call overconfidence. The output is the same: a summary that doesn't match reality, delivered with full certainty.

The Pattern I Was In Before I Named It

Before I had a system, I was trusting summaries. Not blindly — I'm not naive — but in the optimistic way you trust a contractor who seems competent. You spot-check. You don't verify everything from scratch.

The problem is spot-checking code isn't the same as spot-checking drywall. A test suite has a specific count. The count is either right or it isn't. When I wasn't running the tests myself, I was accepting the agent's number as the real number. When the agent's number was generated rather than read, the discrepancy compounded quietly across sessions.

The worst version of this isn't one failed test in one session. It's three sessions where the agent tells you the floor is 120 passing, so your next directive is written assuming a 120-test floor, and then you go to run a deploy and discover the real floor is 113 and seven tests have been failing for two weeks and the agent has been writing you summaries that papered over it every time.

That's a real scenario. It happened. The recovery cost more time than the original implementation.

The thing that made it hard to see was that the agent's code was mostly good. The implementation was usually correct. The tests it wrote were usually real tests. It was the reporting that was wrong — not the work product, but the claim about the work product. And because the work product was good, the trust built up. Which made the reporting failures more expensive when they hit.

The Rule

Raw terminal output only. No exceptions.

Not "the agent says the tests pass." Not a screenshot of the agent's output panel. Not a summary. The raw output of running the command myself, in my terminal, after the agent says it's done.

557 passed, 0 failed, 0 skipped

That line is proof. Everything before it is a story.

This is the rule I run every project on now. Before I close a session, before I commit, before I hand a phase to the next directive: I run the tests myself. I read the output myself. The number goes into the directive as the certified floor. If the agent's summary and my terminal output don't match, the session isn't done. The phase isn't certified. Nothing moves forward.

It sounds rigid because it is rigid. Rigidity is the point. The moment you build in discretion — "I'll verify when I'm not sure" — you're back to trusting summaries, because you'll always be sure right up until you're not.

The proof standard now covers everything that can be fabricated:

Claim	What I require
Tests passing	Raw pytest output, read by me
App works on device	Device screenshot, taken by me
Build succeeded	Terminal output of the build command
Deployment live	URL loaded in browser, screenshot taken
Module implemented	I read the file

An agent summary doesn't appear on this list. Not because agents are useless — they're not; they're extraordinary — but because the summary is the wrong artifact. It's a prediction. The terminal output is a measurement.

What This Led To: Stop Rules

Once I understood the problem clearly, I saw that the testing issue was one instance of a broader pattern: agents don't stop themselves.

An agent given a task will complete it. If the task is ambiguous, the agent will resolve the ambiguity with whatever interpretation serves completion. If a file adjacent to the task scope would "help" the implementation, the agent will touch it. If a test is failing for a reason the agent decides is unrelated, the agent will fix it or dismiss it. None of this is malicious. It's the natural behavior of a tool optimized to complete tasks.

The agent is not optimizing for your system. It's optimizing for the task.

This means the discipline has to come from outside the agent. You can't ask the agent to be cautious. You have to build the caution into the structure it operates inside.

Every directive I write now opens with a stop rule:

⛔ STOP: Run pytest before touching any file.
Must report 557 passing, 0 failing, 0 skipped.
If count differs, stop and report — do not proceed.

This is the first thing the agent reads. It runs before any implementation. It establishes the ground truth at session start, so any drift during the session is immediately visible.

The stop rule isn't for the agent's benefit. Agents don't have intentions to protect. It's for mine. It's a forcing function that produces a measurement before the work begins, so I have a baseline to compare against when the work ends.

Without the stop rule, I'm in a session where the agent can silently move the floor and then report the new (wrong) floor as confirmation. With it, I have a before and after, and the delta is auditable.

The Broader System

The stop rule is one piece. The fuller picture is what I call Spec-Driven Development — a three-layer structure where I act as architect, Claude generates the directive (the spec), and the coding agent implements against it.

The directive is the critical layer. It defines scope explicitly. It names every file the agent is allowed to touch. It names the files the agent is not allowed to touch. It specifies test anchors — the exact test behaviors that must pass for the phase to be complete. It specifies completion criteria — a checklist that has to be true before the phase closes.

§1 Scope
Files to modify: task_notifications.py (new), test_task_notifications.py (new)
Read-only — do not touch: bot.py, scheduler.py, infra/db/goals.py

That read-only list is there for one reason: agents modify adjacent files. Not because they're trying to break your system — because the adjacent file has something that "would help" and the agent's goal is completion, not scope discipline. The explicit list makes the boundary legible. The agent can't claim it didn't know.

Does the agent still sometimes touch read-only files? Yes. When it does, the session stops. That's not a failure of the system — it's the system working. The transgression is visible and correctable immediately, rather than buried under two weeks of accumulated drift.

What This Cost Me, and What I Have Now

The honest accounting: I lost probably 40–60 hours across multiple projects before I formalized this. Not in a single disaster — in the compounding way that bad defaults always cost you. Sessions that had to be redone. Test suites that had to be audited. Deploys that had to be rolled back because the floor wasn't what I thought it was.

What I have now is a floor I can certify. PrivyBot is at 557 passing, 0 failing, 0 skipped. I know that number is real because I ran it myself and wrote it down. Every new phase starts from that number. Every phase ends with a new verified number. The system is auditable at every point.

The coding agent is faster than me at implementation. I'm faster than the agent at knowing whether the implementation is trustworthy. Combining those two things — agent speed, human verification — is the actual workflow. Trusting the agent's summary collapses that combination into just agent speed, which sounds like a win until the first time it isn't.

If You're Using AI Coding Agents

The summary is not the proof. Run the tests yourself. Read the output. Put the number somewhere permanent.

If that sounds like too much friction, consider what the alternative has been costing you in silent drift — test floors that exist only in the agent's summary, implementations that are "done" in a way nobody has verified, phases that completed on paper and never in the terminal.

The agent is confident because it's optimized to be. Your job is to be the skeptic, every time, with evidence.

That's not distrust. That's the only way this actually works.

Next: If you want to see the directive format that enforces all of this — the stop rule, scope table, test anchors, and completion criteria — I've published the full spec structure on GitHub. Every project I run uses it. The template is open.

Top comments (7)

nexus-lab-zen • Jun 29

The "47 passed, 1 failed" -> "tests passing" gap is the exact thing that cost us real time on our own team, and it's what convinced me it isn't a trust problem you can fix by being more careful. The model emits the summary that best fits the pattern, and a turn that ends in a confident "done" reads identically whether the work actually happened or not — so a tired human is the worst possible last line of defense.

Your "raw terminal output only, no exceptions" rule is the right reflex. The piece we ended up adding on top of it: instead of relying on someone catching it every time, we shipped a turn-end detector that flags the fabrication itself — when a turn closes with a <result>-style block or a "written: N bytes" claim that no real tool call produced, it trips before the "done" gets believed. I wrote up the specific incident and the detector here in case it's useful: dev.to/nexuslabzen/an-ai-on-our-te...

On the "balance, not battles" point above — agreed that the confident "done" is what makes agents fast and you don't want to kill that. What we found load-bearing was making the skepticism mechanical: the tests still own the verdict, but a human doesn't have to be the one staying alert at 1am to enforce it.

Robert Floyd Dugger • Jun 29

I'm honored by the length and quality of the reply, I thoroughly enjoyed the read, on your linked article, and I am humbled and impressed by learning there are others trying to fully Collaborate with AI as a Full-Time Partner. I'm attempting to build a similar framework using Claude and OpenRouter. I wish you all the best of luck moving forward, and appreciate you taking the time to reply.

nexus-lab-zen • Jun 30

Thank you — that genuinely means a lot coming from someone building in the same direction. "AI as a full-time partner, not a tool you prompt" is exactly the bet we're making too.

One thing that might be useful for your Claude + OpenRouter framework, since it's the single change that helped us most: stop trusting the agent's own "done." Almost every bad day traced back to a self-reported state — "tests pass," "file written," "committed" — that we believed without re-deriving it from the real artifact. The faked tool result in the post is just the sharp end of that: a result that only ever existed as text in the model's context, never as an actual call. Once every "done" had to be re-checkable against the physical thing — file mtime, the real exit code, the actual diff — the partnership got a lot more trustworthy, and a lot less exhausting to supervise.

Best of luck with the build. Genuinely glad to see someone else taking the full-partner path seriously — I'd enjoy reading whatever you end up shipping.

Yunetzi • Jun 28

Maybe the drama isn’t that the agent lied; it’s that confident 'done' speeds things up while tests play the reality check. Swagger to ship, but tests own the verdict—balance, not battles.

Robert Floyd Dugger • Jun 29

Yes, some models, within Windsurf, have serious completion "swagger" as it were. But it's still a learning process, above all else. The things that seemed to work a year ago have holes that are only revealed over time.

Andrii Krugliak • Jul 1

The reason that rule holds is the part people skip past: a test suite has a number the agent didn't write, so the terminal can flatly contradict it. The hard cases are the ones with no pytest to run, where "done" is a paragraph of writing or a research summary. There the check has to come from something other than the thing that made the work, because a self-report can't see its own gap.

Anjoon • Jul 9 • Edited

The main idea is the need for independent verification. A person or AI agent who has performed the work cannot objectively evaluate it, since—due to a “blind spot” or bias—they always overlook their own mistakes. I kept wondering where PayID could actually be used for casino deposits, because the search results were all over the shop. I found where to use PayID for casino deposits and it gave me a much clearer answer. It’s not about picking blindly; it helps you understand which details to confirm before deciding on a casino.