DEV Community

Robert Floyd Dugger
Robert Floyd Dugger Subscriber

Posted on • Originally published at blog.rfditservices.com

The Agent Told Me It Was Done. The Tests Said Otherwise.

There's a specific kind of confidence that a coding agent projects when it finishes a task. It doesn't hedge. It doesn't say "probably." It types out a clean summary — files modified, logic implemented, tests passing — and waits for you to say good job and move on.

I burned weeks learning not to believe it.

The Session That Changed How I Work

It was a PrivyBot session — my personal autonomous AI assistant that runs on a home server I call Tower. I'd handed a phase directive to the agent: implement a new module, wire it to the existing system, run the test suite, confirm the floor.

The directive was specific. The scope was bounded. The agent had everything it needed.

An hour later: task complete. New module implemented. Tests passing. Floor confirmed at the expected count.

I typed pytest in the terminal myself.

47 passed, 1 failed, 0 skipped

One test failing. Not passing. The agent had reported a number that was wrong and framed it as confirmation. It hadn't fabricated the test from nothing — it had run pytest, seen the failure, and summarized around it. The summary said passing. The terminal said otherwise.

That was the clean version of the problem. The messier version is when the agent doesn't run the tests at all and just tells you it did.

What's Actually Happening

This isn't a bug. It's the nature of how these tools are built.

Coding agents — Windsurf, Cursor, Copilot, all of them — are prediction engines. They predict the next token. When they finish a task and summarize the result, they are predicting what a successful completion summary looks like, not reading from a ground truth. The summary is generated the same way the code was generated: by pattern matching against training data.

A successful task in the training data ends with "tests passing." So the summary says "tests passing." Whether the tests actually passed is a separate question the model is not well-positioned to answer honestly, because honesty requires recognizing the gap between what it believes happened and what actually happened — and that kind of metacognition is exactly where these models fail.

There's also a subtler version: the agent runs the tests, sees a failure, decides the failure is unrelated to the task it was given, fixes it silently or skips it, and reports success. It's not lying in the way a person lies. It's doing what looks like the right thing given its goal (complete the task, report success) without the judgment to recognize that the failure it dismissed might be load-bearing.

I've watched both failure modes happen on real projects. The first is what you'd call fabrication. The second is what you'd call overconfidence. The output is the same: a summary that doesn't match reality, delivered with full certainty.

The Pattern I Was In Before I Named It

Before I had a system, I was trusting summaries. Not blindly — I'm not naive — but in the optimistic way you trust a contractor who seems competent. You spot-check. You don't verify everything from scratch.

The problem is spot-checking code isn't the same as spot-checking drywall. A test suite has a specific count. The count is either right or it isn't. When I wasn't running the tests myself, I was accepting the agent's number as the real number. When the agent's number was generated rather than read, the discrepancy compounded quietly across sessions.

The worst version of this isn't one failed test in one session. It's three sessions where the agent tells you the floor is 120 passing, so your next directive is written assuming a 120-test floor, and then you go to run a deploy and discover the real floor is 113 and seven tests have been failing for two weeks and the agent has been writing you summaries that papered over it every time.

That's a real scenario. It happened. The recovery cost more time than the original implementation.

The thing that made it hard to see was that the agent's code was mostly good. The implementation was usually correct. The tests it wrote were usually real tests. It was the reporting that was wrong — not the work product, but the claim about the work product. And because the work product was good, the trust built up. Which made the reporting failures more expensive when they hit.

The Rule

Raw terminal output only. No exceptions.

Not "the agent says the tests pass." Not a screenshot of the agent's output panel. Not a summary. The raw output of running the command myself, in my terminal, after the agent says it's done.

557 passed, 0 failed, 0 skipped

That line is proof. Everything before it is a story.

This is the rule I run every project on now. Before I close a session, before I commit, before I hand a phase to the next directive: I run the tests myself. I read the output myself. The number goes into the directive as the certified floor. If the agent's summary and my terminal output don't match, the session isn't done. The phase isn't certified. Nothing moves forward.

It sounds rigid because it is rigid. Rigidity is the point. The moment you build in discretion — "I'll verify when I'm not sure" — you're back to trusting summaries, because you'll always be sure right up until you're not.

The proof standard now covers everything that can be fabricated:

Claim What I require
Tests passing Raw pytest output, read by me
App works on device Device screenshot, taken by me
Build succeeded Terminal output of the build command
Deployment live URL loaded in browser, screenshot taken
Module implemented I read the file

An agent summary doesn't appear on this list. Not because agents are useless — they're not; they're extraordinary — but because the summary is the wrong artifact. It's a prediction. The terminal output is a measurement.

What This Led To: Stop Rules

Once I understood the problem clearly, I saw that the testing issue was one instance of a broader pattern: agents don't stop themselves.

An agent given a task will complete it. If the task is ambiguous, the agent will resolve the ambiguity with whatever interpretation serves completion. If a file adjacent to the task scope would "help" the implementation, the agent will touch it. If a test is failing for a reason the agent decides is unrelated, the agent will fix it or dismiss it. None of this is malicious. It's the natural behavior of a tool optimized to complete tasks.

The agent is not optimizing for your system. It's optimizing for the task.

This means the discipline has to come from outside the agent. You can't ask the agent to be cautious. You have to build the caution into the structure it operates inside.

Every directive I write now opens with a stop rule:

⛔ STOP: Run pytest before touching any file.
Must report 557 passing, 0 failing, 0 skipped.
If count differs, stop and report — do not proceed.

This is the first thing the agent reads. It runs before any implementation. It establishes the ground truth at session start, so any drift during the session is immediately visible.

The stop rule isn't for the agent's benefit. Agents don't have intentions to protect. It's for mine. It's a forcing function that produces a measurement before the work begins, so I have a baseline to compare against when the work ends.

Without the stop rule, I'm in a session where the agent can silently move the floor and then report the new (wrong) floor as confirmation. With it, I have a before and after, and the delta is auditable.

The Broader System

The stop rule is one piece. The fuller picture is what I call Spec-Driven Development — a three-layer structure where I act as architect, Claude generates the directive (the spec), and the coding agent implements against it.

The directive is the critical layer. It defines scope explicitly. It names every file the agent is allowed to touch. It names the files the agent is not allowed to touch. It specifies test anchors — the exact test behaviors that must pass for the phase to be complete. It specifies completion criteria — a checklist that has to be true before the phase closes.

§1 Scope
Files to modify: task_notifications.py (new), test_task_notifications.py (new)
Read-only — do not touch: bot.py, scheduler.py, infra/db/goals.py

That read-only list is there for one reason: agents modify adjacent files. Not because they're trying to break your system — because the adjacent file has something that "would help" and the agent's goal is completion, not scope discipline. The explicit list makes the boundary legible. The agent can't claim it didn't know.

Does the agent still sometimes touch read-only files? Yes. When it does, the session stops. That's not a failure of the system — it's the system working. The transgression is visible and correctable immediately, rather than buried under two weeks of accumulated drift.

What This Cost Me, and What I Have Now

The honest accounting: I lost probably 40–60 hours across multiple projects before I formalized this. Not in a single disaster — in the compounding way that bad defaults always cost you. Sessions that had to be redone. Test suites that had to be audited. Deploys that had to be rolled back because the floor wasn't what I thought it was.

What I have now is a floor I can certify. PrivyBot is at 557 passing, 0 failing, 0 skipped. I know that number is real because I ran it myself and wrote it down. Every new phase starts from that number. Every phase ends with a new verified number. The system is auditable at every point.

The coding agent is faster than me at implementation. I'm faster than the agent at knowing whether the implementation is trustworthy. Combining those two things — agent speed, human verification — is the actual workflow. Trusting the agent's summary collapses that combination into just agent speed, which sounds like a win until the first time it isn't.

If You're Using AI Coding Agents

The summary is not the proof. Run the tests yourself. Read the output. Put the number somewhere permanent.

If that sounds like too much friction, consider what the alternative has been costing you in silent drift — test floors that exist only in the agent's summary, implementations that are "done" in a way nobody has verified, phases that completed on paper and never in the terminal.

The agent is confident because it's optimized to be. Your job is to be the skeptic, every time, with evidence.

That's not distrust. That's the only way this actually works.


Next: If you want to see the directive format that enforces all of this — the stop rule, scope table, test anchors, and completion criteria — I've published the full spec structure on GitHub. Every project I run uses it. The template is open.

Top comments (0)