Daniel Stolf

Posted on Jun 5

Your Test Suite Is Lying To You

#ai #vibecoding #tdd #claude

Most dangerous moment in AI-assisted development:

The test suite is green.

You ran the agent. It built the feature. It wrote the tests. They all pass. CI is happy. You ship.

Two weeks later, production hits an empty list and crashes.

This is the failure mode I keep watching engineers walk into, and I've walked into it too.

Tests written after the code are not tests. They're documentation of what the code happened to do. Which means when the code has a bug (a silently ignored duplicate webhook, an unhandled empty input, a failure mode the model assumed away) the tests written from that code don't fail. They pass. They lock the bug in as the documented behavior.

The CI signal turns green. The PR gets approved. The spec gets quietly violated.

And the worst part is that the suite isn't broken. It's doing exactly what it was asked to do: confirm that the code does what the code does. The only thing missing is the spec. The thing the tests were supposed to enforce.

Now add agents to the loop.

The agent generates code. The agent runs the tests it generated alongside. The tests pass. The agent reports success. It moves on to the next feature.

There is no point at which a human looks at the test list and asks: "wait, is the spec actually represented here? Did we test for the empty input? Did we test for the duplicate webhook? Did we test what should never happen?"

Without that pause, the suite is just a self-confirming loop. The code says A. The tests say A is correct. The agent reports A is verified. The fact that the spec said B never enters the system.

Tests catch what they were written to catch. Nothing more.

When humans write tests after code, the same pattern shows up. Most engineers will admit this. The difference with AI is volume and speed. A junior writing code-then-tests produces maybe two or three such artifacts a day. An agent produces twenty. The same blind spot scales without the scarring that used to come from finding the bug in production at 2am.

This is also why prompting your way out of it doesn't work. "Write good tests" is advice the model accepts and immediately ignores in practice, because what it was asked to do (make the feature work and verify it) is satisfied as long as the tests pass. The tests passing IS the verification. The model has no way to know the verification should have come from somewhere else.

The only honest fix is the order.

The tests come from the spec, not the code. Acceptance criteria like "returns X on empty input", "rejects duplicate webhook by ID", "fails loudly on unknown status" become failing tests before any implementation exists. The model writes the code to satisfy those tests. The tests didn't come from the code. They came from what the code was supposed to do.

When tests come first, the model can't generate the bug AND the test that ratifies it. The empty input case is already a red bar. The duplicate webhook is already an assertion. The model has to satisfy what the spec required, not what it happened to produce.

In practice, nobody does this voluntarily.

Not because they don't know better. Because writing the failing test first feels slow in a moment when the agent can produce the implementation in seconds. Every TDD article for the last twenty years has acknowledged this part. The skip feels small in the moment. The discipline collapses.

With humans, the skip is at least visible. You can see in the PR that the tests came in the same commit as the code, often even after it. With agents, the skip is invisible. The tests are there. They pass. Nobody notices that the tests are ratification, not verification.

What I've come to believe is that test-first stops being achievable as discipline once the model is fast enough to make the skip free. It has to become structural. Some part of the workflow has to refuse to accept code edits that arrived before their tests. That's the only thing I've seen consistently hold the line.

Whether that's a hook, a CI check, an agent that won't merge, a pre-commit gate: the mechanism matters less than the principle. The model cannot ship code whose verification it also generated. The tests must be older than the code, written from a source the code didn't influence.

The discipline isn't TDD. The discipline is refusing to accept tests that came from the same context as the code they're testing.

In an AI-assisted workflow, that distinction is the difference between a green suite that means something and a green suite that doesn't.

How does your team prevent tests-after-code in AI-generated PRs? Do you trust that distinction is being held?

Top comments (2)

xulingfeng • Jun 5

"Tests written after code are documentation of what the code happened to do" — that line is brutal because it's true. We hit this exact pattern with AI-generated test cases: green suite, production crash, and the suite was technically correct about what the code happened to do.

What worked for us was flipping the order: write the assertions first (the spec), then let the agent fill in. Not full TDD, more like "here's what should never happen, now make it not do that."

Have you found any workflow changes that actually break the "green suite = ship it" reflex?

Daniel Stolf • Jun 5

“Here’s what should never happen, now make it not do that” is a clear description of negative-case-first thinking.

As for breaking the “green suite = ship it” reflex, nothing behavioral has worked for long. Checklists, review prompts, team norms: they help for a while, then fade. The reflex is fast; discipline is fragile.

What worked was moving the constraint into the workflow. I use a hook that blocks edits to production files unless a sibling test was modified more recently in the same session. Not a warning. A hard stop. Overrides are explicit and logged. The idea is simple: tests can’t come from the same context as the code. The hook enforces it.

Behavioral discipline helps build intuition. Structural enforcement catches the days intuition fails. In practice, you need both.

The other thing that helps is surfacing negative cases before implementation. If “here’s what should never happen” is captured in the spec, and each case requires a failing test before coding begins, the green-suite reflex loses its shortcut. The suite can’t be green until the tests exist, and the tests can’t exist until the spec names the failure modes.

More on this next week. I’ll walk through the workflow I’ve been using to bring engineering discipline to AI-assisted development on complex projects.