The most dangerous line of code your AI agent writes is the test that passes

#ai #webdev #programming #devops

I shipped a breaking change to production three weeks ago. CI was green. 142 tests passed. Code review was approved. The PR was clean enough that I barely read it.

A mobile client started 500ing four hours later.

Here's the part that kept me up that night: every single safety net I had was working as designed. That's what scared me. The system didn't fail. The system did exactly what I told it to do, and the thing still broke.

Let me walk you through how, because I think a lot of you are about to hit the same wall, if you haven't already.

What actually happened

I asked an agent to "clean up the user response DTO." Reasonable. There was a field, phoneNumber, that was always populated for legacy reasons but new accounts left it null. The agent did something genuinely sensible: it made the field optional and, where the value was null, dropped it from the JSON entirely instead of serializing "phoneNumber": null.

Cleaner payload. Smaller response. Better, by most definitions of better.

- { "id": 1042, "name": "Priya", "phoneNumber": null }
+ { "id": 1042, "name": "Priya" }

You see the problem. An older Android build did response.phoneNumber.length on a screen nobody on my team had opened in a year. Field present-but-null? Fine. Field missing? Crash. And you can't force-update an app that's sitting on someone's phone in a tunnel.

But that's not the interesting part. Breaking changes are old news. The interesting part is why nothing caught it.

The tests passed because the agent wrote the tests

This is the shift nobody's pricing in yet.

For twenty years, the implicit contract of a test suite was: a human wrote the assertion as a statement of intent, separately from the implementation. The test and the code disagreed on purpose sometimes, and that disagreement was the whole point. The test was an independent witness.

When the same agent writes the implementation and the test in the same pass, that independence is gone. The agent changed the DTO, then updated the assertion to match the DTO it just wrote:

- assertThat(json).contains("\"phoneNumber\":null");
+ assertThat(json).doesNotContainKey("phoneNumber");

That test didn't fail to catch the breaking change. It ratified it. It went green specifically because the behavior changed. Your test suite stopped being a description of what the system promises and became a description of what the system currently does. Those are not the same thing, and the gap between them is exactly where every breaking change lives.

I've started calling these "yes-man tests." They agree with whatever just got written. A suite full of them gives you the emotional comfort of a green checkmark with none of the protection.

Why agents are structurally blind to this

It's not that the models are dumb. It's that a breaking change is, almost by definition, invisible from inside the repo.

The blast radius of phoneNumber going missing isn't in my backend. It's in an Android repo I don't own, a partner's integration I've never seen, a Zapier zap some customer built in 2024, an LLM-powered agent on the other side that's now parsing my API and will hallucinate around the missing field instead of erroring. The agent optimizes for what it can see: does it compile, does the test pass, does it satisfy the prompt. Compatibility is a property of the boundary between systems, and the agent only ever sees one side of the boundary.

This is also why "just review the PR" doesn't save you at scale. I review fine when I'm reading 40 lines I wrote. I review terribly when I'm rubber-stamping 600 lines of plausible, well-formatted, test-covered code that an agent generated in nine seconds. The volume is the attack vector. We 10x'd our output and quietly 10x'd the rate of silent contract drift right along with it.

What's actually started working for me

I'm not anti-agent. I write more code with them than without now and I'm not going back. But I changed how I think about the safety layer:

1. The contract is the source of truth, not the code. I freeze an OpenAPI spec (or a Pact, or even a checked-in JSON sample of every response) and diff generated output against the frozen contract in CI. Not "do the tests pass" — "did the response shape change in a way that breaks a consumer." Field removed, type narrowed, required-now-optional, enum value dropped, 200→204. Those are the five that cost real money. They're also mechanically detectable, which means a machine should be guarding them, not my tired eyes at 11pm.

2. Treat agent-written tests as guilty until proven independent. If a test changed in the same commit as the code it tests, it is not evidence of anything. I make the agent write characterization tests against the old behavior first, in a separate step, before it's allowed to touch the implementation. Annoying. Works.

3. Breaking-change detection belongs in CI, not in code review. Humans are bad at spotting a missing key in a 600-line diff. Computers are perfect at it. This is the one job I will never hand back to a person.

The mental model that finally clicked for me: the bottleneck was never writing code. Agents proved that. The bottleneck is knowing whether the thing you just wrote broke someone who isn't in the room. That problem got dramatically harder the moment we started generating code faster than we can possibly reason about its consequences.

We're all very excited about how much code we can produce now. I'd start getting equally excited about how much of it we can verify we didn't break.

What's the dumbest breaking change that's slipped past your CI lately? I want to know I'm not alone here.