Quentin

Posted on Feb 24 • Originally published at quentin209.substack.com

Are Our Development Methodologies Obsolete in the Age of AI Agents?

#ai #softwaredevelopment #programming #discuss

TDD is broken. Not because the philosophy is wrong — but because the agent executing it has changed.

Quentin
Feb 24, 2026

We’ve spent decades building methodologies designed for a very specific kind of agent: a human engineer who reads a spec, internalizes business rules, and translates intent into code. TDD, DDD, clean architecture — all of these assume an agent that reasons in invariants and intentions.

That agent is no longer the only one writing code.

The Illusion of Compliance

Anyone who has used Copilot, Cursor, or any LLM-assisted coding tool seriously has seen it: the code looks right. It compiles. It even passes the tests. And then, three weeks later, something breaks in production in a way that makes no sense — until you realize the AI quietly violated a business rule that was never explicitly stated anywhere in the codebase.

This is what I call AI slop — not hallucinated code, not broken code, but plausible code that is semantically wrong. Syntactically valid. Business-false.

The frustrating part? This happens despite every guardrail we put in place. Linting rules, code reviews, unit tests — the AI navigates around them not by breaking them, but by satisfying their letter while missing their spirit.

Why?

Consider something as simple as a discount applied to a price:

# utils/pricing.py
def calculate_margin(price, cost):
    return (price - cost) / price

# discounts.py
def apply_discount(price, user_tier):
    if user_tier == "gold":
        return price * 0.8  # 20% discount
    return price * 0.9

An LLM generates this. It looks correct. Tests pass. Code review is clean — both functions are simple, readable, well-separated. But apply_discount and calculate_margin live in different files and nobody enforces that they talk to each other. On a low-margin product, the Gold discount silently pushes the margin negative. The invariant — “final margin must never be negative” — exists somewhere in a business spec or internal doc, never formalized as something the machine can check. The code is syntactically valid. Business-false. And your test suite, written by the same agent, never thought to cross-reference the two.

The Root Cause: Tokens, Not Contracts

The common diagnosis is context window limitations. “The LLM doesn’t see the whole codebase.” True — but this is a symptom, not the cause. Context windows will grow. The problem won’t disappear with them.

The structural issue is deeper: LLMs have no persistent, binding, machine-checkable representation of your business invariants. They have tokens.

When a human engineer implements a pricing rule, they carry a mental model: margins must never be negative, discounts have floors, edge cases exist and matter. This model is implicit, persistent, and shapes every decision they make.

An LLM has no such model. It has a probability distribution over the next token, conditioned on everything in its context. It will generate code that looks like correct pricing logic because it has seen thousands of pricing implementations. But it has no contract binding it to your pricing invariants.

More context helps — but it doesn’t solve this. You cannot prompt-engineer your way to formal correctness. Prompt engineering treats the symptoms. The disease is architectural.

The TDD Inversion Problem

There’s a subtler issue that doesn’t get enough attention.

TDD’s core philosophy is adversarial: you write tests that try to break your implementation. The tests are hostile witnesses. The implementation must survive them.

Watch what happens when you ask an LLM to write both the tests and the implementation.

The LLM writes the implementation first — because that’s what its training data looks like. Then it writes tests that validate that implementation. The tests are not hostile witnesses. They are accomplices. They are crafted, often unconsciously, to pass with exactly the code that was generated.

This is a complete inversion of TDD’s philosophy. The adversarial relationship between spec and code collapses. You end up with high test coverage that proves nothing except that the AI is internally consistent with itself.

This isn’t a prompt problem. It’s a structural one. The same agent cannot be both the accused and the judge.

The Over/Underengineering Trap

There’s a third failure mode, less discussed but equally real: LLMs systematically struggle to calibrate solution complexity.

Ask for a simple data transformer — you get an abstract factory with three layers of indirection. Ask for a robust domain service — you get a 40-line function with no error handling. The LLM has no cost function for architectural complexity relative to the problem at hand. It pattern-matches to what “looks correct” in its training distribution, which skews toward either tutorial-simple or enterprise-over-engineered depending on the prompt framing.

DDD was supposed to help here — bounded contexts, aggregates, domain language as a forcing function for appropriate complexity. But DDD requires the agent to understand the domain model and maintain that understanding across the entire implementation. LLMs lose this thread. Without explicit, external, machine-checkable constraints, the domain model degrades into decoration.

What This Implies

The value of the engineer is shifting. The “How” — implementation — is increasingly delegatable. What is not delegatable is the “What”: the formal specification of what the system must guarantee, under what conditions, without exception.

This shift is not optional. It’s already happening. The question is whether we formalize it deliberately, or let it happen by accident — with all the hidden debt that entails.

The engineers who will remain valuable are not those who write the best code. They are those who can formally specify what correct behavior looks like, and make that specification executable and verifiable.

A Direction Worth Exploring

I’ve been working on a methodology that tries to address this structurally — not by improving how LLMs generate code, but by changing what governs the generation. This is the logical conclusion of the diagnosis above: if the problem is the absence of a binding, machine-checkable contract, then the solution is to make the contract explicit, formal, and adversarial.

The core idea: separate the contract (what the system must guarantee) from the implementation (how it achieves it), make the contract formally executable, and use it as an adversarial harness against which the AI-generated code must survive.

Tests are no longer written by the same agent that writes the code. They are derived mechanically from the contract — and the contract is written by the human.

It’s called C-TDD (Contract-Driven Test Development). It’s still a draft and an open question — I’m not claiming it’s a solved problem. But the white paper is public, and I’d genuinely like to know whether this diagnosis resonates, and whether the direction makes sense to people working with these tools every day.

C-TDD — Contract-Driven Test Development

Governing AI by Contract.

quentin209.substack.com

The Real Question

I’m not arguing we should slow down AI adoption in engineering. That’s neither realistic nor desirable.

What I’m arguing is that the implicit assumption — that our existing methodologies scale to AI-assisted development without modification — is wrong. TDD was designed for human agents. DDD was designed for human agents. Neither was designed for an agent that has no model of invariants, that inverts the adversarial relationship between test and code, and that has no intrinsic cost function for architectural complexity.

The question isn’t whether AI changes software engineering. It already has.

The question is: what do we put in its place?

I don’t have the full answer. But I think the conversation starts with formalizing what we delegate — and making that formalization something the machine can actually check.

What are you seeing in the field? Is this diagnosis accurate? What am I missing?

Top comments (4)

Matthew Hou • Feb 24

Interesting experience. I've found the combination of good context + good verification is the real multiplier. Context makes AI generate the right thing more often. Verification catches it when context isn't enough. Either one alone gives maybe 30% of the value. Together it's closer to 80%. The remaining 20% still needs human judgment — and honestly, that's the fun part.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.