Vibe Coding vs AI-Driven Development: The Contracts Problem (and GS-TDD)

#ai #testing #tdd #reliability

Most AI-assisted coding right now is just… vibes.

You ask for a feature, you get a plausible diff, it compiles, your dopamine spikes, and two days later production starts doing interpretive dance.

That failure mode has a name: vibe coding.

What I’m aiming for instead is AI-driven development. Whether you call it AI-Assisted Engineering or (as we say in my native Denmark) AI-drevet udvikling, the goal is the same: a workflow where AI helps you design and implement software every day, while humans keep full accountability for correctness, security, and operations.

The difference isn’t “which model” you use. It’s contracts.

Vibe coding is a trust problem

Vibe coding is what happens when we treat generated code like it’s “probably fine” because:

it looks reasonable
it uses familiar patterns
it passes a quick smoke test
it’s late and we want to go home

AI is great at producing plausible code. But plausible code is not the same as correct code.

If your workflow is “generate → ship”, you’re not doing AI-driven development. You’re doing probabilistic deployment.

AI-driven development is a contract problem

The reliable version is simple, but it requires discipline:

Define the contract (Humans)
Let AI work inside that contract (Machine)
Verify the result like an adult (Humans + CI)

The “contract” can be acceptance tests, unit tests, invariants, schemas, types, or all of the above. But tests are the most universal contract because they encode behavior and they fail loudly.

The boring loop: Red → Gold → Refactor (GS-TDD)

To operationalize this, I use a variation of TDD I call Gold Standard TDD (GS-TDD). It evolves the classic Red–Green–Refactor loop into Red–Gold–Refactor.

Red: Write (or generate) a failing test suite that defines behavior. Ideally BDD-style so intent and edge cases are explicit.
Gold: Instruct the AI to produce a Gold Standard implementation on the first pass, which should be production-oriented by design
That means that we have:
- Security-aware defaults
- Maintainable structure
- Solid error handling and boundaries
- Boring, standard architecture
- no mvp, no prototyping
Refactor: Improve the internals safely while keeping behavior unchanged.

The key shift: GS-TDD replaces “minimal Green” with “Gold Standard”, because AI allows us to skip the boilerplate phase. Your tests (the contract) keep the AI honest, and the "Gold" prompt keeps the architecture clean.

For a deeper dive into the methodology, check out my research note on GS-TDD.

A tiny example (The Workflow)

Say you’re adding a rule: “Users can’t create more than 3 API keys”.

Red: Write the contract first.

it("limits API keys per user", async () => {
  const user = await seedUser();
  await createKey(user);
  await createKey(user);
  await createKey(user);

  // This defines the contract: behavior AND error shape
  await expect(createKey(user)).rejects.toThrow("API key limit reached");
});

Now the test suite is the boss.

Gold: Prompt the AI for a production-oriented first pass under constraints. You don't just say "make it pass". You say:

"Implement this feature. Enforce the limit transactionally to avoid race conditions, use our standard domain error types, and add a structured log line when the limit is hit."

The “Gold” code might still fail a test or two on the first run — that’s fine. You debug inside the contract until it’s green.

Refactor: Once behavior is proven, you clean up naming and file structure.

AI can help in all steps — but the contract decides what’s allowed.

The Boring AI Checklist (8 rules I actually follow)

Define the contract: Tests describe what must remain true.
Verify, don’t trust: Run tests, lint, and build locally.
Prefer small diffs: AI hallucinates more in large contexts. Keep changes reviewable.
Keep prompts in the diff: The “why” belongs in PRs/docs, not in your chat history.
Review like it’s human code: Check naming, invariants, and boundaries.
Guard risky edges: Auth, data loss, security headers, rate limits.
Watch production: Logs/metrics/alerts confirm the change behaved.
Rollback is a feature: Know how you undo the change before you ship.

That’s it. No mysticism. Just boring verification at high speed.

“But isn’t this just TDD?”

Sort of — but with a twist in motivation.

Classic TDD helps humans think clearly and design well. With AI in the loop, tests also become safety rails that stop plausible nonsense from sliding into main.

If your AI workflow doesn’t have contracts, your delivery pipeline is basically a slot machine that sometimes pays out.