Jovan Chan

Posted on Jun 1 • Originally published at aicoderscope.com

Test-Driven AI Coding: The Workflow That Actually Catches Bugs

#cursor #githubcopilot #cline #aider

This article was originally published on aicoderscope.com

The bug doesn't come from code that fails to compile. It comes from code that compiles, passes all 24 tests, ships to production, and then silently returns wrong results for every input that wasn't in the original prompt.

AI coding tools have a specific failure mode that raw code review doesn't catch: tautological tests. When you ask Cursor, Copilot, or Cline to "add tests for this function," the tool reads your implementation and writes tests that verify it. If the implementation is wrong, the tests confirm the wrong behavior. Coverage goes up; confidence goes up; bugs survive.

Test-Driven Development forces a different order. Tests come first — written against a spec, not an implementation. The implementation is written to make those tests pass. This constraint is exactly what transforms AI coding tools from confident bug generators into reliable collaborators.

The overhead is real. This article covers what actually works per tool, what discipline costs you in speed, and where TDD still leaves gaps even when you do it right.

The tautological test problem, quantified

When developers let AI generate tests after writing code, roughly 35% of the resulting tests are tautological — they pass because they mirror the implementation's internal logic, not because they verify the correct behavior. Flip the order: write a spec, generate tests from the spec, then generate the implementation, and that rate drops to 5–10% (source: GitHub Copilot documentation on spec-driven workflows).

The mechanism is simple. A test for a calculate_discount(price, tier) function that was written by reading the implementation will assert whatever the implementation does. If the implementation has an off-by-one error on Gold tier discounts, the test passes with the wrong expected value. Neither coverage metrics nor CI pipelines catch this — the test suite is green, and the bug ships.

There's a deeper variant: the test computes its expected output by calling the function under test. This is worse than tautological — it's circular. Any input produces a "passing" test because the expected and actual values are generated by identical code paths.

The fix in both cases is the same: tests must precede implementation, and expected values must come from a spec, not the code.

The three-phase discipline (and why enforcement matters)

Red–Green–Refactor is the standard TDD loop. In AI-assisted development, each phase requires explicit enforcement because AI tools will collapse them if you don't stop them.

Red — The agent writes tests only. No implementation changes. If you run the test suite at the end of this phase, everything should fail. If a test passes without implementation, the test is wrong.

Green — The agent writes the minimal implementation to pass the new tests. "Minimal" matters. Permitting large, speculative implementations here is how scope creeps and bugs enter.

Refactor — Structure improves; behavior doesn't change. Tests stay green throughout. This is the phase where the agent can rename, extract, and reorganize safely.

Any agent that modifies production code during Red is violating the contract. With tools like Cursor and Cline, the enforcement mechanism is your rules file. With Aider, it's the --auto-test flag. With Copilot in VS Code 2026, it's the dedicated agent mode.

Cursor: rules + Agent mode

Cursor's Agent mode and .cursor/rules files give you the enforcement surface you need for TDD. The minimal setup:

Create .cursor/rules/tdd.mdc with (for more on rule file structure, see Custom Cursor Rules: Templates That Actually Work):

---
description: TDD discipline — enforce phase separation
alwaysApply: true
---

PHASE RED: Write failing tests only. Do not touch production code in src/. 
Stop and confirm when all new tests exist and are failing.

PHASE GREEN: Write minimum implementation to pass failing tests. 
Do not refactor. Do not add features not covered by a failing test.

PHASE REFACTOR: Improve code structure only. No behavior changes.
All tests must stay green throughout.

NEVER compute expected test values by calling the function under test.
NEVER add tests after writing implementation — tests come first.

In practice, the workflow is:

Open Cursor Agent (Cmd/Ctrl+Shift+I). Switch to Plan mode.
Describe the feature as a spec — inputs, outputs, edge cases, what should fail. Do not describe implementation.
Ask Agent to write tests in Plan mode. Review them before any code runs.
Switch to Agent mode (not Plan). Ask it to implement until tests pass.
Ask it to refactor — then run the suite one more time.

The step most developers skip is reviewing the tests before switching to implementation. That five-second check is where you catch circular assertions before they become production bugs.

GitHub Copilot / VS Code: dedicated TDD agents

VS Code's Copilot introduced purpose-built TDD agents in 2026. Three agent files, one for each phase, with automatic handoffs between them:

.github/agents/TDD-red.agent.md — writes failing tests only, explicitly forbidden from touching implementation
.github/agents/TDD-green.agent.md — writes minimal implementation, runs test suite automatically
.github/agents/TDD-refactor.agent.md — refactors with tests running as a guard

Create these via Command Palette → Chat: New Custom Agent.

The handoff points between phases are manual checkpoints — you click to advance. This is intentional. The documentation notes: "Handoffs provide control points where you can assess each step, verify the AI's work, and steer the agent." Treating them as friction to click through defeats the purpose.

For simpler scenarios, the /tests slash command in Copilot Chat generates tests that match your project's existing conventions (pytest fixture patterns, Jest describe/it structure) without configuration. The catch: /tests runs after your implementation exists. For real TDD, you need to use the agent workflow above and explicitly prompt: "Write tests for username validation that enforces these rules: [your spec]. Do not write any implementation code."

Aider: automated red-green loop via --auto-test

Aider's TDD integration is the tightest of any tool in the category because the test-run loop is baked into the architecture, not bolted on via a prompt. If you haven't set Aider up yet, start with Aider with Local LLM via Ollama in 2026 — in particular the context window configuration, which affects how reliably Aider tracks failing tests.

aider --test-cmd "pytest tests/" --auto-test

With --auto-test enabled, Aider runs your test suite after every code change. If tests fail, it reads the failure output, reads the changed code, proposes a fix, and re-runs. This loop continues until tests pass or Aider gives up and asks you.

For a real TDD flow with Aider:

Write your test file manually (or prompt Aider in a fresh session with no production code to add).
Verify the tests fail: pytest tests/ should show red.
Start Aider with the test command: aider src/feature.py tests/test_feature.py --test-cmd "pytest tests/test_feature.py" --auto-test
Prompt: "Implement feature.py to pass the tests. Do not moDify the test file."

Aider will iterate — often 2–4 rounds — until the suite is green. This is the closest any tool comes to automated TDD without human intervention at each phase.

The context window trap still applies: keep tests and implementation files small. If Aider is loading 3,000+ lines of context, it starts losing track of which tests are failing and why. One feature, one test file, one session.

Cline: Plan mode for specs, Act mode for implementation

Cline's Plan/Act toggle maps directly onto the Red/Green boundary.

Use Plan mode to write the spec and

DEV Community