Tao Dong

Posted on Jun 1 • Originally published at Medium

The End of Manual QA: How I Built a Self-Testing App with Claude Code and Waterwheel Agent

#webdev #ai #productivity #testing

What if your AI coding agent could write code, test it, fix bugs, and ship — all without a single human in the loop?

That's not a thought experiment. I just did it — and the whole thing cost one cent in test runs.

Here's the short version: I wired Claude Code (as the code agent) together with Waterwheel (as a browser test agent) so that one implements features and the other verifies them autonomously. The tests are plain Markdown files. When a test fails, the code agent reads the failure, fixes the bug, and re-runs — no human in between. I shipped a complete user-authentication feature this way without touching the keyboard during the build-test-fix loop.

The rest of this post breaks down how it works and how you can run the whole thing yourself.

The Problem with Vibe Coding

AI coding agents have gotten remarkably good at writing code. But there's a dirty secret in the vibe coding workflow that nobody talks about: humans are still doing all the QA.

The current cycle looks something like this:

That's a lot of human in there. And it's not just tedious — it's becoming a genuine bottleneck. AI agents can generate features in minutes. Humans cannot verify them at the same speed. As AI coding capabilities accelerate, the manual QA step goes from being a nuisance to being the thing that breaks the whole workflow.

There had to be a better way.

The Proposal: Test Driven Development as the New Paradigm

The solution isn't to make humans faster. It's to remove them from the loop entirely — at least for verification.

I propose adopting Test Driven Development (TDD) as the new development paradigm for AI-assisted coding. Not TDD in the traditional sense of writing unit tests before code, but TDD as an end-to-end philosophy: define what "done" looks like before a single line is written, then let agents handle everything from implementation to verification.

The cycle looks like this instead:

In this new model, the human shows up at the start to define requirements and test cases, then steps away.

This approach also addresses a scalability challenge that plagues AI-driven development. Although AI outpaces humans at implementing well-defined tasks, its narrow focus on the current task often breaks existing features. By following the TDD flow, automated regression testing becomes achievable simply by accumulating all test cases defined across features and versions. If any test breaks, the team can compel the AI to fix it immediately. Even as product complexity grows, the human team does not need to expand — as long as the test cases created through feature design and external bug reporting are well organised.

With a well-managed test accumulation process, TDD significantly lowers the technical bar for product creation. In the future, anyone capable of describing their vision in plain text will be able to build great products that solve real problems.

The Demo: A Real Project, Zero Human Intervention

To prove this works in practice, I built a demo project combining two agents:

Claude Code — acting as the code agent, responsible for implementation and bug fixes
Waterwheel Test Agent — acting as the automation agent, responsible for executing browser-based tests and reporting results

The project implements a basic user authentication feature for a Spring Boot web application. The full source is on GitHub, and the initial_structure branch contains only the starting materials — design requirements and agent instructions, all in plain text — so you can try the entire process yourself.

Here's what the handoff between agents looks like in practice:

Claude Code generates the feature code
Claude Code instructs the Waterwheel agent to run the predefined tests
The Waterwheel agent executes both tests
The Waterwheel agent reports the results back
Claude Code reads the results — if all tests pass, it marks the task complete; otherwise it identifies the cause of the failure and loops back to step 1

No human touched the keyboard between step 1 and step 5.

How the Agents Are Wired Together

The key to making this work is giving each agent a clearly defined role and the right instructions.

Claude Code's Role

Claude Code is initialised with a CLAUDE.md file that defines three custom skills:

Confirm QA Environment — verifies the test environment is ready before running tests
Feature Acceptance Test — delegates test execution to the Waterwheel agent and reads the results
Feature Debug — triggers the inner bug-fix loop when tests fail

These skills are what enable the autonomous inner loop: when a test fails, Claude Code doesn't ask a human what to do. It uses the Feature Debug skill to fix the bug, then triggers another acceptance test.

Waterwheel Agent's Role

The Waterwheel agent runs as a Docker container configured against the project's test folder. Its volume mappings point directly at the project structure:

/tests → the agent's task input folder (read-only)
/outputs → where test results are written (read/write)
/waterwheel-instructions → agent instructions and permissions

Two chained test files define the acceptance criteria. The first test handles user registration and login. The second depends on the first succeeding — if registration is broken, the login test is automatically skipped rather than producing a misleading failure.

The agent is configured to use DeepSeek as its AI provider, which keeps costs low without sacrificing reliability on straightforward browser automation tasks.

The Results

Both agents completed the full feature development and testing without any human intervention. Since Feature Acceptance Test succeeded, no Feature Debug occurred.

Claude Code's output confirmed all tasks completed successfully, and the final application runs at http://localhost:8080 with working user authentication.

The total cost accumulated by the Waterwheel agent for one complete test run: $0.01.

That's not a typo. One cent — enough to replace all the tedious manual verification through a web browser.

Why This Matters

This demo is small by design — one feature, two tests, one developer. But the implications scale.

Consider what this means for a team shipping multiple features per day. Every feature gets:

Test cases defined upfront by a human, once, at design time
Automated implementation by a code agent
Automated verification by a test agent
Automated bug fixing in a tight loop
Automated regression before merge

The human's job shifts entirely to design and requirements. The execution loop — the part that currently consumes most of a QA engineer's time — becomes fully autonomous.

And because the Waterwheel agent is packaged as a Docker image, it can slot into any existing CI/CD pipeline to close the final bug-fix loop, as long as the regression tests are in place.

This isn't the future of software development. Based on this demo, it's already possible today.

Try It Yourself

The demo project is open source. Check out the initial_structure branch, follow the setup instructions in the README, and run the whole process yourself. You'll need: