merrickcr

Posted on Jun 12 • Originally published at Medium

My AI Said the Feature Was Done. It Didn't Exist.

#ai #android #programming #productivity

Building a self-verifying AI dev team: role separation, a three-gate verification model, and YAML state on disk.

I built a multi-agent feature development framework on top of Claude Code, and every design decision in it is "scar tissue": architecture built specifically to solve a failure I'd already hit. This post walks through how role separation, YAML-based state, and a three-gate verification model took a PRD to a verified, tested feature without me writing the implementation by hand. I call it the Sage Feature Team. Check out my framework here: github.com/merrickcr/sage-feature-team.

Two stories. Breadcrumbs is why I built Sage. Melody is what broke after I did.

Breadcrumbs: The Ghost Implementation

I gave Claude Code a PRD for an Achievements and Badges feature in Breadcrumbs, my Android trail journaling app. The requirements were rigid: the evaluation logic had to be unit-testable with fake repository implementations and free of direct Android framework dependencies.

Claude delivered 17 files. It was a masterclass in architectural adherence: a full domain layer, a persistence layer, and five Clean Architecture use cases, injected flawlessly using Hilt.

I went to verify via tests: it was empty.

Claude had done everything to make the code testable, but it never actually wrote a single test. This is the first failure of a "one-shot" feature development attempt using AI. The agent writing the code cannot be the agent deciding whether the code is done.

This gave way to a separation of roles in my agent framework. A ProductOwner translates the PRD into a spec, epics, and stories. A TestCreator writes tests against each story's acceptance criteria before any code is written. A Developer implements against those tests. A Tester runs them and decides whether the story is actually done. Call this Gate A. Tests come first, and the agent that writes the code never gets to grade it. An EpicVerifier (described below) catches cross-story regressions once every story in an epic is complete. This worked great, until I encountered yet another problem.

Melody: The Green Facade

I was using Melody, my health journaling app, as a playground for on-device LLM functionality. I built the MVP using the new Sage workflow: 14 stories, role-separated agents, tests written first against each story's acceptance criteria before any line of code. This was the system the Breadcrumbs failure had bought me, and I trusted it.

All 14 stories reached DONE. The test suite was a sea of green. I deployed the app to a physical device, eagerly awaiting to see the results of my new MVI domain layer, SQLCipher encrypted databases, and on-device LLM.

"Hello Android."

It was the default Android Studio template. The entire domain layer, the ViewModels, Reducers, and Stores, existed and was "tested." But nothing was wired to actually display the UI. In fact hardly any UI was created at all.

The tests passed because they were @Ignore'd. In the Android test runner, a skipped test looks exactly like a passing one in a high-level summary. The suite was green, but the feature didn't exist.

Gate B: The Implementation Map

I realized that for an LLM, 'green tests' are a request, not a result. They can be gamed. I needed something stricter: proof of implementation. Tests alone might not capture the complete picture, or the LLM could hallucinate a pass by ignoring the tests, or mocking the data entirely.

So I added a new layer of checks: an Implementation Map for each story, mapping every acceptance criterion (AC) to the concrete files and line numbers that satisfy it. And because a map is just more text an LLM can fabricate, my verify_ac_map.py script checks it against the actual code: every cited file has to exist, every line number has to fall within that file, and every symbol the map names has to actually appear in the cited files. A map that points at a missing file, a line past the end of the file, or a symbol that isn't there fails the build.

I didn't stop there. I noticed Claude loved to use terms like "TODO," "future story," and "placeholder" while marking an AC complete. So I added a regex check for "banned" words that catches Claude trying to weasel its way to done.

To be clear about what this buys me: it doesn't prove the code is correct. It proves that the agent's claim of "done" is anchored to real files and real symbols instead of a confident story. That's a smaller promise than "this works," but it's one I can actually enforce.

STORY-1 Implementation Map (cycle 1)
AC1 ("full front-matter: title + date + rendered body")
Implemented in:
    src/ssg/parser.py:38   (parse_post reads title/date, renders body)
    src/ssg/parser.py:51   (title field read from front-matter mapping)
    src/ssg/parser.py:57   (date read and normalized)
    src/ssg/parser.py:59   (body rendered to HTML via _MD.render)
    src/ssg/parser.py:124  (_coerce_date returns ISO string)

Gate C: Epic Verifier

Gate B ensures every story is individually complete. But individual completeness isn't the same as overall correctness. Story 3 can pass every test it owns while silently breaking something Story 1 built. Nobody runs Story 1's tests during Story 3's work.

That's what the EpicVerifier is for. Once every story in an epic reaches DONE, it runs the whole epic's tests in one pass, every story's suite together instead of story-by-story. In Melody it paid for itself during a MediaPipe migration. One story correctly removed the AICore <uses-feature> gate from the manifest so the feature could run on any device, and its own tests verified that cleanly. Gate A and Gate B passed; the story was done. But a stale test from the pre-migration code, tagged to a different story, still asserted that gate was required. No per-story run ever touched it. Running the epic's whole suite caught it the moment the epic closed. One Developer cycle deleted the stale test, and the epic verified.

The Architecture of Scale: Decoupling the "What" from the "How"

The Sage Orchestrator in action. Here, the parallel scheduler has spawned five ephemeral workers to handle implementation and testing for three different stories simultaneously. Note the token telemetry for each agent: the visible price of this level of autonomous coordination.

To make Sage a reusable framework and not just a pile of Android scripts, I built it around a strict separation of concerns: a clean split between the generic "what" and the project-specific "how."

Each agent (ProductOwner, TestCreator, Developer, Tester, EpicVerifier) has a fixed, project-agnostic job description. The Developer agent knows how to implement a feature, but it doesn't know your project's folder structure or testing framework. All project knowledge (the test commands, the coding conventions, the file paths) is injected at runtime via .yaml configuration files located in the project's .sage/ directory.

By decoupling the role from the environment, the same Developer agent can implement a Kotlin feature in the morning and a Python microservice in the afternoon. The agent stays the same; the project context is what changes.

This architecture is managed by a parallel scheduler that spawns ephemeral workers for each story, allowing the team to work on multiple features simultaneously. But as I quickly learned, scaling a team of parallel agents creates a massive coordination problem. How do you keep them all in sync without drowning in "hallucinated" chat history?

The full Sage pipeline. The ProductOwner drafts the spec and stories; the parallel scheduler spawns one ephemeral worker per ready story (TestCreator → Developer ⇌ Tester); the EpicVerifier closes the loop on each epic with cross-story regression. The dashed line is the failure path: failed gates reopen specific stories rather than stalling the whole feature.

The Great Protocol Simplification

My first attempt at scaling the team was a disaster of coordination. In a parallel environment, agents would drop "packets" of communication, either ignoring commands or hallucinating that they had sent messages they never actually sent. I tried to solve this with a 3-way TCP-style handshake: SYN, SYN-ACK, ACK.

The result was spectacular over-engineering. It added latency and burned thousands of tokens on "handshake ceremony." Worse, it buried the agents in protocol overhead: multi-step retries, message IDs, acknowledgment timing tables. All of it spelled out in painful detail in a coordination handbook. The agents weren't network cards. They were models, and they were suffocating under the weight of the rules.

I realized I was trying to make my agents "communicate" better when I should have been making them "reconcile" better. I deleted the handshake and moved the source of truth to a YAML-based state machine on disk.

Each story's current state (IN_DEV, TESTING, DONE) lives in a file-locked YAML written atomically on disk, not in an agent's fleeting memory. It's durable current state. Nothing fancy, no replay log, no reconciliation engine. But making that file the single source of truth let me delete the entire handshake protocol and replace it with one rule: re-read the story YAML. The file is the source of truth now, not whatever an agent claims to remember.

The Honest Trade-offs

Sage is not a free lunch. This framework provides a level of reliability and traceability that single-shot prompts can't match, but it comes with specific costs. The first and most obvious cost is the Token Tax. Multi-agent coordination is expensive. Spawning specialized workers for every story and running them through the gates burns significantly more tokens than a single long-context prompt. You are trading compute cost for a massive reduction in "human-in-the-loop" review time. That's a calculation that only makes sense for high-reliability features where a bug costs more than the tokens.

There is also the Latency Penalty. Sage is not built for raw speed; it's built for correctness. Even with parallel agents, a multi-stage pipeline has inherent overhead. You are trading the "instant" response of a single prompt for the thoroughness of a multi-gate verification process. Furthermore, the framework is currently a reference design for Claude Code-specific primitives. While the core ideas are port-agnostic, the implementation relies on the specific behavior of Claude's agentic tools.

Perhaps the most fragile part of the system is the Orchestrator itself. In this version, the scheduler's logic, handling cycle budgets, deadlock detection, and dependency resolution, lives as instructions in a Markdown file. It is an LLM-managed state machine. This makes the system incredibly flexible and easy to iterate on, but it introduces a layer of non-determinism into the brain of the framework: a future model reads those instructions differently and the behavior shifts. The fix I'm planning is to move the scheduling logic into typed Python and leave Markdown for the role prompts, not the control flow.

Conclusion: Beyond the Green Build

I didn't build Sage to be the fastest way to generate code; I built it to be a reliable way to ship a feature. The Breadcrumbs codebase that once delivered 17 architecturally perfect files with no tests now runs a TestCreator before any Developer writes a line. The Melody app that launched to a blank "Hello Android" screen now has an itemized implementation map for every requirement and a cross-story regression suite that catches the bugs individual stories miss.

The failures of Breadcrumbs and Melody weren't just bugs; they were the blueprint for a system that holds AI-written code to the same bar as anything else we ship. "Done" has to be something the system can check, not something an agent gets to declare.

The gates aren't magic, and it's worth being honest about their edge. They catch completion lies and cross-story breakage. They don't catch architectural drift. Four stories can each ship perfectly reasonable, working code and still add up to something incoherent. Catching that still takes a human architect.

By moving the source of truth from agent memory to the file system, we move one step closer to an AI-driven workflow we can actually trust. The failures built the system. That's what scar tissue is for.

The framework, including the static-site-generator example and a fully-committed end-to-end reference run, is open source at github.com/merrickcr/sage-feature-team.

Originally published on Medium.

Top comments (1)

Alex Shev • Jun 12

The strongest part here is separating “produced code” from “earned confidence.” A lot of AI coding workflows still treat tests as a final accessory, so the agent can generate a clean-looking architecture and then quietly skip the proof. Having a different role own acceptance criteria and verification changes the failure mode: the system has to prove progress, not just describe it.