Lutz Leonhardt

Posted on May 9

I built the same MVP twice. The autonomous agent wrote 4.6x more tests — none caught two stubbed core methods.

#agents #ai #automation #contextengineering

Side-post in the keppt build-in-public series — an interlude before the Phase 1 implementation write-up lands.

While building Phase 1 of keppt, I ran a side experiment. Same architecture spec, two builds. One I curated over a day with Claude Code and Codex — plan first, tasks derived, agent-implemented per task, adversarial review, iterate. One I handed off to Factory.ai's Droid in their new Mission Control mode. Spec in, walk away, come back when the budget runs out.

Here is what came back.

the numbers

	Curated (Claude Code + Codex)	Autonomous (Droid Mission Control)
Wall-clock effort	~1 day	~4 hours autonomous
Source LOC	1,367	2,370
Test LOC	1,317	6,015
Test cases	69	339
Working CLI?	yes	no
`LocalFileRepository.edit()`	real, CAS + audit	stub, ignores edits
`LocalFileRepository.search()`	real full-text + scope	stub, returns `[]`
Path-safety vectors	8	13

The autonomous build wrote 4.6× more test LOC than the curated build, with roughly five times as many test cases. None of those tests caught that two of the repository's core methods do nothing. The CLI entry point is one line.

the same starting point

Both builds started from the same spec — a 1,700-line architecture.md describing the tool boundary, vault rules, prompt rules, and model routing for keppt's Phase 1: a single-process CLI that hits a real Obsidian vault with a mocked LLM. Both had the same TypeScript stack, the Vercel AI SDK, Vitest, and the same target: prove the engine end-to-end before any UI sits on top.

The difference was the workflow, not the input.

This is not a benchmark of Factory.ai, Droid, Claude Code, or Codex. It is one frozen run on one non-trivial MVP. The tools all move fast; specifics will shift. But the failure mode it exposed is one I now actively design around.

what the autonomous build actually produced

The Droid is impressive on the surface. Before writing a line of code, it generated a 129-line mission.md, an 815-line validation-contract.md, and 1,373 lines of behavioral specifications across five milestones in contract-work/ — 154 testable assertions (VAL-FOUND-*, VAL-TOOLS-*, VAL-LIFECYCLE-*, VAL-AGENT-*, VAL-CLI-*, VAL-CROSS-*) distributed over 29 features, each with preconditions, expected behavior and verification steps. It set up specialized worker skills (core-implementer, cli-implementer), checked tooling readiness, and started executing milestone-by-milestone with quality gates between each one. It ran for ~4 hours of pure autonomous time on a $200/month Droid Core plan before its weekly budget hit the wall.

The full mission directory is in the repo at session/ — every claim below can be cross-referenced against the actual artifacts.

When the budget hit, the mission was paused at 25 of 41 planned features. Three of five milestones — Foundation, Tools-and-Edits, Vault-Lifecycle — had been sealed against their contracts. The fourth (Agent Runtime) was mid-execution. The fifth (CLI) had not been started. For context: Factory's own data shows 14% of missions run longer than 24 hours, with the longest at 16 days. A 41-feature Phase-1 build was not going to fit inside one $200 weekly budget at any plan tier.

So the interesting question isn't "why didn't it finish". It's "what was inside the milestones it did seal".

The CLI entry, apps/cli/src/index.ts:

export const cliPackageName = "@keppt/cli";

That is the entire file. The package.json declares "bin": { "keppt": "./src/index.ts" }. So the binary entrypoint exists, points to a real file, and that file is a one-line constant. Nothing runs. To be fair, M5 (the CLI milestone) was the last one in the plan — the budget exhausted before any of it was implemented. The CLI's absence is a budget fact. The next file is the more interesting one.

The local file repository, packages/core/src/repository/local-file-repository.ts:90:

async edit(
  filePath: string,
  _edits: SearchReplaceEdit[],
  _changeSummary: string,
  _options: WriteOptions = {},
): Promise<EditResult> {
  validateRelativePath(filePath);
  const current = await this.read(filePath);
  return { ok: true, newContent: current };
}

async search(_query: string, _scope: SearchScope = "active"): Promise<SearchResult[]> {
  return [];
}

edit() reads the file, ignores the edits argument, returns the unchanged content with ok: true. search() returns an empty array. Both are reachable from the tool layer. Neither would survive a single real user turn. And both shipped inside a sealed milestone — M2 (Tools and Edits) passed all 38 of its assertions with these stubs in place.

The 339 tests don't catch it because they run against an InMemoryFileRepository that does have working implementations. A contract-style test that exercised both implementations against the same expectations would have flagged this — but that test was never written, because the contract docs that drove the milestone were satisfied at the spec level. Green tests, no behavior. More budget might have finished M5. It would not automatically reopen M2.

why this happens

The framing I want to push back on is "the agent was sloppy". It wasn't sloppy. It was extremely systematic. The artifacts are some of the most rigorous test plans I have seen come out of an automated process. Two structural reasons explain the gap, neither has to do with model quality.

The contract has to exist before the work begins. Mission Control runs work as a sequence of milestone gates, each one checking the implementation against pre-written assertions. Behavioral-validation generation runs in parallel — one validator per feature — but the implementation workers per milestone are sequential. Factory's own open-questions section is explicit about this: "Serial execution with targeted parallelization has worked better than broad parallelism." Every gate still consumes the contract as input, though, so all 154 assertions had to be in place before any feature implementation began. Hence 815 lines of validation contract before the first commit. This is rational gate-cost, not theatre. But it means the contract is frozen early, so whatever it missed — like "the CLI must actually run end-to-end against real files" — never recovers.

No iteration loop. Manual curation lets discoveries feed back. My plan grew Task 3.5 (a missing scope-policy gate) and Task 3.7 (path-safety hardening) after I'd built Task 3 and an adversarial Codex review found gaps. The autonomous mission cannot do that kind of reframing. Mid-mission self-correction does happen — when a review found that TDD red-phase evidence was missing, the orchestrator tightened worker rules and went back; an ESLint task was inserted late. But these are tactical patches inside the existing plan. Architectural reframing — "wait, the user-facing CLI is the actual product" — does not happen. The autonomous build kept building downward into infrastructure and abstractions. I built upward, toward a thing a user can run. Factory's own diagnosis of this gap is honest: "workers get stuck on edge cases a human would navigate easily." Edge cases are exactly where iteration matters most.

This is not "agents bad". This is a different cost surface:

The autonomous build paid up front, in contract surface.
The curated build paid continuously, in attention.

two patterns I stole anyway

Walking away with "manual curation wins" would be the lazy take. The Droid did real work, and two of its patterns are now in my skill kit — concrete diffs, not vibes.

1. Stable acceptance-criteria IDs. T2-AC-01...T2-AC-07 per task. Lets a wrap-up reference exactly which criteria a session covered. Lets adversarial review tag findings against a specific AC — and a finding without an AC reference is the loudest possible signal that the plan has a gap.

2. Cross-cutting acceptance. A short table at the end of the plan for invariants no single task can prove on its own. "History-log has exactly one entry per mutating turn across the whole session." "Single-clock invariant: prompt date, scope-check date, search today parameter all match within a turn." Touches: {T2, T3, T6}. The Droid produced 17 of these as VAL-CROSS-*. Five to eight is enough for a sequential build.

the honest closer

The real question I went in with was simpler than "which workflow is better": could the answer be "just hand it over and lean back"? If yes, we're close to replaces us on this kind of work, and I'd take it. Three things stopped me from that conclusion.

Cost. $200 bought ~25 of 41 features. Finishing the same mission would likely have required multiple budget cycles. That doesn't make the tool bad — but it changes the ROI calculation.

Clock. ~4 hours of mission time plus the upfront planning is in the same ballpark as my curated day. The autonomy advantage is real (I can step away), but not enormous.

Bloat. 1.7× more source code and 4.6× more test code before the build had even reached the user-facing CLI is a long-term maintainability tax, not a saving. keppt keeps growing past Phase 1, and someone has to read all of that.

So no, I'm not going to sell you "manual curation always wins" or "agents will replace us next quarter". Both are useful, neither alone is enough. The lesson is dimensional:

Autonomous missions are good at: generating contract surface, parallelizable subsystems, defensive scaffolding, exhaustive parametrized rejection tables.
Curated workflows are good at: sequencing, end-to-end focus, mid-build pivots, and asking "is this still the actual product?"

The hybrid I'm running on keppt now: human-curated plan, agent-generated behavioral contracts on top of it, human contract review before implementation, agent implementation per task, adversarial review by a different agent. It costs more attention than fire-and-forget. The output exists and runs.

The frozen experiment — 25 of 41 features completed, 3 of 5 milestones sealed (M2 with two stubbed core methods inside), 8.4k total LOC, 339 green tests — sits at keppt-app-droid for anyone who wants to verify the claims above. The maintained build is keppt-app; Phase 1 MVP is done, Phase 2 starts now.

If you've handed a non-trivial build to an autonomous agent and gotten it back beautifully tested but unusable — what was the missing constraint?

keppt is a chat-driven task & note system. You talk; it stays kept in plain markdown. Beta list and method page at getkeppt.com. Next in the main series: the Phase 1 implementation write-up.

Top comments (1)

Theo Valmis • May 12

The distinction between "contract frozen early" and "contract that evolves with discoveries" is the most important observation in this piece. The autonomous build produced more rigorous test plans than most human teams write, but the contracts were frozen before the first implementation commit. Whatever they missed stayed missed.

This maps to a pattern I see in AI-assisted codebases generally: the architecture decisions that matter most are the ones you discover during implementation, not before. ADRs written before the first line of code are valuable. ADRs amended after the first integration test reveals a gap are more valuable. The autonomous workflow had no mechanism for that amendment cycle.

Your "two patterns I stole" section is practical and undervalued. Stable acceptance-criteria IDs (T2-AC-01) as references that adversarial review can tag findings against should be standard in any multi-agent pipeline. It turns "the plan has a gap" from a vague feeling into a traceable signal.