I get asked for receipts on every cost number I publish. So here is one full run, end to end, with screenshots replaced by file paths you can read on GitHub.
The feature is real. It shipped to a real fintech codebase in March 2026. The prompt, agent outputs, costs, and gate decisions are all recorded in tasks/W14-stripe-webhook-retry in that project (I am not linking the client repo, but I am willing to share artefacts on request).
The feature
"Add idempotent retry handling to our Stripe webhook receiver. If Stripe re-delivers an event (network blip, our 5xx response, manual replay), we should not double-process. PCI scope must stay SAQ-A."
Two sentences. No requirements document. This is roughly the level of brief I get from the team's PM most weeks.
What ran
$ npx great-cto init
…detected archetype: fintech (PCI-DSS)
…attached pack: api-platform-pack
…attached pack: pci-pack
Pipeline rolled forward unattended through:
-
archetype-detector— 12 seconds. Scannedpackage.json,infra/, README. Flagged Stripe webhook handler, BNPL flow, three PII columns inuserstable. -
architect— 4 minutes. WroteARCH.mdwith: idempotency key strategy (deterministic fromevent.id), 24-hour retention window for processed-event log, exit criteria for "duplicate", interaction with existing audit log. - Gate: plan — I read the ARCH note for ~3 minutes, asked one clarifying question about the retention window ("why 24h, not the 7d that Stripe retries for"), agent updated to 7 days, I approved.
-
pmdecomposed into 4 tasks, scheduled 2 in parallel. -
senior-dev× 2 — 38 minutes, parallel git worktrees. Output: 6-file diff, +287 / −41 lines. -
qa-engineer— 11 minutes. Wrote 17 new tests, including a property-based test for replay ordering. Coverage on touched code: 94%. -
pci-reviewer(auto-attached by fintech archetype) — 8 minutes. Verified no card data hits new code path; idempotency log table excluded from CHD scope. -
api-platform-reviewer— 7 minutes. Checked webhook signature replay window (5 min skew, ok), idempotency key collision math, Sunset header (n/a, internal endpoint). -
security-officer— 4 minutes. Verified no new secret access patterns, audit trail covers retry path. -
code-reviewer(12-angle) — 6 minutes. Three minor refactor suggestions, all accepted. - Gate: ship — I saw 5 reviewer cards, scrolled through 2, approved.
-
devopsopened the PR. Branch CI was already green. Merged.
The receipts
Total wall-clock: 1h 26m. That includes the ~7 minutes I spent reading two artefacts and approving two gates. The agents themselves did about 78 minutes of work, mostly in parallel.
Total LLM cost: $1.42. Breakdown by agent (rounded):
architect $0.34
pm $0.04
senior-dev × 2 $0.62
qa-engineer $0.18
pci-reviewer $0.09
api-platform-rev $0.07
security-officer $0.05
code-reviewer $0.03
senior-dev is the cost driver (it writes the code). Reviewers are cheap (they output verdicts, not code). The whole thing fit inside the free monthly credit on Anthropic for the team account.
What I would have done manually
About 4-5 hours of senior backend work for the code, plus 1-2 hours of PCI review (we have an internal expert) before merge. Call it $700-$900 in fully-loaded engineering time. So the cost ratio is roughly 500× cheaper, the time ratio roughly 4× faster, with the same level of review rigour.
That is not "AI replaces the engineer." It is "AI does the mechanical 80% so the engineer spends 7 minutes on the part where judgment matters." The clarifying question I asked at gate:plan (retention window) is the kind of thing the agent would not have caught on its own.
⚠ Honest caveats
- This was a small, well-scoped feature with clear PCI implications already known by the team. For greenfield features in unfamiliar regulatory territory, expect 2-3× longer wall-clock and a chunkier ARCH approval cycle.
- The 78-min agent runtime ran on Claude Sonnet 4.6. On Haiku it would be ~30% faster and ~3× cheaper, with measurably weaker
architectoutput (we tried). - One reviewer agent (the
data-platform-reviewer) was opted out because we do not warehouse webhook events. If you do, that adds ~10 minutes and ~$0.10. - The team's existing test suite was already in good shape. On a codebase with poor test infra,
qa-engineerwould either spend longer building scaffolding or punt — and you would notice at gate:ship.
What this proves and what it doesn't
It proves the pipeline works on a feature you would have hand-built. It does not prove the pipeline can run your whole engineering org. We have not tried that and I do not recommend it. Two human gates per feature is the upper bound for "responsible automation"; more gates means slower, fewer means broken prod (see my earlier post on missing gates for why two is the right number).
If you want the same level of detail for a non-fintech archetype — say a voice-AI MVP or a clinical decision support tool — DM me on Twitter and I will publish another one.
About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Pay your own LLM API. The full architecture diagram, with every node mapped to its source on GitHub, is here.
Top comments (0)