Rayanou

Posted on Apr 14

I stopped prompting Claude Code. I gave it a team instead.

#claudecode #ai #productivity #buildinpublic

I've been building with Claude Code for two months. The first few weeks taught me something that took longer than it should to name.

I tried for weeks to write the best and most effective prompts... but Claude doesn't drift because of a bad prompt. It drifts because there's no structure beneath the session. Every conversation starts fresh. Decisions get re-made. Context you thought was stable evaporates. You end up not with a codebase shaped by clear intent, but with a layered record of every session that started over.

That's also not a model problem. It's a workflow problem.

Before I explain what I built — the part that happened two days ago

Two days before publishing this, the workflow caught two bugs in itself.

The strategic-PM agent was running an autonomous sprint and hit an ambiguous instruction in the project's CLAUDE.md: "one phase = one session." It interpreted this as requiring human approval between phases, which was exactly the opposite of what autonomous orchestration should do. Sessions are technical CLI isolation for context purity, not human gates. The agent had been operating with a subtly wrong mental model, and nothing had surfaced it until the system ran against itself.

The second bug: the /sprint-plan skill instructed Claude to enter plan mode inside claude -p sessions. In non-interactive mode, plan mode triggers an exit waiting for human approval. Exit code 0. Nothing written. Silent failure.

Both are fixed in v3.5.1, documented in the CHANGELOG. I mention this here, at the top, because it's the most honest thing I can say about why this workflow matters: a system that catches its own inconsistencies before they reach production is the actual goal. Not a system that looks clean.

What I built

A sprint workflow — mainly inspired by the agile methods most teams already use. I built this because I kept seeing frameworks and CLAUDE.md templates being released, the kind you paste once and gradually forget. This is different: a methodology. The agent team runs the sprint, you do two things: invoke and validate.

The team:

9 specialized agents in 3 groups: 3 strategic (PM, independent QA challenger, marketing strategist) · 5 technical (architect, code reviewer, security auditor, ops engineer, QA tester) · 1 ops monitor.
Each has a defined role, persistent memory across sessions, and instructions it can't override.
No agent reviews its own work.

The cycle:
/sprint-plan → /build → /review → /fix → (optional /red-team) → /capture-lessons

The skills:
18 skills encode every phase. You stop prompt-engineering the same context every sprint. The skill runs it.

Two modes:

Manual: you invoke each phase, validate, move on.
Autonomous: the strategic-PM orchestrates end-to-end. You review the PR.

Everything is plain markdown and JSON. Nothing to install beyond Claude Code.

The proof

Two months. 55+ sprints on a real production SaaS — multi-tenant, AWS Lambda + ECS Fargate, Stripe billing, currently serving live traffic.

Metric	Value
Total API cost	$1,394.38 (via ccusage)
Without prompt caching	~$13,000
Cache hit rate	95.5%
Sprint duration (autonomous mode)	30–45 min end-to-end

The ~9.3x ratio between what I paid and what it would have cost without caching comes entirely from workflow discipline. CLAUDE.md stays small and stable — it's the cache anchor. Skills load deferred. Every session opens with a warm cache. That's it. No API trick.

Every number is verifiable. ccusage parses Claude Code's local usage logs with daily breakdowns by model. The sprint history lives in the CHANGELOG with commit SHAs you can inspect.

Why the ratio is so high: cached tokens cost roughly 10x less than fresh input tokens. A session that opens with a warm cache costs a fraction of a cold one. Keeping CLAUDE.md stable and loading skills deferred is the only discipline that makes every session start warm. That's it — no API trick, just consistency.

What I'm not claiming

This isn't a claim that AI replaced my team. I'm one person operating with the discipline of a small team.

The 30–45 minutes applies to focused, well-scoped features in autonomous mode. Sprint duration scales with what you ask for; if you scope a complex MVP in a single sprint, you'll burn proportionally more tokens and time. Manual mode takes longer, but gives you finer control over each phase (useful when you want to inspect or redirect between steps). Large refactors can take hours. The workflow doesn't change that math; it makes it more predictable.

The $1,394.38 assumes a high cache hit rate and this specific usage pattern. Your numbers will differ.

If you want to go deeper — how the autonomous mode actually works

You write a direction file: what to build, what success looks like. The strategic-PM reads it, proposes a sprint plan, and launches each phase in its own isolated claude -p subprocess for context purity.

The strategic-QA agent independently reviews each sprint. It challenges PM decisions, not just validates them. The PM defends its choices with evidence. "I agree" without explanation is explicitly forbidden. After 3 rounds without consensus, it escalates to a blockers file for you to decide.

You're not sitting there waiting. The separation between phases is technical isolation for context purity — not a human gate. The PM chains all phases in isolated subprocesses, you can close your terminal and come back. When it's done, there's a PR: plan, build output, review findings, fix log, retrospective. You decide if it ships.

The repo

Open-sourced today: github.com/rbah31/claude-code-workflow

If you've been building with Claude Code and something feels off — the drift, the inconsistency, the feeling that each session starts from zero — it's not you. It's the single-session model. There's a better one. This is mine. Clone it, run a few sprints, tell me what you'd change.

This will be a part of a series on building in production with AI, I mean if I don't flop.

Top comments (3)

Alvarito1983 • Apr 14 • Edited

The "Claude drifts because there's no structure beneath the session" framing is exactly right, and it took me longer than it should have to name it too.

I've been running a similar pattern — orchestrator + specialists — building a 6-tool Docker management ecosystem. What I found matches your experience: the parallel execution isn't the hard part. The hard part is scoping sub-agents so they're genuinely independent. The moment Agent 2 needs to "follow the pattern Agent 1 will establish," you've created a dependency that collapses the parallel gain.

The bug you caught — the agent misinterpreting "one phase = one session" as a human gate — is exactly the failure mode that makes me nervous about fully autonomous sprints. The system can be internally consistent and still be operating on a wrong assumption for weeks. Your fix (documenting it in CHANGELOG, surfacing it explicitly) is the right response. Most people would just fix it silently.

I published a piece on the sub-agent pattern today if you want to compare notes on the orchestration side: dev.to/alvarito1983/claude-code-pa...

Rayanou • Apr 14

Glad to here I'm not alone on those pain points !

For the "wrong assumption for weeks" fear: that's exactly what motivated the CHANGELOG

transparency. Silent fixes are dangerous because they leave no trace of what the system believed before. I'm curious about what you've done on Docker and how do you handle problem in practice. I will definitely read your notes !

Alvarito1983 • Apr 15

The Docker ecosystem context helps make the problem concrete. Six services (container manager, image watcher, uptime monitor, CVE scanner, alert system, central hub with SSO) — each with its own backend, frontend, auth, and data store. The kind of project where wrong assumptions compound fast because a change in one service can break the integration contract with three others.

How I handle it in practice:

The main thing is explicit scope boundaries per sub-agent. Each agent gets told not just what to build, but what not to touch. "You are working on nexus-security only. Do not touch nexus-hub — that's a separate agent's scope." Without that, agents make reasonable-looking decisions that break the integration layer.

The second thing is always ending agent specs with a rebuild and verification step. Agents will write correct code and stop. If you don't tell them to rebuild the Docker containers and verify the endpoint responds, you find out the service is broken when you open the browser, not when the agent finishes.

The assumption problem you're describing — internally consistent, wrong for weeks — showed up for me with a shared API key that all services use to authenticate with the Hub. One agent set it correctly, another assumed a different default. Everything looked fine in isolation. The integration test caught it, but only because I explicitly asked for one.

The CHANGELOG transparency pattern is the right instinct. A system that documents its own wrong assumptions is more trustworthy than one that just fixes them silently, even if the silent fix is technically correct.