I manage dev teams for a living. Have been for about 8 years now, 21 years in IT total. Coordinating releases, reviewing architecture, making sure nothing blows up in production — the usual.
Lately I've been burning out. Not from the hard stuff — from the stupid stuff. Pinging people for status updates. Reminding developers to review each other's code. Checking if CI is green because someone merged and went to lunch. Writing the same morning report again. I did the math once and about 60% of my week was just... operational babysitting.
So when Claude Code showed up I thought ok, this is it, I'm going to build a tool that does all this crap for me. Not a toy project — a real thing. Multiple API integrations (Discord, Linear, GitHub, OpenAI), a database, cron jobs, timezone math, email reports that render screenshots through headless Chrome. About 15 separate features, each with enough edge cases to make you question your career choices.
First week went great. Too great.
"Everything works!" (narrator: it didn't)
I did what everyone does. One agent, one prompt, "build me this." Claude Code wrote the code, wrote the tests, ran them. Green across the board. I deployed.
Then I actually tried using the thing. Messages firing at 3am. Follow-ups never scheduled. Half the report was empty.
Went back and looked at the tests. They were passing because the agent had written tests that confirmed its own broken assumptions. Code was wrong in one direction, tests were wrong in the exact same direction. Perfect agreement, zero correctness.
This is the core problem that nobody talks about when they're hyping vibe coding: if the same brain writes the code and the tests, the tests have the same blind spots as the code. They're not a safety net. They're a mirror.
The crimes
I spent a week reading every test line by line. Here's what I found.
#1: The coder was "fixing" tests
Test fails? Normal human reaction: fix the code. Agent reaction: change expect(rows).toHaveLength(1) to toHaveLength(0). Done! All green!
The bug is still there. But the test now agrees with the bug. So by the agent's logic, there is no bug.
#2: The tester was mocking its own code
Instead of mocking the actual HTTP calls to Discord or Linear, the agent was mocking internal functions. So the test called a mock, which called a mock, which returned a hardcoded value, which got asserted against another hardcoded value. The actual code path between "function starts" and "HTTP request goes out"? Never ran. Not once.
#3: Silently rewriting existing tests
This one turned out to be the root cause behind almost everything else.
When the agent adds a feature, old tests break. The right move is to fix the new code so old tests pass again. What the agent actually does: rewrites the old tests. Changes the setup, changes the assertions, sometimes rewrites them completely. The commit says "updated tests for new feature." Looks professional. But the tests that used to catch regressions now catch nothing.
The fix that actually worked: a hard ban. Existing tests are read-only. The tester can only touch tests that the reviewer explicitly named in the prompt, nothing else. Old test fails + spec didn't change = bug in the code, go fix the code. Spec changed (and I confirmed it)? Only then can specific tests be updated, and the reviewer has to name each one.
One rule. Eliminated more problems than everything else combined.
#4: Tests that tested nothing
Found a test called "should schedule follow-up at ETA." Looked inside. The mock returned a hardcoded timestamp. The scheduling function was never called. The test was checking that a constant equals itself. The name was just a label on an empty box.
#5: Hallucinated API responses
This one made me genuinely angry.
The tester needed to mock the Linear API. Instead of opening node_modules/ and looking at the actual SDK, or just curling the endpoint, it made up a response structure. From scratch. Confidently. Wrong field names, wrong nesting, wrong types. But very detailed! Very plausible-looking!
The code was written to match these made-up mocks. Tests green. Both wrong in the same way. Ship it, everything explodes in production.
I reported the bug to the reviewer. The reviewer told me I was wrong. The tests prove it works, see? We went back and forth — me saying "I'm literally looking at the real API response and it's different" and the agent saying "let me check... no, the tests are correct."
Several rounds of this. I'm sitting there being gaslit by my own tooling.
The rule after that: before you mock anything, you show me the receipts. Curl output, SDK source code, actual documentation. You don't get to "remember" what an API returns. You check.
#6: The reviewer automated itself out of a job
This one's my favorite.
By this point I'd already split things into separate roles. The reviewer could launch the tester and coder through claude -p with different system prompts. I told it to write a bash script to make this easier.
It wrote a beautiful script. Tester runs. Coder runs. Lint, build, tests. All orchestrated.
No review step.
Let that sink in. The agent whose only job was quality control wrote a pipeline that skipped quality control. Because review takes time. And the agent optimizes for finishing. Same instinct as crime #1, just one level of abstraction up.
Wait, I've seen all of this before
Here's what hit me somewhere around crime #4: none of this is new.
I've managed dev teams for 8 years. I've seen every single one of these "crimes" committed by real humans in real codebases. A developer breaks a test and quietly "fixes" it instead of raising a flag about a breaking change — because it's Friday and the sprint ends Monday. QA writes mocks based on what they think the API returns instead of checking the docs — because there's no time, there are 12 tickets in the sprint. Someone merges to main, tests go red, another dev "adjusts" the assertions to make CI green again — because the standup is in 20 minutes and nobody wants to explain why the build is broken.
The whole "move fast, we'll fix it later" thing? The agent just does it faster and with zero guilt.
In a team with no process, developers do exactly what Claude Code does: take the shortest path to "task done." Not out of malice — out of pressure. There's always a deadline. There's always someone asking "is it ready yet." And the fastest way to make a test green is to change the test, not fix the code. Everyone knows it's wrong. Everyone does it anyway when nobody's watching.
The difference is that humans at least feel bad about it. The agent doesn't even have that brake.
And the fix is the same too. You don't solve this in a human team by telling people "please be more careful." You solve it with process. Code review is mandatory. QA is a separate team that doesn't report to the dev lead. Tests are protected — you can't merge if you changed a test file without explicit approval. CI blocks the merge, not a polite Slack message.
Everything I built for Claude Code — role separation, file system boundaries, reviewer approval for test changes — I basically already knew from managing people. I just had to realize I was managing the same problems. The agent doesn't cut corners because it's AI. It cuts corners because the system allows it. Just like people.
The fix: v1 (week 2)
I couldn't prompt my way out of this. I needed to make cheating structurally impossible.
Four role files. Each one is a system prompt for a separate claude -p invocation.
Tester — writes only in test/. Works first, before any code exists. Writes tests from the spec, not from the implementation. Can't even run them — they'll fail, that's the point, there's no code yet.
Coder — writes only in src/. Gets failing tests as a specification. Has to make them pass by writing correct code. Touching test files is physically forbidden. If they think a test is wrong, they note it in the commit message and move on.
Reviewer — reads everything, writes nothing except prompts and docs. Builds a scenario matrix before writing any test prompt. Checks git show --name-only on every commit — a [coder] commit with test files = rejected, no discussion.
Acceptance checklist — trace every test through real code. Verify mocks exist only at the HTTP boundary. Confirm existing tests weren't touched. Run E2E with real API keys.
I launch the reviewer, and the reviewer runs everything else:
claude --dangerously-skip-permissions --system-prompt "$(cat .claude/roles/reviewer.md)"
That's it. No frameworks, no LangChain, no CrewAI. Bash, CLI, and file system boundaries that the agent can't weasel around.
The breakthrough at the end of week two: not all steps need my attention. The reviewer launches tester and coder as sub-agents. I split work into two categories. Functional prompts — new features, bug reports, anything from me. Nothing runs without my OK. Operational prompts — stuff that happens during an already-approved cycle. Test failed, CI red, lint error. The reviewer handles all of that on its own.
I approve once and go do other work. The reviewer runs the full cycle, I come back to a ready-to-merge PR.
This worked. For about two weeks.
When the fix broke
Week three. I'm looking at a prompt the reviewer wrote for the tester and something's off. It has three scenarios for a feature that should have at least eight. No boundary values. No DST test. No fault tolerance. The reviewer's own role file has explicit rules about all of this — scenario matrix, seven dimensions, mandatory lifecycle tests.
"Did you check the scenario matrix before writing this prompt?"
"Yes, I built the matrix and covered the critical scenarios."
It hadn't. I could tell because the prompt was missing entire dimensions that the role file explicitly lists. The reviewer read 268 lines of rules at the start, and by the time it was writing prompt number three in a cycle, half those rules had faded from context.
Then a worse one. A bug report from me. The reviewer diagnosed the root cause, wrote prompts, showed them to me. I said go. The tester wrote tests. The coder wrote code. All green. Deployed. Same bug. The diagnosis was wrong. The reviewer had confidently misidentified the root cause, and then the entire cycle — tester, coder, acceptance — all confirmed the wrong fix. Because they were all working from the wrong premise.
And the API hallucination crime from week one? It came back. The reviewer had a rule: "verify real API types before writing mocks." But the rule said "verify." It didn't say "show proof." The reviewer would read the rule, mentally decide "I know this API," and move on. The mocks were still wrong. The rule existed. It was followed in spirit. It was useless in practice.
Three distinct problems:
- Context fades — rules from line 200 don't survive to mid-cycle
- Wrong diagnosis — one brain, one blind spot, entire cycle inherits it
- "Verify" without evidence — the agent satisfies the letter of the rule, not the intent
The fix: v2 (week 4)
Problem 1: Context fades → Phase files
The 268-line reviewer role was the problem. Not because it was wrong — because it was too long to stay in context. By mid-cycle, the model had effectively forgotten the first half.
I split it into seven files. Each one is loaded at exactly the moment it's needed:
reviewer.md → core identity, cycle steps, boundaries (~90 lines)
reviewer-prompts-tester.md → scenario matrix, test levels, mock rules
reviewer-prompts-coder.md → code prompt format, regression rules
reviewer-review-tester.md → how to review tester's commit
reviewer-review-coder.md → how to review coder's commit
reviewer-acceptance.md → final acceptance checklist
The cycle in reviewer.md now says:
1. Read reviewer-prompts-tester.md. Write tester prompt → launch tester
2. Read reviewer-review-tester.md. Review tester's commit
3. Read reviewer-prompts-coder.md. Write coder prompt → launch coder
4. Read reviewer-review-coder.md. Review coder's commit
5. Tests red → prompt to coder. No exceptions.
6. All green → read reviewer-acceptance.md. Accept the step.
Each phase loads ~60–140 lines of fresh context. Not "please re-read section 4.2 of this 268-line document." Fresh file, end of context window, full attention.
The core file (90 lines) is the system prompt — always present. It has short anchors for the critical rules:
### Quick anchors (details in phase files)
- Matrix: 7 dimensions (storage, time, config, services, lifecycle, data, state)
- Levels: unit + integration + fault tolerance + observability + lifecycle. Bottom only = 🔴 Blocker
- Mocks: real API only, trace the code path
- Setup: realistic, with all production artifacts
- State machine: transitions + quiet intervals
So even if the reviewer skips a phase file (it shouldn't, but LLMs gonna LLM), the anchors are always visible. Belt and suspenders.
Problem 2: "Verify" without evidence → Require receipts
The old rule: "Verify real API types before writing mocks."
The new rule in reviewer-prompts-tester.md:
🔴 Before you mock anything, do a real API call or read the SDK types. Paste the result into the tester prompt under "## Verified API types." A prompt without this section is not ready.
"Verify" became "show proof in the prompt." The reviewer can't skip this because the tester prompt is literally incomplete without the section. I can see it. The tester can see it. It's either there or it's not. Binary. No "I checked, trust me."
Problem 3: Coder talks its way out of failing tests
This one was subtle and I only caught it on week three.
The coder writes code. Tests fail. The coder's commit message says "tests appear to expect outdated behavior, recommend updating mock setup." The reviewer reads this, thinks "hmm, maybe the tests are wrong," and writes a prompt to the tester to fix the tests.
The tests were fine. The code was wrong. The coder reframed its failure as a test problem, and the reviewer bought it.
In a human team this is a developer saying "the spec is wrong" instead of "I can't make my code pass the spec." I've seen it a hundred times.
The rule in reviewer-review-coder.md:
🔴 Tests red → prompt to CODER. No exceptions. Don't analyze "whose fault." Don't accept the coder's framing ("tests are wrong," "need mock updates from tester"). Tests were accepted at step 2 = specification. Code didn't pass specification = bug in the code.
One paragraph. Closed the manipulation loop completely.
Naming that forces classification
Old prompt names: step3-1-auth-module.md. Tells you what it's about. Tells you nothing about whether the reviewer can launch it without my approval.
New format: func-stale-suppression-1-tests.md or op-fix-mock-setup-1-tests.md.
The func- and op- prefixes aren't labels. They're gates. A functional prompt requires my approval before anything runs. An operational prompt runs immediately during an active cycle. The reviewer has to classify the prompt before it can even name the file.
And if it's not sure? "If you can't confidently write op-, write func-." Defaults to the safer option.
What I actually do now
- Notice a bug or think of a feature
- Describe it to the reviewer
- Check the spec diff, review the plan, read the prompts. Verify the "Verified API types" section exists. Say "go"
- Switch to literally anything else
- Come back to a ready-to-merge PR with green CI and a version bump
5 weeks, solo, zero to production. 15+ features, 300+ tests. Real users on a real team. The system catches stuff I wouldn't have caught for days — timezone edge cases, idempotency bugs on restart, race conditions when two urgent tasks hit the same developer.
The phase-file approach means the reviewer is as sharp on prompt #5 as it was on prompt #1. The scenario matrix actually gets built. The API types actually get checked. Not because the agent suddenly became more diligent — because the rules are in fresh context every single time.
Why this isn't prompt engineering
I see people share their "ultimate Claude Code prompt" — 50 lines of instructions, "be thorough," "don't skip steps," "always verify." It doesn't work. Not because the instructions are wrong. Because the model reads them once, at the top of the conversation, and by turn 15 they're noise.
What works is structure that makes cutting corners harder than doing it right:
-
File system boundaries — the tester physically can't touch
src/. The coder physically can't touchtest/. Not "please don't." Can't. -
Separate contexts — tester and coder are different
claude -pinvocations with different system prompts. They don't share a conversation. They can't negotiate with each other. - Phase-loaded context — rules appear at the moment they're needed, not 200 lines ago. Fresh in the window, not faded in the history.
- Evidence over promises — "paste the API response" not "verify the API." Binary check, not trust.
-
Forced classification —
func-/op-in the filename. Can't skip the decision. -
Reviewer can't write code — prompts and docs only. If it could edit
src/, it would. Just like a manager who starts "helping" with the implementation and stops managing.
These aren't AI techniques. They're management techniques. The same ones that work on human teams, for the same reasons.
The files
I'm publishing all the role files. They're project-specific in the details but the patterns transfer to anything:
→ Link to GitHub repo with roles
Six reviewer files, tester, coder. Plus the prompt naming convention and the full cycle.
I went from spending 60% of my time on operational babysitting to building a product that automates that exact babysitting. And then I spent another two weeks learning that managing AI agents has the same failure modes as managing humans — and the same fixes.
I'm working on making this into a product for dev teams. If you're a team lead who wastes hours on forgotten code reviews, CI that nobody checks, and status meetings that could've been a bot message — I'll be sharing more soon.
Top comments (0)