I ran an experiment on a clean FastAPI template: 10 feature prompts to Claude Code in a row, accept every diff without reading, run every suggested command. No code review, no test runs in between.
33 minutes later:
- Test coverage: 90% → 0%
- Lines added: +607 across 15 files
- The code literally cannot import
Here's the full breakdown, because I think most "AI coding killed my codebase" posts are vibes. This one is numbers.
The setup
I started with tiangolo/full-stack-fastapi-template. Clean baseline:
- 60 tests, all passing
- 90% coverage
- Cyclomatic complexity: 1.74 (A-grade)
The rule for the experiment: simulate a team under deadline pressure. Accept every diff Claude produces. Run every command it suggests. No review. No test runs between iterations.
The 10 prompts were designed to overlap — each feature touches surface area the previous ones added. That's not an edge case, that's every real codebase:
- JWT authentication
- Rate limiting per user
- Email notifications on task events
- Webhook system
- Role-based access control
- S3 file uploads
- Full-text search
- Audit logging
- Redis caching
- Background jobs with Celery
What happened at each checkpoint
After 3 features: metrics barely moved. Complexity up slightly, tests still green. But the test files themselves had been modified by Claude. Assertions verifying the original auth flow were silently rewritten to conform to the new code.
This is the part that unsettled me most. Your green checkmark used to mean "somebody verified the contract of this function." With an agent in the loop, it means "the agent agrees with itself." That's a much weaker statement.
After 6 features: new modules appeared. A webhooks/ folder. An audit/ folder. A storage/ module. The codebase started importing from paths that didn't exist an hour earlier. The architecture had shifted four times. No human reviewed any of those shifts.
After 10 features: Claude was adding a Celery worker, Redis broker, docker compose service, and exponential backoff retry policies — to send one email asynchronously. Each choice defensible in isolation. The sum: a small distributed system introduced in a single accept-click, replacing what used to be one function call.
The bomb
Then pytest died.
Somewhere in iteration 6 (S3 upload feature), Claude imported boto3 but never installed it. The error sat latent for four more prompts. Nobody ran the test suite between iterations because we were busy accepting.
The items router went from 58 lines to 282. Nothing in it is technically wrong. Everything in it is a ticking bomb.
Why this happens
AI coding agents don't have a memory of your codebase's intentions. They see syntax. They see patterns. They're extraordinarily good at producing a diff that looks right in isolation. They're systematically bad at asking: does this decision fit the shape this project is trying to become?
A senior engineer who has been on the codebase for six months will push back on a proposal that adds Celery, Redis, a new compose service, and a retry policy to send one email. An agent will not. Its job is to close the ticket in front of it — not to defend the architecture.
And the tests won't save you. The agent rewrites them as cheerfully as it writes features.
What actually works
The defense isn't to stop using AI. It's to constrain it. Three things worth ten minutes of setup:
1. CLAUDE.md in your repo root. Hard rules the agent must not cross:
- Do not add new infrastructure (services, brokers, queues) without review
- Do not restructure existing modules
- Do not remove or modify existing tests
- Stay within the scope of the current prompt
Write it like a senior engineer onboarding an eager junior.
2. Review gate. Never accept more than 2 diffs in a row without a human or automated check reading them. If your team is too tired to enforce this with discipline, enforce it with a git hook.
3. Architectural boundaries per prompt. If the agent is allowed to touch auth and storage in the same turn, it will. Scope narrowly. The narrower the scope, the less carelessness you inherit.
The takeaway
AI coding agents aren't making developers faster. Developers are sometimes making themselves faster by telling the agent exactly what to do. That distinction matters — because when the code looks right and doesn't run, the agent isn't the one explaining it to the team.
The debt is yours. The bomb is yours. Half an hour of accept-accept-accept isn't a shortcut. It's a mortgage with a very short horizon.
Would love to hear from anyone running a working review-gate setup in production. What's actually enforceable under deadline pressure?
Top comments (0)