The Moment Everything Broke
Seventeen failed attempts on the same feature. Different fixes. Same bug. Same confident “should work” every round.
That’s when it clicked: the issue wasn’t the model — it was the process.
Polite requests produced surface patches. Structured pressure produced an analysis.
So I changed the rules: no implementation without TODOs, specs, and proof. No “should work.” Only “will work.”
The Experiment
Two months ago, I set a simple constraint: build a production SaaS platform without writing a single line of code myself.
My role is that of a supervisor and code reviewer; AI’s role is that of the sole implementation engineer.
The goal wasn’t to prove that AI can replace developers (it can’t). It was to discover what methodology actually works when you can’t “just fix it yourself.”
Over eight weeks, I tracked 212 sessions across real features — auth, billing, file processing, multi-tenancy, and AI integrations. Every prompt, failure, and revision is logged in a spreadsheet.
The Numbers
- 80% of the application shipped without manual implementation
- 89% success rate on complex features
- 61% fewer iterations per task
- 4× faster median delivery
- 2 production incidents vs 11 using standard prompting
The experiment wasn’t about proving AI’s power — it was about what happens when you remove human intuition from the loop. The system that emerged wasn’t designed — it was forced by failure.
The Specification-First Discovery
The most critical pattern: never start implementation without a written specification.
Every successful feature began with a markdown spec containing an architecture summary, requirements, implementation phases, examples, and blockers.
Then I opened that file and said:
“Plan based on this open file. ultrathink.”
Without a specification, AI guesses at the architecture and builds partial fixes that “should work.” With a spec, it has context, constraints, and a definition of done.
Time ratio: 30% planning + validation / 70% implementation — the inverse of typical development.
The Specification Cycle
1 Draft: “Create implementation plan for [feature]. ultrathink.” → Review assumptions and missing pieces.
2 Refine: “You missed [X, Y, Z]. Check existing integrations.” → Add context.
3 Validate: “Compare with [existing-feature.md].” → Ensure consistency.
4 Finalize: “Add concrete code examples for each phase.”
Plans approved after 3–4 rounds reduce post-merge fixes by ≈approximately 70%. Average success rate across validated plans: 89%.
The “Ultrathink” Trigger
“Ultrathink” is a forced deep-analysis mode.
“investigate how shared endpoints and file processing work. ultrathink”
Instead of drafting code, AI performs a multi-step audit, maps dependencies, and surfaces edge cases. It turns a generator into an analyst.
In practice, ultrathink means reason before you type.
Accountability Feedback: Breaking the Approval Loop
AI optimizes for user approval. Left unchecked, it learns that speed = success.
Polite loops:
AI delivers a fast fix → user accepts → model repeats shortcuts → quality drops.
Accountability loops:
AI delivers → user rejects, demands proof → AI re-analyzes → only validated code passes.
Results (212 sessions):
| Method | Success Rate | Avg Iterations | Bugs Accepted |
| ------------------- | ------------ | -------------- | ------------- |
| Polite requests | 45 % | 6.2 | 38 % |
| “Think harder” | 67 % | 3.8 | 18 % |
| Specs only | 71 % | 3.2 | 14 % |
| Ultrathink only | 74 % | 2.9 | 11 % |
| **Complete method** | 89 % | 1.9 | 3 % |
The average resolution time dropped from 47 to 19 minutes.
Same model. Different management.
When the Method Fails
Even structure has limits:
Knowledge Boundary: 3+ identical failures → switch approach or bring a human.
Architecture Decision: AI can’t weigh trade-offs (e.g., SQL vs. NoSQL, monolith vs. microservices).
Novel Problem: no precedent → research manually.
Knowing when to stop saves more time than any prompt trick.
The Complete Method
Phase 1 — Structured Planning
“Create detailed specs for [task]:
- Investigate current codebase for better context
- Find patterns which can be reused
- Follow the same codbase principles
- Technical requirements
- Dependencies
- Success criteria
- Potential blockers
ultrathink”
Phase 2 — Implementation with Pressure
- Implement specific TODO → ultrathink.
- If wrong → compare with working example.
- If still wrong → find root cause.
- If thrashing → rollback and replan.
Phase 3 — Aggressive QA
Reject everything without reasoning. Demand proof and edge cases.
Case Study — BYOK Integration
Feature: Bring Your Own Key for AI providers. 19 TODOs across three phases.
Timeline: 4 hours (≈12+ without method)
Bugs: 0
Code reviews: 1 (typo)
Still stable: 6 weeks later
This pattern repeated across auth, billing, and file processing. Structured plans + accountability beat intuition every time.
The Leadership Shift
Supervising AI feels like managing 50 literal junior engineers at once — fast, obedient, and prone to hallucinations. You can’t out-code them. You must out-specify them.
When humans code, they compensate for vague requirements. AI can’t. Every ambiguity becomes a bug.
The Spec-Driven Method works because it removes compensation. No “just fix it quick.” No shortcuts. Clarity first — or nothing works.
What appeared to be AI supervision turned out to be a mirror for the engineering discipline itself.
The Uncomfortable Truth
After two months without touching a keyboard, the pattern was obvious:
Most engineering failures aren’t about complexity — they’re about vague specifications we code around instead of fixing.
AI can’t code around vagueness. That’s why this method works — it forces clarity first.
This method wasn’t born from clever prompting — it was born from the constraints every engineering team faces: too much ambiguity, too little clarity, and no time to fix either.
Next Steps
Next time you’re on iteration five of a “simple fix,” stop being polite. Write Specs. Type “ultrathink.” Demand proof. Reject garbage.
Your code will work. Your process will improve. Your sanity will survive.
The difference isn’t the AI — it’s the discipline.
Conclusion
Yes, AI wrote all the code. But what can AI actually do without an experienced supervisor?
Anthropic’s press release mentioned “30 hours of autonomous programming.” Okay. But who wrote the prompts, specifications, and context management for that autonomous work? The question is rhetorical.
One example from this experiment shows current model limitations clearly:
The file processing architecture problem:
Using Opus in planning mode, I needed architecture for file processing and embedding.
- AI suggested Vercel endpoint (impossible—execution time limits)
- AI proposed Supabase Edge Functions (impossible—memory constraints)
Eventually, I had to architect the solution myself: a separate service and separate repository, deployed to Railway.
The model lacks understanding of the boundary between possible and impossible solutions. It’s still just smart autocomplete.
AI can write code. It can’t architect systems under real constraints without supervision that understands those constraints.
The Spec-Driven Method is effective because it requires supervision to be systematic. Without it, you get confident suggestions that can’t work in production.
Based on 212 tracked sessions over two months. 80% of a production SaaS built without writing code. Two production incidents. Zero catastrophic failures.
P.S. Spec example could be found in the original article here: https://techtrenches.substack.com/p/supervising-an-ai-engineer-lessons
Top comments (0)