"My GPT-5 coding agent was a 10x developer until I actually looked at the Git diff"

Jenny — Sun, 22 Feb 2026 16:04:46 +0000

I love modern AI coding tools. I really do.

They are incredibly fast, hyper-confident, and fully willing to refactor your entire codebase because you casually typed the words "clean up the auth flow" at 2 AM.

Which is how I learned a very annoying, very expensive truth:
AI doesn't ship your project. Your workflow does.

For a while, my workflow was basically what everyone is calling "vibe coding." It looked like this:

Ask for a feature.
Watch the AI write 400 lines of code.
Ship it.
Fix the weird edge case it broke.
Repeat until my repo looks like a thrift store of mismatched design patterns.

It worked... until I tried to change something two weeks later and realized half the code was written for a version of my app (and a version of me) that no longer exists.

So, I ran an experiment on a real project task (small SaaS backend stuff: an auth tweak, a webhook handler, and a permissions edge case). Not huge, but enough surface area for the AI to cause some damage.

And here’s the punchline: The model didn’t matter as much as I expected. The spec did.

The Real Enemy is "Drift"

Most people vibe-code like this: You give the agent a vague prompt, chat back and forth, and let it "figure it out."

Even with the absolute bleeding-edge February 2026 models—like GPT-5, Claude 4.5 Sonnet, or Gemini 3.1 Pro—this approach eventually fails. Sometimes it works temporarily. But usually, the AI does extra stuff you didn’t ask for:

It changes unrelated files.
It introduces new NPM dependencies.
It invents a completely new architecture mid-task.
It fixes the symptom but silently breaks an edge case.

Leaves you with code that technically compiles, but makes future-you suffer. That’s drift. And drift is what turns "AI speed" into "AI debt."

What Changed Everything: The Micro-Spec

I don’t write long documents. I’m not trying to cosplay as a PM for my own side projects. But I realized I had to stop handing the AI "vibes" and start handing it rules.

Now, I write a one-screen execution spec before I touch any code. This is the template I use:

Goal: One sentence. What should happen when the feature is done?
Non-goals: What are we explicitly NOT doing?
Scope: Which specific modules/files are allowed to change?
Constraints: No new dependencies? Follow existing patterns?
Acceptance checks: Tests or behavior checks that prove it’s actually done.

Example (Realistic Webhook Task):

Goal: Handle subscription upgrade webhook from Stripe.
Non-goals: No database schema refactors. No UI changes.
Scope: billing.service.ts + webhook.route.ts ONLY.
Constraints: Handler must be idempotent. Strict signature verification.
Acceptance: Replay test passes. Invalid signature test fails. Double-event test passes.

Once this exists, the AI stops improvising product decisions. It becomes an executor. Which is exactly what you want.

The 2026 Tool Stack (What actually helps)

I’m not married to any tool, but after testing the latest updates, here is how they fit if you want to be spec-driven:

1. Chat Models (Claude 4.5 / GPT-5 / Gemini 3.1 Pro)

Best for: Drafting the micro-spec, listing edge cases, and suggesting acceptance tests.
Worst for: Directly prompting "build the whole feature" without constraints. That’s how you get scope creep.

2. Planning & Spec Layers (Traycer.ai)

If the project is big enough, your one-screen spec turns into multiple sections (auth, DB, UI). At that point, manual markdown gets messy.
Best for: I’ve started using planning extensions inside VS Code, specifically Traycer. It forces file-level breakdowns before the code is written. You give Traycer your high-level goal, and it generates a strict Phase Plan. It’s not magic; it’s just structured enough that you stop handing agents vibes. Plus, it verifies the final code against the plan to catch any hallucinations.

3. IDE Agents (Cursor / Copilot Workspace)

Best for: Implementing a strict Traycer spec in a scoped blast radius.
Worst for: Vague requirements. If you ask Cursor to "fix billing," it will still drift. Just faster.

The Workflow That Stopped My Repo from Haunting Me

Here’s what I do now for anything that actually matters:

Write the tiny spec (or use Traycer to generate the Phase Plan).
Ask a chat model to list edge cases I missed.
Execute in an IDE agent with a strictly locked scope.
Review the diff against the original spec. (Did it sneak in a dependency?)
Run tests and commit.

If something goes wrong, I don't yell at the AI. I update the spec first, then re-run it.

My hot take:
If you can’t write acceptance checks, the task isn’t ready for a GPT-5 or Claude 4.5 agent. It’s ready for you to sit down and think.

Questions for you guys (because I’m curious):