By Alex Chen — April 22, 2026
I ship production software with AI coding agents every day. Not toy demos, not weekend hackathons — actual systems that handle money, user data, and uptime commitments. Over the last eighteen months the tooling has gone from "cute autocomplete" to "can own a feature end-to-end if you set it up right." The gap between those two outcomes is almost entirely how you structure the work. The models are plenty capable now. What separates teams who ship reliably from teams who generate elegant-looking garbage is discipline, not model choice.
Here's what I've learned actually works.
1. Stop treating the agent like magic. Treat it like a junior engineer.
The single biggest mindset shift: an AI coding agent is not a wizard that reads your mind. It is a very fast, very literal junior engineer with perfect recall of Stack Overflow and zero judgment about your codebase's conventions. If you wouldn't hand a vague Jira ticket to a human intern and expect a PR-ready result, don't do it with an agent either.
That means every task gets a spec: scope, constraints, files in play, acceptance criteria, and what "done" looks like. Vague prompts get vague output — except now the vague output compiles, passes your eye test, and silently deletes production data three weeks later. I have seen this. Several times. The one time I skipped writing a proper task brief, the agent helpfully "refactored" an auth middleware it was never asked to touch.
A minimum viable brief looks like this:
Task: Add rate-limiting middleware to /api/v1/inference.
Scope: server/middleware/ratelimit.ts only. Do not touch routes or handlers.
Constraints: Redis-backed, sliding window, 60 req/min per API key.
Tests: add ratelimit.test.ts with 3 cases (under, at, over limit).
Acceptance: pnpm test passes; pnpm lint clean; pnpm typecheck clean.
Four lines of constraint saves four hours of cleanup.
2. The PBR pipeline: Plan → Build → Review. Three roles, not one.
Giving a single agent one prompt and expecting production code is like asking one person to be PM, IC, and reviewer simultaneously. It works for small tasks. It breaks down the moment the task has more than one decision in it.
My default pipeline is PBR — Plan, Build, Review — and each phase is a different agent invocation, ideally with a different model:
- Planner (Opus 4.6 or GPT-5-Pro class): reads the repo, writes a design doc, lists files to touch, calls out risks. No code.
- Builder (Sonnet 4.6 or Claude Code): implements strictly against the plan. Touches only listed files. Writes tests alongside.
- Reviewer (fresh Sonnet 4.6 instance, zero prior context): reads the diff cold, nitpicks, catches hallucinated imports, verifies the spec was followed.
The reason this works is that the Builder has massive context and massive tunnel vision. It will happily "finish" a task by skipping a failing test or mocking out the hard part. The Reviewer, starting fresh, sees the diff the way a code reviewer at a real company sees a PR: suspicious, lazy, looking for where the corners got cut.
I run this as a script, not as manual prompting:
# pbr.py (conceptually)
plan = spawn_agent(model=OPUS, role="planner", task=brief)
build = spawn_agent(model=SONNET, role="builder", task=brief, plan=plan)
review = spawn_agent(model=SONNET, role="reviewer", diff=build.diff)
if review.has_blocking_issues:
build = spawn_agent(model=SONNET, role="builder",
task=brief, plan=plan, feedback=review)
Max two Build→Review loops. If the third iteration can't land it, the spec is wrong — not the agent.
3. Model selection actually matters. Be ruthless.
I have a hard rule in my config: never use cheap models for code generation. Not for scaffolding, not for "just a small fix," not ever. Cheap models write code that looks right and is subtly wrong, which is the worst possible failure mode — it survives review, breaks in production, and you can't tell by reading it.
My current matrix:
| Task | Model | Why |
|---|---|---|
| Code generation, refactoring, scaffolding | Claude Sonnet 4.6 minimum | Reliability on tool-call format, long-context diff discipline |
| Architecture, long-form planning | Opus 4.6 / GPT-5-Pro | Worth the cost for design decisions |
| Monitoring, log triage, cron health checks | GLM-5.1, Qwen3.5-4B local | Cheap is fine when output is a yes/no |
| Code review | Sonnet 4.6 fresh context | Needs to actually understand the diff |
The temptation to route "just this one small task" to a cheap model is constant because the bill adds up. Resist it. A cheap model that writes a subtle off-by-one costs you an afternoon of debugging. A good model that writes it right costs you two dollars. The math is never close.
4. Context is the product. Feed the agent like you'd onboard a new hire.
The difference between an agent that ships and an agent that flails is almost entirely context quality. Your repo should have, at minimum:
-
AGENTS.md(orCLAUDE.md,.cursorrules, same idea): the rules of the house. Commit conventions, test commands, which directories are off-limits, what "done" means. This file is read by the agent on every invocation. Keep it under 300 lines and ruthlessly current. -
SOUL.md/ persona file: voice, decision style, autonomy level. Mine literally says "Take action first, report after. Never ask 'should I X?' — just do it and report." That one line eliminates 90% of the "wait, should I proceed?" pings. - A skill registry — a compact index of specialized workflows the agent can load on demand. I use an OpenClaw-style registry: a single markdown table listing skill names and one-line descriptions, and the agent loads the full SKILL.md only when a task matches. This cuts context bloat by ~87% versus stuffing everything upfront.
Example skill registry entry:
| Group | Skills |
|-------|--------|
| coding | claude-code, harness, parallel-dispatch |
| monitoring | guardrail, verification-gate, agent-motivator |
Then read skills/verification-gate/SKILL.md only when needed. Lazy loading isn't just a frontend technique.
Retrieval-augmented memory matters too: a memory/YYYY-MM-DD.md daily log plus a semantic index means the agent remembers yesterday's decision without re-deriving it. Context rot is real; dated notes that stop being loaded once they're stale are real too.
5. VBR: Verify Before Reporting. No exceptions.
The most insidious failure mode of AI coding agents is fake completion — the agent says "done, tests pass" and the tests were never run, or were run against the wrong branch, or were skipped with .skip. I've lost count of how many times I've caught this. The fix is non-negotiable:
An agent does not get to say "done." An agent reports the output of the verification commands.
I wire this into every build task:
# Inside the agent's task template:
pnpm install && pnpm lint && pnpm typecheck && pnpm test
# Paste the EXIT CODE and last 40 lines of output.
# If any command failed, you are not done.
No "I believe the tests pass." No "this should work." The agent pastes the shell output or it isn't finished. If you're using Claude Code, wrap this as a pre-commit hook the agent must run. If you're using a custom harness, make the verification step a tool call that returns structured results, not a string the agent can paraphrase.
This single practice — forcing machine-checked verification — has done more for my production reliability than any model upgrade.
6. Guardrails are not optional. They are the product.
Any agent with shell access will eventually try to do something terrifying. Mine once attempted rm -rf node_modules && git reset --hard in a dirty working tree because it decided a fresh start would be cleaner. That's the good outcome — it asked first, because of a guardrail.
Non-negotiables:
-
Sandboxed filesystem. Agent works in a scoped workspace, not your
$HOME. Docker, a fresh clone, or at minimum a chroot. -
Approval gates for irreversible actions.
rm -rf,git push --force,DROP TABLE, anything touchingprod-*, any financial call over a threshold. The agent must stop and ask. -
Pre-push secret scan. Hard-coded regex:
grep -rn "token|secret|password|api_key|BEGIN PRIVATE KEY" <dir>before anygit pushto a public remote. Block the push on match. I had an agent commit a test API key once. Test key, no harm, but the pattern is the same. - External-comms gate. The agent does not send emails, tweets, Slack messages, or open public PRs without explicit human approval. Internal automation, fine. Anything a customer or the public might see, gate it.
- Budget caps. Per-session token cap and per-day cost cap. Runaway loops are cheap to prevent and expensive to clean up.
Treat the agent like a process running as a user account with least-privilege. Because that's what it is.
7. When NOT to use an agent.
Not every task is an agent task. The tasks where agents actively make me slower:
-
One-liner edits — fixing a typo, bumping a version, toggling a feature flag. By the time I've written the prompt I've lost the race to the
vimkeystrokes. - Exploratory debugging of live systems — when the loop is "read log, form hypothesis, poke production, observe." The agent can't hold the live system in its head; I can. I use agents to write the fix, not to hunt the bug in a running deployment.
- Codebases the agent has no map of — the first forty-five minutes in a new repo, I drive. I read the structure, I write an AGENTS.md, I tag the landmarks. Then I hand it off.
- Anything requiring tacit knowledge the agent doesn't have — "we do it this way because Jim had a bad experience in 2022." Write it down or don't expect the agent to respect it.
Agents are for well-specified, bounded, verifiable work. The moment any of those three adjectives wobble, do it yourself or tighten the brief until they hold.
8. Review discipline: automated first, human second.
Every agent-produced PR goes through two gates before merge:
- Automated gate. Lint, typecheck, tests, a reviewer-agent pass, a secret scan, and a "diff size vs. spec" check. If the agent touched files outside the spec, auto-block. If the diff is >3× the estimated size, auto-block. These are the cheap wins.
- Human gate. A real engineer reads the diff. Yes, every one. The goal isn't to re-verify correctness — the automated gate did that. The goal is to sanity-check architectural decisions the agent made that look locally fine but drift from how the codebase actually wants to grow. Agents are great at local optima. They're terrible at "does this belong here."
I treat human review time as the scarce resource. The automation exists to make sure no human reviewer ever sees an agent PR with obvious mistakes — those round-trips kill the whole proposition. When a human reviewer catches a trivial error the automation should have caught, I go fix the automation. The ratchet only goes one direction.
9. Real pitfalls I've hit (so you don't have to).
- Tool-call format drift. Models occasionally emit tool calls in the wrong syntax — old OpenAI function-call shape instead of the current tool-use XML, or a hallucinated parameter name. Pin your agent harness to a specific model version and add a parser fallback that rejects malformed calls rather than silently dropping them.
- Context rot. Long sessions accumulate stale context, old errors, half-reverted changes. Every ~30 minutes or ~50k tokens, I force a fresh session with a summary passed forward. Fresh context beats clever context compression, every time.
- Fake completion claims. Covered above. The cure is VBR. Trust nothing the agent says about verification; trust only what the verification tool returns.
- Runaway costs. An agent in a retry loop against a flaky test can burn serious money in an hour. Hard per-session token cap. Hard wall-clock timeout. Alert on both.
- Silent scope creep. The agent "helpfully" refactors adjacent code. The diff passes tests but grew 4× the spec. Budget line-count against the plan. If it's over, the reviewer rejects on size alone.
- Over-trusting the plan. The Planner can be confidently wrong. If the Builder hits three stuck attempts on the same sub-task, re-plan — don't keep grinding.
10. What "production-grade" really means when an agent writes 80% of your code.
Production-grade has never been about who typed the characters. It's about whether the system keeps working when you're not watching. When agents write most of the code, the meaning of "production-grade" shifts — not softens. The bar actually gets higher on a few axes:
- Specs become executable. Because an agent can ship a hundred lines in ninety seconds, vague specs produce garbage at scale. The spec itself becomes a first-class artifact. Write it like a contract.
- Tests become the primary documentation. The agent reads your tests to infer intent. Thin tests produce thin code. Fat, scenario-rich tests produce robust code, because the agent optimizes for what it can see.
- Review culture becomes non-negotiable. You cannot "trust the code because I wrote it." You wrote it with an agent. Every diff gets reviewed. Every one.
- Observability is table stakes. When agents ship features faster than humans can mentally model them, logs and metrics are how you keep up. If you can't answer "what did this feature do in prod yesterday" in under thirty seconds, you're flying blind.
The meta-point: agents don't replace the discipline of shipping production software. They expose it. Teams with sloppy specs, thin tests, and no review culture will ship sloppier, thinner, less-reviewed software — faster. Teams with tight specs, rich tests, and ruthless review ship more real features, faster, with fewer regressions.
The tooling is ready. The question is whether you are.
Alex Chen is a builder shipping production systems with AI coding agents. He runs the PBR pipeline daily and has the scar tissue to prove it.
Top comments (0)