Karpathy coined "vibe coding" in February 2025 for the mode where you accept agent output without reading it carefully. A year later he reframed his own term: fine for throwaway experimentation, a skill-atrophy risk for anything you are supposed to own. The reframing is not a retraction. It is the acknowledgment that the reviewer muscle withers when it is not used, and the withered reviewer is what makes the agent's output undetectably wrong.
Five habits separate the operator from the vibe-coder. Each is pinned to a working 2025-2026 practitioner. Each has a specific failure mode it is designed to catch. I run them as the agentic-engineering.md rule in my framework: claude-code-agent-skills-framework. The QA-automation surface where the same pattern ships against a production-shaped workload is claude-code-mcp-qa-automation.
The paradigm
The operator's job is not to write code. It is to specify, route, and review code written by a fleet of agents. Three hats, rotated consciously:
- Systems architect. Designs the system — inputs, outputs, error budgets, eval harness, escalation paths — before any agent starts executing.
- Orchestrator. Routes work to the right agent primitive. Assembly-line tasks go to assembly-line agents; manager-worker tasks go to manager-worker trees.
- Reviewer. Reads every agent-produced diff. Can explain why each change is correct, or flag the ones that are not.
Without the three hats, "agentic engineering" collapses back into prompt-writing, which is exactly what LLMs are fastest at replacing.
Habit 1 — Spec first (Harper Reed)
Every agent-led task starts with a spec doc and a plan doc. The agent executes against the plan; the operator reviews the plan, not every line. Reed's workflow: brainstorm (idea.md), plan (plan.md), execute (Aider runs the plan).
The compounding artifact is the spec + plan, not the code. Code rots; specs compound across projects.
What this looks like in practice: before asking the agent to write anything, the operator has a markdown file named {feature}-plan.md with the I/O contract, the test cases that will verify it, the acceptance criteria, and the rollback plan. If the plan is not written, the spec is not shared, and the review authority is not retained.
Habit 2 — Primary/secondary split (Geoffrey Litt)
Two parallel agent streams at any given time.
- Stream A (human-primary). Tight-loop design work — where the operator is still discovering what the system should do. The agent is a pair-programmer at most. REPL-style conversation on a hard design question.
- Stream B (async, executor). Well-defined execution tasks where the spec is settled and the work is mechanical. Rename this. Add test coverage here. Port this module to the new API. Dispatched async and reviewed in batch — hourly, end-of-day — not as they land.
The failure mode Litt names is treating everything as Stream B: firing off prompts, context-switching, reviewing nothing in depth. The symmetric failure is treating everything as Stream A: pairing with the agent on trivial refactors that should have been async execution.
Habit 3 — Agent primitive vocabulary (Shrivu Shankar)
Three reusable patterns for multi-agent work. Pick the one that matches the job shape. Do not default to the most complex.
Assembly-line. Sequential pipeline. Agent 1 stage 1, Agent 2 stage 2. Example: extract → validate → transform → load. Easy to debug because each stage has a single input and a single output.
Call-center. A router agent dispatches incoming work to one of several specialist agents. Example: triage agent routes a customer query to support / billing / engineering. The router is the bottleneck and the failure point.
Manager-worker. A manager agent decomposes a task into subtasks, dispatches each to a worker, then aggregates. Example: Claude Code with sub-agents for independent research streams.
The anti-pattern is manager-worker-as-default. It is the most expensive and most failure-prone pattern — state spreads, partial failures hide, aggregation bugs accumulate. If an assembly-line would work, use the assembly-line. Complexity cost is real.
Habit 4 — Evaluation on agent output (Hamel Husain)
Husain's manual-trace-labeling discipline applies to agent output the same way it applies to LLM output. Before trusting the agent at scale, the operator labels 20-100 real agent traces by hand. The trace becomes the eval harness.
When a new agent is deployed — or a new sub-agent inside an existing pipeline — the first 25-50 outputs are read in full, scored, and the failures are categorized into a small taxonomy. The taxonomy becomes the prompt for an LLM-judge, which is then calibrated against the operator's own scores. Only then does the pipeline run unsupervised.
Husain's demonstrated 90%+ human-judge agreement in his LLM-judge field guide is a workflow outcome, not a universal KPI — he explicitly warns that raw agreement misleads on imbalanced data. The discipline is the labeling, not the agreement number.
Habit 5 — Foundational-fluency check (Andrej Karpathy)
Never deploy an agent at scale on a problem the operator could not solve at the 100-line scale. If the tiny version would be beyond the operator, the operator cannot review what the agent ships.
This is the implicit contract behind Karpathy's nanoGPT / micrograd / nanochat series. The lesson: understand the 40-line version before trusting the 40k-line version.
Before deploying an agent to build a production RAG system, the operator has built — from scratch, no agent help — a 40-line toy RAG pipeline. Before deploying an agent to fine-tune a model, the operator has manually done one epoch of gradient descent on paper. The toy version is what the operator reviews against. Without it, the review degrades into rubber-stamping, which is the failure mode the entire rule exists to prevent.
Five anti-patterns to flag on sight
If a teammate (or the operator themselves) is doing any of these, name the anti-pattern verbatim and reference the habit it violates.
1. Blind diff-accepting. Merging agent PRs without reading them. Violates Habit 5: the reviewer muscle atrophies exactly when the operator most needs it.
2. "It works, ship it" with no eval loop. Shipping agent output because the happy path returned 200 OK, without labeling traces. Violates Habit 4.
3. Reaching for a fancy agent when a regex baseline was not tried. Violates Eugene Yan's "start without ML." If the regex gets you 70%, the regex ships first; the agent handles the tail.
4. Habit drift to manager-worker for everything. Defaulting to the most complex primitive because it feels sophisticated. Violates Habit 3.
5. No spec, just a prompt. Asking the agent to do X without a written plan the operator can review and version. Violates Habit 1.
The anti-patterns are not theoretical. They are the specific shapes the drift takes when the operator is tired, the deadline is close, and the agent happens to produce something plausible. Naming them in advance is what makes them catchable in the moment.
The scale position this sits on
Every trade-off between "ship faster with the agent" and "derive first, build from scratch" is implicitly a position on a 0-10 axis. Position 3 is where LLMs are fastest reaching competence — shipping-mode applied work is the exact profile currently being consumed by coding agents. Position 5+ is where the moat lives: problem framing, quality judgment, cross-layer debugging, first-principles derivation.
The five habits are the practice that keeps the operator at Position 5 instead of drifting to Position 2. Position 2 is the prompt-writer. Position 5 is the architect whose review authority agents are built to serve, not replace.
Related practitioners worth reading
The canonical habits come from Reed, Litt, Shankar, Husain, and Karpathy. A wider reading list for operator context:
- Shreya Shankar — Evals for AI Engineers and "Who Validates the Validators?" (UIST 2024). Validation design for LLM-judges.
- Addy Osmani — Beyond Vibe Coding. Engineering discipline layered on top of AI-assisted flows.
- Horace He — "Making Deep Learning Go Brrrr From First Principles." The compute / bandwidth / overhead mental model for when agent output is slow and you must diagnose why.
- Woosuk Kwon — vLLM / PagedAttention (SOSP 2023). OS virtual-memory paging applied to LLM KV cache — the production substrate you will deploy against.
- Simon Willison — continues to use "vibe coding" for throwaway experimental work, and is explicit about it. Reference for when the mode is appropriate.
The five habits come from the first five. The rest are context for the hats the operator rotates through.
The practice is not glamorous. It is a plan doc before every non-trivial task. A labeled trace set before every new agent. A 40-line version before every fancy deployment. Five named checks you can run in under an hour that decide whether the agent is working for you or whether you are working for the agent.
Aman Bhandari. Operator of an AI-engineering research lab running Claude Opus as the coaching partner, plus a QA-automation surface shipping against a real sprint workload. Public artifacts: claude-code-agent-skills-framework and claude-code-mcp-qa-automation. github.com/aman-bhandari.
Top comments (0)