Ziv Kfir

Posted on Feb 16

Why the V-Model Is the Natural Way to Work with AI Coding Agents

The 1980s solved a problem that AI just re-introduced.

TL;DR

The V-model — born from aerospace and defense to manage complexity in large-scale systems — was designed to solve three main problems: shrinking shared memory across growing organizations, organization drift ("I say A, you understand B"), and coordination collapse as system complexity scales. AI coding agents have the exact same problems: limited context windows, ambiguous natural language, and compounding chaos when multiple agents or sessions interact. As coding becomes cheap, project complexity will explode. The discipline of aerospace, refined from Apollo-era systems engineering to the later V-model, will enable high-quality AI-based agentic coding.

The Original Problem: Shared Memory Shrinks as Organizations Grow

In 1975, Fred Brooks published The Mythical Man-Month. The core insight: adding people to a software project doesn't make it faster — it makes coordination overhead grow at N(N-1)/2. Ten people means 45 communication channels. Twenty means 190. The more people involved, the smaller the fraction of the project any single person can hold in their head. [1]

The V-model emerged in the 1980s as a direct response — developed in the software industry as an evolution of the traditional waterfall model. [2] Rather than hoping everyone stays aligned through meetings and hallway conversations, the V-model creates formal contracts between phases:

Goals → validated by E2E tests
Spec (black-box use cases) → validated by E2E tests
Architecture (SDD) → validated by component tests
Implementation → validated by unit tests (TDD)

Goals ─────────────────────── E2E Tests
  ↓                              ↑
  Spec (Black-Box UCs) ───── E2E Tests
    ↓                            ↑
    Architecture (SDD) ───── Component Tests
      ↓                          ↑
      Implementation ──────── Unit Tests (TDD)

Each left-side phase produces a written contract. Each right-side phase verifies that contract. The shared memory of the organization lives in these documents — not in people's heads.

This wasn't just theory. NASA's Systems Engineering Handbook structures its entire mission lifecycle around V-model-shaped processes with successive stages of increasing system definition and maturity. [3] DO-178C (aviation software), ISO 26262 (automotive safety), and IEC 62304 (medical devices) all mandate V-model-shaped processes. [4] When the cost of failure is a crashed spacecraft or a malfunctioning pacemaker, you don't rely on "I think we agreed on X in yesterday's standup."

NASA and other aerospace teams were already using rigorous systems engineering on Apollo-style programs decades before the V-model label became common; the V-model later formalized those patterns. [22]

The AI Parallel: Same Problems, Different Actor

Now replace "team members" with "AI coding agents." The problems are structurally identical.

Problem 1: Limited Shared Memory

A human organization's shared memory shrinks as it grows. An AI agent's memory is always small — bounded by its context window. Even with 1M tokens, a real codebase doesn't fit. The agent can't hold the full picture any more than a 200-person org can hold it in collective memory.

Cursor tested giving agents shared context and letting them coordinate through a shared file. Each agent could check what the others were doing, claim tasks, and update status. It failed: agents held locks too long, forgot to release them, and 20 agents ended up producing the output of 2-3. Most time was spent waiting. [5]

Steve Yegge's Gastown framework independently arrived at the same conclusion after four different failed orchestration patterns: peer coordination does not scale. [5] The solution both converged on? Isolated workers with external contracts — specs, task queues, structured state — replacing shared memory.

A Google/MIT study (December 2025) quantified this: once single-agent accuracy exceeds about 45% on a task, adding more agents yields diminishing or negative returns. In tool-heavy environments (10+ tools), multi-agent efficiency dropped by a factor of 2-6x compared to single agents. [6] [7]

AWS built Kiro — their agentic IDE — around the same insight. Kiro's core innovation isn't faster code generation — it's forcing developers to write a testable specification before any code gets generated. The spec becomes the shared memory. [8]

Problem 2: Organization Drift — "I Say A, You Understand B"

In human organizations, requirements mutate as they pass through people. Product manager says "fast." Designer hears "minimal UI." Developer implements "skip validation." The V-model's traceability matrix exists specifically to catch this: every requirement maps to a design element, maps to code, maps to a test case. If something drifts, the gap is visible. [4]

AI agents have the same problem — except worse:

Ambiguous prompts: "Add photo sharing to my app" forces the AI to guess format, permissions, storage, size limits, error handling — dozens of unstated assumptions. As the arXiv SDD paper puts it: "AI models are excellent at pattern completion but poor at mind reading." [9]
Context pollution: Cursor found that "specifications would mutate as agents misremembered or misinterpreted earlier choices." Quality degraded within hours, regardless of context window size. The system experienced entropy — losing coherence over time. [5]
Semantic drift in hierarchies: Deep agent hierarchies (3+ levels) "accumulate drift as objectives mutate through delegation layers" — essentially playing telephone. Two-tier systems significantly outperform both flat architectures and deeper ones. [5]

The data backs this up. CodeRabbit's State of AI vs Human Code Generation report analyzed 470 GitHub pull requests and found AI-generated code produces 1.7x more logic issues than human-written code. Not syntax errors — the code doing the wrong thing correctly. [10] [11]

Google's 2025 DORA report found that AI adoption has a negative relationship with software delivery stability — teams with higher AI usage showed increased change failure rates and more time spent on rework. The report notes that "AI is being adopted so quickly, both within and outside of software delivery, that our ways of working haven't been able to keep up." [12] [13]

The code ships faster, but it's often more wrong.

The V-model's answer — executable specs, traceability from requirements to tests, contracts between phases — directly addresses this. When the spec says "merge CLI args over file values over defaults" and the component test verifies exactly that, there's no room for the agent to drift into "merge file values over CLI args" because it seemed reasonable.

The Coming Complexity Explosion

Here's where it gets interesting. Coding is becoming cheap — approaching zero marginal cost. Anthropic's Mike Krieger stated that Claude is now effectively writing the vast majority of its own code — what began as "70, 80, 90%" has approached near-100% AI-generated contributions. [14] [15] StrongDM's AI team of three engineers built what would have previously required a 10-person team, using a "software factory" approach where engineers don't even look at the code directly. [16] [17] Cursor has grown to roughly $500M ARR with a small team, generating extraordinary revenue per employee. [18]

Every time in economic history that production costs collapsed, demand exploded:

Desktop publishing didn't eliminate designers — it created a universe of design work
Phone cameras didn't kill photography — they multiplied it by orders of magnitude
Mobile didn't replace developers — it multiplied the number of applications the world needed

Software is about to go through the same expansion, except bigger. Every business process running in spreadsheets, email, and phone calls is up for grabs. Every workflow that wasn't worth automating at $200/hour engineering rates becomes viable at $2 in API calls. [19]

Anthropic's 2026 Agentic Coding Trends Report predicts: "When agents can work autonomously for extended periods, formerly non-viable projects become feasible. Technical debt that accumulated for years because no one had time to address it gets systematically eliminated." [20] Yet their research also shows that while developers use AI in roughly 60% of their work, they can "fully delegate" only 0-20% of tasks — the rest requires active human oversight, validation, and judgment. [20]

This means the software systems of the near future will be dramatically more complex than today's. Not because individual components will be harder — but because there will be far more of them, with far more interdependencies, built far faster.

Think about it like the space program. Landing on the moon wasn't hard because any single component was impossibly complex. It was hard because thousands of components, built by thousands of contributors, had to work together flawlessly. That required extraordinary specification discipline — shared "memory" encoded in interface documents, test protocols, and traceability matrices.

We're heading toward moon-landing-scale software complexity, built at unprecedented speed, by AI agents with goldfish memory. Without V-model discipline — formal specs, architectural contracts, traceable tests — the result will be what Forte Group calls "generating legacy code at machine speed." [21]

What This Looks Like in Practice

Here's how I apply the V-model to AI coding:

Phase	What the human typically does	What the AI can do
Goals	Clarify target, goals, high-level use cases	Help brainstorm examples, rephrase for clarity
Risk & Research	Decide what needs de-risking	Collect docs, compare options, run quick PoCs
Spec (Black-Box)	Own final use cases, constraints, tradeoffs	Propose flows, edge cases, alternative wordings
Architecture	Choose structure, interfaces, boundaries	Suggest patterns, draft diagrams, spot inconsistencies
Test Strategy	Decide what “good enough” means at each level	Propose test cases, generate boilerplate tests
Implementation	Review, accept, steer direction	Generate code via TDD-style loops
Verification	Decide when it’s done, interpret failures	Run tests, surface failures, propose fixes

The key insight: the left side of the V is human work. The bottom and right side are AI work. Humans specify. AI implements. Tests verify that the implementation matches the specification.

This isn't theoretical. I use this daily with Claude Code, and the difference between "vibe coding" and V-model-guided sessions is night and day. With a spec, the agent produces correct code on the first try far more often. Without a spec, it confidently builds the wrong thing — and you don't discover it until you've already built three layers on top.

The Techniques That Make It Work

The V-model is the overarching framework, but several complementary techniques make it practical for AI:

Technique	Why it helps AI
SDD (Spec-Driven Development)	Specs as "super-prompts" — structured, unambiguous, modular inputs that fit context windows [9]
TDD (Test-Driven Development)	Red/green/refactor gives the agent a tight feedback loop and clear success criteria
SOLID Principles	Narrow interfaces = less context needed per component = better agent performance
Traceability	Requirement → design → code → test mapping catches drift before it compounds [4]
Use Cases	Structured actor/system interaction replaces ambiguous prose with verifiable scenarios

The Bottom Line

The V-model wasn't designed for AI. It was designed for the same problems AI has — limited shared memory, organization drift, and coordination collapse in complex systems. As coding becomes cheap and project complexity grows by orders of magnitude, the discipline that aerospace and defense developed for moon landings and flight controllers is exactly what we need to enable high-quality AI-based agentic coding.

The bottleneck has shifted from "can we build it?" to "did we define what to build correctly?" [19] The V-model is a proven machine for defining things correctly.

Vibe coding is for prototypes. V-model is for production.

References

#	Source	Type	Link
[1]	Fred Brooks, The Mythical Man-Month (1975)	Book / Wikipedia	Link
[2]	V-Model Origins — BHI Consulting	Article	Link
[3]	NASA Systems Engineering Process	Reference	Link
[4]	V-Model: Verification & Validation in SDLC — Teaching Agile	Article	Link
[5]	Nate B Jones — "Google Just Proved More Agents Can Make Things WORSE" (reporting Cursor & Gastown findings)	Video	Link
[6]	Google/MIT multi-agent study — The Decoder	Article	Link
[7]	Google/MIT multi-agent study — Fortune	Article	Link
[8]	AWS Kiro: Spec-Driven Development — LinearB	Article	Link
[9]	Piskala — "Spec-Driven Development: From Code to Contract"	arXiv Paper	Link
[10]	CodeRabbit — "State of AI vs Human Code Generation" Report	Press Release	Link
[11]	AI-authored code contains worse bugs — The Register	Article	Link
[12]	Google 2025 DORA Report — Official Announcement	Blog	Link
[13]	DORA 2025 AI Report Analysis — Nish Tahir	Article	Link
[14]	Anthropic: Claude writing nearly 100% of AI-generated code — Reddit	Discussion	Link
[15]	Anthropic's Mike Krieger on Claude writing its own code — Reddit	Discussion	Link
[16]	StrongDM Software Factory — Official Blog	Article	Link
[17]	StrongDM AI team analysis — Simon Willison	Article	Link
[18]	Cursor Revenue & Valuation — Sacra	Report	Link
[19]	Nate B Jones — "The Job Market Split Nobody's Talking About"	Video	Link
[20]	Anthropic — 2026 Agentic Coding Trends Report	PDF Report	Link
[21]	Forte Group — "Specification Engineering: The Non-Negotiable Prerequisite"	Article	Link
[22]	Apollo Systems Engineering — Optima SC	Article	Link

Top comments (2)

Lev Magazinnik • Feb 19

Let me disagree with you. The code must be splitted into blocks/modules, into classes by functionality. So if you start working from scratch, let's say you write some class, some block, and then you add another. And if we're talking about frontend and backend of some application that contains both of them, so you just propagate these blocks from the bottom to top or from top to bottom, and these are separated flows that do not interact with each other.
And from my experience, Opus 4.6 is an excellent model that understands the contracts by itself. It does not break the contract between the modules if there is some communication that exists between the modules or blocks — so it just adds new functionality on top of the existing one. It's not breaking the old one. It's trying to merge into existing functionality and to add new one. So it ends up with a new feature added and without breaking the old one.
I agree with you that a specification can help to build software that precisely matches the specification, and I also agree that if the specification is wrong, AI will build the wrong thing. But at the same time, I think that we need to give AI space for it to find the right decision. And as far as it knows the best practices — all the best practices it learned — it also keeps them in codding... I think, from my experience, it outperformed humans in finding the best practices and writing the best possible code.
I do not agree that if something is not specified in the specification, AI will assume something wrong. I mean, with best practices — it will find the best possible solution in the conditions that it works with, with the surrounding code.And I also do not agree that window limitation has so a dramatic influence on the performance of the code that AI generates. I do not agree with it as far as I see this Cursor. It greps the codebase about 50k lines and does not need to take every line of code into consideration with a smarter greping and a smarter creation of the context. By the way, current state of the cursor does this very well. You don't need to keep everything in context; you just need to understand the dependencies and keep them in context, and the current models with current windows are doing well for 50,000 lines of code. But if you have some requirements that you do know, they must be in the specification. AI can't read your mind ...
If you have requirements to make it beautiful — so I guess AI will make it work not bad, because you don't set it requirements you don't have the requirements to set. At the end of the day, if you do know the requirements, specify that requirement. If you by yourself do not know the requirement, or are not sure that you know the best practice that will match the desired application — do not specify it. AI will do that much better than you can. I mean, on average.
That's what I saw, and I want to stress that I'm talking only about the best models right now — Opus 4.6. I did have some experience at the same time with Sonnet, and it really can break the dependencies. And also Cursor Composer-1 also can break dependencies, and it's true for other models, but not for Opus in my spectrum of tasks.

Ziv Kfir • Feb 21

From your comment, Lev, I believe you disagree on the architecture phase, right?
Well, let's agree to disagree. I will elaborate,
First, the architecture phase was done together with the AI.
Second, the architecture avoids drift, and this is super important. I have pasted a few examples where the AI implementation was of the ask. Still, it is possible to mitigate those with another prompt; however, having an aligned architecture enables running to completion.

Last, it is another checkpoint. Often, humans have a much deeper understanding of their needs than is expressed in requirements, e.g., a vision for the future, and it is, IMO, impossible to write perfect requirements (specs in my terms) up front. In others words, this is a way to debug the requirements together with AI.

So, until now, still trying to improve while running zero-code development for about a year, I am in a position where dark factories should pass through both requirements and architecture checkpoints.
Things might change, there might be other better ideas now or in the future, but meanwhile, two checkpoints (requirements/spec, and architecture) are my way to dark factories.
This outcome with a few tens of the same project dev in a few methods, AI tools, and LLMs.
Although the above, I am continuing to learn and try new approaches and will update when there is a big idea shift.