DEV Community: Kotaro Andy

What Should Humans Design When AI Can Write Most of the Code?

Kotaro Andy — Mon, 04 May 2026 05:04:52 +0000

What Should Humans Design When AI Can Write Most of the Code?

AI can now write code.

Not perfectly. Not always safely. Not without review.

But it can write a great deal of code.

It can generate functions, create tests, call APIs, build UI components, handle common errors, and produce large amounts of implementation detail at a speed no human developer can match.

This changes the meaning of programming.

For a long time, much of a programmer's work was the act of implementation itself. We read requirements, understood the system, designed functions, wrote types, created tests, handled edge cases, debugged errors, and gradually transformed an abstract idea into working code.

Of course, programming was never only typing. It always required design, judgment, and understanding.

But still, a large amount of time was spent translating structure into code.

Now AI is starting to take over a large part of that translation work.

So the question is not simply:

Can AI write code?

The more important question is:

If AI can write most of the code, what should humans design?

My answer is this:

Humans must design the structure that comes before code.

AI can generate implementation.

But humans still need to define what the correct structure is.

What states exist in the system?

What operations are allowed?

What invariants must always hold?

What must be true before an operation is executed?

What must be guaranteed after it completes?

When multiple AI agents work together, what contract connects their outputs?

These are not merely coding questions.

They are specification questions.

And this is where my current research and development work begins.

The Limitations of Natural Language Specifications

When people use AI to build software, they often describe what they want in natural language.

For example:

Build an order management system.

Do not allow an order if inventory is insufficient.

Confirm the order when payment succeeds.

Do not allow a canceled order to be confirmed again.

At first glance, these instructions seem clear.

But as software specifications, they are full of ambiguity.

What exactly does "insufficient inventory" mean?

At what moment is inventory checked?

Is inventory reserved before payment or deducted after payment?

What happens if payment fails?

Can a canceled order still be refunded?

What state transitions are valid?

What assumptions does the payment module make about the order module?

What assumptions does the inventory module make about the payment module?

Human developers may notice these ambiguities during conversation.

But when AI agents are asked to implement different parts of the system independently, ambiguity becomes dangerous.

Imagine one AI agent implementing the order module.

Another implements inventory.

Another implements payment.

Another writes the UI.

Another generates tests.

Each agent may interpret the same natural language instruction slightly differently.

Individually, each piece of code may look reasonable.

But when the modules are integrated, the system may fail.

The problem is not that AI cannot write code.

The problem is that the structure given to AI is often ambiguous.

Natural language is flexible.

That is why humans like it.

But software systems require precise boundaries.

And AI agents need contracts they cannot easily misinterpret.

Formal Specifications as the Skeleton of Software

This is why I am interested in applying formal methods to AI-assisted software development.

Formal methods allow us to describe software specifications with mathematical precision.

Among them, I am particularly interested in VDM-SL.

VDM-SL allows us to define:

types
state
invariants
operations
preconditions
postconditions

In other words, it allows us to describe what a system is supposed to be before we write the implementation.

To me, VDM-SL is not merely an old formal specification language.

It is a language for giving structure to AI.

Natural language is too ambiguous.

Implementation code is too concrete.

Formal specification sits between them.

It is not yet executable application code.

But it is much more precise than ordinary prose.

A formal specification describes the skeleton of a system.

Before asking AI to write code, we define the skeleton.

After AI writes code, we check whether the code conforms to that skeleton.

When multiple AI agents collaborate, we use the skeleton as a contract between them.

This changes the role of the human developer.

The human is no longer just the person who writes every line of code.

The human becomes the person who defines the structure that AI must respect.

Formal Agent Contracts

One of the ideas I am working on is what I call Formal Agent Contracts.

The idea is simple:

When multiple AI agents collaborate on software development, the boundaries between agents should be defined as formal contracts.

These contracts can specify what each agent is responsible for, what data it receives, what data it returns, and what conditions must hold before and after its work.

For example, in an e-commerce system, we might have separate agents for:

Order
Inventory
Payment
Notification
UI
Testing

Without formal contracts, each agent may generate code based on its own interpretation of the requirements.

With formal contracts, each agent works against a clearly defined boundary.

The order agent knows exactly what an Order state is.

The inventory agent knows exactly when stock can be reserved.

The payment agent knows exactly what must be true before payment confirmation.

The test agent knows what properties the system is supposed to satisfy.

The point is not to restrict AI unnecessarily.

The point is to give AI enough structure to work safely.

Freedom without contracts creates chaos.

Freedom within contracts creates collaboration.

From Test-Driven Development to Formal-Spec-Driven Development

I do not think tests are obsolete.

Tests are still essential.

They are useful for checking examples, integration behavior, UI behavior, external API interactions, performance, and real-world use cases.

But in AI-assisted development, tests alone are not enough.

Tests check finite cases.

Formal specifications describe properties.

A test says:

For this input, the output was correct.

A formal specification says:

This operation must always preserve this invariant.

When AI can generate large amounts of code very quickly, we cannot rely only on testing generated code after the fact.

We need to define the structure before code is generated.

This is why I am interested in formal-spec-driven development.

The workflow looks like this:

A human describes the domain rules.
AI helps translate those rules into a formal specification.
The human reviews whether the specification matches the real domain.
Formal tools check the specification for consistency.
AI agents generate implementation based on the specification.
Tests and reviews verify that the implementation follows the intended behavior.

In this workflow, AI is not replacing human judgment.

AI is assisting with translation, implementation, and verification.

The human remains responsible for meaning.

Why AI Makes Formal Methods More Practical

Formal methods have existed for a long time.

They have been used in safety-critical and high-reliability systems.

They are powerful.

They can reveal errors that ordinary testing may miss.

But they have not become mainstream in everyday software development.

Why?

Because they are difficult.

Writing formal specifications requires training.

Many developers are not familiar with mathematical notation.

The tooling can feel unfamiliar.

The short-term cost often seems too high compared with simply writing code and tests.

AI may change this.

AI can help write formal specifications.

AI can explain formal specifications in natural language.

AI can help translate domain rules into VDM-SL.

AI can interpret verification errors.

AI can generate implementation scaffolds from specifications.

This means AI may reduce the adoption barrier of formal methods.

At the same time, formal methods can reduce one of the biggest risks of AI-generated code: ambiguity.

AI makes formal methods easier to use.

Formal methods make AI-generated software safer to trust.

That mutual relationship is what interests me.

Humans Move Upstream

In AI-assisted development, humans do not disappear.

They move upstream.

From writing every line of code

to defining the structure of the system.

From implementing details

to deciding state, constraints, and contracts.

From checking only examples

to defining properties that must always hold.

From giving vague prompts

to designing specifications that AI cannot easily misunderstand.

This does not make human developers less important.

It makes their judgment more important.

AI can generate code.

But it cannot reliably decide what the business rules should be.

AI can produce implementation options.

But it cannot fully understand the long-term maintenance cost of a wrong abstraction.

AI can write tests.

But it cannot always know whether the test represents the real-world domain correctly.

AI can assist with formal specifications.

But the human must still decide whether the specification captures the intended meaning.

The human role becomes more architectural, more semantic, and more responsible.

Specification as a Shared Object Between Humans and AI

One of the biggest problems in AI-assisted development is that natural language prompts are not stable enough as shared objects.

A prompt is easy to write.

But it is also easy to misread.

Different humans may interpret the same prompt differently.

Different AI agents may produce different assumptions from the same prompt.

And once the code is generated, the original intention may disappear into implementation details.

A formal specification can act as a shared object.

It is readable by humans, at least with support.

It is processable by tools.

It can guide AI generation.

It can be checked, revised, and versioned.

It can define the contract between modules and agents.

This is important because AI development is becoming less like a single programmer writing a single file.

It is becoming more like orchestration.

Multiple agents.

Multiple modules.

Multiple generated artifacts.

Multiple layers of verification.

In such an environment, we need something more stable than a prompt.

We need contracts.

My Research Direction

My work is not simply about promoting VDM-SL.

It is not only about building a Claude Code plugin.

It is not only about making AI coding more convenient.

The deeper question is:

How should software development be reorganized when AI can generate implementation?

My answer is:

We need a formal layer between human intention and AI-generated code.

That layer should express:

domain rules
states
invariants
valid operations
module boundaries
agent responsibilities
preconditions
postconditions
properties that must be preserved

This formal layer allows humans, AI agents, and verification tools to cooperate.

Humans provide meaning.

AI helps translate and implement.

Formal tools check consistency.

AI agents generate code.

Humans review the result against the intended structure.

This is not development without humans.

It is development where humans work at a higher level.

Why This Matters

As AI-generated code becomes more common, the amount of software being produced will increase dramatically.

But more code does not necessarily mean better systems.

If the structure is wrong, AI will generate wrong code faster.

If the requirements are ambiguous, AI will multiply ambiguity.

If module boundaries are unclear, AI agents will produce incompatible pieces.

If invariants are not defined, errors may only appear late in integration or production.

Speed without structure is dangerous.

That is why I believe formal specifications will become more important, not less.

The future of software development is not simply:

AI writes code.

It is more likely:

Humans define structure.

AI generates implementation.

Formal specifications connect intention, verification, and code.

Conclusion

AI can now write much of the code.

So human developers must ask a new question:

What remains uniquely important for us to design?

I believe the answer is structure.

The structure of states.

The structure of constraints.

The structure of operations.

The structure of responsibility between modules.

The structure of contracts between AI agents.

Code is no longer the only central artifact.

The specification before code may become just as important.

Perhaps even more important.

My work explores this transition:

From implementation-first development

to formal-spec-driven development.

From vague natural language prompts

to verifiable contracts.

From AI as a code generator

to AI as a collaborator constrained by formal structure.

From humans as manual coders

to humans as designers of meaning, correctness, and architecture.

AI can write code.

But humans must still decide what the code is supposed to mean.

I Built a Claude Code Plugin for Formal Specifications and Ran a 30-Trial Evaluation. Here's What I Found.

Kotaro Andy — Wed, 01 Apr 2026 16:34:57 +0000

The Problem: LLMs Write Plausible Code That Misses Edge Cases

If you've used Claude Code (or any LLM-powered coding assistant) for non-trivial tasks, you've probably noticed a pattern: the generated code looks correct, passes the obvious test cases, and then breaks on edge cases that were never specified in the prompt.

This isn't a model quality issue — it's a specification problem. When you tell an LLM "build a bank transfer system," there are dozens of implicit rules the model has to guess at: What happens if the source and destination accounts are the same? Is a transfer of zero valid? Is the transfer atomic?

I built a Claude Code plugin called Formal Agent Contracts to address this. It guides the LLM through defining business rules in VDM-SL (a formal specification language) before writing any implementation code. The idea is that the act of formalizing constraints forces the model to confront ambiguities upfront rather than silently making assumptions.

The real question was: does this actually improve output quality in a measurable way?

Experiment Design

I designed a controlled evaluation comparing two conditions across 3 benchmark tasks × 5 trials each = 30 total runs.

Conditions

Group	Setup	Prompt
Control (A)	Claude Code only, no formal methods	`"{task spec}. Implement in TypeScript with tests."`
Treatment (B)	Claude Code + Formal Agent Contracts plugin	`"Run the integrated workflow for {task spec}. Generate TypeScript."`

Both groups receive identical task specifications — the treatment group gets no extra information, just a different workflow that starts with spec formalization.

Benchmark Tasks

Each task was designed with increasing complexity and a set of deliberately hidden "traps" — edge cases not explicitly stated in the prompt that a thorough implementation should handle:

Task	Complexity	Trap Count	Description
Bank Account	Low	10	Single agent. CRUD + transfers. Traps: transfer atomicity, zero-balance withdrawal, negative initial balance
Library System	Medium	12	3 agents (Catalog, Member, Loan). Traps: inventory count consistency across agents, immediate re-borrow after overdue return, double extension prevention
Online Auction	High	14	State machine + time constraints + concurrency. Traps: cascading bid extensions, simultaneous bid priority, payment timeout boundaries, re-listing after cancellation

Metrics

Four metrics, each scored per-trial:

M1 — Contract Violation Detection Rate: For each hidden trap, score 0–3 based on whether the code handles it and whether a test covers it. CVDR = Σ(scores) / (traps × 3) × 100%

M2 — Specification Coverage: What fraction of business rules have explicit validation in code (runtime checks, assertions, type constraints)? SC = Σ(rule_scores) / (rules × 3) × 100%

M4 — Specification Explicitness: Are business rules captured in a machine-verifiable form (formal spec = 3, code assertion = 2, comment = 1, implicit = 0)?

M6 — Test Effectiveness: Heuristic analysis of test code quality — edge case coverage, boundary value testing, negative test cases, test density.

Scoring was automated via keyword-matching heuristics against gold-standard trap definitions and business rule extractions. All raw scores are in the repo.

Results

Aggregate

Metric	Control	Treatment	Δ	Cliff's δ	Effect Size
M1: Violation Detection	52.0% ± 33.6	63.9% ± 22.5	+11.8pp	0.18	small
M2: Spec Coverage	39.1% ± 21.2	81.9% ± 26.9	+42.8pp	0.74	large
M4: Spec Explicitness	8.9% ± 15.3	100.0% ± 0.0	+91.1pp	1.00	large
M6: Test Effectiveness	66.7% ± 23.4	87.0% ± 8.5	+20.3pp	0.56	large

Three out of four metrics show large effect sizes (Cliff's δ ≥ 0.474). For context, a Cliff's δ of 0.74 on M2 means that in 87% of pairwise comparisons between control and treatment runs, the treatment run had higher spec coverage.

The Complexity Scaling Effect

This was the most interesting finding. The plugin's impact scales with task complexity:

Complexity	M1 Δ	M2 Δ	M6 Δ
Low (Bank)	-3.3pp	+40.0pp	+13.5pp
Medium (Library)	+15.0pp	+62.2pp	+17.5pp
High (Auction)	+23.8pp	+26.2pp	+30.0pp

For the low-complexity task, the control group already does well on M1 — simple business rules are within the LLM's "intuition." But once you introduce multi-agent coordination (medium) or state machines with temporal constraints (high), the gap widens sharply.

The auction task is where it gets stark. The control group averaged 11.4% on violation detection vs. 35.2% for treatment. Edge cases like cascading bid extension limits, simultaneous bid resolution, and payment timeout boundaries were almost universally missed without formal spec guidance.

What the Treatment Group Actually Does Differently

Looking at the generated artifacts, the mechanism is clear. The treatment group produces three files per run (.vdmsl + .ts + .test.ts) vs. two for control (.ts + .test.ts). The VDM-SL spec acts as an intermediate artifact that:

Forces explicit enumeration of invariants, pre-conditions, and post-conditions before code generation
Surfaces implicit requirements through the formalization process — when you try to write inv balance >= 0 you immediately ask "what about the initial balance?"
Generates more targeted tests because the test cases are derived from spec violations rather than happy-path scenarios

For example, in the auction task, a typical control run might test bid > currentPrice but not bid >= currentPrice + minIncrement(currentPrice). The treatment run, having defined minBidIncrement as a VDM-SL function with explicit boundary conditions, naturally produces tests for the 1% minimum and the ¥100 floor.

Threats to Validity

I want to be upfront about the limitations:

Internal validity: The same Claude instance generated both conditions. Each run used intentional style variation, but the fundamental issue remains — this isn't a human-subjects experiment. The treatment group's improvement could partly reflect the structured workflow rather than the formal specification per se.

Scoring methodology: M1 and M2 use heuristic keyword matching, not actual code execution. A function named validateBalance scores the same whether it's correctly implemented or not. M6 was originally designed to run gold-standard tests via Jest, but filesystem permission constraints forced a fallback to heuristic test code analysis.

M4 is structurally biased: The treatment group produces VDM-SL files by definition, which automatically scores L3 on explicitness. The +91.1pp delta on M4 is real but partially tautological. The more meaningful signal comes from M2 and M6, which measure actual code behavior.

Scale: Three tasks, five trials each. The effect sizes are large enough to be meaningful at n=15, but replication with more tasks and different LLMs would strengthen the conclusions.

All raw data (30 directories of generated code, scores.json, the evaluation script, and gold-standard test suites) are in the repository for independent verification.

How It Works in Practice

The plugin bundles 10 skills for Claude Code. The typical workflow is:

You: "Run the integrated workflow for this auction system spec."

Claude (via plugin):
  1. define-contract  → Generates VDM-SL from your natural language description
  2. verify-spec      → Runs VDMJ syntax/type checking + proof obligation generation
  3. smt-verify       → Converts POs to SMT-LIB, proves with Z3
  4. generate-code    → Scaffolds TypeScript/Python with runtime contract checks
  5. (runs tests)     → Validates generated code against the spec

For existing codebases, there's a reverse workflow (extract-spec → refine-spec → reconcile-code) that extracts provisional specs from code and refines them through dialogue.

You don't need to know VDM-SL. The conversation is in natural language — the plugin handles the translation.

Try It

/plugin marketplace add kotaroyamame/formal-agent-contracts

Or clone and install locally:

git clone https://github.com/kotaroyamame/formal-agent-contracts.git
cd formal-agent-contracts
/plugin install .

Prerequisites: Java 11+ (for VDMJ) and optionally Z3 (pip install z3-solver) for automated proving.

The full evaluation data, protocol, and scoring scripts are under eval/ in the repo. If you run your own evaluation or replicate these results, I'd love to hear about it.

Repository: github.com/kotaroyamame/formal-agent-contracts
Detailed evaluation report: iid.systems/formal-agent-contracts/evaluation

Formal Agent Contracts: Bring Mathematical Rigor to Multi-Agent Development with a Claude Code Plugin

Kotaro Andy — Wed, 01 Apr 2026 07:54:14 +0000

When multiple AI agents collaborate on a system, something always goes wrong at the boundaries. Agent A assumes one format, Agent B expects another, and the bug surfaces three layers deep at 2 AM. We've all been there.

Formal Agent Contracts is a Claude Code plugin that attacks this problem head-on. It lets you define precise, machine-verifiable contracts between agents using VDM-SL (Vienna Development Method – Specification Language), then automatically verify, prove, and generate code from those contracts. The twist: you don't need to know VDM-SL at all. Claude handles the formalism; you describe what your agents do in plain English.

What Does It Actually Do?

The plugin provides six skills that form a pipeline:

define-contract — Describe your agent's role in natural language. Claude converts it into a VDM-SL formal specification with types, preconditions, postconditions, and invariants.
verify-spec — Run VDMJ (the VDM-SL reference toolchain) to syntax-check and type-check your spec, then auto-generate proof obligations.
smt-verify — Convert proof obligations to SMT-LIB and solve them with Z3. Get back: proved, counterexample found, or unknown.
generate-code — Produce TypeScript or Python scaffolds from your spec, complete with runtime contract checks that throw clear errors on violations.
integrated-workflow — Run the full pipeline (define → verify → prove → generate → test) in one session.
formal-methods-guide — Ask Claude to explain any VDM-SL concept along the way.

Installation

Prerequisites

Java 11+ is required for VDMJ (the VDM-SL checker). Python 3.8+ is needed if you want Z3-based automated proving.

Install VDMJ:

# Download the JAR from GitHub releases
mkdir -p ~/.vdmj
curl -L -o ~/.vdmj/vdmj.jar \
  https://github.com/nickbattle/vdmj/releases/latest/download/vdmj-4.6.0-SNAPSHOT.jar

Install Z3 (optional, for SMT proving):

pip install z3-solver

Install the Plugin

In a Claude Code session (terminal or Cowork), run:

/plugin install formal-agent-contracts@iid

That's it. The six skills are now available in your session.

Walkthrough: Building a Task Manager Agent

Let's build a real example from scratch. We'll define a task management agent with status transitions, verify the spec catches invalid transitions, and generate code with runtime contract enforcement.

Step 1: Define the Contract

Tell Claude what your agent does:

You: Define a task manager agent. Tasks have an ID, title, status
     (Todo / InProgress / Done), and priority (Low / Medium / High).
     Done tasks cannot transition back to any other status.

Claude invokes the define-contract skill and produces a VDM-SL specification:

types
  Status = <Todo> | <InProgress> | <Done>;
  Task :: id     : nat1
          title  : seq1 of char
          status : Status
  inv t == len t.title <= 100;

operations
  ChangeStatus: TaskId * Status ==> ()
  ChangeStatus(taskId, newStatus) == ...
    pre  taskId in set dom board and
         ValidTransition(board(taskId).status, newStatus)
    post board(taskId).status = newStatus;

The pre (precondition) says: the task must exist and the transition must be valid. The post (postcondition) guarantees the status is actually updated. The inv (invariant) constrains title length at all times. These three constructs form the contract — the rules every caller and implementation must respect.

Step 2: Verify the Spec

You: Verify this spec.

Claude runs VDMJ and reports:

✅ Syntax check: PASSED
✅ Type check: PASSED
📋 Proof Obligations generated: 38
   - CreateTask satisfies TaskBoard invariant
   - ChangeStatus preserves title across transition
   - DeleteTask reduces board cardinality by exactly 1

Proof obligations (POs) are questions automatically derived from your contracts. For instance, "After ChangeStatus, is board(taskId).title still the same as before?" You didn't write this check — the tool inferred it from your post condition.

Step 3: Prove Obligations with Z3

You: Prove the POs with Z3.

Claude converts each PO to SMT-LIB and runs Z3:

PO #12 (ChangeStatus preserves title): ✅ Proved
PO #15 (DeleteTask cardinality):       ✅ Proved
PO #23 (CreateTask invariant):         ✅ Proved
...
37/38 proved, 1 unknown (requires manual review)

Step 4: Generate Code

You: Generate TypeScript from this spec.

Claude produces implementation code with runtime contract checks baked in:

changeStatus(taskId: TaskId, newStatus: Status): void {
  // --- Pre-conditions (from VDM-SL spec) ---
  checkPre(
    this.board.has(taskId),
    `taskId ${taskId} not in dom board`
  );
  checkPre(
    validTransition(this.board.get(taskId)!.status, newStatus),
    `Invalid transition: ${this.board.get(taskId)!.status} → ${newStatus}`
  );

  // --- Operation body ---
  const task = this.board.get(taskId)!;
  const updated = { ...task, status: newStatus };
  this.board.set(taskId, updated);

  // --- Post-conditions (from VDM-SL spec) ---
  checkPost(
    this.board.get(taskId)!.status === newStatus,
    `status must be ${newStatus}`
  );
}

Try violating the contract:

agent.changeStatus(taskId, "Todo");
// → ContractError: Pre-condition failed:
//   Invalid transition: Done → Todo

No silent bugs. No debugging session at 2 AM.

One-Shot: The Integrated Workflow

If you want the full pipeline in one go:

You: Run the integrated workflow for a task management agent.
     Tasks have ID, title, status (Todo/InProgress/Done),
     priority. Done tasks can't go back.

Claude orchestrates all phases — define, verify, prove, generate, test — handling errors and retries automatically, and produces a session report at the end.

Why Formal Contracts Instead of Tests?

Tests are inductive: you check a finite number of cases and hope they cover enough. Formal contracts are deductive: you prove properties hold for all possible inputs.

Consider this scenario without contracts:

// This silently succeeds — the bug surfaces later, somewhere else
task.status = "Todo";  // Was "Done" — should be forbidden!

With contracts:

changeStatus(taskId, "Todo");
// → ContractError: Invalid transition: Done → Todo

The spec also doubles as living documentation. It precisely describes what each agent does, what it expects, and what it guarantees — and it never drifts out of sync with the code, because the code is generated from it.

Resources

Plugin install: /plugin install formal-agent-contracts@iid in Claude Code
Source: github.com/kotaroyamame/formal-agent-contracts
Full task manager example: included in the plugin at examples/task-manager/
Research paper: Formal-Spec-Driven Development

Built by IID Systems. Licensed under MIT.

GitHub Spec Kit Is 80% Right — Here's the Missing 20% That Would Make It Transformative

Kotaro Andy — Sat, 28 Mar 2026 04:47:44 +0000

I Love Spec Kit. And That's Why I Want to Push It Further.

GitHub's Spec Kit is, in my assessment, the most intellectually honest attempt at AI-driven development to date. While most tools focus on faster code generation, Spec Kit asks a more fundamental question:

What if intent — not code — was the source of truth?

This is exactly the right question. The Constitution → Specify → Plan → Tasks → Implement workflow is well-designed. The steerable gates where humans can intervene are smart. The 25+ agent support (Claude Code, Copilot, Gemini CLI, Cursor, etc.) shows pragmatic thinking about ecosystem diversity. The 40+ community extensions demonstrate real traction.

I've been working on a formal specification-driven development framework that starts from the same premise — specifications should drive development, not the other way around. But after months of research and implementation, I believe Spec Kit has a structural gap that, if filled, would make it dramatically more powerful.

That gap is formal verifiability of specifications themselves.

The Problem: Structured Ambiguity

Spec Kit specifications are written in structured Markdown. This is a massive improvement over the alternatives:

MetaGPT passes natural language PRDs between agents → ambiguity accumulates at each handoff
ChatDev relies on dialogue-based consensus → non-deterministic and non-reproducible
Devin/SWE-Agent infers specs from code → circular reasoning (code is the spec)

Spec Kit avoids all of these pitfalls by making specs explicit, structured, and human-editable. But here's the uncomfortable truth:

Structured natural language reduces ambiguity. It does not eliminate it.

Consider a Spec Kit specification like this:

## User Story: Add to Cart
The user can add a product to their shopping cart.
The cart should reflect the updated quantity.

This is clear to a human reader. But it leaves critical questions unanswered:

Can the user add 0 items? Negative quantities?
What happens when stock is insufficient?
If the product is already in the cart, does the quantity replace or accumulate?
Is there a maximum cart size?

These aren't edge cases. They're boundary conditions — the exact places where modules interact and where bugs hide. A human spec author might catch some of them. But the point of a specification system is that the system itself should make missing boundaries visible.

What Formal Specifications Add

In VDM-SL (Vienna Development Method - Specification Language), the same requirement looks like this:

AddToCart: CustomerId * ProductId * nat1 ==> ()
AddToCart(cid, pid, qty) ==
  carts(cid) := if pid in set dom carts(cid)
                then carts(cid) ++ {pid |-> carts(cid)(pid) + qty}
                else carts(cid) ++ {pid |-> qty}
pre  pid in set dom inventory
     and inventory(pid) >= qty
     and qty > 0
     and CardinalityItems(carts(cid)) < MAX_CART_SIZE
post pid in set dom carts(cid)
     and carts(cid)(pid) = carts~(cid)(pid) + qty  -- accumulate, not replace

Notice what happens:

nat1 as the type of qty — zero and negative quantities are structurally impossible. Not "tested against," not "documented as invalid" — impossible at the type level.
pre conditions — stock sufficiency and cart size limits are explicit preconditions. An AI agent reading this spec cannot "forget" about them.
post conditions — the behavior is unambiguous: quantities accumulate (not replace). Any agent implementing this module knows exactly what the expected behavior is.
Machine-verifiable — if another module's postcondition guarantees inventory(pid) >= qty, we can mechanically verify that it satisfies this precondition. No human review needed for interface consistency.

The Three Gaps Formal Specs Would Fill in Spec Kit

Gap 1: Boundary Conditions Are Invisible in Natural Language

In Spec Kit today, boundary conditions depend on the spec author remembering to include them. This is a human-reliability problem — the very thing we're trying to engineer out of the system.

Formal specifications force boundary conditions to be explicit because the type system and preconditions won't compile without them. A VDM-SL spec with qty: nat but no precondition on stock levels is visibly incomplete — the type checker can flag it.

Gap 2: Cross-Module Consistency Is Unverifiable

This is the critical scaling problem. Spec Kit writes specs per task (≈ per module), but there's no mechanism to verify that Task A's output satisfies Task B's input requirements.

With 3 modules, you have 3 pairwise interfaces to check — manageable by human review. With 10 modules, you have 45. With 20 modules, 190. Human review does not scale linearly.

Formal specifications solve this with compositional verification: if Order module's ConfirmOrder operation has a postcondition, and Inventory module's ReserveStock has a precondition, you can mechanically check whether ConfirmOrder.post ⇒ ReserveStock.pre. This is the A.post ⇒ B.pre pattern — and it scales linearly with module count, not quadratically.

Module A (Order)                Module B (Inventory)
┌─────────────────┐            ┌─────────────────┐
│  ConfirmOrder()  │            │  ReserveStock()  │
│  post:           │──verify───│  pre:            │
│   order.status   │            │   productId ∈    │
│    = <CONFIRMED> │   A.post   │    dom inventory │
│   ∧ stock        │    ⇒       │   ∧ quantity > 0 │
│    reserved      │   B.pre    │   ∧ available ≥  │
│                  │            │     quantity     │
└─────────────────┘            └─────────────────┘

Gap 3: Specs Can't Validate Themselves

Spec Kit's philosophy is "intent is the source of truth." But there's a meta-problem: how do you verify that the spec itself is internally consistent?

Natural language specs can contain contradictions that only surface during implementation:

- Users must complete checkout within 30 minutes
- Users can save their cart and resume later
- Inventory is reserved at checkout start

Are these contradictory? If a user saves their cart at minute 29, resumes 3 hours later, is inventory still reserved? A human reader might catch this. A type-checker on formal specifications will catch it — the invariant on reserved inventory duration would conflict with the "resume later" postcondition.

A Concrete Proposal: Spec Kit + Formal Specs

I'm not proposing to replace Spec Kit's natural language layer. I'm proposing to add a formal verification layer beneath it. Here's how it could work within Spec Kit's existing workflow:

Phase: Constitution (unchanged)

Project principles, values, guidelines — these remain in natural language. They're governance, not computation.

Phase: Specify (enhanced)

The human writes intent in natural language (as today). Then, an AI agent generates a VDM-SL formalization of the spec and explains it back:

"Your spec says users can add products to their cart. I've formalized this with these constraints: quantities must be positive integers, stock must be sufficient, and cart has a maximum of 50 items. The quantity behavior is accumulative — adding 3 of a product that already has 2 in the cart gives 5, not 3. Does this match your intent?"

The human validates meaning. The machine validates consistency.

Phase: Plan (enhanced)

The technical plan now includes interface contracts between modules: which module's postconditions feed into which module's preconditions. These contracts are in VDM-SL, mechanically verifiable.

Phase: Tasks (enhanced)

Each task carries not just a natural language description, but a formal module specification — the types, state, operations, pre/post conditions. The AI agent implementing the task has an unambiguous contract.

Phase: Implement (unchanged)

Agents implement against formal specs rather than natural language descriptions. Claude Code, Copilot, Gemini CLI — any of the 25+ supported agents can read VDM-SL and generate conforming code.

New Phase: Verify

A dedicated verification step checks A.post ⇒ B.pre across all module boundaries. This happens at the spec level, before any code runs. Integration problems are caught before implementation, not after.

Why This Matters for Multi-Agent Development

Single-agent development (one developer + one AI) can get by with natural language specs. The human catches ambiguities in real time. But the industry is clearly moving toward multi-agent development — multiple AI agents building different modules in parallel.

In multi-agent scenarios, natural language ambiguity becomes a scaling crisis:

Modules	Pairwise Interfaces	Human Review Feasibility
3	3	Easy
5	10	Manageable
10	45	Difficult
20	190	Impractical

Formal interface contracts are the only known mechanism that scales: each contract is verified independently, so verification effort grows linearly with module count (O(n)), not quadratically (O(n²)).

What I'm Not Saying

Let me be clear about what this proposal is not:

Not "Spec Kit is bad." Spec Kit is the best-designed spec-driven framework available. The philosophy is right. The UX is right. The community extensions are impressive.
Not "everyone should learn VDM-SL." The AI reads and writes the formal notation. The human verifies meaning in natural language. No formal methods expertise required.
Not "formal specs replace tests." Tests still matter for integration testing, performance testing, and E2E validation. Formal specs replace tests as the center of correctness — not as the entire verification strategy.
Not "this works for everything." UI/UX, performance tuning, and quick bug fixes don't benefit from formal specifications. This is for multi-module business logic systems where correctness at module boundaries is critical.

The Open Source Connection

We've built a complete framework around this idea:

formal-spec-driven-dev — Apache 2.0 licensed

It includes VDM-SL templates, AI prompt templates for all 4 development phases, multi-agent orchestration configs, and a working e-commerce example (Order, Inventory, Payment modules with formal interface contracts).

Our comparison analysis covers the structural critiques of MetaGPT, ChatDev, Devin, Spec Kit, and Claude Code in detail, with a unified framework for understanding each approach's "source of truth" and its logical limitations.

If you're using Spec Kit today and thinking about how to scale it to multi-agent or multi-module projects, I'd love to hear from you. The combination of Spec Kit's excellent developer experience and formal verification's mathematical guarantees could be exactly what the ecosystem needs.

Hikaru Ando — IID Systems
GitHub: formal-spec-driven-dev | Apache 2.0

References:

GitHub Spec Kit — The toolkit this article proposes to enhance
Spec-Driven Development with AI (GitHub Blog)
Diving Into Spec-Driven Development (Microsoft Developer Blog)
formal-spec-driven-dev — Our formal specification framework
Comparison of AI-Driven Development Architectures — Structural critique of each approach

Stop Writing Tests First — Write Formal Specs. Let AI Agent Teams Build Your System.

Kotaro Andy — Fri, 27 Mar 2026 17:03:42 +0000

The Problem Nobody's Talking About

Everyone's excited about AI-powered coding. GitHub Copilot autocompletes your functions. Claude and GPT-4 generate entire modules from prompts. But here's the uncomfortable truth: we're still thinking about AI as a single tool, when the real paradigm shift is multiple AI agents working as a team.

And that raises a question nobody's answering well:

When three AI agents are building three modules in parallel, what prevents them from producing code that doesn't fit together?

Natural language specs? Too ambiguous. One agent interprets "the order must be valid" differently from another. Test-driven development? Tests are inductive — they verify specific cases, not the contract between modules. You can have 100% coverage on each module and still watch the system crash at integration.

What If Agents Had Blueprints?

Think about how a construction project works. You don't hand three subcontractors a paragraph of prose and hope for the best. You give them engineering drawings — precise, unambiguous documents that define every interface: where the plumbing connects to the electrical, what load the foundation bears, which walls are structural.

Formal specifications are the engineering drawings of software. And a method called VDM-SL (Vienna Development Method - Specification Language), originally from IBM in the 1970s, turns out to be nearly ideal for AI agents to read, write, and reason about.

Here's why it matters for multi-agent development:

Module A (Order)                Module B (Inventory)
┌─────────────────┐            ┌─────────────────┐
│                  │            │                  │
│  PlaceOrder()    │            │  ReserveStock()  │
│  post:           │──contract──│  pre:            │
│   order.status   │            │   productId ∈    │
│    = <CONFIRMED> │   A.post   │    dom inventory │
│   ∧ stock        │    ⇒       │   ∧ quantity > 0 │
│    reserved      │   B.pre    │   ∧ available ≥  │
│                  │            │     quantity     │
└─────────────────┘            └─────────────────┘

A's postcondition implies B's precondition. This is mechanically verifiable. No ambiguity. No "alignment meetings" between agents. No integration surprises.

A Concrete Example

Here's what a VDM-SL specification for an order module actually looks like:

module Order

types
  OrderId = nat1;
  ProductId = nat1;

  LineItem :: productId : ProductId
             quantity : nat1
             unitPrice : nat;

  OrderStatus = <PENDING> | <CONFIRMED> | <COMPLETED> | <CANCELLED>;

  Order :: orderId    : OrderId
           customerId : nat1
           lineItems  : seq of LineItem
           status     : OrderStatus
           totalAmount : nat;

state OrderStore of
  orders    : map OrderId to Order
  nextOrderId : nat1
inv mk_OrderStore(orders, nextOrderId) ==
  nextOrderId > 0
  and forall id in set dom orders & orders(id).orderId = id

operations

CreateOrder: nat1 * seq of LineItem ==> OrderId
CreateOrder(customerId, items) ==
  let id = nextOrderId in (
    orders := orders ++ {id |-> mk_Order(id, customerId, items,
                          <PENDING>, SumItems(items))};
    nextOrderId := nextOrderId + 1;
    return id
  )
pre items <> []
post RESULT = nextOrderId~ - 1
     and RESULT in set dom orders
     and orders(RESULT).status = <PENDING>;

Notice what this gives you:

Types with invariants — nat1 means positive integers only. No null, no zero.
State invariant — the inv clause is always true. Every order ID in the map matches the order's own ID.
Pre-conditions — items <> [] means you can't create an empty order. This is a structural guarantee, not a test case.
Post-conditions — after CreateOrder, the order exists in the store with status PENDING. Any agent reading this knows exactly what to expect.

An AI agent building the inventory module doesn't need to see the Order module's implementation. It only needs the interface contract — the types and operation signatures with pre/post conditions. That's a fraction of the context window.

The Multi-Agent Architecture

Here's the workflow we propose:

Phase 1: Human + Architect AI (1-3 days)

The human (domain expert) and an architect AI collaborate to define:

System-wide VDM-SL specification
Module decomposition
Interface contracts between modules (the A.post ⇒ B.pre relationships)

The human doesn't need to read VDM-SL. The AI explains the spec in natural language: "This says a customer can't place an order unless they have at least one item, and after the order is placed, inventory is reserved. Does that match your business rules?"

Phase 2-4: Parallel Module Agents

Each module gets its own AI agent. Agent A builds Order. Agent B builds Inventory. Agent C builds Payment. In parallel.

Each agent's context contains only:

Its own module spec (~200-500 lines of VDM-SL)
Interface contracts of dependent modules (~50-100 lines each)
Cross-cutting decisions from Phase 1

No agent needs another agent's implementation code. This fits comfortably in current context windows.

Integration Verification

A dedicated integration agent checks A.post ⇒ B.pre across all module boundaries. It never reads implementation code — only interface specs. If Order's ConfirmOrder postcondition guarantees stock_reserved = true, and Inventory's ShipOrder precondition requires stock_reserved = true, the composition is verified.

┌──────────────────────────────────────────────────┐
│            Phase 1: Architecture                  │
│     Human + Architect AI                          │
│     → System spec + module decomposition          │
│     → Interface contracts (A.post ⇒ B.pre)        │
└────────────┬─────────────┬───────────────┬───────┘
             │             │               │
             ▼             ▼               ▼
     ┌──────────┐  ┌──────────┐   ┌──────────┐
     │ Agent A  │  │ Agent B  │   │ Agent C  │
     │ Order    │  │Inventory │   │ Payment  │
     │ Module   │  │ Module   │   │ Module   │
     └─────┬────┘  └────┬─────┘   └────┬─────┘
           │             │               │
           ▼             ▼               ▼
     ┌──────────────────────────────────────────┐
     │      Integration Verification Agent       │
     │   Checks A.post ⇒ B.pre across all       │
     │   module boundaries (spec-level only)     │
     └──────────────────────────────────────────┘

Why Not Just Use Natural Language Specs?

Three reasons:

1. Ambiguity scales exponentially. With 2 modules and natural language specs, you have a manageable amount of misinterpretation risk. With 10 modules and 45 pairwise interactions, the ambiguity explodes. Formal specs eliminate this at the root — nat1 means one thing, everywhere, always.

2. Verification becomes mechanical. You can't machine-check whether "the order should be valid" in one module agrees with "valid orders have at least one item" in another. But you can machine-check whether Order.ConfirmOrder.post ⇒ Inventory.ReserveStock.pre.

3. Context windows stay small. Each agent needs only its module spec + dependency interfaces — not the full natural-language requirements document that keeps growing. A 10-module system might have 200 pages of natural language specs but only 50 lines of interface contract per module boundary.

Why TDD Doesn't Solve This

This is the part that might get me some angry comments, but hear me out.

Dijkstra said it in 1969:

"Program testing can be used to show the presence of bugs, but never to show their absence."

TDD is inductive reasoning: you verify specific inputs and hope they generalize. Formal specification is deductive reasoning: you state what must always be true, then derive implementations that satisfy those conditions.

For a single developer, TDD is a useful discipline. For multi-agent AI development, it fails structurally:

Aspect	TDD	Formal Specs
Reasoning	Inductive (specific → general)	Deductive (general → specific)
Inter-agent contract	Implicit in shared tests	Explicit in interface specs
Verification	Run tests, hope for the best	Mechanical proof of composition
Context per agent	Needs test suite + code + shared state	Needs only interface contracts
Scales with agents	Coordination overhead grows	Contracts remain verifiable

Note: we're not saying "never write tests." Tests still have a role — integration tests for external systems, performance tests, E2E UI tests. What we're saying is: tests are no longer the center. Formal specifications are.

The Human Role: Not Obsolete, Elevated

This paradigm doesn't remove humans. It redefines what humans do:

Before: Write code, write tests, review code, debug integration failures.

After:

Domain expert — "This is how our business works. These are the rules."
Architecture decision maker — "Split the system into these modules. This module depends on that one."
Quality judge — "The AI says the spec means X. Does X match what we actually need?"

You don't need to read VDM-SL. You don't need to write code. You need deep domain knowledge and the ability to ask sharp questions when the AI explains its specifications back to you.

The Open-Source Framework

We've packaged this entire methodology as an open-source project:

formal-spec-driven-dev — Apache 2.0 licensed

What's included:

📄 Foundational paper (Japanese + English) — the full theoretical argument, ~30 min read

📋 VDM-SL templates — module template, interface contract template. Copy, customize, go.

🤖 AI prompt templates for all 4 phases:

Phase 1: Specification dialogue (elicit requirements → VDM-SL)
Phase 2: Technical design (VDM-SL → architecture decisions)
Phase 3: Implementation (VDM-SL → production code)
Phase 4: Verification (code ↔ spec cross-check)

🔧 Multi-agent orchestration:

agent-config.yaml — role definitions, context requirements, outputs
workflow.md — step-by-step guide to running multi-agent development today

📦 Working example: E-commerce order system with 3 modules (Order, Inventory, Payment), complete with VDM-SL specs, integration verification, and cross-module contracts.

Try It in 30 Minutes

Clone the repo
Open templates/prompts/phase1-specification.md
Copy the system prompt into Claude or GPT-4
Start describing a module from your own project
Watch the AI produce a VDM-SL spec, then explain it back to you in plain language

That's it. No VDM-SL expertise required. The AI reads and writes the formal notation. You verify the meaning.

What We Need from the Community

This is early-stage. The methodology is sound, but the tooling ecosystem needs work:

Domain templates — Healthcare, fintech, logistics, IoT. Each domain has its own patterns. We need VDM-SL templates for common domain models.
Orchestration tooling — Better ways to coordinate multiple AI agents with shared specifications. Integration with existing agent frameworks.
Case studies — Real teams trying this on real projects. What worked? What broke? Where are the rough edges?
Verification automation — Tools that automatically check A.post ⇒ B.pre across module boundaries.

If any of this resonates, check out the repo and open an issue. PRs welcome. Case study reports are especially valuable — even "I tried this and it didn't work because..." helps enormously.

The Bottom Line

The era of single-agent AI coding is already giving way to multi-agent development. The unsolved problem is coordination. Natural language specs create ambiguity that scales exponentially with the number of agents. Tests verify specific cases but can't guarantee compositional correctness.

Formal specifications — specifically, VDM-SL with pre/post conditions and invariants — provide what multi-agent development needs: unambiguous, mechanically verifiable contracts between agents.

The human role doesn't disappear. It elevates. You become the domain expert and architect who ensures the AI team builds the right thing. The specs ensure they build it correctly.

Hikaru Ando — IID Systems
GitHub: formal-spec-driven-dev | Apache 2.0

This article is also available in Japanese (日本語).

The End of Test-Driven Development: Best Practices for AI Agent-Driven Development with Formal Methods

Kotaro Andy — Fri, 27 Mar 2026 08:54:13 +0000

AI Agent-Driven Development with Formal Methods — The Human Role in the Era of Multi-Agent Coordination

— Toward a Near Future Where AI Agent Teams Autonomously Build Systems Using Formal Specifications as Contracts —

Introduction: Entering the Era of AI Agent-Driven Development

The rapid evolution of large language models (LLMs) is fundamentally challenging how software is built. AI tools such as GitHub Copilot, Claude, and GPT-4 already demonstrate practical capability in code generation, review, and refactoring. But the trajectory points beyond "using AI as a tool"—toward a new paradigm in which multiple AI agents coordinate like a development team, autonomously constructing systems.

This raises a fundamental question: when multiple AI agents coordinate, what serves as the "team's common language"? Vague natural-language specifications cannot eliminate interpretive discrepancies between agents. Applying Test-Driven Development (TDD) doesn't help either—tests are grounded in inductive reasoning and cannot function as a mechanism for inter-agent specification agreement.

What this article proposes is a multi-agent development paradigm in which formal methods—specifically VDM (Vienna Development Method)—serve as the backbone of specification, with formal specifications functioning as rigorous contracts between agents. Think of it like construction: the human serves as domain expert and architect, communicating the overall design intent, while multiple AI agents work in parallel like a construction crew—one handling the foundation, another the structural frame, another electrical, another plumbing. Each agent understands its boundaries through the formal specification—the "blueprint"—enabling unambiguous coordination.

This article first identifies the fundamental limitations of TDD (Chapter 1), then argues why formal methods are ideally suited as "contracts" between AI agents (Chapters 2–5), presents a concrete multi-agent coordination architecture (Section 7.3), and discusses how the human role is redefined in this new paradigm—as domain expert and architecture-level decision maker.

Chapter 1: Why Test-Driven Development Is Fundamentally Flawed

Before proceeding, a clarification: this article critiques the paradigm of TDD—the idea that tests should be the central driver of design and quality assurance—not the act of testing itself. The role testing continues to play is addressed in Section 7.2.

1.1 The Fundamental Limit of Tests: The Trap of Inductive Reasoning

The core problem with TDD is that tests are grounded in inductive reasoning.

A test asks: "For this finite set of inputs, does the program produce the expected output?" But the input space of virtually any real program is infinite or astronomically large. Edsger W. Dijkstra put it plainly:

"Program testing can be used to show the presence of bugs, but never to show their absence."
— Edsger W. Dijkstra, 1969

This is not merely a pithy observation—it is a logically precise statement. Even if every test case {t₁, t₂, ..., tₙ} passes, there is no guarantee that the program behaves correctly for some untested input tₙ₊₁. This is the problem of incomplete induction.

1.2 The Mathematics of Incompleteness

Consider an integer addition function add(a, b). If both a and b are 32-bit integers, the number of possible input pairs is 2³² × 2³² = 2⁶⁴ ≈ 1.8 × 10¹⁹. Running one test per nanosecond, full coverage would take approximately 585 years.

In real systems, inputs are far more complex than integer pairs. API requests, database states, timing, concurrency ordering—the state space is effectively infinite. Every TDD test suite is sampling an infinitesimally small region of that space.

1.3 The Counterargument That TDD Is a Design Method

TDD advocates often reframe it: "TDD is not a testing method—it is a design method." Writing tests first forces you to design testable interfaces, the argument goes.

But this argument has a structural problem. When test-writability becomes the criterion for good design, the tail wags the dog. A testable interface is not necessarily a good interface. Design should be derived from the logical structure of requirements, not reverse-engineered from the convenience of tests.

In formal methods, the specification itself drives design. By explicitly stating invariants, pre-conditions, and post-conditions, interface correctness is guaranteed at the specification level. There is no need to entrust design to the accidental selection of test cases.

1.4 The Illusion of Coverage

100% code coverage is often cited as a quality metric. But coverage measures which lines of code were executed, not which parts of the specification were verified.

Consider this function:

def divide(a: int, b: int) -> float:
    return a / b

The test divide(10, 2) == 5.0 achieves 100% line coverage. But the case b = 0 is never tested. Coverage is fundamentally insufficient as a measure of specification completeness.

A formal specification writes it differently:

divide(a: int, b: int) -> float
  pre: b ≠ 0
  post: result * b = a

The pre-condition b ≠ 0 resolves the division-by-zero problem at the specification level. No test case can be forgotten because the constraint is structural, not incidental.

Chapter 2: What Are Formal Methods? A Focus on VDM

2.1 The Core Idea

Formal methods are a family of techniques that use mathematical notation and inference rules to describe software specifications and reason about their correctness.

Major formal methods include:

VDM (Vienna Development Method): Developed at IBM's Vienna lab in the 1970s. Features the model-oriented specification language VDM-SL. Extensive industrial application history.
Z Notation: From Oxford University. Specification based on set theory and schemas.
B Method: Used in systems such as the Paris Métro automatic train operation. High affinity with automated proof.
Alloy: A lightweight formal method from MIT. Supports automatic verification through model checking.
TLA+: Developed by Leslie Lamport. Particularly strong for distributed systems. Adopted extensively by Amazon.

This article focuses on VDM for three reasons: VDM-SL is well-suited for AI generation and manipulation; it integrates well with mechanical consistency checking via tools like Overture Tool; and it has a strong industrial track record.

2.2 Core Elements of VDM-SL

Type Definitions: VDM-SL defines the structure of data as types. Beyond primitive types (nat, int, bool, char), it supports set types (set of T), sequence types (seq of T), map types (map T1 to T2), product types, and union types.

Invariants: Conditions that a type or system state must always satisfy, expressed in predicate logic. This ensures that invalid states cannot exist at the specification level.

Operations: State-changing operations defined as pairs of pre-conditions and post-conditions. "What must hold before this operation is called?" and "What must hold after it completes?" are made explicit.

Functions: Pure computations without side effects. In implicit specifications, only the relationship between inputs and outputs is stated as a predicate—no concrete algorithm is specified.

2.3 Deductive vs. Inductive Reasoning

The decisive difference between TDD and formal methods lies in the direction of reasoning.

TDD (inductive) infers general correctness from a finite set of specific test cases. Formal specification (deductive) defines an unambiguous criterion for correctness, from which any conformant implementation can be reasoned about.

An important distinction must be made here. Formal specification and formal verification are separate processes. Writing a VDM-SL specification guarantees the internal consistency (freedom from contradiction) and completeness (all required operations are defined) of the specification itself. Verifying that an implementation conforms to the specification requires separate means—theorem provers, property-based testing, etc.—addressed in Chapter 4, Phase 4.

Nevertheless, inductive reasoning has a fundamental ceiling: no accumulation of test cases reaches certainty. The formal specification approach at minimum defines unambiguously what "correct" means, enabling verification against a clear standard. This is a qualitative difference from TDD.

Chapter 3: VDM-SL in Practice — A Concrete Example

3.1 Example: A User Management System

A note before the code: this article argues that humans do not need to read formal specifications. So why show VDM-SL here at all? For the same reason a building client benefits from knowing that structural calculations exist and what role they play—even without reading them. The following example illustrates how rigorous, and how complete, the artifact generated by AI actually is.

-- Formal specification of a user management system

types
  UserId = nat1
  inv uid == uid <= 999999;

  Email = seq1 of char
  inv email ==
    exists i in set inds email & email(i) = '@'
    and len email <= 254;

  Password = seq of char
  inv pw == len pw >= 8 and len pw <= 128;

  Role = <Admin> | <Editor> | <Viewer>;

  User :: id       : UserId
          email    : Email
          name     : seq1 of char
          role     : Role
          active   : bool
  inv u == len u.name <= 100;

state UserSystem of
  users    : map UserId to User
  nextId   : UserId
  emails   : set of Email
inv mk_UserSystem(users, nextId, emails) ==
  -- All user IDs are less than nextId
  (forall uid in set dom users & uid < nextId)
  -- emails matches the set of all registered email addresses
  and emails = {users(uid).email | uid in set dom users}
  -- Each user's id field matches its key in the map
  and (forall uid in set dom users & users(uid).id = uid)
init s == s = mk_UserSystem({|->}, 1, {})
end

operations

  RegisterUser(email: Email, name: seq1 of char, role: Role) uid: UserId
    ext wr users  : map UserId to User
        wr nextId : UserId
        wr emails : set of Email
    pre email not in set emails   -- email uniqueness
        and nextId <= 999999      -- ID ceiling
    post let newUser = mk_User(nextId~, email, name, role, true) in
         uid = nextId~
         and users = users~ munion {nextId~ |-> newUser}
         and nextId = nextId~ + 1
         and emails = emails~ union {email};

  DeactivateUser(uid: UserId)
    ext wr users : map UserId to User
    pre uid in set dom users
        and users(uid).active = true
    post users = users~ ++
         {uid |-> mu(users~(uid), active |-> false)};

  ChangeRole(uid: UserId, newRole: Role)
    ext wr users : map UserId to User
    pre uid in set dom users
        and users(uid).active = true
    post users = users~ ++
         {uid |-> mu(users~(uid), role |-> newRole)};

  FindUserByEmail(email: Email) result: [UserId]
    ext rd users : map UserId to User
    post if exists uid in set dom users & users(uid).email = email
         then result <> nil
              and result in set dom users
              and users(result).email = email
         else result = nil;

functions

  ActiveUserCount: map UserId to User -> nat
  ActiveUserCount(users) ==
    card {uid | uid in set dom users & users(uid).active};

  HasAdmin: map UserId to User -> bool
  HasAdmin(users) ==
    exists uid in set dom users &
      users(uid).role = <Admin> and users(uid).active;

3.2 What This Specification Directly Guarantees

Reading the specification carefully reveals that the following properties are logically guaranteed—not probabilistically suggested by test cases.

1. Email uniqueness: The pre-condition email not in set emails in RegisterUser makes duplicate registration with the same address logically impossible at the specification level. Guaranteeing this with TDD would require testing every possible registration order and concurrency pattern—an intractable task.

2. State consistency: The system invariant inv mk_UserSystem(...) ensures that the emails set and users map remain permanently synchronized, and that ID integrity is always maintained. That post-conditions preserve this invariant can be confirmed deductively from the specification.

3. Type constraints: UserId is a natural number between 1 and 999999; Email is a string of 1–254 characters containing @; Password is 8–128 characters. These constraints are built into the type definitions.

3.3 Design Challenges Arising from Specification Contracts

Section 3.2 addressed properties that the specification directly guarantees. But formal specifications serve a second, equally important function: because their contracts are explicit, they surface design challenges that would otherwise remain invisible until implementation.

Pre-conditions are responsibility-boundary contracts. For example, the pre-condition users(uid).active = true in ChangeRole declares: "changing the role of an inactive user is outside this operation's scope." Invalid inputs do not magically disappear—the contract states: "if this condition does not hold, this operation's behavior is not guaranteed."

This explicit contract gives rise to two concrete design challenges.

Design challenge 1: Compositional verification across modules. If module A calls RegisterUser (post-condition: active = true) and module B immediately calls ChangeRole, A's post-condition logically implies B's pre-condition (A.post ⇒ B.pre). This can be formally confirmed—proving that the module composition is free of contradiction at the specification level. This is a guarantee that testing struggles to achieve. This verification is performed automatically by AI during Phase 2 (design).

Design challenge 2: Defensive placement. When data that violates a pre-condition reaches the system, the question becomes: where and how to defend? Should the validation layer sit at the API gateway? At the entry point of each module? Should a data quarantine (sanitization) module be inserted, and at which architectural layer? These are design questions, not specification questions—but they only surface as concrete agenda items because the pre-conditions are explicit. With ambiguous natural-language specifications, there is a real risk that these defensive design discussions never occur before implementation begins.

These design challenges are resolved concretely in Phase 2 (AI-driven design) of the workflow described in Chapter 4. Formal specifications simultaneously deliver two forms of value: correctness guarantees (Section 3.2) and design challenge clarification (this section).

Chapter 4: The AI-Driven Development Workflow

4.1 The Proposed Workflow

Phase 1: Client and Architect — Dialogue-Driven Formal Specification

This phase mirrors the relationship between a building client and an architect. The client (human) does not read structural calculations. They converse with the architect (AI), confirming that their requirements are correctly captured.

The human communicates business requirements in natural language. AI generates VDM-SL. The human never reads the formal specification directly. AI explains the specification's content and implications in natural language; the human judges whether the explanation matches their intent.

A concrete example of this dialogue: the human says, "I want to prevent duplicate email addresses at registration." AI responds: "I've defined the pre-condition email not in set emails. This makes re-registration with an already-registered address logically impossible. However, should email addresses from deactivated accounts be reusable? I need your decision on that." The human decides based on their domain knowledge.

In this phase, the human is responsible for:

Providing business domain knowledge
Hearing AI's explanations and judging specification validity — "Is that actually the intended behavior?"
Asking about edge cases — "What happens when a deactivated user tries to re-register?"
Final approval of the specification

AI is responsible for: converting natural-language requirements into VDM-SL; checking internal consistency; explaining the specification and its implications clearly; and surfacing potential edge cases and contradictions.

The key point: the formal specification is an artifact that AI reads and writes. Humans need not understand VDM-SL notation. Why use VDM-SL rather than a natural-language specification document? This question is addressed in Section 5.3.

Here is another concrete example—a Phase 1 dialogue for the seemingly simple requirement of "sorting numbers":

Human: I want to sort input numbers in ascending order.
AI: So for {3, 2, 7}, the output should be [2, 3, 7]?
Human: Yes.
AI: Just to confirm—would [1, 2, 3] be acceptable for {3, 2, 7}? That is, is any ascending sequence fine, or must the output contain only the input elements?
Human: It must contain all elements from the input.
AI: Then for {3, 3, 1, 8}, would [1, 3, 8] be correct? Should duplicates be collapsed?
Human: No—just rearrange the original values. Duplicates remain as they are.

After three rounds of dialogue, AI generates the following VDM-SL specification:

functions

sort[@T]: seq of @T * (@T * @T -> bool) -> seq of @T
sort(l, less) == is not yet specified
post bagOf[@T](l) = bagOf[@T](RESULT)              -- Permutation
     and forall i in set {1, ..., len RESULT - 1} &
         not less(RESULT(i+1), RESULT(i))           -- Sorted
measure len l;

bagOf[@T]: seq of @T -> map @T to nat
bagOf(s) ==
  { i |-> occurs[@T](i, s) | i in set elems s }
post dom RESULT = elems s and sizeOfBag[@T](RESULT) = len s;

occurs[@T]: @T * seq of @T -> nat
occurs(e, s) ==
  if s = [] then 0
  else (if e = hd s then 1 else 0) + occurs[@T](e, tl s)
measure len s;

Note what happened: three rounds of misunderstanding in natural language were resolved by two post-conditions in VDM-SL. bagOf(l) = bagOf(RESULT) means "the output is a permutation of the input (same elements, same counts)." The ordering condition means "no adjacent pair is out of order." This specification is satisfied by bubble sort, quicksort, and merge sort alike. Algorithm selection is a design decision, not a specification concern.

Phase 2: AI-Driven Technology Selection and Architecture — Integrating Non-Functional Requirements

The formal specification from Phase 1 defines what to compute, but not at what scale, speed, or cost. Phase 2 integrates non-functional requirements provided separately by the human, combining them with the formal specification to make technical design decisions.

For the sorting specification above, suppose the human states: "Data scale is approximately n = 10³⁰," "Compute resources and budget are limited," and "Stable sorting is not required." Combining these constraints with the specification, AI determines:

Programming language (weighing type safety, performance requirements, ecosystem)
Algorithm selection (n = 10³⁰ requires external sorting; stability not required favors quicksort variants)
Frameworks and libraries
Architecture (whether distributed processing is needed, microservices vs. monolith, database selection, etc.)
Infrastructure (server specifications, cloud provider, container orchestration, etc.)

The key structural insight: the formal specification provides the correctness criterion while non-functional requirements provide the design constraints, and the two are independent. Bubble sort satisfies the VDM-SL specification but is O(n²)—unworkable for n = 10³⁰—and is rejected based on non-functional requirements. The separation of formal specification and non-functional requirements prevents correctness and efficiency from becoming entangled.

Phase 3: AI Code Generation

AI generates code corresponding to the VDM-SL specification. The key points:

VDM-SL pre-conditions are implemented as runtime validation or assertions
Post-conditions serve as correctness criteria for the generated code
Invariants become design constraints on data structures
Type definitions are mapped as directly as possible to the implementation language's type system

A notable aspect of this phase is the design decision involved in translating VDM-SL's abstract types into concrete data structures. VDM-SL is intentionally abstract about implementation details. For example, seq of T (an ordered collection) might become an Array if random access dominates, a LinkedList if head insertions and deletions are frequent, or an ArrayList or Deque if both are needed. Similarly, set of T could map to a HashSet, TreeSet, or BitSet; map T1 to T2 could become a HashMap, TreeMap, or ConcurrentHashMap.

These decisions cannot be derived from type definitions alone. AI makes them by synthesizing the operation frequency patterns in the specification (is lookup dominant, or insertion?), the non-functional requirements established in Phase 2 (concurrency, memory constraints, latency requirements), and the characteristics of the selected language and frameworks. In other words, translating from the specification's "What" to the implementation's "How" is not a trivial mapping but an optimization process coupled with Phase 2's architectural decisions.

Phase 4: Specification Conformance Verification

The generated code is verified against the formal specification. This is the first point at which testing appears—but its role is fundamentally different from TDD. Tests are property-based tests derived automatically from the specification. No human writes individual test cases by hand.

Because the specification is precisely defined, test generation enjoys enormous flexibility. Random tests that generate inputs satisfying pre-conditions and verify post-conditions. Boundary-value tests that probe the edges of invariant constraints. Systematic tests that cover all state-transition paths. All of these can be automatically generated from the specification. For the sorting example, the two post-conditions—bagOf(l) = bagOf(RESULT) and the ordering condition—generate tests with empty sequences, single elements, all-identical elements, reverse-ordered inputs, and massive sequences, all without human intervention.

Phases 2–4 Are an Autonomous AI Cycle

A critical point deserves emphasis: the cycle of Phase 2 (design) → Phase 3 (implementation) → Phase 4 (verification) is executed entirely by AI, autonomously. If Phase 4 detects a specification violation, AI either fixes the implementation or revisits design decisions (algorithm selection, data structure choices) and re-implements. No human intervenes in this feedback loop.

This is possible precisely because the specification is formally defined. If the specification were ambiguous, AI could not autonomously determine whether a defect is a specification problem or an implementation problem. But a formal specification provides an unambiguous criterion for correctness, enabling AI to automatically judge "this does not satisfy the specification → correction is needed." Once the human confirms the specification in Phase 1, AI can be entrusted with the entire process until a working system is delivered.

4.2 What Humans Need to Know

The skills required of humans in this paradigm are fundamentally different from traditional development.

Required skills:

Domain knowledge: Deep understanding of the business domain being developed. This is the irreplaceable human contribution—the one AI cannot substitute. Decisions like "should deactivated users' email addresses be reusable?" can only be made by someone who understands the business context.
Logical dialogue ability: The ability to identify logical contradictions or missing requirements through natural-language conversation when AI explains the specification. Reading formal notation is not required, but being able to reason in plain language about structures like "if A then B" or "for every X, Y holds" is valuable.
Ability to articulate requirements: The ability to make implicit domain knowledge explicit in a form that AI can process. This is the same skill as traditional requirements definition, but the bar for clarity rises when conversing with AI—ambiguity cannot be left unresolved.

Skills that become unnecessary:

Reading or writing formal method notation (AI generates the specification and explains it in plain language)
Proficiency in specific programming languages (AI selects the appropriate language and writes the code)
Detailed knowledge of frameworks and libraries (AI selects the optimal ones)
Test case design (automatically derived from the specification)
Detailed infrastructure design (AI derives this from non-functional requirements in the specification)

4.3 Comparison with Traditional Development

Dimension	TDD	Formal Methods + AI-Driven
Correctness guarantee	Inductive (finite test cases)	Deductive (logical specification)
Specification ambiguity	Tests constitute an implicit spec	Specification is explicit and precise
Design driver	Test writability	Logical structure of requirements
Human role	Coder and tester	Client (domain expert + decision-maker)
AI utilization	Code completion	Design → implementation → verification, end-to-end
Scalability	Test count grows explosively	Specification scales with problem complexity
Bug detection timing	After implementation (at test runtime)	During specification (detected as logical contradiction)

Chapter 5: Why Formal Methods—Why Now?

5.1 The Three Walls That Blocked Formal Methods Before LLMs

Formal methods have existed since the 1970s yet never achieved broad industrial adoption. Three walls stood in the way.

First, writing formal specifications required advanced mathematical training. People capable of writing VDM-SL or Z notation were scarce, and demanding that from an entire team was unrealistic.

Second, a large gap existed between formal specifications and working code. Even a beautiful specification required manual effort to implement, and bugs crept in at that translation step.

Third, the return on investment was opaque. The time cost of writing specifications was high; the payoff was unclear; adoption was confined to safety-critical domains such as aerospace, rail, and medical devices.

5.2 Why LLMs Are a Game Changer

LLMs dissolve all three walls.

Wall 1 dissolved: Humans no longer need to write or read formal specifications. Business requirements are expressed in natural language; AI converts them to VDM-SL. AI explains the specification in natural language; humans evaluate the explanation. A building client grasps the design through dialogue with the architect—without reading a single structural calculation.

Wall 2 dissolved: AI performs the translation from formal specification to code. The gap between specification and implementation shrinks dramatically. AI generates code with an understanding of pre-conditions, post-conditions, and invariants—reducing the risk of bugs entering at the translation step. That said, current LLMs can hallucinate when handling complex specifications, so conformance verification (Phase 4) cannot be skipped.

Wall 3 dissolved: AI automation has slashed the cost of specification. What once took weeks can now be accomplished in hours or days through AI dialogue. The ROI equation has changed fundamentally.

5.3 Rebutting "If Only AI Reads It, Why Bother with Formal Methods?"

A natural objection arises: if humans don't read the formal specification, why write it in VDM-SL at all? Couldn't AI manage the specification in natural language?

The answer is clear: formal notation functions as a discipline of thought for AI, not just for humans.

First, formal notation structurally eliminates ambiguity. Writing "a user has one email address" in natural language leaves open whether this means exactly one, at least one, or at most one. Defining email: Email in VDM-SL settles it by notation alone. When AI manages specifications in natural language, it risks introducing exactly this kind of ambiguity. Formal notation eliminates that risk by construction.

Second, formal specifications are mechanically verifiable intermediate artifacts. VDM-SL specifications can be type-checked and consistency-checked using tools such as Overture Tool. Natural-language specification documents cannot. Even when the AI that wrote the specification and the AI that implements it are different sessions or different models, the formal specification binds both as a precise contract.

Third, there is the question of auditability. Formal specifications can be verified after the fact by other AI systems, other tools, or future verification systems—regardless of whether any human reads them. For example, a security audit can mechanically confirm from the formal specification whether all operations are accessible only to authenticated users. A compliance review can automatically verify whether a personal data deletion operation propagates to all related tables. Performing equivalent mechanical verification on natural-language specifications is fundamentally difficult due to inherent ambiguity. This auditability value grows as project scale increases.

In short, formal methods exist not "for humans" but "for logical rigor." Writing in formal notation improves AI's reasoning precision even when no human will ever read the result. This is the same principle by which mathematicians reach more accurate conclusions using symbols than intuition alone.

5.4 The Transition Strategy Before AGI

Current LLMs are not AGI. They lack the ability to autonomously understand business requirements and design optimal systems from scratch. But they already demonstrate practical capability in understanding formal specifications and generating conformant code.

This profile—strong at specification understanding and implementation, limited at requirements definition—is ideal for combination with formal methods. Humans convey the business "What" in natural language; AI translates it into a formal specification, verifies it through dialogue with the human, and then implements the technical "How."

When AGI arrives, AI may be able to handle even the "What." Until then, formal methods combined with AI-driven development is the most rational approach available.

Chapter 6: A Practical Roadmap

6.1 Individual Level

Step 1: Practice logical articulation (1–2 weeks)

Start by articulating familiar business rules precisely in natural language. For an e-commerce order process: "Items with zero inventory cannot be ordered." "A user can have at most one order in processing at a time." Write these as explicit conditional statements. Understanding concepts like sets, maps, and universal/existential quantification at the natural-language level is sufficient. There is no need to learn VDM-SL notation.

Step 2: Practice specification dialogue with AI (1–2 weeks)

Ask AI to "write a formal VDM-SL specification for a user management feature." As AI explains the generated specification, practice identifying gaps and contradictions. The goal is to develop the ability to refine a specification by asking questions like "What happens to a deactivated user's data?" or "Can an administrator delete their own account?"

Step 3: Run the full cycle on a small project (2–4 weeks)

Execute the entire pipeline—from specification dialogue through code generation to reviewing the deliverable. Start with a small API with basic CRUD operations. The essential experience is the feedback loop: use the generated artifact, find behaviors that differ from what was agreed in the specification dialogue, and return that feedback to AI.

6.2 Team and Organizational Level

For organizational adoption, begin with a single small new service as a pilot project built entirely with formal methods + AI-driven development. Do not attempt to migrate existing large systems immediately. Demonstrate value first, then scale.

The review process changes fundamentally. Traditional code review is replaced by specification dialogue review. Share the AI specification-dialogue logs; convene domain-knowledgeable humans to discuss whether the specification correctly reflects the requirements. Delegate implementation details to AI; focus human attention at the specification level.

Chapter 7: Honest Limitations

7.1 Where Formal Methods Are Not Enough

Formal methods are not universal. Complementary approaches remain necessary in several areas.

UI/UX specification: The "usability" and "aesthetics" of user interfaces are difficult to formalize. Prototyping and user testing remain valuable. However, the business logic beneath the UI—state transitions, validation rules, and so on—is fully formalizable.

Performance characteristics: VDM-SL describes what is computed, not how fast. Performance requirements must be handled separately as non-functional requirements.

External system integration: The actual behavior of third-party APIs cannot be fully captured in a formal specification. Integration testing remains valuable at system boundaries.

7.2 Tests Do Not Disappear Entirely

The article's title is provocative, but it does not wholesale reject testing. What it rejects is the TDD paradigm—the idea that tests should be the center of design and quality assurance.

Testing persists in this approach in the following roles:

Property-based tests automatically derived from the specification (conformance verification)
Integration tests for external system boundaries
Performance and load tests
End-to-end / acceptance tests for UI

The shift is that testing moves from center stage to a supporting role.

7.3 Multi-Agent Coordination and Applicable Scale — Scaling Through Formal Specifications

Can a single AI agent autonomously build an entire large-scale system? No—context window constraints prevent it from grasping a system of tens of thousands of lines at once. But this is the wrong question. In human development teams, no single engineer holds the entire system in their head either. Team development works because interfaces between modules have been agreed upon.

Formal specifications provide precisely this "interface agreement that enables team development"—rigorously. And this is the key that makes autonomous construction of medium-to-large systems possible even with current AI agents.

Multi-Agent Architecture with Formal Specifications as Contracts

The structure is as follows.

In Phase 1 (human + architect AI), the formal specification for the entire system is defined. The core activity here is module decomposition and the definition of inter-module interface specifications—the pre-conditions and post-conditions of operations each module exposes. This set of specifications becomes the "contract" for all agents.

In Phases 2–4, independent AI agents work on their assigned modules in parallel, each executing design, implementation, and verification autonomously. Agent A handles the inventory management module, Agent B handles order processing, Agent C handles payment. Each agent's context needs only "its own module's specification" plus "the interface specifications of modules it depends on (pre/post-conditions only)"—implementation details of other modules are entirely unnecessary. This fits comfortably within current context windows.

For integration verification, a dedicated integration agent cross-checks the published interface specifications of all modules, mechanically verifying the A.post ⇒ B.pre compositional consistency. This agent need not examine implementation code at all—it specializes in specification-level consistency checking.

Why this structure is impossible with natural-language specifications but possible with formal specifications comes down to three reasons. First, each agent can independently determine whether its implementation satisfies its spec. Because the correctness criterion is unambiguous, no "alignment meetings" with other agents are needed. Second, inter-module consistency verification can be performed mechanically. Implication relationships between post-conditions and pre-conditions are tool-verifiable. Third, the context each agent must hold is minimized. No agent needs to know the internals of other modules—only their interface specifications.

This is the same principle by which API documentation and interface definitions (OpenAPI, Protocol Buffers, etc.) enable division of labor in human teams. However, natural-language API documentation leaves ambiguities such as "only success responses documented, error cases undefined." Formal specifications structurally eliminate this ambiguity.

Remaining Challenges and the Human Role

Several challenges remain at this point: unifying cross-cutting concerns (authentication/authorization, logging, error handling patterns), managing shared data model consistency (database schemas), and automating agent orchestration (execution order and dependency control). These are not problems of fundamental impossibility but engineering challenges—their resolution is a matter of time.

For now, these cross-cutting design decisions can be finalized during Phase 1 by the architect AI and human, then included in each agent's instructions. The human's role is not only that of domain expert deciding "what to build," but also that of architecture-level decision maker in multi-agent development. In the building analogy, this is the client who not only decides the floor plan but also consults with the architect on "in what order to commission foundation, structure, electrical, and plumbing, and to which contractors."

Scaling Outlook

The combination of multi-agent coordination and formal specifications makes medium-scale systems (tens of thousands of lines) theoretically achievable even with current AI models; the maturation of the orchestration layer is the key to practical deployment. Furthermore, as context windows expand, inter-agent coordination protocols become standardized, and long-term memory improves, large-scale systems (over one hundred thousand lines) come into view. Crucially, across all of these technological advances, formal specifications function as "rigorous contracts" between agents. With natural-language specifications, coordination ambiguity grows exponentially as the number of agents increases; with formal specifications, module boundary consistency remains mechanically verifiable regardless of agent count—making the approach inherently scalable.

Conclusion: The Human Role in AI Agent-Driven Development

The multi-agent paradigm of formal methods + AI-driven development fundamentally redefines the human role in software development. The developer who was "a person who writes code" becomes a domain expert and architecture-level decision maker—"a person who decides what to build, designs the overall system structure, and evaluates the output of the AI agent team."

This is not a devaluation of human contribution—it is an elevation in abstraction level. It is the same evolution as the transition from assembly language to high-level languages, from manual memory management to garbage collection. Humans become free to concentrate on more fundamental questions: "What should be built?", "Why?", and "How should the system be decomposed, and which agents should own which responsibilities?"

What is required is deep understanding of one's own business domain, structural thinking about system architecture, and the ability to engage in logical dialogue with AI. Reading or writing formal notation is not required—that is the AI agents' responsibility. The human's role is to exercise domain expertise when the architect AI explains in natural language: "Is this module decomposition appropriate for the requirements?" The quality of that judgment is the core competence of humans in the multi-agent development era.

Formal methods function as a discipline of each AI agent's reasoning, as rigorous contracts between agents, as a means of mechanical verification at module boundaries, and as an auditable record for the future. Humans need not read them—but writing formally carries decisive value, especially in contexts where multiple agents coordinate. This is the central claim of this article.

The era of TDD's dominance is drawing to a close. We are witnessing the dawn of multi-agent AI-driven development, where formal specifications serve as "contracts" between agents. And the role humans must play in this new paradigm is not writing code, but providing the decision-making and oversight that ensures the AI agent team builds the right thing, correctly.

References

Dijkstra, E.W. (1969). Notes on Structured Programming.
Jones, C.B. (1990). Systematic Software Development using VDM. Prentice Hall.
Fitzgerald, J. & Larsen, P.G. (2009). Modelling Systems: Practical Tools and Techniques in Software Development. Cambridge University Press.
Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley.
Newcombe, C. et al. (2015). "How Amazon Web Services Uses Formal Methods." Communications of the ACM, 58(4).
Jackson, D. (2012). Software Abstractions: Logic, Language, and Analysis. MIT Press.
Overture Tool Project: https://www.overturetool.org/
Larsen, P.G. et al. (2010). "Industrial Applications of VDM." In Bentley Historical Library.
Bicarregui, J. et al. (2009). "Proof and Model Checking for Protocol Design." Formal Aspects of Computing, 21(1-2).