<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kotaro Andy</title>
    <description>The latest articles on DEV Community by Kotaro Andy (@kotaroyamame).</description>
    <link>https://dev.to/kotaroyamame</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3845884%2F7a434bac-98bb-47cd-9078-9d3fb4dfa39a.png</url>
      <title>DEV Community: Kotaro Andy</title>
      <link>https://dev.to/kotaroyamame</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kotaroyamame"/>
    <language>en</language>
    <item>
      <title>I Built a Claude Code Plugin for Formal Specifications and Ran a 30-Trial Evaluation. Here's What I Found.</title>
      <dc:creator>Kotaro Andy</dc:creator>
      <pubDate>Wed, 01 Apr 2026 16:34:57 +0000</pubDate>
      <link>https://dev.to/kotaroyamame/i-built-a-claude-code-plugin-for-formal-specifications-and-ran-a-30-trial-evaluation-heres-what-i-2g02</link>
      <guid>https://dev.to/kotaroyamame/i-built-a-claude-code-plugin-for-formal-specifications-and-ran-a-30-trial-evaluation-heres-what-i-2g02</guid>
      <description>&lt;h2&gt;
  
  
  The Problem: LLMs Write Plausible Code That Misses Edge Cases
&lt;/h2&gt;

&lt;p&gt;If you've used Claude Code (or any LLM-powered coding assistant) for non-trivial tasks, you've probably noticed a pattern: the generated code looks correct, passes the obvious test cases, and then breaks on edge cases that were never specified in the prompt.&lt;/p&gt;

&lt;p&gt;This isn't a model quality issue — it's a specification problem. When you tell an LLM "build a bank transfer system," there are dozens of implicit rules the model has to guess at: What happens if the source and destination accounts are the same? Is a transfer of zero valid? Is the transfer atomic?&lt;/p&gt;

&lt;p&gt;I built a Claude Code plugin called &lt;strong&gt;Formal Agent Contracts&lt;/strong&gt; to address this. It guides the LLM through defining business rules in &lt;a href="https://en.wikipedia.org/wiki/Vienna_Development_Method" rel="noopener noreferrer"&gt;VDM-SL&lt;/a&gt; (a formal specification language) before writing any implementation code. The idea is that the act of formalizing constraints forces the model to confront ambiguities upfront rather than silently making assumptions.&lt;/p&gt;

&lt;p&gt;The real question was: does this actually improve output quality in a measurable way?&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiment Design
&lt;/h2&gt;

&lt;p&gt;I designed a controlled evaluation comparing two conditions across 3 benchmark tasks × 5 trials each = 30 total runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conditions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control (A)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Code only, no formal methods&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"{task spec}. Implement in TypeScript with tests."&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Treatment (B)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Code + Formal Agent Contracts plugin&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"Run the integrated workflow for {task spec}. Generate TypeScript."&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both groups receive &lt;strong&gt;identical task specifications&lt;/strong&gt; — the treatment group gets no extra information, just a different workflow that starts with spec formalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Tasks
&lt;/h3&gt;

&lt;p&gt;Each task was designed with increasing complexity and a set of deliberately hidden "traps" — edge cases not explicitly stated in the prompt that a thorough implementation should handle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Trap Count&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bank Account&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Single agent. CRUD + transfers. Traps: transfer atomicity, zero-balance withdrawal, negative initial balance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Library System&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;3 agents (Catalog, Member, Loan). Traps: inventory count consistency across agents, immediate re-borrow after overdue return, double extension prevention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Online Auction&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;State machine + time constraints + concurrency. Traps: cascading bid extensions, simultaneous bid priority, payment timeout boundaries, re-listing after cancellation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Metrics
&lt;/h3&gt;

&lt;p&gt;Four metrics, each scored per-trial:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M1 — Contract Violation Detection Rate&lt;/strong&gt;: For each hidden trap, score 0–3 based on whether the code handles it and whether a test covers it. &lt;code&gt;CVDR = Σ(scores) / (traps × 3) × 100%&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M2 — Specification Coverage&lt;/strong&gt;: What fraction of business rules have explicit validation in code (runtime checks, assertions, type constraints)? &lt;code&gt;SC = Σ(rule_scores) / (rules × 3) × 100%&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M4 — Specification Explicitness&lt;/strong&gt;: Are business rules captured in a machine-verifiable form (formal spec = 3, code assertion = 2, comment = 1, implicit = 0)?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M6 — Test Effectiveness&lt;/strong&gt;: Heuristic analysis of test code quality — edge case coverage, boundary value testing, negative test cases, test density.&lt;/p&gt;

&lt;p&gt;Scoring was automated via keyword-matching heuristics against gold-standard trap definitions and business rule extractions. All raw scores are in the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Aggregate
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Control&lt;/th&gt;
&lt;th&gt;Treatment&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;Cliff's δ&lt;/th&gt;
&lt;th&gt;Effect Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M1: Violation Detection&lt;/td&gt;
&lt;td&gt;52.0% ± 33.6&lt;/td&gt;
&lt;td&gt;63.9% ± 22.5&lt;/td&gt;
&lt;td&gt;+11.8pp&lt;/td&gt;
&lt;td&gt;0.18&lt;/td&gt;
&lt;td&gt;small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M2: Spec Coverage&lt;/td&gt;
&lt;td&gt;39.1% ± 21.2&lt;/td&gt;
&lt;td&gt;81.9% ± 26.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+42.8pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;large&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M4: Spec Explicitness&lt;/td&gt;
&lt;td&gt;8.9% ± 15.3&lt;/td&gt;
&lt;td&gt;100.0% ± 0.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+91.1pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;large&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M6: Test Effectiveness&lt;/td&gt;
&lt;td&gt;66.7% ± 23.4&lt;/td&gt;
&lt;td&gt;87.0% ± 8.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20.3pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.56&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;large&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three out of four metrics show &lt;strong&gt;large effect sizes&lt;/strong&gt; (Cliff's δ ≥ 0.474). For context, a Cliff's δ of 0.74 on M2 means that in 87% of pairwise comparisons between control and treatment runs, the treatment run had higher spec coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Complexity Scaling Effect
&lt;/h3&gt;

&lt;p&gt;This was the most interesting finding. The plugin's impact scales with task complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;M1 Δ&lt;/th&gt;
&lt;th&gt;M2 Δ&lt;/th&gt;
&lt;th&gt;M6 Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Low (Bank)&lt;/td&gt;
&lt;td&gt;-3.3pp&lt;/td&gt;
&lt;td&gt;+40.0pp&lt;/td&gt;
&lt;td&gt;+13.5pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium (Library)&lt;/td&gt;
&lt;td&gt;+15.0pp&lt;/td&gt;
&lt;td&gt;+62.2pp&lt;/td&gt;
&lt;td&gt;+17.5pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High (Auction)&lt;/td&gt;
&lt;td&gt;+23.8pp&lt;/td&gt;
&lt;td&gt;+26.2pp&lt;/td&gt;
&lt;td&gt;+30.0pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the low-complexity task, the control group already does well on M1 — simple business rules are within the LLM's "intuition." But once you introduce multi-agent coordination (medium) or state machines with temporal constraints (high), the gap widens sharply.&lt;/p&gt;

&lt;p&gt;The auction task is where it gets stark. The control group averaged 11.4% on violation detection vs. 35.2% for treatment. Edge cases like cascading bid extension limits, simultaneous bid resolution, and payment timeout boundaries were almost universally missed without formal spec guidance.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Treatment Group Actually Does Differently
&lt;/h3&gt;

&lt;p&gt;Looking at the generated artifacts, the mechanism is clear. The treatment group produces three files per run (&lt;code&gt;.vdmsl&lt;/code&gt; + &lt;code&gt;.ts&lt;/code&gt; + &lt;code&gt;.test.ts&lt;/code&gt;) vs. two for control (&lt;code&gt;.ts&lt;/code&gt; + &lt;code&gt;.test.ts&lt;/code&gt;). The VDM-SL spec acts as an intermediate artifact that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forces explicit enumeration&lt;/strong&gt; of invariants, pre-conditions, and post-conditions before code generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surfaces implicit requirements&lt;/strong&gt; through the formalization process — when you try to write &lt;code&gt;inv balance &amp;gt;= 0&lt;/code&gt; you immediately ask "what about the initial balance?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generates more targeted tests&lt;/strong&gt; because the test cases are derived from spec violations rather than happy-path scenarios&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, in the auction task, a typical control run might test &lt;code&gt;bid &amp;gt; currentPrice&lt;/code&gt; but not &lt;code&gt;bid &amp;gt;= currentPrice + minIncrement(currentPrice)&lt;/code&gt;. The treatment run, having defined &lt;code&gt;minBidIncrement&lt;/code&gt; as a VDM-SL function with explicit boundary conditions, naturally produces tests for the 1% minimum and the ¥100 floor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Threats to Validity
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about the limitations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Internal validity&lt;/strong&gt;: The same Claude instance generated both conditions. Each run used intentional style variation, but the fundamental issue remains — this isn't a human-subjects experiment. The treatment group's improvement could partly reflect the structured workflow rather than the formal specification per se.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring methodology&lt;/strong&gt;: M1 and M2 use heuristic keyword matching, not actual code execution. A function named &lt;code&gt;validateBalance&lt;/code&gt; scores the same whether it's correctly implemented or not. M6 was originally designed to run gold-standard tests via Jest, but filesystem permission constraints forced a fallback to heuristic test code analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M4 is structurally biased&lt;/strong&gt;: The treatment group produces VDM-SL files by definition, which automatically scores L3 on explicitness. The +91.1pp delta on M4 is real but partially tautological. The more meaningful signal comes from M2 and M6, which measure actual code behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scale&lt;/strong&gt;: Three tasks, five trials each. The effect sizes are large enough to be meaningful at n=15, but replication with more tasks and different LLMs would strengthen the conclusions.&lt;/p&gt;

&lt;p&gt;All raw data (30 directories of generated code, scores.json, the evaluation script, and gold-standard test suites) are in the repository for independent verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works in Practice
&lt;/h2&gt;

&lt;p&gt;The plugin bundles 10 skills for Claude Code. The typical workflow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "Run the integrated workflow for this auction system spec."

Claude (via plugin):
  1. define-contract  → Generates VDM-SL from your natural language description
  2. verify-spec      → Runs VDMJ syntax/type checking + proof obligation generation
  3. smt-verify       → Converts POs to SMT-LIB, proves with Z3
  4. generate-code    → Scaffolds TypeScript/Python with runtime contract checks
  5. (runs tests)     → Validates generated code against the spec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For existing codebases, there's a reverse workflow (&lt;code&gt;extract-spec&lt;/code&gt; → &lt;code&gt;refine-spec&lt;/code&gt; → &lt;code&gt;reconcile-code&lt;/code&gt;) that extracts provisional specs from code and refines them through dialogue.&lt;/p&gt;

&lt;p&gt;You don't need to know VDM-SL. The conversation is in natural language — the plugin handles the translation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/plugin marketplace add kotaroyamame/formal-agent-contracts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or clone and install locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kotaroyamame/formal-agent-contracts.git
&lt;span class="nb"&gt;cd &lt;/span&gt;formal-agent-contracts
/plugin &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prerequisites: Java 11+ (for VDMJ) and optionally Z3 (&lt;code&gt;pip install z3-solver&lt;/code&gt;) for automated proving.&lt;/p&gt;

&lt;p&gt;The full evaluation data, protocol, and scoring scripts are under &lt;code&gt;eval/&lt;/code&gt; in the repo. If you run your own evaluation or replicate these results, I'd love to hear about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/kotaroyamame/formal-agent-contracts" rel="noopener noreferrer"&gt;github.com/kotaroyamame/formal-agent-contracts&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Detailed evaluation report&lt;/strong&gt;: &lt;a href="https://iid.systems/formal-agent-contracts/evaluation/" rel="noopener noreferrer"&gt;iid.systems/formal-agent-contracts/evaluation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>formalmethods</category>
      <category>testing</category>
    </item>
    <item>
      <title>Formal Agent Contracts: Bring Mathematical Rigor to Multi-Agent Development with a Claude Code Plugin</title>
      <dc:creator>Kotaro Andy</dc:creator>
      <pubDate>Wed, 01 Apr 2026 07:54:14 +0000</pubDate>
      <link>https://dev.to/kotaroyamame/formal-agent-contracts-bring-mathematical-rigor-to-multi-agent-development-with-a-claude-code-33k4</link>
      <guid>https://dev.to/kotaroyamame/formal-agent-contracts-bring-mathematical-rigor-to-multi-agent-development-with-a-claude-code-33k4</guid>
      <description>&lt;p&gt;When multiple AI agents collaborate on a system, something always goes wrong at the boundaries. Agent A assumes one format, Agent B expects another, and the bug surfaces three layers deep at 2 AM. We've all been there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formal Agent Contracts&lt;/strong&gt; is a Claude Code plugin that attacks this problem head-on. It lets you define precise, machine-verifiable contracts between agents using VDM-SL (Vienna Development Method – Specification Language), then automatically verify, prove, and generate code from those contracts. The twist: you don't need to know VDM-SL at all. Claude handles the formalism; you describe what your agents do in plain English.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does It Actually Do?
&lt;/h2&gt;

&lt;p&gt;The plugin provides six skills that form a pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;define-contract&lt;/strong&gt; — Describe your agent's role in natural language. Claude converts it into a VDM-SL formal specification with types, preconditions, postconditions, and invariants.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;verify-spec&lt;/strong&gt; — Run VDMJ (the VDM-SL reference toolchain) to syntax-check and type-check your spec, then auto-generate proof obligations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;smt-verify&lt;/strong&gt; — Convert proof obligations to SMT-LIB and solve them with Z3. Get back: proved, counterexample found, or unknown.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;generate-code&lt;/strong&gt; — Produce TypeScript or Python scaffolds from your spec, complete with runtime contract checks that throw clear errors on violations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;integrated-workflow&lt;/strong&gt; — Run the full pipeline (define → verify → prove → generate → test) in one session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;formal-methods-guide&lt;/strong&gt; — Ask Claude to explain any VDM-SL concept along the way.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Java 11+&lt;/strong&gt; is required for VDMJ (the VDM-SL checker). &lt;strong&gt;Python 3.8+&lt;/strong&gt; is needed if you want Z3-based automated proving.&lt;/p&gt;

&lt;p&gt;Install VDMJ:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the JAR from GitHub releases&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ~/.vdmj
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ~/.vdmj/vdmj.jar &lt;span class="se"&gt;\&lt;/span&gt;
  https://github.com/nickbattle/vdmj/releases/latest/download/vdmj-4.6.0-SNAPSHOT.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install Z3 (optional, for SMT proving):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;z3-solver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Install the Plugin
&lt;/h3&gt;

&lt;p&gt;In a Claude Code session (terminal or Cowork), run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/plugin &lt;span class="nb"&gt;install &lt;/span&gt;formal-agent-contracts@iid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The six skills are now available in your session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Walkthrough: Building a Task Manager Agent
&lt;/h2&gt;

&lt;p&gt;Let's build a real example from scratch. We'll define a task management agent with status transitions, verify the spec catches invalid transitions, and generate code with runtime contract enforcement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Define the Contract
&lt;/h3&gt;

&lt;p&gt;Tell Claude what your agent does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: Define a task manager agent. Tasks have an ID, title, status
     (Todo / InProgress / Done), and priority (Low / Medium / High).
     Done tasks cannot transition back to any other status.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude invokes the &lt;strong&gt;define-contract&lt;/strong&gt; skill and produces a VDM-SL specification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;types&lt;/span&gt;
  &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Todo&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;InProgress&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;     &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;nat1&lt;/span&gt;
          &lt;span class="n"&gt;title&lt;/span&gt;  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;seq1&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;char&lt;/span&gt;
          &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Status&lt;/span&gt;
  &lt;span class="n"&gt;inv&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;operations&lt;/span&gt;
  &lt;span class="n"&gt;ChangeStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TaskId&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="o"&gt;==&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;ChangeStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;pre&lt;/span&gt;  &lt;span class="n"&gt;taskId&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="n"&gt;dom&lt;/span&gt; &lt;span class="n"&gt;board&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt;
         &lt;span class="n"&gt;ValidTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;board&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="n"&gt;board&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;pre&lt;/code&gt; (precondition) says: the task must exist &lt;em&gt;and&lt;/em&gt; the transition must be valid. The &lt;code&gt;post&lt;/code&gt; (postcondition) guarantees the status is actually updated. The &lt;code&gt;inv&lt;/code&gt; (invariant) constrains title length at all times. These three constructs form the &lt;strong&gt;contract&lt;/strong&gt; — the rules every caller and implementation must respect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Verify the Spec
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: Verify this spec.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude runs VDMJ and reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Syntax check: PASSED
✅ Type check: PASSED
📋 Proof Obligations generated: 38
   - CreateTask satisfies TaskBoard invariant
   - ChangeStatus preserves title across transition
   - DeleteTask reduces board cardinality by exactly 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Proof obligations (POs) are questions automatically derived from your contracts. For instance, "After &lt;code&gt;ChangeStatus&lt;/code&gt;, is &lt;code&gt;board(taskId).title&lt;/code&gt; still the same as before?" You didn't write this check — the tool inferred it from your &lt;code&gt;post&lt;/code&gt; condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Prove Obligations with Z3
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: Prove the POs with Z3.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude converts each PO to SMT-LIB and runs Z3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PO #12 (ChangeStatus preserves title): ✅ Proved
PO #15 (DeleteTask cardinality):       ✅ Proved
PO #23 (CreateTask invariant):         ✅ Proved
...
37/38 proved, 1 unknown (requires manual review)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Generate Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: Generate TypeScript from this spec.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude produces implementation code with &lt;strong&gt;runtime contract checks&lt;/strong&gt; baked in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;changeStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// --- Pre-conditions (from VDM-SL spec) ---&lt;/span&gt;
  &lt;span class="nf"&gt;checkPre&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;`taskId &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; not in dom board`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;checkPre&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;validTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;`Invalid transition: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; → &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// --- Operation body ---&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;updated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;newStatus&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;updated&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// --- Post-conditions (from VDM-SL spec) ---&lt;/span&gt;
  &lt;span class="nf"&gt;checkPost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;board&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;`status must be &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;newStatus&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Try violating the contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;changeStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Todo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// → ContractError: Pre-condition failed:&lt;/span&gt;
&lt;span class="c1"&gt;//   Invalid transition: Done → Todo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No silent bugs. No debugging session at 2 AM.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Shot: The Integrated Workflow
&lt;/h2&gt;

&lt;p&gt;If you want the full pipeline in one go:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: Run the integrated workflow for a task management agent.
     Tasks have ID, title, status (Todo/InProgress/Done),
     priority. Done tasks can't go back.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude orchestrates all phases — define, verify, prove, generate, test — handling errors and retries automatically, and produces a session report at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Formal Contracts Instead of Tests?
&lt;/h2&gt;

&lt;p&gt;Tests are &lt;em&gt;inductive&lt;/em&gt;: you check a finite number of cases and hope they cover enough. Formal contracts are &lt;em&gt;deductive&lt;/em&gt;: you prove properties hold for &lt;em&gt;all&lt;/em&gt; possible inputs.&lt;/p&gt;

&lt;p&gt;Consider this scenario without contracts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// This silently succeeds — the bug surfaces later, somewhere else&lt;/span&gt;
&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Todo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Was "Done" — should be forbidden!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With contracts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;changeStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Todo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// → ContractError: Invalid transition: Done → Todo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spec also doubles as &lt;strong&gt;living documentation&lt;/strong&gt;. It precisely describes what each agent does, what it expects, and what it guarantees — and it never drifts out of sync with the code, because the code is generated from it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plugin install&lt;/strong&gt;: &lt;code&gt;/plugin install formal-agent-contracts@iid&lt;/code&gt; in Claude Code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source&lt;/strong&gt;: &lt;a href="https://github.com/kotaroyamame/formal-agent-contracts" rel="noopener noreferrer"&gt;github.com/kotaroyamame/formal-agent-contracts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full task manager example&lt;/strong&gt;: included in the plugin at &lt;code&gt;examples/task-manager/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research paper&lt;/strong&gt;: &lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;Formal-Spec-Driven Development&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://iid.systems" rel="noopener noreferrer"&gt;IID Systems&lt;/a&gt;. Licensed under MIT.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>formalmethods</category>
      <category>agents</category>
    </item>
    <item>
      <title>GitHub Spec Kit Is 80% Right — Here's the Missing 20% That Would Make It Transformative</title>
      <dc:creator>Kotaro Andy</dc:creator>
      <pubDate>Sat, 28 Mar 2026 04:47:44 +0000</pubDate>
      <link>https://dev.to/kotaroyamame/github-spec-kit-is-80-right-heres-the-missing-20-that-would-make-it-transformative-2bi6</link>
      <guid>https://dev.to/kotaroyamame/github-spec-kit-is-80-right-heres-the-missing-20-that-would-make-it-transformative-2bi6</guid>
      <description>&lt;h2&gt;
  
  
  I Love Spec Kit. And That's Why I Want to Push It Further.
&lt;/h2&gt;

&lt;p&gt;GitHub's &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;Spec Kit&lt;/a&gt; is, in my assessment, the most intellectually honest attempt at AI-driven development to date. While most tools focus on &lt;em&gt;faster code generation&lt;/em&gt;, Spec Kit asks a more fundamental question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if intent — not code — was the source of truth?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is exactly the right question. The Constitution → Specify → Plan → Tasks → Implement workflow is well-designed. The steerable gates where humans can intervene are smart. The 25+ agent support (Claude Code, Copilot, Gemini CLI, Cursor, etc.) shows pragmatic thinking about ecosystem diversity. The 40+ community extensions demonstrate real traction.&lt;/p&gt;

&lt;p&gt;I've been working on a &lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;formal specification-driven development framework&lt;/a&gt; that starts from the same premise — &lt;strong&gt;specifications should drive development, not the other way around&lt;/strong&gt;. But after months of research and implementation, I believe Spec Kit has a structural gap that, if filled, would make it dramatically more powerful.&lt;/p&gt;

&lt;p&gt;That gap is &lt;strong&gt;formal verifiability of specifications themselves&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Structured Ambiguity
&lt;/h2&gt;

&lt;p&gt;Spec Kit specifications are written in structured Markdown. This is a massive improvement over the alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MetaGPT passes natural language PRDs between agents → ambiguity accumulates at each handoff&lt;/li&gt;
&lt;li&gt;ChatDev relies on dialogue-based consensus → non-deterministic and non-reproducible&lt;/li&gt;
&lt;li&gt;Devin/SWE-Agent infers specs from code → circular reasoning (code &lt;em&gt;is&lt;/em&gt; the spec)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spec Kit avoids all of these pitfalls by making specs explicit, structured, and human-editable. But here's the uncomfortable truth:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured natural language reduces ambiguity. It does not eliminate it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider a Spec Kit specification like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## User Story: Add to Cart&lt;/span&gt;
The user can add a product to their shopping cart.
The cart should reflect the updated quantity.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is clear to a human reader. But it leaves critical questions unanswered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can the user add 0 items? Negative quantities?&lt;/li&gt;
&lt;li&gt;What happens when stock is insufficient?&lt;/li&gt;
&lt;li&gt;If the product is already in the cart, does the quantity replace or accumulate?&lt;/li&gt;
&lt;li&gt;Is there a maximum cart size?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't edge cases. They're &lt;strong&gt;boundary conditions&lt;/strong&gt; — the exact places where modules interact and where bugs hide. A human spec author might catch some of them. But the point of a specification system is that &lt;strong&gt;the system itself should make missing boundaries visible&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Formal Specifications Add
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev/tree/main/templates/vdm-sl" rel="noopener noreferrer"&gt;VDM-SL&lt;/a&gt; (Vienna Development Method - Specification Language), the same requirement looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AddToCart: CustomerId * ProductId * nat1 ==&amp;gt; ()
AddToCart(cid, pid, qty) ==
  carts(cid) := if pid in set dom carts(cid)
                then carts(cid) ++ {pid |-&amp;gt; carts(cid)(pid) + qty}
                else carts(cid) ++ {pid |-&amp;gt; qty}
pre  pid in set dom inventory
     and inventory(pid) &amp;gt;= qty
     and qty &amp;gt; 0
     and CardinalityItems(carts(cid)) &amp;lt; MAX_CART_SIZE
post pid in set dom carts(cid)
     and carts(cid)(pid) = carts~(cid)(pid) + qty  -- accumulate, not replace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;nat1&lt;/code&gt; as the type of &lt;code&gt;qty&lt;/code&gt;&lt;/strong&gt; — zero and negative quantities are structurally impossible. Not "tested against," not "documented as invalid" — &lt;em&gt;impossible at the type level&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pre&lt;/code&gt; conditions&lt;/strong&gt; — stock sufficiency and cart size limits are &lt;strong&gt;explicit preconditions&lt;/strong&gt;. An AI agent reading this spec cannot "forget" about them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;post&lt;/code&gt; conditions&lt;/strong&gt; — the behavior is unambiguous: quantities accumulate (not replace). Any agent implementing this module knows exactly what the expected behavior is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-verifiable&lt;/strong&gt; — if another module's postcondition guarantees &lt;code&gt;inventory(pid) &amp;gt;= qty&lt;/code&gt;, we can mechanically verify that it satisfies this precondition. No human review needed for interface consistency.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Three Gaps Formal Specs Would Fill in Spec Kit
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gap 1: Boundary Conditions Are Invisible in Natural Language
&lt;/h3&gt;

&lt;p&gt;In Spec Kit today, boundary conditions depend on the spec author remembering to include them. This is a human-reliability problem — the very thing we're trying to engineer out of the system.&lt;/p&gt;

&lt;p&gt;Formal specifications force boundary conditions to be explicit because the type system and preconditions won't compile without them. A VDM-SL spec with &lt;code&gt;qty: nat&lt;/code&gt; but no precondition on stock levels is &lt;strong&gt;visibly incomplete&lt;/strong&gt; — the type checker can flag it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 2: Cross-Module Consistency Is Unverifiable
&lt;/h3&gt;

&lt;p&gt;This is the critical scaling problem. Spec Kit writes specs per task (≈ per module), but there's no mechanism to verify that &lt;strong&gt;Task A's output satisfies Task B's input requirements&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With 3 modules, you have 3 pairwise interfaces to check — manageable by human review. With 10 modules, you have 45. With 20 modules, 190. Human review does not scale linearly.&lt;/p&gt;

&lt;p&gt;Formal specifications solve this with &lt;strong&gt;compositional verification&lt;/strong&gt;: if Order module's &lt;code&gt;ConfirmOrder&lt;/code&gt; operation has a postcondition, and Inventory module's &lt;code&gt;ReserveStock&lt;/code&gt; has a precondition, you can mechanically check whether &lt;code&gt;ConfirmOrder.post ⇒ ReserveStock.pre&lt;/code&gt;. This is the &lt;code&gt;A.post ⇒ B.pre&lt;/code&gt; pattern — and it scales linearly with module count, not quadratically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Module A (Order)                Module B (Inventory)
┌─────────────────┐            ┌─────────────────┐
│  ConfirmOrder()  │            │  ReserveStock()  │
│  post:           │──verify───│  pre:            │
│   order.status   │            │   productId ∈    │
│    = &amp;lt;CONFIRMED&amp;gt; │   A.post   │    dom inventory │
│   ∧ stock        │    ⇒       │   ∧ quantity &amp;gt; 0 │
│    reserved      │   B.pre    │   ∧ available ≥  │
│                  │            │     quantity     │
└─────────────────┘            └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gap 3: Specs Can't Validate Themselves
&lt;/h3&gt;

&lt;p&gt;Spec Kit's philosophy is "intent is the source of truth." But there's a meta-problem: &lt;strong&gt;how do you verify that the spec itself is internally consistent?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Natural language specs can contain contradictions that only surface during implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; Users must complete checkout within 30 minutes
&lt;span class="p"&gt;-&lt;/span&gt; Users can save their cart and resume later
&lt;span class="p"&gt;-&lt;/span&gt; Inventory is reserved at checkout start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Are these contradictory? If a user saves their cart at minute 29, resumes 3 hours later, is inventory still reserved? A human reader might catch this. A type-checker on formal specifications &lt;em&gt;will&lt;/em&gt; catch it — the invariant on reserved inventory duration would conflict with the "resume later" postcondition.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Proposal: Spec Kit + Formal Specs
&lt;/h2&gt;

&lt;p&gt;I'm not proposing to replace Spec Kit's natural language layer. I'm proposing to &lt;strong&gt;add a formal verification layer beneath it&lt;/strong&gt;. Here's how it could work within Spec Kit's existing workflow:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase: Constitution (unchanged)
&lt;/h3&gt;

&lt;p&gt;Project principles, values, guidelines — these remain in natural language. They're governance, not computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase: Specify (enhanced)
&lt;/h3&gt;

&lt;p&gt;The human writes intent in natural language (as today). Then, an AI agent &lt;strong&gt;generates a VDM-SL formalization&lt;/strong&gt; of the spec and explains it back:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Your spec says users can add products to their cart. I've formalized this with these constraints: quantities must be positive integers, stock must be sufficient, and cart has a maximum of 50 items. The quantity behavior is accumulative — adding 3 of a product that already has 2 in the cart gives 5, not 3. Does this match your intent?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The human validates meaning. The machine validates consistency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase: Plan (enhanced)
&lt;/h3&gt;

&lt;p&gt;The technical plan now includes &lt;strong&gt;interface contracts&lt;/strong&gt; between modules: which module's postconditions feed into which module's preconditions. These contracts are in VDM-SL, mechanically verifiable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase: Tasks (enhanced)
&lt;/h3&gt;

&lt;p&gt;Each task carries not just a natural language description, but a &lt;strong&gt;formal module specification&lt;/strong&gt; — the types, state, operations, pre/post conditions. The AI agent implementing the task has an unambiguous contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase: Implement (unchanged)
&lt;/h3&gt;

&lt;p&gt;Agents implement against formal specs rather than natural language descriptions. Claude Code, Copilot, Gemini CLI — any of the 25+ supported agents can read VDM-SL and generate conforming code.&lt;/p&gt;

&lt;h3&gt;
  
  
  New Phase: Verify
&lt;/h3&gt;

&lt;p&gt;A dedicated verification step checks &lt;code&gt;A.post ⇒ B.pre&lt;/code&gt; across all module boundaries. This happens at the &lt;strong&gt;spec level&lt;/strong&gt;, before any code runs. Integration problems are caught before implementation, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Multi-Agent Development
&lt;/h2&gt;

&lt;p&gt;Single-agent development (one developer + one AI) can get by with natural language specs. The human catches ambiguities in real time. But the industry is clearly moving toward &lt;strong&gt;multi-agent development&lt;/strong&gt; — multiple AI agents building different modules in parallel.&lt;/p&gt;

&lt;p&gt;In multi-agent scenarios, natural language ambiguity becomes a scaling crisis:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modules&lt;/th&gt;
&lt;th&gt;Pairwise Interfaces&lt;/th&gt;
&lt;th&gt;Human Review Feasibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Manageable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;Difficult&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;190&lt;/td&gt;
&lt;td&gt;Impractical&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Formal interface contracts are the only known mechanism that scales: each contract is verified independently, so verification effort grows linearly with module count (O(n)), not quadratically (O(n²)).&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Not Saying
&lt;/h2&gt;

&lt;p&gt;Let me be clear about what this proposal is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not "Spec Kit is bad."&lt;/strong&gt; Spec Kit is the best-designed spec-driven framework available. The philosophy is right. The UX is right. The community extensions are impressive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not "everyone should learn VDM-SL."&lt;/strong&gt; The AI reads and writes the formal notation. The human verifies meaning in natural language. No formal methods expertise required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not "formal specs replace tests."&lt;/strong&gt; Tests still matter for integration testing, performance testing, and E2E validation. Formal specs replace tests as the &lt;em&gt;center&lt;/em&gt; of correctness — not as the entire verification strategy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not "this works for everything."&lt;/strong&gt; UI/UX, performance tuning, and quick bug fixes don't benefit from formal specifications. This is for &lt;strong&gt;multi-module business logic systems&lt;/strong&gt; where correctness at module boundaries is critical.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Open Source Connection
&lt;/h2&gt;

&lt;p&gt;We've built a complete framework around this idea:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;&lt;code&gt;formal-spec-driven-dev&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Apache 2.0 licensed&lt;/p&gt;

&lt;p&gt;It includes VDM-SL templates, AI prompt templates for all 4 development phases, multi-agent orchestration configs, and a working e-commerce example (Order, Inventory, Payment modules with formal interface contracts).&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev/blob/master/docs/comparison.md" rel="noopener noreferrer"&gt;comparison analysis&lt;/a&gt; covers the structural critiques of MetaGPT, ChatDev, Devin, Spec Kit, and Claude Code in detail, with a unified framework for understanding each approach's "source of truth" and its logical limitations.&lt;/p&gt;

&lt;p&gt;If you're using Spec Kit today and thinking about how to scale it to multi-agent or multi-module projects, I'd love to hear from you. The combination of Spec Kit's excellent developer experience and formal verification's mathematical guarantees could be exactly what the ecosystem needs.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Hikaru Ando — &lt;a href="https://iid.systems" rel="noopener noreferrer"&gt;IID Systems&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;GitHub: formal-spec-driven-dev&lt;/a&gt; | Apache 2.0&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;GitHub Spec Kit&lt;/a&gt; — The toolkit this article proposes to enhance&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/" rel="noopener noreferrer"&gt;Spec-Driven Development with AI (GitHub Blog)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.microsoft.com/blog/spec-driven-development-spec-kit" rel="noopener noreferrer"&gt;Diving Into Spec-Driven Development (Microsoft Developer Blog)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;formal-spec-driven-dev&lt;/a&gt; — Our formal specification framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev/blob/master/docs/comparison.md" rel="noopener noreferrer"&gt;Comparison of AI-Driven Development Architectures&lt;/a&gt; — Structural critique of each approach&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>github</category>
      <category>softwaredevelopment</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Stop Writing Tests First — Write Formal Specs. Let AI Agent Teams Build Your System.</title>
      <dc:creator>Kotaro Andy</dc:creator>
      <pubDate>Fri, 27 Mar 2026 17:03:42 +0000</pubDate>
      <link>https://dev.to/kotaroyamame/stop-writing-tests-first-write-formal-specs-let-ai-agent-teams-build-your-system-31h0</link>
      <guid>https://dev.to/kotaroyamame/stop-writing-tests-first-write-formal-specs-let-ai-agent-teams-build-your-system-31h0</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody's Talking About
&lt;/h2&gt;

&lt;p&gt;Everyone's excited about AI-powered coding. GitHub Copilot autocompletes your functions. Claude and GPT-4 generate entire modules from prompts. But here's the uncomfortable truth: &lt;strong&gt;we're still thinking about AI as a single tool&lt;/strong&gt;, when the real paradigm shift is &lt;strong&gt;multiple AI agents working as a team&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that raises a question nobody's answering well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When three AI agents are building three modules in parallel, what prevents them from producing code that doesn't fit together?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Natural language specs? Too ambiguous. One agent interprets "the order must be valid" differently from another. Test-driven development? Tests are inductive — they verify specific cases, not the contract between modules. You can have 100% coverage on each module and still watch the system crash at integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What If Agents Had Blueprints?
&lt;/h2&gt;

&lt;p&gt;Think about how a construction project works. You don't hand three subcontractors a paragraph of prose and hope for the best. You give them &lt;strong&gt;engineering drawings&lt;/strong&gt; — precise, unambiguous documents that define every interface: where the plumbing connects to the electrical, what load the foundation bears, which walls are structural.&lt;/p&gt;

&lt;p&gt;Formal specifications are the engineering drawings of software. And a method called &lt;strong&gt;VDM-SL&lt;/strong&gt; (Vienna Development Method - Specification Language), originally from IBM in the 1970s, turns out to be nearly ideal for AI agents to read, write, and reason about.&lt;/p&gt;

&lt;p&gt;Here's why it matters for multi-agent development:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Module A (Order)                Module B (Inventory)
┌─────────────────┐            ┌─────────────────┐
│                  │            │                  │
│  PlaceOrder()    │            │  ReserveStock()  │
│  post:           │──contract──│  pre:            │
│   order.status   │            │   productId ∈    │
│    = &amp;lt;CONFIRMED&amp;gt; │   A.post   │    dom inventory │
│   ∧ stock        │    ⇒       │   ∧ quantity &amp;gt; 0 │
│    reserved      │   B.pre    │   ∧ available ≥  │
│                  │            │     quantity     │
└─────────────────┘            └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A's postcondition implies B's precondition.&lt;/strong&gt; This is mechanically verifiable. No ambiguity. No "alignment meetings" between agents. No integration surprises.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Example
&lt;/h2&gt;

&lt;p&gt;Here's what a VDM-SL specification for an order module actually looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module Order

types
  OrderId = nat1;
  ProductId = nat1;

  LineItem :: productId : ProductId
             quantity : nat1
             unitPrice : nat;

  OrderStatus = &amp;lt;PENDING&amp;gt; | &amp;lt;CONFIRMED&amp;gt; | &amp;lt;COMPLETED&amp;gt; | &amp;lt;CANCELLED&amp;gt;;

  Order :: orderId    : OrderId
           customerId : nat1
           lineItems  : seq of LineItem
           status     : OrderStatus
           totalAmount : nat;

state OrderStore of
  orders    : map OrderId to Order
  nextOrderId : nat1
inv mk_OrderStore(orders, nextOrderId) ==
  nextOrderId &amp;gt; 0
  and forall id in set dom orders &amp;amp; orders(id).orderId = id

operations

CreateOrder: nat1 * seq of LineItem ==&amp;gt; OrderId
CreateOrder(customerId, items) ==
  let id = nextOrderId in (
    orders := orders ++ {id |-&amp;gt; mk_Order(id, customerId, items,
                          &amp;lt;PENDING&amp;gt;, SumItems(items))};
    nextOrderId := nextOrderId + 1;
    return id
  )
pre items &amp;lt;&amp;gt; []
post RESULT = nextOrderId~ - 1
     and RESULT in set dom orders
     and orders(RESULT).status = &amp;lt;PENDING&amp;gt;;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what this gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Types with invariants&lt;/strong&gt; — &lt;code&gt;nat1&lt;/code&gt; means positive integers only. No null, no zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State invariant&lt;/strong&gt; — the &lt;code&gt;inv&lt;/code&gt; clause is always true. Every order ID in the map matches the order's own ID.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-conditions&lt;/strong&gt; — &lt;code&gt;items &amp;lt;&amp;gt; []&lt;/code&gt; means you can't create an empty order. This is a &lt;em&gt;structural guarantee&lt;/em&gt;, not a test case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-conditions&lt;/strong&gt; — after &lt;code&gt;CreateOrder&lt;/code&gt;, the order exists in the store with status &lt;code&gt;PENDING&lt;/code&gt;. Any agent reading this knows exactly what to expect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An AI agent building the inventory module doesn't need to see the Order module's implementation. It only needs the &lt;strong&gt;interface contract&lt;/strong&gt; — the types and operation signatures with pre/post conditions. That's a fraction of the context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multi-Agent Architecture
&lt;/h2&gt;

&lt;p&gt;Here's the workflow we propose:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Human + Architect AI (1-3 days)
&lt;/h3&gt;

&lt;p&gt;The human (domain expert) and an architect AI collaborate to define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System-wide VDM-SL specification&lt;/li&gt;
&lt;li&gt;Module decomposition&lt;/li&gt;
&lt;li&gt;Interface contracts between modules (the &lt;code&gt;A.post ⇒ B.pre&lt;/code&gt; relationships)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The human doesn't need to read VDM-SL.&lt;/strong&gt; The AI explains the spec in natural language: "This says a customer can't place an order unless they have at least one item, and after the order is placed, inventory is reserved. Does that match your business rules?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2-4: Parallel Module Agents
&lt;/h3&gt;

&lt;p&gt;Each module gets its own AI agent. Agent A builds Order. Agent B builds Inventory. Agent C builds Payment. &lt;strong&gt;In parallel.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each agent's context contains only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its own module spec (~200-500 lines of VDM-SL)&lt;/li&gt;
&lt;li&gt;Interface contracts of dependent modules (~50-100 lines each)&lt;/li&gt;
&lt;li&gt;Cross-cutting decisions from Phase 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No agent needs another agent's implementation code. This fits comfortably in current context windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration Verification
&lt;/h3&gt;

&lt;p&gt;A dedicated integration agent checks &lt;code&gt;A.post ⇒ B.pre&lt;/code&gt; across all module boundaries. It never reads implementation code — only interface specs. If Order's &lt;code&gt;ConfirmOrder&lt;/code&gt; postcondition guarantees &lt;code&gt;stock_reserved = true&lt;/code&gt;, and Inventory's &lt;code&gt;ShipOrder&lt;/code&gt; precondition requires &lt;code&gt;stock_reserved = true&lt;/code&gt;, the composition is verified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────┐
│            Phase 1: Architecture                  │
│     Human + Architect AI                          │
│     → System spec + module decomposition          │
│     → Interface contracts (A.post ⇒ B.pre)        │
└────────────┬─────────────┬───────────────┬───────┘
             │             │               │
             ▼             ▼               ▼
     ┌──────────┐  ┌──────────┐   ┌──────────┐
     │ Agent A  │  │ Agent B  │   │ Agent C  │
     │ Order    │  │Inventory │   │ Payment  │
     │ Module   │  │ Module   │   │ Module   │
     └─────┬────┘  └────┬─────┘   └────┬─────┘
           │             │               │
           ▼             ▼               ▼
     ┌──────────────────────────────────────────┐
     │      Integration Verification Agent       │
     │   Checks A.post ⇒ B.pre across all       │
     │   module boundaries (spec-level only)     │
     └──────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Not Just Use Natural Language Specs?
&lt;/h2&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Ambiguity scales exponentially.&lt;/strong&gt; With 2 modules and natural language specs, you have a manageable amount of misinterpretation risk. With 10 modules and 45 pairwise interactions, the ambiguity explodes. Formal specs eliminate this at the root — &lt;code&gt;nat1&lt;/code&gt; means one thing, everywhere, always.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Verification becomes mechanical.&lt;/strong&gt; You can't machine-check whether "the order should be valid" in one module agrees with "valid orders have at least one item" in another. But you &lt;em&gt;can&lt;/em&gt; machine-check whether &lt;code&gt;Order.ConfirmOrder.post ⇒ Inventory.ReserveStock.pre&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Context windows stay small.&lt;/strong&gt; Each agent needs only its module spec + dependency interfaces — not the full natural-language requirements document that keeps growing. A 10-module system might have 200 pages of natural language specs but only 50 lines of interface contract per module boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why TDD Doesn't Solve This
&lt;/h2&gt;

&lt;p&gt;This is the part that might get me some angry comments, but hear me out.&lt;/p&gt;

&lt;p&gt;Dijkstra said it in 1969:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Program testing can be used to show the presence of bugs, but never to show their absence."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;TDD is &lt;strong&gt;inductive reasoning&lt;/strong&gt;: you verify specific inputs and hope they generalize. Formal specification is &lt;strong&gt;deductive reasoning&lt;/strong&gt;: you state what must always be true, then derive implementations that satisfy those conditions.&lt;/p&gt;

&lt;p&gt;For a single developer, TDD is a useful discipline. For multi-agent AI development, it fails structurally:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;TDD&lt;/th&gt;
&lt;th&gt;Formal Specs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Inductive (specific → general)&lt;/td&gt;
&lt;td&gt;Deductive (general → specific)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inter-agent contract&lt;/td&gt;
&lt;td&gt;Implicit in shared tests&lt;/td&gt;
&lt;td&gt;Explicit in interface specs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;Run tests, hope for the best&lt;/td&gt;
&lt;td&gt;Mechanical proof of composition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context per agent&lt;/td&gt;
&lt;td&gt;Needs test suite + code + shared state&lt;/td&gt;
&lt;td&gt;Needs only interface contracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scales with agents&lt;/td&gt;
&lt;td&gt;Coordination overhead grows&lt;/td&gt;
&lt;td&gt;Contracts remain verifiable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note: we're not saying "never write tests." Tests still have a role — integration tests for external systems, performance tests, E2E UI tests. What we're saying is: &lt;strong&gt;tests are no longer the center.&lt;/strong&gt; Formal specifications are.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Human Role: Not Obsolete, Elevated
&lt;/h2&gt;

&lt;p&gt;This paradigm doesn't remove humans. It redefines what humans do:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; Write code, write tests, review code, debug integration failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain expert&lt;/strong&gt; — "This is how our business works. These are the rules."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture decision maker&lt;/strong&gt; — "Split the system into these modules. This module depends on that one."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality judge&lt;/strong&gt; — "The AI says the spec means X. Does X match what we actually need?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need to read VDM-SL. You don't need to write code. You need &lt;strong&gt;deep domain knowledge&lt;/strong&gt; and &lt;strong&gt;the ability to ask sharp questions&lt;/strong&gt; when the AI explains its specifications back to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open-Source Framework
&lt;/h2&gt;

&lt;p&gt;We've packaged this entire methodology as an open-source project:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;&lt;code&gt;formal-spec-driven-dev&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Apache 2.0 licensed&lt;/p&gt;

&lt;p&gt;What's included:&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;Foundational paper&lt;/strong&gt; (Japanese + English) — the full theoretical argument, ~30 min read&lt;/p&gt;

&lt;p&gt;📋 &lt;strong&gt;VDM-SL templates&lt;/strong&gt; — module template, interface contract template. Copy, customize, go.&lt;/p&gt;

&lt;p&gt;🤖 &lt;strong&gt;AI prompt templates for all 4 phases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1: Specification dialogue (elicit requirements → VDM-SL)&lt;/li&gt;
&lt;li&gt;Phase 2: Technical design (VDM-SL → architecture decisions)&lt;/li&gt;
&lt;li&gt;Phase 3: Implementation (VDM-SL → production code)&lt;/li&gt;
&lt;li&gt;Phase 4: Verification (code ↔ spec cross-check)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔧 &lt;strong&gt;Multi-agent orchestration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;agent-config.yaml&lt;/code&gt; — role definitions, context requirements, outputs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;workflow.md&lt;/code&gt; — step-by-step guide to running multi-agent development today&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📦 &lt;strong&gt;Working example:&lt;/strong&gt; E-commerce order system with 3 modules (Order, Inventory, Payment), complete with VDM-SL specs, integration verification, and cross-module contracts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It in 30 Minutes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Clone the repo&lt;/li&gt;
&lt;li&gt;Open &lt;code&gt;templates/prompts/phase1-specification.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Copy the system prompt into Claude or GPT-4&lt;/li&gt;
&lt;li&gt;Start describing a module from your own project&lt;/li&gt;
&lt;li&gt;Watch the AI produce a VDM-SL spec, then explain it back to you in plain language&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No VDM-SL expertise required. The AI reads and writes the formal notation. You verify the &lt;em&gt;meaning&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Need from the Community
&lt;/h2&gt;

&lt;p&gt;This is early-stage. The methodology is sound, but the tooling ecosystem needs work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain templates&lt;/strong&gt; — Healthcare, fintech, logistics, IoT. Each domain has its own patterns. We need VDM-SL templates for common domain models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration tooling&lt;/strong&gt; — Better ways to coordinate multiple AI agents with shared specifications. Integration with existing agent frameworks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Case studies&lt;/strong&gt; — Real teams trying this on real projects. What worked? What broke? Where are the rough edges?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification automation&lt;/strong&gt; — Tools that automatically check &lt;code&gt;A.post ⇒ B.pre&lt;/code&gt; across module boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of this resonates, check out the repo and open an issue. PRs welcome. Case study reports are especially valuable — even "I tried this and it didn't work because..." helps enormously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The era of single-agent AI coding is already giving way to multi-agent development. The unsolved problem is coordination. Natural language specs create ambiguity that scales exponentially with the number of agents. Tests verify specific cases but can't guarantee compositional correctness.&lt;/p&gt;

&lt;p&gt;Formal specifications — specifically, VDM-SL with pre/post conditions and invariants — provide what multi-agent development needs: &lt;strong&gt;unambiguous, mechanically verifiable contracts between agents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The human role doesn't disappear. It elevates. You become the domain expert and architect who ensures the AI team builds the right thing. The specs ensure they build it correctly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Hikaru Ando — IID Systems&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev" rel="noopener noreferrer"&gt;GitHub: formal-spec-driven-dev&lt;/a&gt; | Apache 2.0&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article is also available in &lt;a href="https://github.com/kotaroyamame/formal-spec-driven-dev/blob/main/docs/ja/paper.md" rel="noopener noreferrer"&gt;Japanese (日本語)&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The End of Test-Driven Development: Best Practices for AI Agent-Driven Development with Formal Methods</title>
      <dc:creator>Kotaro Andy</dc:creator>
      <pubDate>Fri, 27 Mar 2026 08:54:13 +0000</pubDate>
      <link>https://dev.to/kotaroyamame/the-end-of-test-driven-development-best-practices-for-ai-agent-driven-development-with-formal-4ma5</link>
      <guid>https://dev.to/kotaroyamame/the-end-of-test-driven-development-best-practices-for-ai-agent-driven-development-with-formal-4ma5</guid>
      <description>&lt;h1&gt;
  
  
  AI Agent-Driven Development with Formal Methods — The Human Role in the Era of Multi-Agent Coordination
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;— Toward a Near Future Where AI Agent Teams Autonomously Build Systems Using Formal Specifications as Contracts —&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction: Entering the Era of AI Agent-Driven Development
&lt;/h2&gt;

&lt;p&gt;The rapid evolution of large language models (LLMs) is fundamentally challenging how software is built. AI tools such as GitHub Copilot, Claude, and GPT-4 already demonstrate practical capability in code generation, review, and refactoring. But the trajectory points beyond "using AI as a tool"—toward a new paradigm in which &lt;strong&gt;multiple AI agents coordinate like a development team, autonomously constructing systems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This raises a fundamental question: when multiple AI agents coordinate, what serves as the "team's common language"? Vague natural-language specifications cannot eliminate interpretive discrepancies between agents. Applying Test-Driven Development (TDD) doesn't help either—tests are grounded in inductive reasoning and cannot function as a mechanism for inter-agent specification agreement.&lt;/p&gt;

&lt;p&gt;What this article proposes is a multi-agent development paradigm in which formal methods—specifically VDM (Vienna Development Method)—serve as the backbone of specification, with formal specifications functioning as &lt;strong&gt;rigorous contracts between agents&lt;/strong&gt;. Think of it like construction: the human serves as domain expert and architect, communicating the overall design intent, while multiple AI agents work in parallel like a construction crew—one handling the foundation, another the structural frame, another electrical, another plumbing. Each agent understands its boundaries through the formal specification—the "blueprint"—enabling unambiguous coordination.&lt;/p&gt;

&lt;p&gt;This article first identifies the fundamental limitations of TDD (Chapter 1), then argues why formal methods are ideally suited as "contracts" between AI agents (Chapters 2–5), presents a concrete multi-agent coordination architecture (Section 7.3), and discusses &lt;strong&gt;how the human role is redefined&lt;/strong&gt; in this new paradigm—as domain expert and architecture-level decision maker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 1: Why Test-Driven Development Is Fundamentally Flawed
&lt;/h2&gt;

&lt;p&gt;Before proceeding, a clarification: this article critiques the &lt;em&gt;paradigm&lt;/em&gt; of TDD—the idea that tests should be the central driver of design and quality assurance—not the act of testing itself. The role testing continues to play is addressed in Section 7.2.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 The Fundamental Limit of Tests: The Trap of Inductive Reasoning
&lt;/h3&gt;

&lt;p&gt;The core problem with TDD is that tests are grounded in &lt;strong&gt;inductive reasoning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A test asks: "For this finite set of inputs, does the program produce the expected output?" But the input space of virtually any real program is infinite or astronomically large. Edsger W. Dijkstra put it plainly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Program testing can be used to show the presence of bugs, but never to show their absence."&lt;br&gt;
— Edsger W. Dijkstra, 1969&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is not merely a pithy observation—it is a logically precise statement. Even if every test case &lt;code&gt;{t₁, t₂, ..., tₙ}&lt;/code&gt; passes, there is no guarantee that the program behaves correctly for some untested input &lt;code&gt;tₙ₊₁&lt;/code&gt;. This is the problem of incomplete induction.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 The Mathematics of Incompleteness
&lt;/h3&gt;

&lt;p&gt;Consider an integer addition function &lt;code&gt;add(a, b)&lt;/code&gt;. If both &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; are 32-bit integers, the number of possible input pairs is &lt;code&gt;2³² × 2³² = 2⁶⁴ ≈ 1.8 × 10¹⁹&lt;/code&gt;. Running one test per nanosecond, full coverage would take approximately 585 years.&lt;/p&gt;

&lt;p&gt;In real systems, inputs are far more complex than integer pairs. API requests, database states, timing, concurrency ordering—the state space is effectively infinite. Every TDD test suite is sampling an infinitesimally small region of that space.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 The Counterargument That TDD Is a Design Method
&lt;/h3&gt;

&lt;p&gt;TDD advocates often reframe it: "TDD is not a testing method—it is a design method." Writing tests first forces you to design testable interfaces, the argument goes.&lt;/p&gt;

&lt;p&gt;But this argument has a structural problem. When test-writability becomes the criterion for good design, the tail wags the dog. A testable interface is not necessarily a good interface. Design should be derived from the &lt;strong&gt;logical structure of requirements&lt;/strong&gt;, not reverse-engineered from the convenience of tests.&lt;/p&gt;

&lt;p&gt;In formal methods, the specification itself drives design. By explicitly stating invariants, pre-conditions, and post-conditions, interface correctness is guaranteed at the specification level. There is no need to entrust design to the accidental selection of test cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 The Illusion of Coverage
&lt;/h3&gt;

&lt;p&gt;100% code coverage is often cited as a quality metric. But coverage measures which &lt;em&gt;lines of code were executed&lt;/em&gt;, not which &lt;em&gt;parts of the specification were verified&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Consider this function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test &lt;code&gt;divide(10, 2) == 5.0&lt;/code&gt; achieves 100% line coverage. But the case &lt;code&gt;b = 0&lt;/code&gt; is never tested. Coverage is fundamentally insufficient as a measure of specification completeness.&lt;/p&gt;

&lt;p&gt;A formal specification writes it differently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;divide(a: int, b: int) -&amp;gt; float
  pre: b ≠ 0
  post: result * b = a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pre-condition &lt;code&gt;b ≠ 0&lt;/code&gt; resolves the division-by-zero problem at the specification level. No test case can be forgotten because the constraint is structural, not incidental.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 2: What Are Formal Methods? A Focus on VDM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 The Core Idea
&lt;/h3&gt;

&lt;p&gt;Formal methods are a family of techniques that use mathematical notation and inference rules to describe software specifications and reason about their correctness.&lt;/p&gt;

&lt;p&gt;Major formal methods include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VDM (Vienna Development Method):&lt;/strong&gt; Developed at IBM's Vienna lab in the 1970s. Features the model-oriented specification language VDM-SL. Extensive industrial application history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z Notation:&lt;/strong&gt; From Oxford University. Specification based on set theory and schemas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B Method:&lt;/strong&gt; Used in systems such as the Paris Métro automatic train operation. High affinity with automated proof.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alloy:&lt;/strong&gt; A lightweight formal method from MIT. Supports automatic verification through model checking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLA+:&lt;/strong&gt; Developed by Leslie Lamport. Particularly strong for distributed systems. Adopted extensively by Amazon.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article focuses on VDM for three reasons: VDM-SL is well-suited for AI generation and manipulation; it integrates well with mechanical consistency checking via tools like Overture Tool; and it has a strong industrial track record.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Core Elements of VDM-SL
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type Definitions:&lt;/strong&gt; VDM-SL defines the structure of data as types. Beyond primitive types (&lt;code&gt;nat&lt;/code&gt;, &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;bool&lt;/code&gt;, &lt;code&gt;char&lt;/code&gt;), it supports set types (&lt;code&gt;set of T&lt;/code&gt;), sequence types (&lt;code&gt;seq of T&lt;/code&gt;), map types (&lt;code&gt;map T1 to T2&lt;/code&gt;), product types, and union types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invariants:&lt;/strong&gt; Conditions that a type or system state must always satisfy, expressed in predicate logic. This ensures that invalid states cannot exist at the specification level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operations:&lt;/strong&gt; State-changing operations defined as pairs of pre-conditions and post-conditions. "What must hold before this operation is called?" and "What must hold after it completes?" are made explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functions:&lt;/strong&gt; Pure computations without side effects. In implicit specifications, only the relationship between inputs and outputs is stated as a predicate—no concrete algorithm is specified.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Deductive vs. Inductive Reasoning
&lt;/h3&gt;

&lt;p&gt;The decisive difference between TDD and formal methods lies in the direction of reasoning.&lt;/p&gt;

&lt;p&gt;TDD (inductive) infers general correctness from a finite set of specific test cases. Formal specification (deductive) defines an unambiguous criterion for correctness, from which any conformant implementation can be reasoned about.&lt;/p&gt;

&lt;p&gt;An important distinction must be made here. Formal specification and formal verification are separate processes. Writing a VDM-SL specification guarantees the &lt;strong&gt;internal consistency&lt;/strong&gt; (freedom from contradiction) and &lt;strong&gt;completeness&lt;/strong&gt; (all required operations are defined) of the specification itself. Verifying that an implementation &lt;em&gt;conforms&lt;/em&gt; to the specification requires separate means—theorem provers, property-based testing, etc.—addressed in Chapter 4, Phase 4.&lt;/p&gt;

&lt;p&gt;Nevertheless, inductive reasoning has a fundamental ceiling: no accumulation of test cases reaches certainty. The formal specification approach at minimum defines unambiguously what "correct" means, enabling verification against a clear standard. This is a qualitative difference from TDD.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 3: VDM-SL in Practice — A Concrete Example
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Example: A User Management System
&lt;/h3&gt;

&lt;p&gt;A note before the code: this article argues that humans do not need to read formal specifications. So why show VDM-SL here at all? For the same reason a building client benefits from knowing that structural calculations exist and what role they play—even without reading them. The following example illustrates how rigorous, and how complete, the artifact generated by AI actually is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Formal specification of a user management system

types
  UserId = nat1
  inv uid == uid &amp;lt;= 999999;

  Email = seq1 of char
  inv email ==
    exists i in set inds email &amp;amp; email(i) = '@'
    and len email &amp;lt;= 254;

  Password = seq of char
  inv pw == len pw &amp;gt;= 8 and len pw &amp;lt;= 128;

  Role = &amp;lt;Admin&amp;gt; | &amp;lt;Editor&amp;gt; | &amp;lt;Viewer&amp;gt;;

  User :: id       : UserId
          email    : Email
          name     : seq1 of char
          role     : Role
          active   : bool
  inv u == len u.name &amp;lt;= 100;

state UserSystem of
  users    : map UserId to User
  nextId   : UserId
  emails   : set of Email
inv mk_UserSystem(users, nextId, emails) ==
  -- All user IDs are less than nextId
  (forall uid in set dom users &amp;amp; uid &amp;lt; nextId)
  -- emails matches the set of all registered email addresses
  and emails = {users(uid).email | uid in set dom users}
  -- Each user's id field matches its key in the map
  and (forall uid in set dom users &amp;amp; users(uid).id = uid)
init s == s = mk_UserSystem({|-&amp;gt;}, 1, {})
end

operations

  RegisterUser(email: Email, name: seq1 of char, role: Role) uid: UserId
    ext wr users  : map UserId to User
        wr nextId : UserId
        wr emails : set of Email
    pre email not in set emails   -- email uniqueness
        and nextId &amp;lt;= 999999      -- ID ceiling
    post let newUser = mk_User(nextId~, email, name, role, true) in
         uid = nextId~
         and users = users~ munion {nextId~ |-&amp;gt; newUser}
         and nextId = nextId~ + 1
         and emails = emails~ union {email};

  DeactivateUser(uid: UserId)
    ext wr users : map UserId to User
    pre uid in set dom users
        and users(uid).active = true
    post users = users~ ++
         {uid |-&amp;gt; mu(users~(uid), active |-&amp;gt; false)};

  ChangeRole(uid: UserId, newRole: Role)
    ext wr users : map UserId to User
    pre uid in set dom users
        and users(uid).active = true
    post users = users~ ++
         {uid |-&amp;gt; mu(users~(uid), role |-&amp;gt; newRole)};

  FindUserByEmail(email: Email) result: [UserId]
    ext rd users : map UserId to User
    post if exists uid in set dom users &amp;amp; users(uid).email = email
         then result &amp;lt;&amp;gt; nil
              and result in set dom users
              and users(result).email = email
         else result = nil;

functions

  ActiveUserCount: map UserId to User -&amp;gt; nat
  ActiveUserCount(users) ==
    card {uid | uid in set dom users &amp;amp; users(uid).active};

  HasAdmin: map UserId to User -&amp;gt; bool
  HasAdmin(users) ==
    exists uid in set dom users &amp;amp;
      users(uid).role = &amp;lt;Admin&amp;gt; and users(uid).active;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 What This Specification Directly Guarantees
&lt;/h3&gt;

&lt;p&gt;Reading the specification carefully reveals that the following properties are &lt;strong&gt;logically guaranteed&lt;/strong&gt;—not probabilistically suggested by test cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Email uniqueness:&lt;/strong&gt; The pre-condition &lt;code&gt;email not in set emails&lt;/code&gt; in &lt;code&gt;RegisterUser&lt;/code&gt; makes duplicate registration with the same address logically impossible at the specification level. Guaranteeing this with TDD would require testing every possible registration order and concurrency pattern—an intractable task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. State consistency:&lt;/strong&gt; The system invariant &lt;code&gt;inv mk_UserSystem(...)&lt;/code&gt; ensures that the &lt;code&gt;emails&lt;/code&gt; set and &lt;code&gt;users&lt;/code&gt; map remain permanently synchronized, and that ID integrity is always maintained. That post-conditions preserve this invariant can be confirmed deductively from the specification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Type constraints:&lt;/strong&gt; &lt;code&gt;UserId&lt;/code&gt; is a natural number between 1 and 999999; &lt;code&gt;Email&lt;/code&gt; is a string of 1–254 characters containing &lt;code&gt;@&lt;/code&gt;; &lt;code&gt;Password&lt;/code&gt; is 8–128 characters. These constraints are built into the type definitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Design Challenges Arising from Specification Contracts
&lt;/h3&gt;

&lt;p&gt;Section 3.2 addressed properties that the specification directly guarantees. But formal specifications serve a second, equally important function: because their contracts are explicit, they &lt;strong&gt;surface design challenges&lt;/strong&gt; that would otherwise remain invisible until implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-conditions are responsibility-boundary contracts.&lt;/strong&gt; For example, the pre-condition &lt;code&gt;users(uid).active = true&lt;/code&gt; in &lt;code&gt;ChangeRole&lt;/code&gt; declares: "changing the role of an inactive user is outside this operation's scope." Invalid inputs do not magically disappear—the contract states: "if this condition does not hold, this operation's behavior is not guaranteed."&lt;/p&gt;

&lt;p&gt;This explicit contract gives rise to two concrete design challenges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design challenge 1: Compositional verification across modules.&lt;/strong&gt; If module A calls &lt;code&gt;RegisterUser&lt;/code&gt; (post-condition: &lt;code&gt;active = true&lt;/code&gt;) and module B immediately calls &lt;code&gt;ChangeRole&lt;/code&gt;, A's post-condition logically implies B's pre-condition (A.post ⇒ B.pre). This can be formally confirmed—proving that the module composition is free of contradiction at the specification level. This is a guarantee that testing struggles to achieve. This verification is performed automatically by AI during Phase 2 (design).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design challenge 2: Defensive placement.&lt;/strong&gt; When data that violates a pre-condition reaches the system, the question becomes: where and how to defend? Should the validation layer sit at the API gateway? At the entry point of each module? Should a data quarantine (sanitization) module be inserted, and at which architectural layer? These are design questions, not specification questions—but they only surface as concrete agenda items &lt;em&gt;because&lt;/em&gt; the pre-conditions are explicit. With ambiguous natural-language specifications, there is a real risk that these defensive design discussions never occur before implementation begins.&lt;/p&gt;

&lt;p&gt;These design challenges are resolved concretely in Phase 2 (AI-driven design) of the workflow described in Chapter 4. Formal specifications simultaneously deliver two forms of value: &lt;strong&gt;correctness guarantees&lt;/strong&gt; (Section 3.2) and &lt;strong&gt;design challenge clarification&lt;/strong&gt; (this section).&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 4: The AI-Driven Development Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 The Proposed Workflow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Client and Architect — Dialogue-Driven Formal Specification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This phase mirrors the relationship between a building client and an architect. The client (human) does not read structural calculations. They converse with the architect (AI), confirming that their requirements are correctly captured.&lt;/p&gt;

&lt;p&gt;The human communicates business requirements in natural language. AI generates VDM-SL. The human never reads the formal specification directly. AI explains the specification's content and implications in natural language; the human judges whether the explanation matches their intent.&lt;/p&gt;

&lt;p&gt;A concrete example of this dialogue: the human says, "I want to prevent duplicate email addresses at registration." AI responds: "I've defined the pre-condition &lt;code&gt;email not in set emails&lt;/code&gt;. This makes re-registration with an already-registered address logically impossible. However, should email addresses from deactivated accounts be reusable? I need your decision on that." The human decides based on their domain knowledge.&lt;/p&gt;

&lt;p&gt;In this phase, the human is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Providing business domain knowledge&lt;/li&gt;
&lt;li&gt;Hearing AI's explanations and judging specification validity — "Is that actually the intended behavior?"&lt;/li&gt;
&lt;li&gt;Asking about edge cases — "What happens when a deactivated user tries to re-register?"&lt;/li&gt;
&lt;li&gt;Final approval of the specification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI is responsible for: converting natural-language requirements into VDM-SL; checking internal consistency; explaining the specification and its implications clearly; and surfacing potential edge cases and contradictions.&lt;/p&gt;

&lt;p&gt;The key point: the formal specification is an artifact that AI reads and writes. Humans need not understand VDM-SL notation. Why use VDM-SL rather than a natural-language specification document? This question is addressed in Section 5.3.&lt;/p&gt;

&lt;p&gt;Here is another concrete example—a Phase 1 dialogue for the seemingly simple requirement of "sorting numbers":&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Human:&lt;/strong&gt; I want to sort input numbers in ascending order.&lt;br&gt;
&lt;strong&gt;AI:&lt;/strong&gt; So for &lt;code&gt;{3, 2, 7}&lt;/code&gt;, the output should be &lt;code&gt;[2, 3, 7]&lt;/code&gt;?&lt;br&gt;
&lt;strong&gt;Human:&lt;/strong&gt; Yes.&lt;br&gt;
&lt;strong&gt;AI:&lt;/strong&gt; Just to confirm—would &lt;code&gt;[1, 2, 3]&lt;/code&gt; be acceptable for &lt;code&gt;{3, 2, 7}&lt;/code&gt;? That is, is any ascending sequence fine, or must the output contain only the input elements?&lt;br&gt;
&lt;strong&gt;Human:&lt;/strong&gt; It must contain all elements from the input.&lt;br&gt;
&lt;strong&gt;AI:&lt;/strong&gt; Then for &lt;code&gt;{3, 3, 1, 8}&lt;/code&gt;, would &lt;code&gt;[1, 3, 8]&lt;/code&gt; be correct? Should duplicates be collapsed?&lt;br&gt;
&lt;strong&gt;Human:&lt;/strong&gt; No—just rearrange the original values. Duplicates remain as they are.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After three rounds of dialogue, AI generates the following VDM-SL specification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;functions

sort[@T]: seq of @T * (@T * @T -&amp;gt; bool) -&amp;gt; seq of @T
sort(l, less) == is not yet specified
post bagOf[@T](l) = bagOf[@T](RESULT)              -- Permutation
     and forall i in set {1, ..., len RESULT - 1} &amp;amp;
         not less(RESULT(i+1), RESULT(i))           -- Sorted
measure len l;

bagOf[@T]: seq of @T -&amp;gt; map @T to nat
bagOf(s) ==
  { i |-&amp;gt; occurs[@T](i, s) | i in set elems s }
post dom RESULT = elems s and sizeOfBag[@T](RESULT) = len s;

occurs[@T]: @T * seq of @T -&amp;gt; nat
occurs(e, s) ==
  if s = [] then 0
  else (if e = hd s then 1 else 0) + occurs[@T](e, tl s)
measure len s;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note what happened: three rounds of misunderstanding in natural language were resolved by two post-conditions in VDM-SL. &lt;code&gt;bagOf(l) = bagOf(RESULT)&lt;/code&gt; means "the output is a permutation of the input (same elements, same counts)." The ordering condition means "no adjacent pair is out of order." This specification is satisfied by bubble sort, quicksort, and merge sort alike. &lt;strong&gt;Algorithm selection is a design decision, not a specification concern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: AI-Driven Technology Selection and Architecture — Integrating Non-Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The formal specification from Phase 1 defines &lt;em&gt;what&lt;/em&gt; to compute, but not at what scale, speed, or cost. Phase 2 integrates &lt;strong&gt;non-functional requirements&lt;/strong&gt; provided separately by the human, combining them with the formal specification to make technical design decisions.&lt;/p&gt;

&lt;p&gt;For the sorting specification above, suppose the human states: "Data scale is approximately n = 10³⁰," "Compute resources and budget are limited," and "Stable sorting is not required." Combining these constraints with the specification, AI determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Programming language (weighing type safety, performance requirements, ecosystem)&lt;/li&gt;
&lt;li&gt;Algorithm selection (n = 10³⁰ requires external sorting; stability not required favors quicksort variants)&lt;/li&gt;
&lt;li&gt;Frameworks and libraries&lt;/li&gt;
&lt;li&gt;Architecture (whether distributed processing is needed, microservices vs. monolith, database selection, etc.)&lt;/li&gt;
&lt;li&gt;Infrastructure (server specifications, cloud provider, container orchestration, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key structural insight: the formal specification provides the &lt;strong&gt;correctness criterion&lt;/strong&gt; while non-functional requirements provide the &lt;strong&gt;design constraints&lt;/strong&gt;, and the two are independent. Bubble sort satisfies the VDM-SL specification but is O(n²)—unworkable for n = 10³⁰—and is rejected based on non-functional requirements. The separation of formal specification and non-functional requirements prevents correctness and efficiency from becoming entangled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: AI Code Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI generates code corresponding to the VDM-SL specification. The key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VDM-SL pre-conditions are implemented as runtime validation or assertions&lt;/li&gt;
&lt;li&gt;Post-conditions serve as correctness criteria for the generated code&lt;/li&gt;
&lt;li&gt;Invariants become design constraints on data structures&lt;/li&gt;
&lt;li&gt;Type definitions are mapped as directly as possible to the implementation language's type system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A notable aspect of this phase is the design decision involved in translating VDM-SL's abstract types into concrete data structures. VDM-SL is intentionally abstract about implementation details. For example, &lt;code&gt;seq of T&lt;/code&gt; (an ordered collection) might become an &lt;code&gt;Array&lt;/code&gt; if random access dominates, a &lt;code&gt;LinkedList&lt;/code&gt; if head insertions and deletions are frequent, or an &lt;code&gt;ArrayList&lt;/code&gt; or &lt;code&gt;Deque&lt;/code&gt; if both are needed. Similarly, &lt;code&gt;set of T&lt;/code&gt; could map to a &lt;code&gt;HashSet&lt;/code&gt;, &lt;code&gt;TreeSet&lt;/code&gt;, or &lt;code&gt;BitSet&lt;/code&gt;; &lt;code&gt;map T1 to T2&lt;/code&gt; could become a &lt;code&gt;HashMap&lt;/code&gt;, &lt;code&gt;TreeMap&lt;/code&gt;, or &lt;code&gt;ConcurrentHashMap&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These decisions cannot be derived from type definitions alone. AI makes them by synthesizing the operation frequency patterns in the specification (is lookup dominant, or insertion?), the non-functional requirements established in Phase 2 (concurrency, memory constraints, latency requirements), and the characteristics of the selected language and frameworks. In other words, translating from the specification's "What" to the implementation's "How" is not a trivial mapping but an optimization process coupled with Phase 2's architectural decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 4: Specification Conformance Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The generated code is verified against the formal specification. This is the first point at which testing appears—but its role is fundamentally different from TDD. Tests are property-based tests derived automatically from the specification. No human writes individual test cases by hand.&lt;/p&gt;

&lt;p&gt;Because the specification is precisely defined, test generation enjoys enormous flexibility. Random tests that generate inputs satisfying pre-conditions and verify post-conditions. Boundary-value tests that probe the edges of invariant constraints. Systematic tests that cover all state-transition paths. All of these can be automatically generated from the specification. For the sorting example, the two post-conditions—&lt;code&gt;bagOf(l) = bagOf(RESULT)&lt;/code&gt; and the ordering condition—generate tests with empty sequences, single elements, all-identical elements, reverse-ordered inputs, and massive sequences, all without human intervention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phases 2–4 Are an Autonomous AI Cycle&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A critical point deserves emphasis: the cycle of Phase 2 (design) → Phase 3 (implementation) → Phase 4 (verification) is &lt;strong&gt;executed entirely by AI, autonomously&lt;/strong&gt;. If Phase 4 detects a specification violation, AI either fixes the implementation or revisits design decisions (algorithm selection, data structure choices) and re-implements. No human intervenes in this feedback loop.&lt;/p&gt;

&lt;p&gt;This is possible precisely because the specification is formally defined. If the specification were ambiguous, AI could not autonomously determine whether a defect is a specification problem or an implementation problem. But a formal specification provides an unambiguous criterion for correctness, enabling AI to automatically judge "this does not satisfy the specification → correction is needed." Once the human confirms the specification in Phase 1, AI can be entrusted with the entire process until a working system is delivered.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 What Humans Need to Know
&lt;/h3&gt;

&lt;p&gt;The skills required of humans in this paradigm are fundamentally different from traditional development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required skills:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain knowledge:&lt;/strong&gt; Deep understanding of the business domain being developed. This is the irreplaceable human contribution—the one AI cannot substitute. Decisions like "should deactivated users' email addresses be reusable?" can only be made by someone who understands the business context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical dialogue ability:&lt;/strong&gt; The ability to identify logical contradictions or missing requirements through natural-language conversation when AI explains the specification. Reading formal notation is not required, but being able to reason in plain language about structures like "if A then B" or "for every X, Y holds" is valuable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ability to articulate requirements:&lt;/strong&gt; The ability to make implicit domain knowledge explicit in a form that AI can process. This is the same skill as traditional requirements definition, but the bar for clarity rises when conversing with AI—ambiguity cannot be left unresolved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skills that become unnecessary:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading or writing formal method notation (AI generates the specification and explains it in plain language)&lt;/li&gt;
&lt;li&gt;Proficiency in specific programming languages (AI selects the appropriate language and writes the code)&lt;/li&gt;
&lt;li&gt;Detailed knowledge of frameworks and libraries (AI selects the optimal ones)&lt;/li&gt;
&lt;li&gt;Test case design (automatically derived from the specification)&lt;/li&gt;
&lt;li&gt;Detailed infrastructure design (AI derives this from non-functional requirements in the specification)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Comparison with Traditional Development
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;TDD&lt;/th&gt;
&lt;th&gt;Formal Methods + AI-Driven&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correctness guarantee&lt;/td&gt;
&lt;td&gt;Inductive (finite test cases)&lt;/td&gt;
&lt;td&gt;Deductive (logical specification)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specification ambiguity&lt;/td&gt;
&lt;td&gt;Tests constitute an implicit spec&lt;/td&gt;
&lt;td&gt;Specification is explicit and precise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design driver&lt;/td&gt;
&lt;td&gt;Test writability&lt;/td&gt;
&lt;td&gt;Logical structure of requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human role&lt;/td&gt;
&lt;td&gt;Coder and tester&lt;/td&gt;
&lt;td&gt;Client (domain expert + decision-maker)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI utilization&lt;/td&gt;
&lt;td&gt;Code completion&lt;/td&gt;
&lt;td&gt;Design → implementation → verification, end-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Test count grows explosively&lt;/td&gt;
&lt;td&gt;Specification scales with problem complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug detection timing&lt;/td&gt;
&lt;td&gt;After implementation (at test runtime)&lt;/td&gt;
&lt;td&gt;During specification (detected as logical contradiction)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Chapter 5: Why Formal Methods—Why Now?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 The Three Walls That Blocked Formal Methods Before LLMs
&lt;/h3&gt;

&lt;p&gt;Formal methods have existed since the 1970s yet never achieved broad industrial adoption. Three walls stood in the way.&lt;/p&gt;

&lt;p&gt;First, writing formal specifications required advanced mathematical training. People capable of writing VDM-SL or Z notation were scarce, and demanding that from an entire team was unrealistic.&lt;/p&gt;

&lt;p&gt;Second, a large gap existed between formal specifications and working code. Even a beautiful specification required manual effort to implement, and bugs crept in at that translation step.&lt;/p&gt;

&lt;p&gt;Third, the return on investment was opaque. The time cost of writing specifications was high; the payoff was unclear; adoption was confined to safety-critical domains such as aerospace, rail, and medical devices.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Why LLMs Are a Game Changer
&lt;/h3&gt;

&lt;p&gt;LLMs dissolve all three walls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall 1 dissolved:&lt;/strong&gt; Humans no longer need to write or read formal specifications. Business requirements are expressed in natural language; AI converts them to VDM-SL. AI explains the specification in natural language; humans evaluate the explanation. A building client grasps the design through dialogue with the architect—without reading a single structural calculation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall 2 dissolved:&lt;/strong&gt; AI performs the translation from formal specification to code. The gap between specification and implementation shrinks dramatically. AI generates code with an understanding of pre-conditions, post-conditions, and invariants—reducing the risk of bugs entering at the translation step. That said, current LLMs can hallucinate when handling complex specifications, so conformance verification (Phase 4) cannot be skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall 3 dissolved:&lt;/strong&gt; AI automation has slashed the cost of specification. What once took weeks can now be accomplished in hours or days through AI dialogue. The ROI equation has changed fundamentally.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Rebutting "If Only AI Reads It, Why Bother with Formal Methods?"
&lt;/h3&gt;

&lt;p&gt;A natural objection arises: if humans don't read the formal specification, why write it in VDM-SL at all? Couldn't AI manage the specification in natural language?&lt;/p&gt;

&lt;p&gt;The answer is clear: &lt;strong&gt;formal notation functions as a discipline of thought for AI, not just for humans.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, formal notation structurally eliminates ambiguity. Writing "a user has one email address" in natural language leaves open whether this means exactly one, at least one, or at most one. Defining &lt;code&gt;email: Email&lt;/code&gt; in VDM-SL settles it by notation alone. When AI manages specifications in natural language, it risks introducing exactly this kind of ambiguity. Formal notation eliminates that risk by construction.&lt;/p&gt;

&lt;p&gt;Second, formal specifications are mechanically verifiable intermediate artifacts. VDM-SL specifications can be type-checked and consistency-checked using tools such as Overture Tool. Natural-language specification documents cannot. Even when the AI that wrote the specification and the AI that implements it are different sessions or different models, the formal specification binds both as a precise contract.&lt;/p&gt;

&lt;p&gt;Third, there is the question of auditability. Formal specifications can be verified after the fact by other AI systems, other tools, or future verification systems—regardless of whether any human reads them. For example, a security audit can mechanically confirm from the formal specification whether all operations are accessible only to authenticated users. A compliance review can automatically verify whether a personal data deletion operation propagates to all related tables. Performing equivalent mechanical verification on natural-language specifications is fundamentally difficult due to inherent ambiguity. This auditability value grows as project scale increases.&lt;/p&gt;

&lt;p&gt;In short, formal methods exist not "for humans" but "for logical rigor." Writing in formal notation improves AI's reasoning precision even when no human will ever read the result. This is the same principle by which mathematicians reach more accurate conclusions using symbols than intuition alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.4 The Transition Strategy Before AGI
&lt;/h3&gt;

&lt;p&gt;Current LLMs are not AGI. They lack the ability to autonomously understand business requirements and design optimal systems from scratch. But they already demonstrate practical capability in understanding formal specifications and generating conformant code.&lt;/p&gt;

&lt;p&gt;This profile—strong at specification understanding and implementation, limited at requirements definition—is ideal for combination with formal methods. Humans convey the business "What" in natural language; AI translates it into a formal specification, verifies it through dialogue with the human, and then implements the technical "How."&lt;/p&gt;

&lt;p&gt;When AGI arrives, AI may be able to handle even the "What." Until then, formal methods combined with AI-driven development is the most rational approach available.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 6: A Practical Roadmap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6.1 Individual Level
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Practice logical articulation (1–2 weeks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start by articulating familiar business rules precisely in natural language. For an e-commerce order process: "Items with zero inventory cannot be ordered." "A user can have at most one order in processing at a time." Write these as explicit conditional statements. Understanding concepts like sets, maps, and universal/existential quantification at the natural-language level is sufficient. There is no need to learn VDM-SL notation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Practice specification dialogue with AI (1–2 weeks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ask AI to "write a formal VDM-SL specification for a user management feature." As AI explains the generated specification, practice identifying gaps and contradictions. The goal is to develop the ability to refine a specification by asking questions like "What happens to a deactivated user's data?" or "Can an administrator delete their own account?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Run the full cycle on a small project (2–4 weeks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Execute the entire pipeline—from specification dialogue through code generation to reviewing the deliverable. Start with a small API with basic CRUD operations. The essential experience is the feedback loop: use the generated artifact, find behaviors that differ from what was agreed in the specification dialogue, and return that feedback to AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.2 Team and Organizational Level
&lt;/h3&gt;

&lt;p&gt;For organizational adoption, begin with a single small new service as a pilot project built entirely with formal methods + AI-driven development. Do not attempt to migrate existing large systems immediately. Demonstrate value first, then scale.&lt;/p&gt;

&lt;p&gt;The review process changes fundamentally. Traditional code review is replaced by &lt;strong&gt;specification dialogue review&lt;/strong&gt;. Share the AI specification-dialogue logs; convene domain-knowledgeable humans to discuss whether the specification correctly reflects the requirements. Delegate implementation details to AI; focus human attention at the specification level.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 7: Honest Limitations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Where Formal Methods Are Not Enough
&lt;/h3&gt;

&lt;p&gt;Formal methods are not universal. Complementary approaches remain necessary in several areas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UI/UX specification:&lt;/strong&gt; The "usability" and "aesthetics" of user interfaces are difficult to formalize. Prototyping and user testing remain valuable. However, the business logic beneath the UI—state transitions, validation rules, and so on—is fully formalizable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance characteristics:&lt;/strong&gt; VDM-SL describes &lt;em&gt;what&lt;/em&gt; is computed, not &lt;em&gt;how fast&lt;/em&gt;. Performance requirements must be handled separately as non-functional requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External system integration:&lt;/strong&gt; The actual behavior of third-party APIs cannot be fully captured in a formal specification. Integration testing remains valuable at system boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Tests Do Not Disappear Entirely
&lt;/h3&gt;

&lt;p&gt;The article's title is provocative, but it does not wholesale reject testing. What it rejects is the TDD &lt;em&gt;paradigm&lt;/em&gt;—the idea that tests should be the center of design and quality assurance.&lt;/p&gt;

&lt;p&gt;Testing persists in this approach in the following roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Property-based tests automatically derived from the specification (conformance verification)&lt;/li&gt;
&lt;li&gt;Integration tests for external system boundaries&lt;/li&gt;
&lt;li&gt;Performance and load tests&lt;/li&gt;
&lt;li&gt;End-to-end / acceptance tests for UI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shift is that testing moves from center stage to a supporting role.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Multi-Agent Coordination and Applicable Scale — Scaling Through Formal Specifications
&lt;/h3&gt;

&lt;p&gt;Can a single AI agent autonomously build an entire large-scale system? No—context window constraints prevent it from grasping a system of tens of thousands of lines at once. But this is the wrong question. In human development teams, no single engineer holds the entire system in their head either. Team development works because interfaces between modules have been agreed upon.&lt;/p&gt;

&lt;p&gt;Formal specifications provide precisely this "interface agreement that enables team development"—rigorously. And this is the key that makes autonomous construction of medium-to-large systems possible even with current AI agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Agent Architecture with Formal Specifications as Contracts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The structure is as follows.&lt;/p&gt;

&lt;p&gt;In Phase 1 (human + architect AI), the formal specification for the entire system is defined. The core activity here is module decomposition and the definition of inter-module interface specifications—the pre-conditions and post-conditions of operations each module exposes. This set of specifications becomes the "contract" for all agents.&lt;/p&gt;

&lt;p&gt;In Phases 2–4, independent AI agents work on their assigned modules &lt;strong&gt;in parallel&lt;/strong&gt;, each executing design, implementation, and verification autonomously. Agent A handles the inventory management module, Agent B handles order processing, Agent C handles payment. Each agent's context needs only "its own module's specification" plus "the interface specifications of modules it depends on (pre/post-conditions only)"—implementation details of other modules are entirely unnecessary. This fits comfortably within current context windows.&lt;/p&gt;

&lt;p&gt;For integration verification, a dedicated integration agent cross-checks the published interface specifications of all modules, mechanically verifying the A.post ⇒ B.pre compositional consistency. This agent need not examine implementation code at all—it specializes in specification-level consistency checking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this structure is impossible with natural-language specifications but possible with formal specifications&lt;/strong&gt; comes down to three reasons. First, each agent can independently determine whether its implementation satisfies its spec. Because the correctness criterion is unambiguous, no "alignment meetings" with other agents are needed. Second, inter-module consistency verification can be performed mechanically. Implication relationships between post-conditions and pre-conditions are tool-verifiable. Third, the context each agent must hold is minimized. No agent needs to know the internals of other modules—only their interface specifications.&lt;/p&gt;

&lt;p&gt;This is the same principle by which API documentation and interface definitions (OpenAPI, Protocol Buffers, etc.) enable division of labor in human teams. However, natural-language API documentation leaves ambiguities such as "only success responses documented, error cases undefined." Formal specifications structurally eliminate this ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remaining Challenges and the Human Role&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Several challenges remain at this point: unifying cross-cutting concerns (authentication/authorization, logging, error handling patterns), managing shared data model consistency (database schemas), and automating agent orchestration (execution order and dependency control). These are not problems of fundamental impossibility but engineering challenges—their resolution is a matter of time.&lt;/p&gt;

&lt;p&gt;For now, these cross-cutting design decisions can be finalized during Phase 1 by the architect AI and human, then included in each agent's instructions. The human's role is not only that of domain expert deciding "what to build," but also that of &lt;strong&gt;architecture-level decision maker&lt;/strong&gt; in multi-agent development. In the building analogy, this is the client who not only decides the floor plan but also consults with the architect on "in what order to commission foundation, structure, electrical, and plumbing, and to which contractors."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling Outlook&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The combination of multi-agent coordination and formal specifications makes medium-scale systems (tens of thousands of lines) theoretically achievable even with current AI models; the maturation of the orchestration layer is the key to practical deployment. Furthermore, as context windows expand, inter-agent coordination protocols become standardized, and long-term memory improves, large-scale systems (over one hundred thousand lines) come into view. Crucially, across all of these technological advances, formal specifications function as "rigorous contracts" between agents. With natural-language specifications, coordination ambiguity grows exponentially as the number of agents increases; with formal specifications, module boundary consistency remains mechanically verifiable regardless of agent count—making the approach inherently scalable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: The Human Role in AI Agent-Driven Development
&lt;/h2&gt;

&lt;p&gt;The multi-agent paradigm of formal methods + AI-driven development fundamentally redefines the human role in software development. The developer who was "a person who writes code" becomes a &lt;strong&gt;domain expert and architecture-level decision maker&lt;/strong&gt;—"a person who decides what to build, designs the overall system structure, and evaluates the output of the AI agent team."&lt;/p&gt;

&lt;p&gt;This is not a devaluation of human contribution—it is an &lt;strong&gt;elevation in abstraction level&lt;/strong&gt;. It is the same evolution as the transition from assembly language to high-level languages, from manual memory management to garbage collection. Humans become free to concentrate on more fundamental questions: "What should be built?", "Why?", and "How should the system be decomposed, and which agents should own which responsibilities?"&lt;/p&gt;

&lt;p&gt;What is required is deep understanding of one's own business domain, structural thinking about system architecture, and the ability to engage in logical dialogue with AI. Reading or writing formal notation is not required—that is the AI agents' responsibility. The human's role is to exercise domain expertise when the architect AI explains in natural language: "Is this module decomposition appropriate for the requirements?" The quality of that judgment is the core competence of humans in the multi-agent development era.&lt;/p&gt;

&lt;p&gt;Formal methods function as a discipline of each AI agent's reasoning, as rigorous contracts between agents, as a means of mechanical verification at module boundaries, and as an auditable record for the future. Humans need not read them—but writing formally carries decisive value, especially in contexts where multiple agents coordinate. This is the central claim of this article.&lt;/p&gt;

&lt;p&gt;The era of TDD's dominance is drawing to a close. We are witnessing the dawn of multi-agent AI-driven development, where formal specifications serve as "contracts" between agents. And the role humans must play in this new paradigm is not writing code, but providing the &lt;strong&gt;decision-making and oversight&lt;/strong&gt; that ensures the AI agent team builds the right thing, correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Dijkstra, E.W. (1969). &lt;em&gt;Notes on Structured Programming&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Jones, C.B. (1990). &lt;em&gt;Systematic Software Development using VDM&lt;/em&gt;. Prentice Hall.&lt;/li&gt;
&lt;li&gt;Fitzgerald, J. &amp;amp; Larsen, P.G. (2009). &lt;em&gt;Modelling Systems: Practical Tools and Techniques in Software Development&lt;/em&gt;. Cambridge University Press.&lt;/li&gt;
&lt;li&gt;Lamport, L. (2002). &lt;em&gt;Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers&lt;/em&gt;. Addison-Wesley.&lt;/li&gt;
&lt;li&gt;Newcombe, C. et al. (2015). "How Amazon Web Services Uses Formal Methods." &lt;em&gt;Communications of the ACM&lt;/em&gt;, 58(4).&lt;/li&gt;
&lt;li&gt;Jackson, D. (2012). &lt;em&gt;Software Abstractions: Logic, Language, and Analysis&lt;/em&gt;. MIT Press.&lt;/li&gt;
&lt;li&gt;Overture Tool Project: &lt;a href="https://www.overturetool.org/" rel="noopener noreferrer"&gt;https://www.overturetool.org/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Larsen, P.G. et al. (2010). "Industrial Applications of VDM." In &lt;em&gt;Bentley Historical Library&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Bicarregui, J. et al. (2009). "Proof and Model Checking for Protocol Design." &lt;em&gt;Formal Aspects of Computing&lt;/em&gt;, 21(1-2).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwaredevelopment</category>
    </item>
  </channel>
</rss>
