DEV Community

Kotaro Andy
Kotaro Andy

Posted on

GitHub Spec Kit Is 80% Right — Here's the Missing 20% That Would Make It Transformative

I Love Spec Kit. And That's Why I Want to Push It Further.

GitHub's Spec Kit is, in my assessment, the most intellectually honest attempt at AI-driven development to date. While most tools focus on faster code generation, Spec Kit asks a more fundamental question:

What if intent — not code — was the source of truth?

This is exactly the right question. The Constitution → Specify → Plan → Tasks → Implement workflow is well-designed. The steerable gates where humans can intervene are smart. The 25+ agent support (Claude Code, Copilot, Gemini CLI, Cursor, etc.) shows pragmatic thinking about ecosystem diversity. The 40+ community extensions demonstrate real traction.

I've been working on a formal specification-driven development framework that starts from the same premise — specifications should drive development, not the other way around. But after months of research and implementation, I believe Spec Kit has a structural gap that, if filled, would make it dramatically more powerful.

That gap is formal verifiability of specifications themselves.

The Problem: Structured Ambiguity

Spec Kit specifications are written in structured Markdown. This is a massive improvement over the alternatives:

  • MetaGPT passes natural language PRDs between agents → ambiguity accumulates at each handoff
  • ChatDev relies on dialogue-based consensus → non-deterministic and non-reproducible
  • Devin/SWE-Agent infers specs from code → circular reasoning (code is the spec)

Spec Kit avoids all of these pitfalls by making specs explicit, structured, and human-editable. But here's the uncomfortable truth:

Structured natural language reduces ambiguity. It does not eliminate it.

Consider a Spec Kit specification like this:

## User Story: Add to Cart
The user can add a product to their shopping cart.
The cart should reflect the updated quantity.
Enter fullscreen mode Exit fullscreen mode

This is clear to a human reader. But it leaves critical questions unanswered:

  • Can the user add 0 items? Negative quantities?
  • What happens when stock is insufficient?
  • If the product is already in the cart, does the quantity replace or accumulate?
  • Is there a maximum cart size?

These aren't edge cases. They're boundary conditions — the exact places where modules interact and where bugs hide. A human spec author might catch some of them. But the point of a specification system is that the system itself should make missing boundaries visible.

What Formal Specifications Add

In VDM-SL (Vienna Development Method - Specification Language), the same requirement looks like this:

AddToCart: CustomerId * ProductId * nat1 ==> ()
AddToCart(cid, pid, qty) ==
  carts(cid) := if pid in set dom carts(cid)
                then carts(cid) ++ {pid |-> carts(cid)(pid) + qty}
                else carts(cid) ++ {pid |-> qty}
pre  pid in set dom inventory
     and inventory(pid) >= qty
     and qty > 0
     and CardinalityItems(carts(cid)) < MAX_CART_SIZE
post pid in set dom carts(cid)
     and carts(cid)(pid) = carts~(cid)(pid) + qty  -- accumulate, not replace
Enter fullscreen mode Exit fullscreen mode

Notice what happens:

  1. nat1 as the type of qty — zero and negative quantities are structurally impossible. Not "tested against," not "documented as invalid" — impossible at the type level.
  2. pre conditions — stock sufficiency and cart size limits are explicit preconditions. An AI agent reading this spec cannot "forget" about them.
  3. post conditions — the behavior is unambiguous: quantities accumulate (not replace). Any agent implementing this module knows exactly what the expected behavior is.
  4. Machine-verifiable — if another module's postcondition guarantees inventory(pid) >= qty, we can mechanically verify that it satisfies this precondition. No human review needed for interface consistency.

The Three Gaps Formal Specs Would Fill in Spec Kit

Gap 1: Boundary Conditions Are Invisible in Natural Language

In Spec Kit today, boundary conditions depend on the spec author remembering to include them. This is a human-reliability problem — the very thing we're trying to engineer out of the system.

Formal specifications force boundary conditions to be explicit because the type system and preconditions won't compile without them. A VDM-SL spec with qty: nat but no precondition on stock levels is visibly incomplete — the type checker can flag it.

Gap 2: Cross-Module Consistency Is Unverifiable

This is the critical scaling problem. Spec Kit writes specs per task (≈ per module), but there's no mechanism to verify that Task A's output satisfies Task B's input requirements.

With 3 modules, you have 3 pairwise interfaces to check — manageable by human review. With 10 modules, you have 45. With 20 modules, 190. Human review does not scale linearly.

Formal specifications solve this with compositional verification: if Order module's ConfirmOrder operation has a postcondition, and Inventory module's ReserveStock has a precondition, you can mechanically check whether ConfirmOrder.post ⇒ ReserveStock.pre. This is the A.post ⇒ B.pre pattern — and it scales linearly with module count, not quadratically.

Module A (Order)                Module B (Inventory)
┌─────────────────┐            ┌─────────────────┐
│  ConfirmOrder()  │            │  ReserveStock()  │
│  post:           │──verify───│  pre:            │
│   order.status   │            │   productId ∈    │
│    = <CONFIRMED> │   A.post   │    dom inventory │
│   ∧ stock        │    ⇒       │   ∧ quantity > 0 │
│    reserved      │   B.pre    │   ∧ available ≥  │
│                  │            │     quantity     │
└─────────────────┘            └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Gap 3: Specs Can't Validate Themselves

Spec Kit's philosophy is "intent is the source of truth." But there's a meta-problem: how do you verify that the spec itself is internally consistent?

Natural language specs can contain contradictions that only surface during implementation:

- Users must complete checkout within 30 minutes
- Users can save their cart and resume later
- Inventory is reserved at checkout start
Enter fullscreen mode Exit fullscreen mode

Are these contradictory? If a user saves their cart at minute 29, resumes 3 hours later, is inventory still reserved? A human reader might catch this. A type-checker on formal specifications will catch it — the invariant on reserved inventory duration would conflict with the "resume later" postcondition.

A Concrete Proposal: Spec Kit + Formal Specs

I'm not proposing to replace Spec Kit's natural language layer. I'm proposing to add a formal verification layer beneath it. Here's how it could work within Spec Kit's existing workflow:

Phase: Constitution (unchanged)

Project principles, values, guidelines — these remain in natural language. They're governance, not computation.

Phase: Specify (enhanced)

The human writes intent in natural language (as today). Then, an AI agent generates a VDM-SL formalization of the spec and explains it back:

"Your spec says users can add products to their cart. I've formalized this with these constraints: quantities must be positive integers, stock must be sufficient, and cart has a maximum of 50 items. The quantity behavior is accumulative — adding 3 of a product that already has 2 in the cart gives 5, not 3. Does this match your intent?"

The human validates meaning. The machine validates consistency.

Phase: Plan (enhanced)

The technical plan now includes interface contracts between modules: which module's postconditions feed into which module's preconditions. These contracts are in VDM-SL, mechanically verifiable.

Phase: Tasks (enhanced)

Each task carries not just a natural language description, but a formal module specification — the types, state, operations, pre/post conditions. The AI agent implementing the task has an unambiguous contract.

Phase: Implement (unchanged)

Agents implement against formal specs rather than natural language descriptions. Claude Code, Copilot, Gemini CLI — any of the 25+ supported agents can read VDM-SL and generate conforming code.

New Phase: Verify

A dedicated verification step checks A.post ⇒ B.pre across all module boundaries. This happens at the spec level, before any code runs. Integration problems are caught before implementation, not after.

Why This Matters for Multi-Agent Development

Single-agent development (one developer + one AI) can get by with natural language specs. The human catches ambiguities in real time. But the industry is clearly moving toward multi-agent development — multiple AI agents building different modules in parallel.

In multi-agent scenarios, natural language ambiguity becomes a scaling crisis:

Modules Pairwise Interfaces Human Review Feasibility
3 3 Easy
5 10 Manageable
10 45 Difficult
20 190 Impractical

Formal interface contracts are the only known mechanism that scales: each contract is verified independently, so verification effort grows linearly with module count (O(n)), not quadratically (O(n²)).

What I'm Not Saying

Let me be clear about what this proposal is not:

  • Not "Spec Kit is bad." Spec Kit is the best-designed spec-driven framework available. The philosophy is right. The UX is right. The community extensions are impressive.
  • Not "everyone should learn VDM-SL." The AI reads and writes the formal notation. The human verifies meaning in natural language. No formal methods expertise required.
  • Not "formal specs replace tests." Tests still matter for integration testing, performance testing, and E2E validation. Formal specs replace tests as the center of correctness — not as the entire verification strategy.
  • Not "this works for everything." UI/UX, performance tuning, and quick bug fixes don't benefit from formal specifications. This is for multi-module business logic systems where correctness at module boundaries is critical.

The Open Source Connection

We've built a complete framework around this idea:

formal-spec-driven-dev — Apache 2.0 licensed

It includes VDM-SL templates, AI prompt templates for all 4 development phases, multi-agent orchestration configs, and a working e-commerce example (Order, Inventory, Payment modules with formal interface contracts).

Our comparison analysis covers the structural critiques of MetaGPT, ChatDev, Devin, Spec Kit, and Claude Code in detail, with a unified framework for understanding each approach's "source of truth" and its logical limitations.

If you're using Spec Kit today and thinking about how to scale it to multi-agent or multi-module projects, I'd love to hear from you. The combination of Spec Kit's excellent developer experience and formal verification's mathematical guarantees could be exactly what the ecosystem needs.


Hikaru Ando — IID Systems
GitHub: formal-spec-driven-dev | Apache 2.0


References:

Top comments (0)