DEV Community

Bala Paranj
Bala Paranj

Posted on

The Industry Needs an Open Reasoning Spec. Seven Papers Explain What Goes In It.

The API ecosystem had a coordination problem. Every API was described differently — prose documentation, custom schemas, tribal knowledge. Then a standard format emerged. One format. Every tool reads it. Code generators, documentation engines, test frameworks, mock servers, SDK builders — all consume the same specification. The ecosystem unified around one artifact.

The AI era has the same coordination problem — but for a different artifact. Not "how does this API behave?" but "what properties must this system preserve?" "What behavioral contracts must hold?" "Why was this boundary placed here?" "What invariants must every change respect?"

Seven foundational CS papers — Parnas, Naur, Brooks, Knuth, Dijkstra, Liskov, Lehman — converge on the same conclusion: the value of software isn't in the code. It's in the properties, contracts, boundaries, and rationale that make the code safe to modify. AI generates the code. Nothing standardizes the properties. The code is generated at scale. The properties are scattered across READMEs, ADRs, Slack threads, and people's heads.

The industry needs an Open Reasoning Spec — a machine-readable standard for properties, contracts, and rationale that AI agents consume, enforcement engines verify, and humans read.

What exists today (and why it's insufficient)

Several standards describe parts of the problem:

Standard        What it describes              What's missing
────────        ─────────────────              ──────────────
OpenAPI         API shapes (endpoints, types)  Behavioral contracts (idempotency,
                                               ordering, invariants)

JSON Schema     Data shapes (fields, types)    Semantic constraints (this field
                                               means admin access, not just boolean)

OSCAL           Compliance controls            Mechanical predicates (how to CHECK
                                               the control, not just describe it)

STIX            Threat intelligence            Property definitions (what must be
                                               TRUE, not just what threats exist)

OPA/Rego        Policy rules                   Rationale (WHY this policy exists,
                                               what decision it implements)

CEL             Predicate expressions          Context (what the predicate means
                                               in domain terms, who validated it)

ADRs            Decision rationale             Enforcement link (the ADR says WHY
                                               but nothing connects it to a CHECK)
Enter fullscreen mode Exit fullscreen mode

Each standard covers one layer. None connects them. The API shape (OpenAPI) doesn't link to the behavioral contract (Liskov). The compliance control (OSCAL) doesn't link to the mechanical predicate (CEL). The decision rationale (ADR) doesn't link to the enforcement gate (CI check). The standards exist in silos — exactly like API descriptions before OpenAPI.

What the seven papers require

Each foundational paper identifies a specific element that must be in the specification:

Paper What must be specified Current gap
Parnas (1972) Module boundaries — what's inside, what's outside, what the interface promises Boundaries exist in code (packages, modules) but aren't declared in a consumable spec
Naur (1985) Domain theory — how the software maps to the real world Lives in people's heads. No format captures it.
Brooks (1986) Essential complexity — which parts are irreducibly hard Not distinguished from accidental complexity in any specification
Knuth (1984) Narrative — WHY each decision was made ADRs exist but aren't machine-readable or linked to enforcement
Dijkstra (1972) Verifiable properties — what must be TRUE, stated simply enough to prove Properties scattered across tests, types, linter rules, and comments
Liskov (1994) Behavioral contracts — promises beyond the type signature No standard format. Lives in comments, test names, tribal knowledge.
Lehman (1980) Structural maintenance rules — what's allowed to grow, what must be consolidated No standard. Implicit in code review culture.

No existing standard captures all seven. Most capture zero or one. The reasoning specification must capture all seven in one artifact — because they're interdependent. The boundary (Parnas) is meaningless without the contract it enforces (Liskov). The contract is unverifiable without the property definition (Dijkstra). The property is unmaintainable without the rationale (Knuth). The rationale is lost without the boundary that preserves it (Parnas).

The shape of an Open Reasoning Spec

An Open Reasoning Spec (working name — the community will name it) is a machine-readable document that describes what must be TRUE about a system, WHY it must be true, and HOW to check it. Three sections, mapping to the three layers:

Section 1: Boundaries (Layer 1 — Parnas)

boundaries:
  - id: boundary.auth.module
    description: "Authentication module  all auth logic lives here"
    interface:
      exports:
        - function: Authenticate(credentials) → (session, error)
        - function: ValidateSession(token) → (claims, error)
      imports_allowed:
        - crypto/
        - database/sessions
      imports_forbidden:
        - business/  # auth must not depend on business logic
        - api/       # auth must not depend on API layer
    enforcement:
      tool: depguard
      config_path: .depguard.yml
      ci_gate: true
Enter fullscreen mode Exit fullscreen mode

The boundary declares: what the module exports (the interface), what it's allowed to import (dependencies), and what it's forbidden to import (architectural boundaries). The enforcement section links the boundary to the tool that checks it. An agent reading this spec KNOWS the boundary before generating code. A CI gate reading this spec ENFORCES the boundary on every commit.

Section 2: Properties and contracts (Layer 2 — Dijkstra + Liskov)

properties:
  - id: prop.auth.idempotent
    description: "Authenticate is idempotent  same credentials produce same session"
    scope: boundary.auth.module
    type: behavioral_contract
    predicate: |
      for all (c: Credentials):
        Authenticate(c) == Authenticate(c)
    verification:
      method: property_test
      tool: gopter
      test_path: auth/auth_property_test.go
    mitre_attack: T1078.004  # relevant threat technique

  - id: prop.iam.no_admin_escalation
    description: "No principal can escalate to admin through any role chain"
    scope: global
    type: safety_invariant
    predicate: |
      for all (p: Principal, r: Role):
        can_assume(p, r) AND is_admin(r) implies is_admin(p)
    verification:
      method: cel_predicate
      expression: "properties.identity.escalation.*.present != true"
      tool: stave
    validated_against:
      - vendor: datadog
        lab: iam-002-to-admin
        result: detected
      - vendor: bishopfox
        lab: privesc4-CreateAccessKey
        result: detected
Enter fullscreen mode Exit fullscreen mode

Each property declares: what must be true (the predicate), how to check it (the verification method and tool), what threat it addresses (MITRE mapping), and what independent oracle validated it (lab traceability). The predicate is machine-readable — an enforcement engine can evaluate it. The description is human-readable — a developer can understand WHY.

Section 3: Rationale (Layer 3 — Knuth + Naur)

rationale:
  - id: rationale.auth.idempotency
    decision: "Authenticate must be idempotent"
    date: 2024-03-15
    author: engineering-team
    context: |
      Incident INC-2024-03-12: retry middleware called Authenticate
      twice for the same request. The non-idempotent implementation
      created two sessions. The user received two session cookies.
      Subsequent requests alternated between sessions, producing
      intermittent authorization failures.
    consequences:
      - Sessions are keyed by credential hash, not by call sequence
      - Duplicate calls return the existing session, not a new one
    properties_enforced:
      - prop.auth.idempotent
    supersedes: null
    temporal_validity: active
Enter fullscreen mode Exit fullscreen mode

The rationale records WHY the decision was made (the incident), WHAT consequences it has (implementation constraints), WHICH properties enforce it (linked to Section 2), and WHETHER it's still active (temporal validity). When the decision is superseded, the temporal_validity changes and the linked properties are flagged for review.

This IS the context graph ThoughtWorks describes — but with enforcement links. The rationale connects to the property. The property connects to the verification. The verification connects to the CI gate. The chain is: WHY → WHAT → HOW → CHECKED.

Adapted, not copied

The seven papers identified the properties. The Open Reasoning Spec doesn't implement them as the authors originally described — because the authors wrote for human developers, not for AI-assisted development. Each idea was adapted for the new context:

Paper Original idea Problem with original in AI era Adaptation in the spec
Knuth (1984) Rationale as comments interwoven with code AI overwrites comments on regeneration. Comments go stale silently. Comments are coupled to code that changes. Rationale in a SEPARATE artifact (Section 3) that the AI reads but doesn't modify. Linked to enforcement so staleness is detectable.
Dijkstra (1972) Simplicity for human reasoning about correctness AI generates code too complex for humans to reason about. Simplicity can't be enforced by asking. Properties as MECHANICAL CHECKS (Section 2) that verify correctness without requiring the human to reason about the implementation. The check replaces the reasoning.
Liskov (1994) Behavioral contracts as documentation and convention Documentation is ignored. Convention is violated by AI that doesn't know the convention. Contracts as EXECUTABLE PREDICATES (Section 2) that CI evaluates on every change. The contract is enforced, not documented.
Parnas (1972) Module boundaries as design decisions in the developer's head AI doesn't hold design decisions. The developer who prompted the AI didn't make the boundary decision. Boundaries as DECLARED STRUCTURE (Section 1) with enforcement links. The boundary is in the spec, not in someone's head.
Naur (1985) Theory built through the act of writing code The act of writing was replaced by prompting. Theory is never built. Theory EXTERNALIZED into the spec — boundaries + properties + rationale = the theory in durable form. The spec IS the theory, persisted independently of the people who authored it.
Lehman (1980) Structural maintenance through human discipline AI generates faster than humans can maintain. Discipline doesn't scale. Maintenance rules as AUTOMATED CHECKS — deletion metrics, dependency constraints, complexity thresholds in Section 2. The maintenance is mechanical, not disciplinary.
Brooks (1986) Essential complexity recognized through the struggle of implementation AI eliminates the struggle. Essential complexity is hidden under perfect-looking output. Essential complexity NAMED EXPLICITLY as properties in Section 2. The spec forces the author to state what's hard — which properties are domain-specific, which constraints are irreducible. Naming it makes it visible.

The common adaptation across all seven: what the original paper located in the developer's MIND (theory, understanding, discipline, convention, struggle), the spec locates in a DURABLE, ENFORCEABLE ARTIFACT. The mind is fragile — people leave, forget, get overridden by AI. The artifact persists, is version-controlled, is mechanically enforced, and survives personnel changes.

The specific innovation for the AI era: enforcement replaces documentation. Every previous attempt to capture properties, contracts, and rationale produced DOCUMENTATION — comments, ADRs, convention guides, design docs. Documentation goes stale because nothing checks whether it's still accurate. The Open Reasoning Spec connects every rationale entry to a mechanical check. When the check fails, the rationale is reviewed. When the rationale is superseded, the check is updated. The connection between WHY and WHAT is maintained mechanically, not by human memory.

This is the difference between "writing better comments" (Knuth's original) and "declaring enforceable properties with linked rationale" (the adaptation). The idea is the same — capture WHY. The mechanism is different — enforce it instead of commenting it. The enforcement is what makes it survive in a world where AI generates and regenerates code continuously.

What this enables

For AI agents

The agent reads the spec before generating code. It knows: the auth module can't import from business/. The Authenticate function must be idempotent. The boundary is enforced by depguard. The agent generates code WITHIN these constraints — not because it was prompted to, but because the constraints are declared in a machine-readable format the agent consumes as context.

The agent doesn't need the full codebase in context. It needs the reasoning spec. The spec is smaller than the code (hundreds of lines vs. thousands). It contains the information the seven papers say matters: boundaries, properties, contracts, rationale. The code is the implementation. The spec is the theory.

For enforcement engines

The CI gate reads the spec and runs every verification: depguard checks boundary imports, gopter runs property tests, Stave evaluates CEL predicates, Z3 checks satisfiability. Each property has a declared verification method. The gate runs them all. The exit code is pass or fail. No human triage. No alert queue.

When a property fails, the gate produces: WHICH property failed, WHAT the predicate says, WHY the property exists (linked rationale), and WHAT to do about it (remediation from the property definition). The developer sees: "prop.auth.idempotent FAILED because Authenticate created a new session for duplicate credentials. This property exists because of INC-2024-03-12 (see rationale). Fix: key sessions by credential hash."

For humans

The developer reads the spec and understands the system at the boundary level — Parnas's information hiding. They don't need to read the implementation. They read the interfaces, the properties, and the rationale. The spec IS Naur's theory, externalized into a durable artifact. The theory survives personnel changes because it's in the spec, not in someone's head.

For interoperability

Different tools consume different sections. The linter reads boundaries. The property tester reads contracts. The compliance engine reads properties with MITRE mappings. The AI agent reads all three. The Open Reasoning Spec is the lingua franca between tools — the same way API description standards unified the API tooling ecosystem.

Tool                Consumes from spec
────                ──────────────────
depguard            boundaries.imports_forbidden
gopter              properties.predicate (property tests)
Stave               properties.predicate (CEL expressions)
Z3                  properties.predicate (SMT-LIB export)
Soufflé             properties.predicate (Datalog rules)
CI gate             all verification sections
AI coding agent     boundaries + properties + rationale
Human developer     boundaries + rationale (readable sections)
Compliance auditor  properties + validated_against (evidence)
Enter fullscreen mode Exit fullscreen mode

One spec. Ten consumers. Each reads the section it needs. The ecosystem builds around the spec the same way the API ecosystem built around OpenAPI.

What must be true about the standard

For an Open Reasoning Spec to succeed, it needs five properties that successful standards demonstrate:

1. Machine-readable AND human-readable. YAML or JSON with clear field names. The developer reads it. The CI gate parses it. Same document.

2. Incrementally adoptable. A team can start with one boundary definition and one property. They don't need to spec the entire system on day one. Each section is independently useful.

3. Tool-agnostic. The spec describes WHAT to check, not WHICH TOOL checks it. The verification section names the tool but the predicate is tool-independent. A property that says "this function is idempotent" can be checked by gopter, QuickCheck, Hypothesis, or any property testing framework.

4. Extensible. New property types, new verification methods, new rationale formats can be added without breaking existing specs. The same way mature standards add capabilities without breaking existing consumers.

5. Versionable. The spec lives in version control alongside the code. Changes to the spec are reviewed in PRs. The spec evolves with the system. Temporal validity on rationale entries handles superseded decisions.

The precedent

API description standards didn't invent API descriptions. Multiple competing formats existed. The standard that succeeded unified the fragments into one format that the ecosystem adopted. The key wasn't inventing something new — it was standardizing what already existed into one interoperable format.

The reasoning spec fragments already exist: ADRs describe rationale. Type signatures describe boundaries. Property tests describe contracts. Linter configs describe structural rules. OSCAL describes compliance controls. CEL/Rego/OPA describe policy predicates.

The standard that unifies them — connecting rationale to property to enforcement to verification in one document — doesn't exist yet. The seven foundational papers describe what it must contain. The three-layer model describes its structure. The ecosystem (AI agents, enforcement engines, compliance tools, human developers) describes its consumers.

Every standard emerges when the coordination problem becomes acute enough. The reasoning coordination problem is becoming acute now — because AI generates code at scale without consuming the properties, contracts, and rationale that make the code safe to modify. The Open Reasoning Spec will emerge when the cost of NOT having it exceeds the cost of creating it. That point is approaching fast.

From Wild West to engineering discipline

Right now AI-assisted development is in its Wild West phase. Every team picks different tools. Every tool solves a different fragment. No shared vocabulary describes what "correct AI-assisted output" means. Companies evaluate tools by token throughput and generation speed — the equivalent of evaluating a construction company by how fast they pour concrete, not by whether the building stands.

The waste is measurable. Tokens spent generating code that fails review. Tokens spent regenerating code that fails CI. Tokens spent debugging AI-generated code that nobody understands. Tokens spent on feedback flywheels that improve surface quality without mechanical verification. The industry is spending billions on code generation and near-zero on code verification. The ratio is inverted.

An Open Reasoning Spec changes this in three ways:

1. Companies know what to evaluate. Instead of "which AI coding tool generates code fastest?" the question becomes "which tool consumes the reasoning spec and produces output that satisfies the declared properties?" The spec is the evaluation criteria. A tool that generates code that violates the spec's properties is failing — regardless of how fast it generates. A tool that generates less code but satisfies every property is succeeding. The spec makes "correct" measurable.

2. Tooling and ecosystem unify around the standard. Today: fragmented solutions that don't interoperate. The linter checks boundaries but doesn't know about behavioral contracts. The property tester checks contracts but doesn't know about rationale. The AI agent generates code but doesn't know about either. With a standard: every tool reads the same spec. The linter reads boundaries from Section 1. The property tester reads contracts from Section 2. The AI agent reads all three sections before generating. The compliance auditor reads the validated_against entries. One artifact coordinates the entire toolchain. The ecosystem builds around it instead of fragmenting without it.

3. Enforcement becomes automated. The spec declares properties. The CI gate checks them. The check is mechanical — deterministic, independent of the model, runs at machine speed. Every token spent generating code that violates a declared property is a wasted token. The enforcement gate catches it before the code reaches review, before it reaches staging, before it reaches production. The waste moves from "discovered in production" to "caught at generation time." The cost of a violation drops from "incident + remediation + post-mortem" to "regenerate."

This is how every engineering discipline matured. Civil engineering had its Wild West — buildings collapsed, bridges failed, each builder used different standards. Then building codes emerged. The codes declared properties: load-bearing capacity, wind resistance, seismic tolerance. The inspection process verified the properties mechanically. Builders who met the code shipped. Builders who didn't, didn't. The code was the standard. The inspection was the enforcement. The ecosystem (architects, engineers, inspectors, material suppliers) unified around the code.

Software engineering is the last major engineering discipline without a standard for declaring and enforcing the properties that matter. The Open Reasoning Spec is that standard. The seven foundational papers describe what it must contain. The three-layer model describes its structure. Six safety-critical domains proved the resolution works. The only question is when the industry adopts it — not whether.

The Wild West ends when enforcement becomes automated. Not when AI gets better at generating code. Not when feedback flywheels improve prompt quality. Not when context graphs capture more rationale. Enforcement. Automated. Mechanical. Deterministic. Independent of the model. That is the transition from code generation to engineering.


Stave implements a domain-specific version of the Open Reasoning Spec for cloud security: the observation contract (JSON Schema Draft 2020-12) defines boundaries, the control YAML defines properties with CEL predicates, the defect/infection/failure model captures rationale, and the SIR export enables multi-engine verification (SMT-LIB for Z3, Datalog for Soufflé, facts for Prolog). Every property is lab-validated against independent expert oracles. The format is specific to cloud security. The structure — boundaries + properties + rationale, machine-readable, incrementally adoptable, tool-agnostic — is the structure the open standard needs. Apache 2.0.

Top comments (0)