DEV Community: Diya Burman

I Converted the order-api to OKF. Here's What I Found.

Diya Burman — Tue, 21 Jul 2026 13:30:00 +0000

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

The previous spin-off article made a claim: the skills, ADRs, evals, and runbooks being built in this series map cleanly onto Google's Open Knowledge Format. Same problem, different vocabulary.

Claims need testing. So I ran the test.

This article is the result of converting the order-api's docs/ directory into a conformant OKF v0.1 bundle and then running a controlled comparison experiment: the same Claude Code task against the current repo structure versus the OKF bundle. Two fresh agent contexts. Same task description. Document what each agent does differently.

The result was not what I expected.

The conversion

25 documents in scope across the docs/ directory plus CLAUDE.md at the project root.

The mapping:

Current type	OKF type	Count
ADR	Decision	2
Eval	Guardrail	3
Skill (all tiers)	Methodology	6
Runbook	Playbook	2
Reference docs	Reference	9
Pedagogical examples	Reference	3
CLAUDE.md	Agent Standing Orders	1

Conversion involved two changes to each document. First, YAML frontmatter at line 1:

---
type: Guardrail
title: "Operation Scope Eval"
description: "Pre-flight check that must be answered before modifying app/main.py or any file in tests/."
tags: [eval, pre-flight, operation-scope, layer-3]
timestamp: 2026-06-28
---

Second, a ## Related section at the bottom of every ADR, eval, skill, and runbook — cross-links to the documents most likely to be relevant when reading this one.

For eval-operation-scope.md, the related section links to ADR-001 and ADR-002 (whose invariants this eval enforces), the CLAUDE.md pre-flight table (which routes to this eval), and the runbook that handles the scenario where the pre-flight check fires too late.

For ADR-001-inventory-before-payment.md, the related section links back to the eval that enforces it at runtime, the Gherkin scenarios in tests/features/order_creation.feature that test the behavioral outcome, and the CLAUDE.md decision index entry.

Then index files at each directory level, following OKF spec §6:

docs/index.md                  — entry point for the bundle
docs/ADR/index.md              — both decisions with descriptions
docs/evals/index.md            — three evals with trigger summaries
docs/runbooks/index.md         — both runbooks with trigger scenarios
docs/skills/index.md           — three-tier structure explanation
docs/skills/tier1/index.md
docs/skills/tier2/index.md     — five tier-2 skills including deprecated
docs/skills/tier3/index.md
docs/log.md                    — bundle update history

OKF conformance check after conversion:

Total docs: 24
Missing frontmatter: 0
All docs have frontmatter.

All tests still passing. The conversion touched only docs/ files — no implementation, no feature files, no step definitions, no Pact files.

The experiment

One task. Two fresh agent contexts. No knowledge of what the other run did.

The task:

"Add a new endpoint to the order service: GET /orders/{order_id}/history — returns a list of status changes the order has gone through (created, confirmed, etc.) with timestamps. Write the Gherkin scenarios first. Apply the relevant skills. Run the pre-flight evals. Then implement."

Run A started with: CLAUDE.md + task description only.
Run B started with: CLAUDE.md + task description + docs/index.md as explicit starting point, with the instruction "The docs/ directory is an OKF knowledge bundle. Start by reading docs/index.md for an overview of available knowledge, then navigate from there."

Both runs documented which files were read before the first Gherkin scenario was written.

What happened

Run A — Navigation log (10 files, in order)

docs/skills/tier1/output-formatting-standard.md — CLAUDE.md named it explicitly
docs/skills/tier2/gherkin-scenario-quality-v2.md — CLAUDE.md named it explicitly
docs/ADR/ADR-001-inventory-before-payment.md — CLAUDE.md decision index
docs/ADR/ADR-002-fire-and-forget-notification.md — CLAUDE.md decision index
docs/evals/eval-operation-scope.md — CLAUDE.md pre-flight table
tests/features/order_creation.feature — style reference
tests/features/order_status_good.feature — GET endpoint assertion style
app/main.py — data model and in-memory store
tests/features/notification_service.feature — count pattern style
docs/evals/index.md — end-of-navigation check: confirmed nothing missed

Two tool calls to find the Gherkin skill. One call for each ADR. One call for the eval. Then code.

Run B — Navigation log (20 files, in order)

docs/index.md — entry point (required by experiment)
docs/skills/ directory listing — index mentioned skills/ subdirectory
docs/ADR/ directory listing — index mentioned ADR/ subdirectory
docs/evals/ directory listing — index mentioned evals/ subdirectory
docs/skills/index.md — three-tier structure
docs/ADR/index.md — ADR listing before individual ADRs
docs/evals/index.md — which evals exist and what they trigger on
docs/skills/tier2/index.md — find Gherkin quality skill by name
docs/skills/tier2/gherkin-scenario-quality-v2.md — the skill
docs/ADR/ADR-001-inventory-before-payment.md — pre-flight ADR check
docs/evals/eval-operation-scope.md — pre-flight eval
docs/ADR/ADR-002-fire-and-forget-notification.md — complete ADR check
docs/evals/eval-environment.md — confirmed not triggered (Run A did not read this)
docs/evals/eval-contract-preflight.md — confirmed not triggered (Run A did not read this)
docs/skills/tier1/output-formatting-standard.md — formatting standard
tests/features/order_status_good.feature — style reference
tests/features/order_status_good.feature (second pass) — additional assertion style reference (Run A did not revisit this)
app/main.py — data model
tests/features/order_creation.feature — style reference
docs/skills/tier2/step-definition-style.md — step definition conventions (Run A did not read this)

Five hops to find the Gherkin skill. Eight directory traversals before any ADR was opened.

The results

Q1: Did the agent find the relevant skill faster in Run B?

No. Run A found the Gherkin skill at navigation step 2 — two tool calls from CLAUDE.md. Run B found it at navigation step 9 — five tool calls via the index hierarchy.

OKF was slower by three navigation steps.

The reason is simple: CLAUDE.md names exact file paths. docs/skills/tier2/gherkin-scenario-quality-v2.md is in CLAUDE.md's skill table. The agent opens it directly. OKF's hierarchical navigation adds structural traversal layers that are resolved top-down. When a direct pointer already exists, hierarchical navigation is strictly overhead.

Q2: Did OKF cross-linking change which documents the agent consulted?

Yes — significantly.

Run B's agent read all three evals. Run A's agent read one (the one that fires). The docs/evals/index.md caused Run B's agent to read eval-environment.md and eval-contract-preflight.md and explicitly confirm they were not triggered. Run A's agent did not know those evals existed until step 10, when it checked docs/evals/index.md as a final verification.

The "Related" section in eval-operation-scope.md provided a third confirmation path to ADR-001 and ADR-002 — beyond CLAUDE.md's decision index and the ADR/index.md listing. Three independent paths to the same documents.

Q3: Were there documents found in one run but not the other?

Run B found, Run A did not:

docs/evals/eval-environment.md — proactively read and confirmed not triggered
docs/evals/eval-contract-preflight.md — proactively read and confirmed not triggered
docs/skills/tier2/step-definition-style.md — discovered via tier2/index.md; CLAUDE.md does not reference this skill in its skill table

That last item is the most important finding in the experiment.

step-definition-style.md is the skill that encodes five implicit conventions for writing step definition files — the fixture-chaining pattern, mock server state asserted via call log rather than response body, the time.sleep(0.3) timing, and the _post_order-style helper naming. It was identified as the highest-risk undocumented pattern in the project during the skill audit: followed in every test file across multiple sessions, never written down, consistent only because the same source files were read each time.

CLAUDE.md's skill table does not list it. An agent relying solely on CLAUDE.md would not find it for a task that involves writing step definitions — which the history endpoint task does.

Run B's agent found it because tier2/index.md listed it alongside the four other tier-2 skills. The OKF index is not selective — it lists everything in the directory. CLAUDE.md is selective — it lists what the author remembered to add.

Q4: Did the index.md change the order of understanding?

Yes — and this is the most structural difference between the two runs.

Run A built understanding specific-first: went directly to individual documents named in CLAUDE.md before having any overview of what existed.

Run B built understanding overview-first: read the bundle structure before reading any individual document, arriving at each document knowing what else existed in the same category.

This mattered for the eval space specifically. Run B's agent knew "there are three evals" before reading any of them. Run A's agent knew only "there is a pre-flight table in CLAUDE.md" and read only the eval that fires.

Q5: Did OKF frontmatter change any agent decision?

Not directly. The type, tags, and description fields were not cited as decision drivers in either run. The description field in index.md entries was used to confirm documents before opening them — but this was confirmation, not routing. Routing was driven by CLAUDE.md in both runs.

The frontmatter's most useful effect was indirect: the description field in index entries gave Run B's agent enough context to decide whether to open a file without opening it. A modest efficiency gain, not a qualitative change.

Q6: How many implicit decisions were made?

Run A: 10. Run B: 11. Difference of 1 — within noise.

OKF cross-linking does not reduce implicit decisions for Gherkin scenario writing. The implicit decisions are product decisions: what to name a field, which ordering to use, whether "CREATED" is a valid initial status. These come from specification gaps, not navigation gaps. No amount of infrastructure can supply them. That remains the human's job.

What OKF gives that CLAUDE.md does not

Completeness over selection. CLAUDE.md's skill table will always lag the actual document count. It lists what the author remembered to add. OKF's index never lags — the index is derived from the documents. Every file in tier2/ appears in tier2/index.md. Every eval in evals/ appears in evals/index.md. An agent reading OKF gets the complete picture of what exists; an agent reading CLAUDE.md gets the author's current mental model of what the agent needs.

Overview-first navigation. An agent starting from docs/index.md knows the shape of the knowledge bundle before it navigates into it. This matters most for a new agent or a new session context — the agent arrives at any specific document knowing what else exists in the same category. CLAUDE.md's navigation is task-driven; it points the agent at specific files for specific purposes. OKF's navigation is discovery-driven; it lets the agent understand the scope before committing to a path.

Confirmed non-applicable documents. Run B's agent confirmed that two evals were not triggered and documented that confirmation. Run A's agent may have silently assumed those evals did not exist. The difference between "I checked and it does not apply" and "I did not check" matters in a project where a missed eval is the failure mode Layer 3 was built to prevent.

What CLAUDE.md gives that OKF cannot replace

Direct routing is faster. When CLAUDE.md names an exact file path, the agent opens it in one tool call. When OKF provides a hierarchy, the agent traverses N levels. For targeted navigation on a known task, CLAUDE.md's explicit pointers are strictly faster than OKF's hierarchical discovery. The three-step speed advantage in Run A is real.

Behavioral instructions. CLAUDE.md's "you may not" list, invariant statements, environment discrimination sections, and pre-flight routing table are instructions, not metadata. "Before modifying app/main.py, run docs/evals/eval-operation-scope.md" is a behavioral instruction. OKF can express "eval-operation-scope.md exists and its description says it intercepts app/main.py modifications" — but an agent must infer from that description that it should run the eval. Inference is the failure mode. The routing instruction is what prevents inference.

Behavioral constraints. CLAUDE.md's HALT conditions and prohibition list ("you may never push directly to main," "do not add continue-on-error to pipeline jobs") cannot be represented as OKF frontmatter. They are instructions, not structured knowledge. OKF formalizes what exists; CLAUDE.md governs what must be done and what must not be done. These are different layers of the same system.

The honest synthesis

OKF and CLAUDE.md solve different problems.

CLAUDE.md is an instruction document that happens to contain a knowledge map. It tells the agent what to do and where to find specific things for specific tasks. Its strength is precision and speed for known tasks.

OKF is a knowledge map that happens to be readable by agents. It tells the agent what exists — including things no task has yet required it to find. Its strength is completeness and discovery for unknown scope.

When both exist: CLAUDE.md's direct pointers are faster for targeted navigation. OKF's index hierarchy is more complete. The combination catches what each one misses alone. CLAUDE.md's skill table misses step-definition-style.md; OKF's tier2/index.md surfaces it. OKF cannot route the agent to run an eval before a specific action; CLAUDE.md's pre-flight table does exactly that.

The finding that runs counter to the intuitive case for OKF: for a project with a well-maintained CLAUDE.md, OKF does not replace or accelerate what CLAUDE.md already does. It fills the gaps that CLAUDE.md leaves uncovered. Those gaps are real — step-definition-style.md is a high-priority skill that CLAUDE.md's skill table does not reference, meaning any agent relying solely on CLAUDE.md for navigation would miss it entirely — but they are not the primary navigation problem. They are the completeness problem.

The right architecture is both. CLAUDE.md for routing and behavioral instructions. OKF for structural completeness and cross-document relationships. Neither replaces the other. They solve different problems at the same layer of the stack.

The order-api repository is now an OKF v0.1 conformant bundle. The full conversion — frontmatter, index files, cross-links, and log.md — is in the repo.

Further reading

This article was written with the assistance of AI tools.

The Skill Review: A New Artifact for a New Workflow

Diya Burman — Mon, 20 Jul 2026 13:30:00 +0000

Preface

Issue #11 stress-tested the Gherkin quality skill and found four failure modes. Issue #11 then fixed them. The resulting v2.0 skill passed every adversarial input.

This issue asks the question that should have come first: how do you review a skill before the stress tests tell you what's wrong?

The answer matters because stress tests and skill reviews catch different things. A stress test answers "does the skill work when called?" A review answers "is the skill ready to be called in all the contexts its description implies?" They are complementary. Running the stress tests first and the review second — as happened in Issues #11 and #12 — is the wrong order.

Why skill review is a different discipline from code review

Code review is a solved problem. You review the diff. You check the logic. You ask "does this code do what it should?" and you either approve or request changes.

Skill review is not a solved problem. The review target is different. You are not asking whether the code is correct. You are asking:

Does the routing signal route correctly in all contexts where this skill should fire, and no contexts where it shouldn't?
Could two agents produce different outputs that both satisfy the output contract?
Does the methodology describe reasoning that generalises, or a procedure that only applies to the examples shown?
Does the skill fail explicitly when it can't produce correct output — or does it produce plausible-looking wrong output?

None of these questions can be answered by reading the code. They can only be answered by working through a structured checklist.

The five-dimension checklist

The review framework built in this session covers five dimensions. Every numbered question must be answered before a skill version can be approved. A reviewer who reads a skill and asks "does this look reasonable?" is not doing a skill review.

Dimension 1 — Routing signal
Is the description on a single line and under 120 characters? Does it name the artifact type, the domain scope, and the methodology — specifically enough to route correctly and generally enough not to misroute?

The test: write three prompts that SHOULD route to this skill and three that SHOULD NOT. Verify each. Document any misroutes.

Dimension 2 — Output contract
Is the contract explicit and enumerable — every requirement a yes/no check, not a judgment call? Could two agents produce different outputs that both satisfy it? Does the contract specify what the skill must NOT produce, not just what it must?

The test: identify the downstream consumer. Document what it does with the skill's output. Ask whether the contract is sufficient for that consumption pattern.

Dimension 3 — Methodology
Does the methodology describe reasoning or procedure? Pick three edge case inputs not covered by the methodology examples. Apply the methodology manually. Document whether it produces correct output for each.

The test: identify domain knowledge that an agent cannot infer from first principles. It must be stated explicitly, not implied.

Dimension 4 — Idempotency and stability
Apply the skill to the same input with three different framings. Do all three produce structurally identical output? Apply the skill to an already-correct input. Does it return unchanged or rewrite unnecessarily? Apply the skill to its own output. Does it return unchanged?

Dimension 5 — Failure modes
Test with one out-of-scope input, one contradictory input, one empty input. For each, classify the output: FAIL SIGNAL (explicit failure, no output), PLAUSIBLE WRONG (looks correct, contains error), or CORRECT REFUSAL (actionable error message). Are all PLAUSIBLE WRONG outcomes eliminated?

Applying the framework to v1.1

The v1.1 review is the review that should have happened before the Issue #11 stress tests were needed. Working through all five dimensions found findings the stress tests could never have found — and confirmed exactly which failures the stress tests did find.

Routing signal — 137 characters against a 120-character limit. The description is 17 characters over the threshold above which many agent routing frameworks truncate or deprioritise the signal. The excess carries "and output contract" — meaningful to the skill author, invisible to an agent routing on a 120-character budget. The stress tests in Issue #11 could not find this — they test behaviour when the skill is invoked. They cannot test whether the skill is invoked.

Output contract permits agent divergence. Four under-specified requirements allow two agents to produce different outputs that both satisfy the contract: scenario title "explicit" criterion, required fields per scenario type, required HTTP status codes per outcome, required external services per scenario type. Stress tests verify one agent's output. They cannot reveal the latitude available to a second agent working from the same contract.

Methodology gap for missing Given clause. The Q3 check asks whether terms are defined, but not whether the precondition state is established. A scenario with no Given clause passes Q3 if none of the steps use undefined nouns. The stress tests used well-formed inputs; this gap was not in the input set.

Two PLAUSIBLE WRONG failure modes confirmed:

UI scenario → translated silently to API scenario
Contradictory constraints → documented as assumptions, scenario produced anyway

Both are exactly the failures the Issue #11 stress tests found. A pre-v2.0 review using this checklist would have required explicit termination for both cases — and v2.0's Guards 2 and 3 might have been built before the stress tests were necessary.

v1.1 Review verdict: CHANGES REQUESTED.

Applying the framework to v2.0

The v2.0 review confirms that the four stress-test failures are fixed. It also finds three issues the stress tests missed.

Routing signal is now 179 characters. v2.0 made the signal 42 characters longer than v1.1's already-failing signal. The addition of ", four pre-flight guards, and a minimal-change" describes internal implementation mechanisms that are irrelevant to a caller routing to this skill. The routing signal now describes how the skill works internally rather than what it produces.

Guard 4 return value format is ambiguous between two instructions. This is the most important finding in the session. Documented in full below.

Guard 4 gap for missing Q5 assertions. A scenario that passes all five Guard 4 format conditions — concrete IDs, named services, HTTP status, no UNDERSPECIFIED patterns, field+value Then clauses — but is missing Q5 side-effect assertions (payment gateway call count, inventory reservation assertion) triggers Guard 4 and returns "no changes required." The skill signals completion for an incomplete scenario. The stress tests in Issue #11 did not test this input type because the stress tests focused on the four failure modes v2.0 was designed to fix.

Two new edge cases introduced:

Guard 2 rejects mixed UI/API scenarios (one UI step, three API steps) entirely, when partial assistance on the three API steps is possible. Over-broad refusal.
The v2.1 work list: shorten the description, clarify Guard 4 return format, add Q5 side-effect check to Guard 4, document mixed UI/API handling, add reasoning to Guard 2's pattern list.

v2.0 Review verdict: APPROVED WITH COMMENTS.

The four major failure modes are addressed. The new issues are not blocking. v2.0 is ready to be the canonical version — with a v2.1 planned.

The real review comment

PR: gherkin-scenario-quality-v2.md — Agent-safe Gherkin quality skill
Section: Pre-flight guards → Guard 4 (Idempotency check)

Guard 4 has two return instructions that conflict, and the conflict matters at agent scale.

The "Return:" block shows only the annotation comment:

# SKILL: No changes required — scenario satisfies output contract.
# Five-question diagnostic result: [observations, or "none"]

Then the next line says: "If Guard 4 triggers, return the input scenario unchanged."

Together these read as: return the comment block, AND return the input scenario. But a skill returns a single value. The two instructions imply three possible interpretations: (a) the annotation only — the scenario is not in the output; (b) the annotation prepended to the scenario, matching the minimal-fix pattern used elsewhere; or (c) the scenario with the annotation appended.

The rest of the skill uses format (b). Guard 4's "Return:" block uses format (a). The inconsistency is invisible when a human reads the output and manually pastes the scenario into a feature file — the human ignores the comment and pastes the scenario. But in an automated pipeline it is not invisible.

Concrete impact: A downstream agent that receives Guard 4 output and writes all skill output to a feature file would write the # SKILL: No changes required annotation as a Gherkin comment into the file. At pipeline scale across 50 feature files, that is 50 permanent skill-internal annotations committed to production specs. If the agent uses interpretation (a) and treats the annotation block as the complete output, the original scenario is silently discarded — replaced by two comment lines.

Suggested fix: Align Guard 4's return spec with the minimal-fix annotation pattern:

Return the input scenario with the following comment prepended:
  # SKILL: No changes required — scenario satisfies output contract.
  # Five-question diagnostic result: [observations; "none" if Q1–Q5 find nothing]
[followed by the complete input scenario, unchanged]

Alternatively, if the annotation is caller metadata and NOT part of the Gherkin output, state this explicitly: "The guard annotation is caller metadata. Do not include it in the feature file. Return it as a separate response block before the unchanged scenario."

Either formulation eliminates the ambiguity. The current text requires the downstream agent to guess.

Why this is the most important finding from either review: v2.0 was built explicitly to be safe at agent scale. The four guards exist because automated pipelines create failure modes that human callers handle silently. Guard 4's return value specification has the same class of failure it was designed to prevent — a human reading the output knows which part is the scenario and which part is metadata; an automated pipeline does not. The stress tests verified that Guard 4 triggers correctly. What they could not verify — because they test the skill in isolation — is whether Guard 4's output is correctly specified for all downstream consumers.

The boundary between stress testing and skill review

The stress tests in Issue #11 found three behavioral failures and confirmed a fourth. This session's review found three findings the stress tests could not reach:

The routing signal length: only testable by asking whether the skill is selected, not how it behaves when selected
The Guard 4 return value ambiguity: only testable by asking what a downstream agent does with the output, not what the output contains
The Guard 4 gap for missing Q5 assertions: only testable with an input type that passes the four guards while carrying a structural omission

Stress tests answer "does the skill work?" Review answers "is the skill ready for every context?" The findings that stress tests cannot reach are the ones where the skill is correctly invoked, correctly produces output, and a downstream system still fails — because the output format was not specified for that consumption pattern, or because the routing signal was too long to fire reliably, or because a guard fired on valid input.

Running the stress tests first, as happened in Issue #11, found the acute failures. Running the review second, as happened here, found the ones that would have surfaced later — quietly, in production, without a clear signal that the skill was the cause.

The correct order is review first, stress tests second. The review tells you where to aim the stress tests.

Next issue: The Skill Audit — walking through the full prompt library accumulated across twelve issues, applying the tier framework, and building the audit template readers can use on their own libraries.

Sources & Further Reading

Nate B. Jones — Agent-First Skills Architecture · natebjones.com
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Project repository
Skill review checklist
Session findings — Issue #12

This article was written with the assistance of AI tools.

What Google Just Formalized (And What We've Been Building All Along)

Diya Burman — Thu, 16 Jul 2026 13:30:00 +0000

Preface

On June 12, 2026, Google Cloud published the Open Knowledge Format — a specification for representing organizational knowledge as a directory of markdown files with YAML frontmatter, designed to be authored by people, generated by agents, and consumed by both without bespoke SDKs.

I found out about it from a comment on Issue #11 — left by Larkwin, a friend whose work I genuinely respect. If you're building something ambitious and hitting the stage where operational scaling becomes the bottleneck, their firm lark.win does fractional leadership and engineering velocity work with senior operators who embed with your team and own outcomes. Worth a conversation if that's where you are.

But back to the comment — it pointed me at the OKF spec and that was the thread that unravelled this piece. So: thank you.

My first reaction was recognition, not surprise. The problem OKF is solving — knowledge scattered across wikis, heads, tickets, and shared drives that AI agents cannot assemble reliably — is the exact problem this newsletter has been building infrastructure to address since Issue #2. Independently. In a single-repo, single-engineer context. With a different vocabulary but the same structure.

This piece is the mapping. Every artifact built so far in The Level 5 Engineer sits somewhere in OKF's concept taxonomy. And the gap that OKF has not yet closed points directly at what the next phase of this series needs to do.

What OKF is, briefly

OKF formalizes the LLM-wiki pattern into a portable, interoperable format. It is vendor-neutral and agent-friendly, representing knowledge as a directory of markdown files with YAML frontmatter and requiring no new runtime or SDK.

A bundle of OKF documents is just markdown, just files, and just YAML frontmatter. One required field — type. Optional metadata: title, description, resource, tags, timestamp. A markdown body for everything else. Concepts link to each other with standard markdown links, turning the directory into a traversable graph.

The full v0.1 specification fits on a single page.

If you have used Obsidian or written a CLAUDE.md file, the shape is immediately familiar. What OKF adds is the agreed-upon conventions that make a bundle written by one team consumable by a different agent without translation.

The problem it is solving

Most teams don't suffer from a lack of data. They suffer from a lack of shared context. Definitions, caveats, ownership, and "how to use this safely" guidance end up scattered across wikis, tickets, dashboards, and people's heads.

When an AI agent needs to answer a question about your system, it has to assemble the answer from these scattered, mutually incompatible surfaces. People compensate with experience — they know which wiki is "more correct," who to ask, and which dashboard is legacy. Agents don't have that intuition. When context is missing or split across systems, an agent has to infer and guess.

This is precisely the problem the series has been building toward. The failure modes this newsletter is working to prevent — agents re-deriving decisions that were already made, agents treating production and staging resources interchangeably, agents removing a guard that was there for a reason — are all context failures. Not capability failures. The agent had the intelligence. It just didn't know enough about its environment.

OKF is solving the same problem from a different angle. Where this newsletter builds project-specific artifacts (CLAUDE.md, ADRs, evals, runbooks), OKF is building a portable, cross-org standard. Same problem. Different scope.

The mapping

The artifacts built so far in this series, mapped to OKF concept types.

Skills → type: Methodology

The Gherkin quality skill, the step definition style skill, the session start protocol — these are methodologies. They encode domain-specific reasoning that agents use to produce consistent output. OKF's Methodology type captures this: a concept that describes how to do something rather than what something is.

The OKF cross-linking that would make these most useful: a Methodology concept should link to the ADR or finding that motivated it, the artifacts it is meant to produce, and any prerequisites that should be read before using it.

ADRs → type: Decision

ADR-001 (inventory before payment) and ADR-002 (fire-and-forget notification) are Decision concepts — single units of knowledge that capture a choice, its context, and its consequences. OKF's structure maps cleanly: the YAML frontmatter holds the metadata, the markdown body holds the human-facing ADR content, and cross-links connect the decision to the Gherkin scenarios that enforce it and the evals that protect it at runtime.

Evals → type: Guardrail

This type is not in OKF's example list — but OKF explicitly does not define a fixed taxonomy of concept types. Producers choose values that are descriptive and self-explanatory. Guardrail is the right name for the pre-flight checks being built in this series: they are not tests of output, they are checks that intercept intent before execution and ask whether the situation is safe to proceed.

The cross-linking for a Guardrail concept is the most important part. Each eval links to the failure mode it addresses, the ADRs whose invariants it enforces, and the CLAUDE.md section that routes to it before relevant actions.

Runbooks → type: Playbook

OKF's own example uses type: Playbook for a runbook — the incident response for a data freshness alert. The agent-facing runbook in this series fits this type precisely. The critical difference between the human-facing and agent-facing versions maps directly to OKF's design intent: OKF is written for agents that cannot fill gaps with judgment. The structure with explicit decision trees, named thresholds, and completion criteria is the agent-readable version of what OKF Playbooks should be.

CLAUDE.md → type: Agent Standing Orders

CLAUDE.md is not a standard OKF type, but it is the most important concept in the bundle. It is the document the agent reads before any other — the standing orders that govern session behavior, permissions, and routing. In an OKF bundle, it would link outward to every other concept type: Methodologies (skills to apply), Decisions (ADRs to consult before modifying covered code), Guardrails (evals to run before risky actions), and Playbooks (runbooks for degraded scenarios).

The findings/ directory → log.md

OKF's log.md is a chronological history of changes at any bundle level. The findings/ directory in this project is the same thing: a structured record of what was attempted, what failed, and what was learned, updated in real time during every session. The difference is scope — OKF's log.md records what changed; this project's findings files record why it changed and what the finding means for the reader.

The three-tier skills structure → OKF subdirectory organization

OKF's bundle structure is hierarchical: subdirectories group concepts, and each level can have its own index.md for progressive disclosure. The docs/skills/tier1/, docs/skills/tier2/, and docs/skills/tier3/ structure is already OKF-conformant in shape. Adding index.md files at each level would make the tier hierarchy navigable by an agent reading the bundle from the root.

What the convergence means

Two efforts, independently arriving at the same structure.

OKF was designed for enterprise data teams managing BigQuery datasets, metric definitions, and incident runbooks across organizations. This newsletter was designed for a single engineer trying to make Claude Code sessions reliable and consistent across the project's lifetime.

The core structure is the same: concepts as markdown files, cross-linked into a traversable graph, with YAML frontmatter that tells a consuming agent what kind of thing it is reading before it reads the body.

The convergence is not coincidental. It reflects the underlying problem. Adopting OKF now is a bet that agentic workflows will move from experiments to core operations — it pays off fastest in projects with decisions made across sessions, dependencies with their own failure modes, and invariants that emerged without explicit documentation. The order-api project is exactly that profile.

What OKF formalizes is the pattern this project reached by building toward the same problem from the implementation side. The spec arrived six weeks ago. The need has been here since Issue #2.

The gap OKF has not yet closed

Three places where the order-api's artifacts go beyond what OKF v0.1 handles.

Invariant documentation. OKF has no standard concept type or section convention for "this property must never change." ADRs in this project contain invariant sections — explicit statements of what would break if the decision were reversed, and which tests currently enforce the invariant. This is not standard OKF. It is an extension that addresses one of the most dangerous agent failure modes: an agent optimizing away a load-bearing constraint because nothing in the bundle marks it as non-negotiable.

A proposed OKF extension field: invariants: [list] in the frontmatter of a Decision concept, naming the properties that must remain true in all future implementations. This could be the basis for a v0.2 proposal.

Eval routing. OKF's cross-linking mechanism is manual: a human or agent adds a markdown link from one concept to another. There is no mechanism for an OKF bundle to express "before modifying file X, read eval Y." The CLAUDE.md pre-flight section in this project handles this routing with a table that maps action types to eval documents. OKF could express this relationship as a new frontmatter field on Guardrail concepts: intercepts: [list of file paths or pattern matches]. An agent that reads a Guardrail concept with an intercepts field knows to apply the eval before modifying the listed files.

Skill versioning. OKF's timestamp field records last meaningful change. The skill files in this project have version numbers (v1.1, v2.0) but OKF has no standard version field. The skill review process in this series makes explicit that v1.1 and v2.0 are different things with different capability guarantees. An agent routing to the Gherkin quality skill should find v2.0, not v1.1. OKF's current model requires the producer to deprecate v1.1 explicitly rather than providing a standard field that consuming agents can use to select the canonical version.

A proposed OKF extension: version: "2.0" and supersedes: ../gherkin-scenario-quality.md in the frontmatter of a versioned Methodology concept.

What this means for the series

The series is currently building the stewardship layer — CLAUDE.md, ADRs, evals, and runbooks. OKF's arrival suggests a fourth consideration that will matter when that layer is complete: portability.

The skills, ADRs, evals, and runbooks being built are useful to the agents that work on this project. They are not currently portable — another project would have to read the full series to understand what each document does and why it exists.

An OKF-conformant version of the same bundle would be portable. The type: Methodology frontmatter on a skill file tells any consuming agent — in any project, using any framework — what kind of thing it is reading before it reads the body. The cross-links tell the agent where to look next. The index.md at the bundle root tells the agent what is available before it opens any individual file.

The companion article to this one converts the order-api's docs/ directory into a conformant OKF bundle and runs a comparison experiment: the same Claude Code task against the current structure versus the OKF bundle. That article answers whether OKF's formal structure changes what an agent does, or whether the informal structure this project built achieves the same result.

For now, the honest observation: Google published a specification for the problem this series has been trying to solve. The solution they arrived at is the same solution this series arrived at. The vocabulary is different. The structure is the same.

That is not a coincidence. It is evidence that the problem is real and the structure works.

Further reading

This article was written with the assistance of AI tools.

Designing for Non-Human Callers

Diya Burman — Tue, 14 Jul 2026 16:30:15 +0000

Preface

Issues #9 and #10 built skills and organised them into tiers. This issue breaks them.

The thesis: skills built for human use degrade under agent load in specific, predictable ways. Not randomly. Not dramatically. They degrade by producing output that looks indistinguishably correct — and is subtly, silently wrong.

The only way to find this before production does is to stress-test the skill deliberately. So that's what this session did.

What "agent scale" actually means

When a human uses a skill, there is a correction layer between the skill's output and the downstream action. The human reads the output, compares it to the input, notices that the user IDs changed for no reason, and asks a question. The skill's imprecision gets caught before it causes damage.

At agent scale, that correction layer is absent. A downstream agent consuming skill output treats it as a verified artifact. It does not re-read the input and compare it to the output. It implements from what the skill produced. Changed user IDs become changed step definition values. An invented endpoint becomes implementation work that was never requested. A retained contradiction becomes a test that can never pass.

Three properties distinguish a skill that survives agent-scale usage from one that doesn't:

Idempotency: Calling the skill twice on the same input produces the same output. Not a similar output. The same one.

Output stability: The output format does not drift based on how the task is framed, only on what the input contains.

Failure specificity: When the skill cannot produce correct output, it fails in a way that tells the caller exactly what is missing — rather than producing plausible-looking wrong output.

The Gherkin quality evaluator from Issue #9 had none of these. Here's the proof.

The idempotency test

Five runs. Same input scenario. Only the framing changed.

Scenario: Order is confirmed when all conditions are met
  Given a user with a valid account
  And items are available
  When the order is placed
  Then it should succeed

The five framings:

"Evaluate this scenario using the Gherkin quality skill."
"Apply the Gherkin quality skill to improve this scenario."
"Use the Gherkin quality skill to check this scenario before I implement it."
"This scenario needs to be agent-ready. Run it through the Gherkin quality skill."
"The Gherkin quality skill should evaluate this. What does it produce?"

Across five runs with identical input:

HTTP status code varied: 201 (Runs 1, 3, 4) vs 200 (Runs 2, 5). The word "improve" and the passive framing of Run 5 primed lower-commitment defaults. "Agent-ready" in Run 4 primed explicit assumption surfacing.
Number of scenarios varied: two scenarios (Runs 1 and 4), two different scenarios (Run 3), one scenario (Runs 2 and 5).
Failure path varied: stock-out (Runs 1, 4), payment decline (Run 3), absent (Runs 2, 5).
Assumption comment count varied: 0 (Run 3), 1 (Runs 2, 5), 2 (Run 1), 3 (Run 4).

The core Then clauses were stable. The structural decisions — how many scenarios, which HTTP status, which failure path — were not.

For a human, this is manageable. Read all five outputs, merge the best elements, proceed. For a downstream agent, this is a silent contract violation. The agent consuming Run 2's output (one scenario, HTTP 200) cannot know that Run 4's output (two scenarios, HTTP 201, three assumption comments) was more complete. It implements from what it received.

The routing signal description does not specify whether the output must include failure scenarios, which HTTP status to use when the input is silent, or how aggressively to surface assumptions. These are structural decisions the skill leaves open. Different framings resolve them differently. All five framings are valid English ways of saying "use the Gherkin quality skill."

The output stability test

Six inputs, each slightly improving on the baseline. The question: does the skill's output structure remain consistent as inputs get better?

Inputs A through E — progressively more specific versions of the same scenario — produced stable output. The Then clause pattern held. The assumption comments appeared. The external services were named. The skill absorbed improvements in the input without changing its output structure.

Input F was the critical test: a scenario that was already substantially well-formed, taken directly from tests/features/order_creation.feature.

Scenario: Order is successfully created when payment succeeds and all items are in stock
  Given a registered user with id "user-123"
  And the inventory service confirms all items are in stock
  And the payment gateway will accept the charge
  When the user submits an order for SHOE-RED-42 and BELT-BRN-M
  Then the order status is "CONFIRMED"
  And the response includes an order id
  And the payment gateway received exactly one charge request
  And the inventory service received a reservation request

The skill found two minor debt items — a missing HTTP status in the Then, and "received a reservation request" without a count. Both real. Both fixable.

Then it produced a full rewrite. It changed user-123 to a new ID. It replaced "the user submits an order" with "the client submits a POST to /orders." It restated every clause that was already correct.

The output satisfied the output contract. It passed quality criteria. It looked better than the input.

But "better" is not the contract. "Only changes what violates the contract" is the contract. A downstream agent receiving this output cannot tell whether the rewrite was necessary or whether it introduced assumptions — HTTP 201 vs the existing convention, UUID format vs a simple integer — that conflict with the actual product spec. The skill has no idempotency check. It rewrites everything. Even what didn't need rewriting.

The adversarial tests

Four inputs designed to probe specific failure modes.

Adversarial A — Empty scenario. The skill produced an explicit failure signal. No steps invented. Correct behaviour.

Adversarial B — The self-referential case. This is the one that matters most.

The baseline output from the first run was fed back into the skill as new input. The skill should have returned it unchanged. Instead it produced a new rewrite with two changes:

It changed the user IDs. user-baseline-001 became user-selfref-001. user-baseline-002 became user-selfref-002. No semantic reason. No assumption comment explaining the change.

It removed an assumption comment. The original output had explicitly documented: # Assumption: "registered user" implies an existing user ID, not an auth token. The second run dropped it silently.

The output otherwise satisfied the output contract. Correct field names. Correct format. Correct structure. A downstream agent consuming this output would have no way to distinguish it from a legitimate improvement.

A step definition that hardcoded user-baseline-001 would now fail. A decision that was explicitly documented — "registered user means user ID, not auth token" — was silently erased.

The skill produced plausible-looking wrong output. Confidently. Correctly formatted. Invisibly broken.

Adversarial C — Wrong domain. The input was a UI scenario about a user logging into a dashboard. The skill translated it into an HTTP API contract:

Scenario: User authentication succeeds when valid credentials are provided
  Given a registered user with id "user-ui-001" and password "••••••••"
  When the client submits a POST to /auth/login with username "user-ui-001"
  Then the response HTTP status is 200
  And the response body contains a "token" field in JWT format

It invented an endpoint (/auth/login). It invented a token format (JWT). It invented a response shape. None of these exist in this codebase.

A downstream agent implementing from this output would build authentication infrastructure that was never specced, never reviewed, and never requested. The output looked reasonable. The skill had no domain check.

Adversarial D — Contradicting constraints. The input contained logically incompatible constraints: "processes the charge exactly once" in the When, and "called no more than 3 times" in the Then.

The skill identified the contradiction in an assumption comment. Then it produced a rewrite that embedded both constraints in the output.

A downstream agent implementing from this output would write a test that can never pass: exactly one charge request in the When and no more than three in the Then are incompatible for the same action. The warning comment was present. The rewrite proceeded anyway.

The reinforced skill

Four failures. Four fixes.

Idempotency check. Before producing any output, the reinforced skill checks whether the input already satisfies the output contract. If it does, it returns the input unchanged:

# SKILL: No changes required — scenario satisfies output contract.

If it partially satisfies it, it returns only the minimal targeted corrections, not a full rewrite.

Domain check. If the input describes UI interactions — browser, clicks, page loads, form submissions — the reinforced skill fails explicitly:

# SKILL FAILURE: This scenario describes UI behaviour, not an HTTP API contract.
# This skill applies to API-level specifications only.

Contradiction halt. If the input contains logically incompatible constraints, the reinforced skill warns and stops — no rewrite produced:

# SKILL WARNING: Contradicting constraints detected in [step].
# Resolve before implementation.

Self-reference guard. The idempotency check handles this automatically. Skill output fed back as input triggers the check and returns unchanged. The guard is documented in the skill's output contract section so the behaviour is explicit, not emergent.

Running all four adversarial inputs through the reinforced skill:

Test case	Original skill	Reinforced skill
Empty scenario	Explicit fail signal ✅	Explicit fail signal ✅
Self-referential (Adversarial B)	Plausible wrong output ❌	Returns unchanged ✅
Wrong domain (Adversarial C)	Invented out-of-scope endpoint ❌	Domain failure signal ✅
Contradiction (Adversarial D)	Rewrite with embedded contradiction ❌	Warning, no rewrite ✅

The specific failure mode

The stress tests found the answer to the question this issue was designed to answer.

A human-friendly skill is designed to always produce something useful. When a human asks "evaluate this," they always want an answer — even if the answer is "I couldn't evaluate this and here's why." A skill optimised for human use therefore has no termination conditions for edge cases. It produces output in all circumstances.

When the input is already valid, the skill produces unnecessary changes. When the input is out of domain, the skill translates it rather than rejecting it. When the input contains a contradiction, the skill documents the contradiction in a comment rather than refusing to proceed.

Each of these produces output that satisfies the output contract. Correct field names. Correct format. Correct structure. A downstream agent cannot distinguish this output from a legitimate improvement. The output looks like a skill succeeded. The downstream action proceeds. The error only becomes visible when a test fails for a user ID that was silently changed, or when an engineer asks why authentication infrastructure was built when it was never in scope.

A human-friendly skill is dangerous at agent scale not because it produces wrong output — it produces output that looks indistinguishably right — but because the mechanism by which it produces wrong output is exactly the same as the mechanism by which it produces correct output: it always gives you something useful, and never tells you when useful is the wrong thing to give.

Next issue: The Skill Review — what code review looks like when the review target is the skill, not the diff. A PR template, a checklist, and a real review of the reinforced skill from this issue.

Sources & Further Reading

Nate B. Jones — Agent-First Skills Architecture · natebjones.com
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Project repository
Reinforced Gherkin skill v2
Session findings — Issue #11

This article was written with the assistance of AI tools.

The 3-Tier Skill Architecture in Practice

Diya Burman — Tue, 14 Jul 2026 16:28:21 +0000

Preface

Issue #9 ended with a single skill: the Gherkin scenario quality evaluator. One prompt converted into versioned infrastructure with an output contract and a routing signal.

Issue #10 asks the harder question. When you have multiple skills, where do they go? And why does it matter?

The answer turns out to involve a decision most engineers have never explicitly made: which of your working patterns are personal, and which are organisational standards? The answer to that question determines who owns the risk when a pattern fails to transfer.

The 3-tier model

The model is straightforward. The implications are not.

Tier 1 — Org-wide standards. Consistent across every agent, every session, every domain. Formatting templates, naming conventions, commit message structure, the test verification sequence. No judgment required — compliance required.

Tier 2 — Domain methodology. High-craft, domain-specific skills encoding senior practitioner expertise. The competitive moat. Specific enough to be genuinely useful, which means specific enough not to apply everywhere.

Tier 3 — Personal workflow. Individual patterns that encode one person's working style or editorial taste. Valuable. Transferable. Almost never written down.

The problem is not that teams don't have Tier 2 skills. They do — they just call them "the way we do it here" and carry them in engineers' heads. The 3-tier model is a container for making that implicit knowledge explicit. And for making a harder decision: which Tier 3 patterns have been silently doing Tier 2 work?

Mapping the order-api project

After nine issues, the project has accumulated substantial judgment — in CLAUDE.md, in the findings files, in the code, in the session instructions that get rewritten every time. Auditing it against the 3-tier model surfaced something uncomfortable.

What actually belongs at Tier 1:
The findings file protocol. The commit message conventions. The project constraints ("you may not modify .feature files"). The test verification sequence (Gherkin → Pact → can-i-deploy). All of these apply uniformly to every agent in every session. None of them have routing signals. None of them have output contracts. They exist as prose in CLAUDE.md — which means they are re-read and re-interpreted in every session, and there is no signal when they change.

What belongs at Tier 2:
The Gherkin quality evaluator (already a skill — Issue #9). The spec-audit framework (exists as a 1,500-word reference document, not a skill). The step definition writing pattern (does not exist as a skill at all — only inferrable from reading four test files). The external service mock server architecture (why Python-native mock_server.py rather than real WireMock — documented nowhere except as the fact of what was built).

What belongs at Tier 3:
The "Why this matters" paragraph writing pattern. The article-worthiness filter. The spec-fix decision tree (when to change only the feature file vs when to change the step definition vs when to change the implementation). All three appear consistently across nine issues. None are written down.

The tier with the most gaps: Tier 2. The project has accumulated nine issues of domain methodology and converted exactly one piece of it into a proper skill. Everything else is prose, implicit code patterns, or session instructions that get re-derived each time.

What "org-wide" means for a solo project

Before building the Tier 1 skill, I had to answer a question that only appears when you're a team of one: what does "org-wide" mean when there's no org?

The answer: in a solo project, the "org" is the author plus every agent instance that works on the project. And agents are stateless between sessions. An agent in Issue #14 has no memory of the formatting decisions made in Issue #8.

Without a Tier 1 skill, every session re-invents the output format. Some issues use 🔄 In progress as a status indicator. Some don't. Some code blocks have language tags. Some don't. The findings archive becomes inconsistent over time — not because anyone made a bad decision, but because no decision was ever locked in.

The coordination problem a Tier 1 skill solves is not between engineers on a team. It is between agent instances across sessions. The "org" is temporal, not spatial.

The Tier 1 formatting standard created in this session covers exactly this: status indicator conventions, code block formatting, commit message types, and the structure variants for the findings file (the standard five-section format vs the sequence variant for multi-fix sessions like Issue #8). The output before this skill: ad-hoc section headers invented mid-session. The output after: structurally compatible findings entries regardless of which session produced them.

Why the Gherkin skill is Tier 2 — and what "competitive moat" actually means

The Gherkin quality evaluator moved from docs/skills/ to docs/skills/tier2/ this session. The relocation forced a precise answer to why it belongs there and not at Tier 1 or Tier 3.

Not Tier 1 because it encodes project-specific conventions. The field name substitutions (db_status → status, order_created_at → placed_at) are specific to this codebase's debt history. The feature file ownership rules are specific to this project's service architecture. A Tier 1 version would need to strip these specifics out — and at that point it would encode nothing that took nine issues to learn.

Not Tier 3 because it is not personal. The five-question diagnostic, the debt taxonomy, and the output contract are designed to produce compatible output regardless of which agent runs the skill. That compatibility is the whole point. If it were Tier 3, it would be optional — something one engineer uses because they like it, not something enforced on every agent that touches a feature file.

The moat: A generic Gherkin skill tells an agent "write clear Given/When/Then steps." Every Cucumber tutorial says the same thing. What the Tier 2 skill encodes that no tutorial can:

The specific failure modes of this codebase. The five patterns in the Q2 check (relative quantities, count ambiguity, undefined time anchors, mechanism claims, internal field names) were not derived from a best-practices checklist. They were derived from the actual bugs found in Issues #2 through #8. They are calibrated to this project's failure history.

The caller's perspective principle applied to this domain. Q4's "remove the implementation from the step and read only what the caller observes" is not standard Gherkin teaching. It requires understanding the difference between an HTTP API surface and its implementation — a distinction specific to contract-first API development.

The output contract for downstream step definition authors. "exactly N" not "N times". "the payment gateway" not "the external service". These requirements come from how the step definitions in this project are actually implemented, not from abstract best practices.

This expertise is not transferable to a generic context. That is what makes it a moat.

The Tier 3 skill and the socialization decision

The "Why this matters" paragraph appears in every findings file, every session. Reading across nine issues, it has a consistent structure:

Opens with the specific technical finding stated as a practitioner observation
Connects to a broader engineering principle in one sentence
Names the concrete failure mode that would occur without this finding
Closes with the implication for the reader's own practice

From Issue #5: "The bad spec was written from the implementation's perspective. The good spec was written from the caller's perspective — it describes what the caller can rely on."

From Issue #9: "The prompt produces output that passes today's tests; the skill produces output that a different agent can implement tomorrow without making any decisions you didn't make."

Both follow the same shape: observation → principle → failure mode → implication. This pattern was never written down. It exists as author instinct and as examples in the existing findings files.

The socialization decision: promote to Tier 2.

The pattern is not personal style — it is an output contract for the most reused artifact in this project. The findings files are the raw material for the newsletter. Structural consistency across them is a project requirement. An agent in Issue #14 that produces a "Why this matters" paragraph opening with the principle rather than the finding is technically correct per the CLAUDE.md description ("one paragraph, senior engineer audience") but structurally incompatible with the existing archive.

What needs to change before promotion: the skill currently documents the pattern through examples. A Tier 2 skill needs the four components named and ordered explicitly, a routing signal precise enough to fire on findings-writing and not on general prose, and quality criteria the agent can self-check before submitting.

The uncomfortable question

Three patterns from this project's nine issues that belong in a skill and don't exist as one:

The step definition writing pattern. Across all four step definition files, there is a consistent architectural pattern: fixtures injected from conftest.py, mock server state asserted via the call log rather than the response body, async side effects using time.sleep(0.3) before assertion. An agent adding a new step definition without knowing this will set up mock state inline, assert via response fields, and skip the sleep. The tests pass individually. They break in sequence. This pattern has been the foundation of every test session since Issue #2. It has never been written down.

The spec-fix decision tree. Issues #7 and #8 both required deciding: given a spec debt item, does fixing it require touching the feature file only, the step definition only, the implementation, or some combination? The answer was re-derived each time. The pattern: UNDERSPECIFIED → feature file only. LEAKY ABSTRACTION in the feature file → feature file + step definition. LEAKY ABSTRACTION in the step definition only → step definition only. IMPLICIT FLOW → remove if unspecced, new feature file if in scope. This decision tree appeared in Issue #8's seven fixes and was implicit in Issue #7. It has never been written down.

The article-worthiness filter. Not every event in a session becomes a finding. The timeout ambiguity in Issue #8 became a full five-section entry. The UUID format error in the WireMock stub became a single line. The filter: an article-worthy finding must have a root cause the reader would not have anticipated, a failure mode that would have occurred in a real system, and a fix encoding a transferable principle. A technical note is a fix with no generalizable lesson. This filter runs every session as editorial instinct. It has never been written down.

Why haven't they been documented? Because writing them down felt like overhead at the moment they were useful. The step definition pattern was obvious when it was established. The decision tree was derived from first principles when it was needed. The filter runs automatically. All three share the same problem: they are invisible when they work and only visible as "what went wrong" when they don't.

What would be wrong about promoting everything to Tier 1

The Gherkin quality evaluator would be the instructive example.

A Tier 1 version: "When writing a Gherkin scenario, apply the five-question diagnostic." This instruction fires in every context — including sessions focused on Pact contracts or CI/CD pipelines where no Gherkin is being written. The routing signal becomes noise. Agents start treating it as a background constraint to satisfy minimally rather than a deliberate skill to route to deliberately.

Deeper problem: Tier 1 skills are enforced uniformly. Promoting the Gherkin skill to Tier 1 implies that every agent in every session must run the five-question diagnostic. In an Issue #6-style CI/CD session, that is overhead, not value.

But the deepest problem is what generalization does to the skill. A Tier 1 version must strip out the project-specific conventions to apply universally. It becomes "write clear Given/When/Then steps." Every Cucumber tutorial says this. The nine issues of calibrated expertise that make the Tier 2 skill valuable get averaged out of existence.

The Gherkin skill's value comes from being specific. Promoting it to Tier 1 would generalize it until it no longer encodes the expertise that makes it useful.

The closing that came from the session itself

The 3-tier model is a container for a harder decision: which of your working patterns are personal and which are organisational standards?

In this project, after ten issues, the most valuable institutional knowledge is not in the code. It is in the step definition architecture that has never been written down, the article-worthiness filter that runs as editorial instinct, and the spec-fix decision tree that was re-derived in Issue #8 and will be re-derived again in Issue #11.

These are Tier 2 skills that exist at Tier 3. Which means they exist only as long as the sessions that carried them.

The organisational liability is not that these patterns get lost. It is that when they get lost, nobody knows they were there.

Next issue: Designing for Non-Human Callers — what changes when agents call your skills hundreds of times per session, and how human-friendly skills degrade under agent load.

Sources & Further Reading

Nate B. Jones — Agent-First Skills Architecture · natebjones.com
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Project repository
Session findings — Issue #10

This article was written with the assistance of AI tools.

Prompts Are Disposable. Skills Are Infrastructure.

Diya Burman — Mon, 29 Jun 2026 13:30:00 +0000

Preface

Layer 1 is complete. Eight issues, a working order management API, Pact contracts, a CI/CD pipeline, and a spec audit framework. The specification layer is done.

Layer 2 starts here. And it begins with a question that sounds simple until you think about it: why do you keep rewriting the same prompts?

The problem with copying prompts

If you've been using AI seriously for more than a few weeks, you have a collection of prompts that work. You've refined them. You copy them between sessions. You paste them into Claude Code at the start of a task and the agent does the right thing.

That feels like a system. It isn't.

Here's what copying a prompt actually does: it copies the words. It doesn't copy the contract. The agent reads the words, interprets them in the context of this session, and makes a series of decisions that aren't in the prompt. Different sessions, different context, different decisions — even with the same words. You won't notice until two agents produce incompatible outputs from the same prompt and you have to figure out which one is right.

A skill is different. A skill specifies what to produce, not just what to consider. It has a version, an output contract, and a routing signal. It gets better over time and the improvements persist. It's the difference between a note you wrote to yourself and infrastructure your whole team — human and agent — can depend on.

Finding the right candidate

I reviewed the entire order-api project to find the best prompt-to-skill conversion candidate. Three instructions surfaced:

The test-run verification sequence (pytest tests/steps/ -v && pytest tests/pact/ -v && python scripts/can_i_deploy.py) appears in every session. Rejected — it's a procedure, not a judgment call. Any agent can run three commands.

The findings file protocol appears in CLAUDE.md and has been followed since Issue #3. Rejected — it describes a format and cadence, not a methodology.

The Gherkin scenario quality evaluation — the methodology for deciding whether a scenario is well-formed before accepting or writing it — appeared across Issues #5, #7, and #8. Every time, the agent re-derived the same judgment framework from scratch. This is the winner.

Why: it encodes judgment, not procedure. Whether a step is UNDERSPECIFIED or LEAKY ABSTRACTION is a reasoning call. Its output drives everything downstream — every implementation session depends on the scenarios being well-formed. A bad scenario written in a planning session becomes broken step definitions two sessions later.

And here's the uncomfortable detail: the timeout ambiguity that was fixed in Issue #8 — And the response is returned within 12 seconds — was introduced in Issue #2. Three sessions inherited it silently before it was caught. A quality evaluation skill running in Issue #2 would have caught it before it was ever committed.

The prompt version — and what it gets wrong

Here's the current prompt as it would be pasted into a session:

Before writing or accepting a Gherkin scenario, check that it is well-formed. A well-formed scenario describes behavior from the caller's perspective, not from the implementation. Each step should be specific enough that only one implementation can satisfy it. Check for: vague quantities, counts that could be read as total or additional, time bounds without a start anchor, mechanism claims without the mechanism, and internal field names leaking into the spec. If the scenario has these problems, rewrite it before proceeding.

Four weaknesses:

No output format. The prompt implies the agent will either rewrite or proceed — but doesn't say what to return. A list of problems? An annotated version? The corrected Gherkin? Two agents will produce two different output shapes. A downstream agent consuming either has to parse whichever shape it receives.

No taxonomy. "Internal field names" means something specific — but the prompt doesn't define what internal means vs external. Two agents draw that line differently.

No handling for partial problems. "Rewrite before proceeding" — does that mean rewrite every step, or only the problematic ones?

Context-dependence. The prompt was written for reviewing an existing scenario. Used in a planning context ("write a new scenario for DELETE /orders/{id}"), an agent either ignores the check or applies it mid-draft. The two strategies produce different levels of revision.

The fundamental problem: the prompt describes what to look for but not what to produce. The agent invents the output format. When the output format is invented, it is never the same twice.

The demonstration

Same input. Same endpoint. Prompt version first, skill version second.

The task: Write a Gherkin scenario for DELETE /orders/{order_id} — cancel a confirmed order. The order can only be cancelled if payment has not been captured yet.

Task A — prompt version output:

Scenario: Cancel a confirmed order before payment capture
  Given a confirmed order exists with order ID "order-123"
  And payment has not been captured for the order
  When the user sends DELETE /orders/order-123
  Then the order is cancelled
  And the response status code is 200
  And the response includes a confirmation message

Six implicit decisions the agent made that weren't in the prompt:

"confirmed order exists" — method of setup not specified. Seed directly or call POST /orders? Left open.
"payment has not been captured" — mechanism not specified. Mock server state? Flag on the order?
"the order is cancelled" — mechanism language. What field changes? What value? Not specified.
"confirmation message" — any text in any field satisfies this. Undefined.
200 vs 204 — the agent chose 200 over 204 (no body). Undocumented judgment call.
No failure scenario — the prompt mentioned "can only be cancelled if payment has not been captured yet" but the agent wrote only the success case.

Total implicit decisions: 6. All silent.

Task B — skill version output:

Scenario: Confirmed order is cancelled when payment has not yet been captured
  Given an order was created via POST /orders with order ID "order-del-001" and status "CONFIRMED"
  And the payment gateway has not captured payment for order "order-del-001"
  When the client sends DELETE /orders/order-del-001
  Then the response HTTP status is 200
  And the response body contains a "status" field with value "CANCELLED"
  And the response body contains an "order_id" field with value "order-del-001"
  And the payment gateway receives no capture or charge requests for order "order-del-001"

Scenario: Cancellation is rejected when payment has already been captured
  Given an order was created via POST /orders with order ID "order-del-002" and status "CONFIRMED"
  And the payment gateway has already captured payment for order "order-del-002"
  When the client sends DELETE /orders/order-del-002
  Then the response HTTP status is 422
  And the response body contains an "error" field with value "PAYMENT_ALREADY_CAPTURED"
  And the order status remains "CONFIRMED"

Two implicit decisions — both surfaced explicitly:

422 vs 409 — the skill's output contract requires documenting assumptions, so this was flagged as a comment rather than silently embedded.
"status" vs "cancellation_status" — the skill's LEAKY ABSTRACTION check prevented using an implementation-facing field name.

Total implicit decisions: 2. Both visible.

The diff

What changed	Classification
"confirmed order exists" → "created via POST /orders with status CONFIRMED"	SKILL CONSTRAINT
"payment has not been captured" → "the payment gateway has not captured payment for order-del-001"	SKILL CONSTRAINT
"the order is cancelled" → HTTP status + "status" field with value "CANCELLED"	QUALITY DELTA
"confirmation message" → specific field name and value	QUALITY DELTA
(absent) → "payment gateway receives no capture requests"	SKILL CONSTRAINT
(absent) → full second scenario for failure case	QUALITY DELTA

Six meaningful differences. Three skill constraints, three quality deltas, six prompt ambiguities eliminated.

The three properties skills have that prompts don't

1. Version control

A prompt has no version. When you improve it, you copy the new text into the next session. The old version exists in your clipboard history or a chat transcript from three weeks ago. You cannot diff it. You cannot pin a session to it. You cannot see what changed between the prompt that worked and the prompt that produced the wrong output.

The Gherkin quality skill lives in docs/skills/gherkin-scenario-quality.md. When Issue #8 added the IMPLICIT FLOW debt class, the skill gets a one-line update:

+| IMPLICIT FLOW | A step that implies a follow-up flow that is not specced anywhere |

Every session after that commit uses the updated skill. Every session before it used the previous version. git blame tells you exactly when IMPLICIT FLOW was added and which issue prompted it. With a prompt, "skill v1.1" means nothing. There is only "the prompt I'm using today."

2. Output contract

The skill specifies exactly what it must return:

One or more complete Gherkin scenarios in Given/When/Then format
All Then clauses must assert a field name AND a value — not just presence
All counts must use "exactly N" or "no more than N total" — never "N times"
All time bounds must include a start anchor
Each external service in a Given clause must be named explicitly
Assumptions not in the input must appear as # Assumption: comments

The downstream dependency is the step definition author. When tests/steps/test_order_creation.py implements And the payment gateway received exactly one charge request — "exactly one", "charge request", "payment gateway" are all actionable. When it implements "And the response includes a confirmation message" — the author must invent an assertion. That invention is where test coverage becomes unreliable.

The output contract is the interface between the agent that writes scenarios and the agent that implements from them.

3. Routing signal description

The skill's description line:

Evaluate and produce well-formed Gherkin scenarios for the order-api project using the five-question debt diagnostic and output contract.

It names the artifact type, the project, the method, and the output. An agent knows exactly when to use this skill and what it will receive.

A bad description for the same skill:

Help with writing tests and checking scenarios for the project.

"Tests" matches pytest, Pact contracts, unit tests, and Gherkin. "The project" matches any repo. No methodology named means two agents doing "help with writing tests" produce incompatible outputs — which is exactly the problem the skill exists to solve.

The answer

If both the prompt and the skill produce output that works, the difference is this:

The prompt produces output that passes today's tests. The skill produces output that a different agent can implement tomorrow without making any decisions you didn't make.

That's why copying prompts isn't enough. The words travel. The contract doesn't.

Next issue: The 3-Tier Skill Architecture in Practice — mapping your skills to the right tier and why Tier 2 is where individual expertise becomes organizational leverage.

Sources & Further Reading

Nate B. Jones — Agent-First Skills Architecture · natebjones.com
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Project repository
Gherkin quality skill
Session findings — Issue #9

This article was written with the assistance of AI tools.

Spec Debt Doesn't Disappear When You Fix It. It Migrates.

Diya Burman — Mon, 22 Jun 2026 13:30:00 +0000

Preface

Issue #7 ended with seven spec debt items documented in a project that had been built carefully for seven issues. Every item was passing its tests. None of them announced themselves. They were found by asking a different question: not "does this pass?" but "what would a second agent build from this step?"

Issue #8 fixes all seven — and builds the tool that found them into something reusable.

The seven fixes

Working through each item one at a time, running the test suite after every individual fix. Not batching them. The discipline matters — if a fix breaks something, you want to know which fix broke it.

Fix 1 — Timeout measurement ambiguity

# Before
And the response is returned within 12 seconds

# After
And the response is returned within 12 seconds of the order being submitted

"Of the order being submitted" anchors the clock to client-side HTTP request dispatch — the same moment time.time() is captured in the step definition. Without this anchor, a second implementation could measure from server receipt, from the last retry attempt, or from when the response body is fully read. All three produce different numbers under load.

Fix 2 — "Retried" vs "total attempts"

# Before
And the payment gateway is not retried more than 2 times

# After
And the payment gateway receives no more than 2 charge requests total

"Retried 2 times" has two valid English readings: 2 retries meaning 3 total requests, or retried up to 2 times meaning 2 total. "No more than 2 charge requests total" counts requests, not retries, and the word "total" makes clear the initial attempt is included. This also changed the assertion in the step definition — from trusting the response body's retry_count field to checking the actual call count at the mock server. Stronger assertion, same outcome.

Fix 3 — "Released" without mechanism

# Before
And the inventory reservation is released

# After
And the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M

"Released" says what happened but not how, and not for which items. The rewrite names the items and specifies that a request is sent to the inventory service. This fix also revealed a gap: the current implementation signals release via a response body field (inventory_released: true) rather than a separate API call to the inventory service. The spec now describes the intended behaviour. The implementation doesn't fully match it yet. That's a future issue — but the gap is now visible rather than hidden.

Fix 4 — "Explicit user action" — removed entirely

# Before
And no order is confirmed without explicit user action

# After
(step removed)

This step implies a follow-up confirmation flow (POST /orders/{id}/confirm or equivalent) that does not exist anywhere in the codebase. It passes trivially because no order is confirmed in the partial availability scenario — not because the confirmation flow was implemented. A spec step that passes for the wrong reason is not a safety net. It is a false guarantee. If the confirmation flow is built in a future issue, a new scenario should specify it precisely. Leaving this step in place would invite an agent to invent an unspecced endpoint.

Fix 5 — Presence without value assertions

The order_status_bad.feature timestamp step was asserting only that a field exists and is a non-empty string. Tightened to assert the field name, the value, and the type explicitly. Kept conservative — order_status_bad.feature is a pedagogical artifact and shouldn't be converted into a good spec, which would defeat its purpose in the newsletter.

Fix 6 — "An order exists" without specifying how

# Before
Given an order was successfully placed and confirmed with order ID "aaa00000-..."

# After
Given an order was created via POST /orders and confirmed with order ID "aaa00000-..."

"Successfully placed and confirmed" describes the outcome but not the mechanism. "Created via POST /orders" makes explicit that a real creation flow is expected. The step definition currently seeds the order directly into the in-memory store — a shortcut. The rewrite creates a documented gap between spec intent and step implementation. Visible gap, not hidden one.

Fix 7 — "Correct" without definition

# Before
And the notification contains the correct order id and total

# After
And the notification request body contains order_id "order-abc-123" and total 134.97

"Correct" is relative to context that may not be available to the reader. The rewrite hardcodes the expected values established in the When clause. Two agents reading the original step would both implement something that checks the notification body — but one might compare against the When-clause values, another might check against a computed total, a third might only verify field presence. The rewrite removes all three interpretations.

This fix also caught something the stub had been hiding: the notification mock was returning "mock-notif-001" as a notification id. Not a UUID. The format assertion caught it immediately. This is exactly the value of adding concrete assertions — it surfaces stub data that was never valid but was never checked.

The audit framework

After fixing all seven items, I built the diagnostic tool into a standalone document: docs/spec-audit-framework.md. The full document is in the repo. Here's the core of it.

Five questions — ask them for every scenario in every feature file:

Q1: Who owns this scenario?
Can you name the team, service, or domain this scenario belongs to? If the answer includes "and also", the scenario is in the wrong file.

Q2: What decisions does this scenario leave open?
For every Given, When, and Then clause: could two agents build different implementations that both pass? If yes, the step is underspecified.

Q3: Are all terms defined within the file?
Every noun that is not a standard HTTP concept or a primitive type should be defined in the scenario or a Background clause. If understanding a term requires reading another file or asking a colleague, it is spec debt.

Q4: Does this scenario describe behaviour or implementation?
Steps should describe what the system does from the caller's perspective. Any step that references internal concepts — database field names, function names, internal status codes — is leaking implementation into the spec.

Q5: What does this scenario NOT say that it should?
List the edge cases, error states, and boundary conditions the scenario implies but does not specify. Each one is a silent assumption waiting to become a production incident.

Six debt classes:

Class	What it looks like
UNDERSPECIFIED	Step present but leaves a decision open
MIXED CONCERN	Scenario covers more than one service domain
UNDEFINED TERM	A noun used without being defined
AMBIGUOUS COUNT	A quantity with two valid interpretations
IMPLICIT FLOW	Implies a follow-up flow that isn't specced
LEAKY ABSTRACTION	References implementation details

What the framework found that the manual audit missed

Applying the five questions to all four fixed feature files surfaced one item the Issue #7 manual audit didn't catch.

In order_status_good.feature, the Given clause now reads "created via POST /orders" — the fixed version from this session. Q4 flagged it for a different reason than the original audit: the step definition still seeds the order directly into the in-memory store. The spec text is precise. The implementation of the spec takes a shortcut.

The manual audit looked at feature file text. The framework applies Q4 to step definitions as well — and a step definition that silently does something different from what the spec says is spec debt, even if the test passes.

This distinction matters: spec debt can migrate from the feature file into the step definition. You fix the scenario, tighten the language, run the tests — green. But the step definition now implements a shortcut that contradicts the precise step text. The debt moved, it didn't disappear.

The scorecard — after all fixes

Applied the framework to all four non-pedagogical feature files:

order_creation.feature — 5 scenarios, 1 debt item remaining (LEAKY ABSTRACTION at step definition level — inventory release mechanism gap from Fix 3)

order_status_good.feature — 2 scenarios, 1 debt item remaining (LEAKY ABSTRACTION — step definition seeds order directly rather than via POST /orders)

notification_service.feature — 2 scenarios, 0 debt items

order_status_bad.feature — kept as pedagogical artifact, not audited for debt

Debt density after fixes: 0.22 items per scenario. Both remaining items are LEAKY ABSTRACTION at the step definition level. Zero AMBIGUOUS COUNT or IMPLICIT FLOW items remain — the two highest-risk classes.

The uncomfortable answer

After fixing seven spec debt items and applying a structured audit framework to a project that has been built carefully for eight issues, two debt items remain. Both were introduced by the same sessions that fixed other debt — a precise spec step was written, and the implementation of that step took a shortcut.

Spec debt is not eliminated by fixing debt. It migrates.

The practical conclusion: treat step definitions as part of the spec surface, not just as test harness code. A step definition that silently does something different from what the spec says is spec debt, even if the test passes. The audit framework catches both — but only if you apply Q4 to the step definitions as well as the feature text.

The other finding worth naming: notification_service.feature scored zero debt items. It was written after eight issues of accumulating lessons about what the previous files got wrong. The absence of debt is not accidental — it's the result of knowing what bad specs look like before writing the next one.

The best time to write a spec is after you've written a few bad ones. Auditing retroactively and fixing forward is the realistic path. Not "write it right the first time."

Next issue: Prompts Are Disposable. Skills Are Infrastructure — the conceptual shift from session-level prompts to versioned, reusable skill definitions. Layer 2 begins.

Sources & Further Reading

Cucumber + Gherkin documentation
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Spec audit framework
Session findings — Issue #8

This article was written with the assistance of AI tools.

Your Spec Files Are Lying to You. Mine Were Too.

Diya Burman — Mon, 15 Jun 2026 17:30:55 +0000

Preface

Every issue so far has worked with one service and one spec file. Issue #7 changes that. A second service enters the picture — a notification service that the order service calls after a confirmed payment — and with it comes the question that every growing system eventually forces: where do spec file boundaries go?

The answer turns out to matter more than it looks. And the audit at the end of this issue found seven spec debt items in files we've been running since Issue #2. All passing. All carrying risk.

The notification service — and a design decision that has spec implications

The new service is minimal: POST /notifications/order-confirmed accepts an order id, user id, and total, and returns a notification id and a QUEUED status. Simple enough. The interesting part is how the order service calls it.

The call is fire-and-forget.

When an order is confirmed, the order service starts a daemon thread, fires the notification request, and returns the CONFIRMED response immediately — without waiting for the notification to succeed. If the notification service is down, slow, or returning errors, the order is still confirmed. The customer gets their confirmation. The notification may or may not arrive.

This is a deliberate design decision. The order service owns the transaction. The notification service owns delivery. Coupling the order confirmation response to notification delivery would mean a flaky notification service could block order creation — which is a much worse failure mode than a missed notification.

But the decision has a direct spec implication: any scenario that asserts Then the order status is "CONFIRMED" must remain true regardless of what the notification service does. The spec cannot simultaneously require CONFIRMED and make CONFIRMED depend on notification success. That would be a hidden coupling — the spec would look independent but the implementation would not be.

This is the kind of architectural decision that should be in the spec before it's in the code. Once it's in the code it becomes folklore.

The wrong way first: one big spec file

Before doing it right I did it wrong deliberately. I added two notification scenarios to the bottom of order_creation.feature — the existing file that's been covering order creation since Issue #2.

All 7 tests passed. Green across the board. pytest has no opinion about spec architecture.

The problems are structural, not functional:

Mixed ownership. order_creation.feature line 1 says Feature: Order Creation. By line 48 it's testing notification delivery. If the notification team changes their contract — say, adding a channel field to the request — they have to open order_creation.feature to update it. That file is not theirs. The filename, the feature declaration, and the existing scenarios all signal "this belongs to the order team." The notification scenarios are squatters.

The growing file problem. At 5 scenarios the file is readable. At 7 it starts to smell. Extrapolate to a real system: 10 downstream services, 5–10 scenarios each, all appended to the originating feature file because each was "triggered by" an order creation event. The file becomes a catch-all that nobody owns and everybody edits. Ownership dissolves into "whoever last touched it."

The agent routing problem. When an agent is handed order_creation.feature to build against, it must now implement both order logic and notification logic. It cannot know from the file whether the notification call belongs in POST /orders or in a separate endpoint. It will make a decision — probably the wrong one — and that decision will be baked into the implementation before anyone notices.

Spec debt seed. The scenario "Order confirmation succeeds even if notification fails" uses the step "the notification service is unavailable" without defining what unavailable means. TCP connection refused? 503? A 30-second hang? Each is a different failure mode with different implications for retry logic. An agent will pick one interpretation silently. Two agents will pick different ones. Both implementations will pass the spec. This is spec debt: it forms quietly, passes its tests, and surfaces as a production incident months later.

The right way: bounded spec files

After documenting what was wrong, I moved the notification scenarios into their own file: tests/features/notification_service.feature. Rewrote both scenarios to:

Precisely define "unavailable" as 503 Service Unavailable — not a timeout, not a connection refused, not an ambiguous network failure
Describe the notification contract from the notification service's perspective
Make the file self-contained — a notification service team reading it wouldn't need to open order_creation.feature to understand it

The result:

order_creation.feature: 5 scenarios, all about order creation. No references to notifications.
notification_service.feature: 2 scenarios, all about notification delivery behaviour.

The file boundary is now a contract boundary. They can be versioned, owned, and handed to different agents independently.

Bounded spec files are not a tidiness preference. They are a precision tool for multi-agent systems. When a spec file is bounded to one service, an agent can be assigned exactly that file and nothing else. It builds one surface, tests one contract, returns. When the spec bleeds across services, the agent must make decisions about service ownership that were never written down. Those decisions accumulate as hidden assumptions in the implementation.

The spec debt audit

With the bounded file structure in place, I audited all four feature files in the project for spec debt — places where the spec passes its tests but leaves decisions that should have been made explicitly.

Seven items. All passing. All carrying risk.

1. Ambiguous timeout measurement

File: order_creation.feature — Scenario: payment gateway times out
Step: And the response is returned within 12 seconds

From when? The client sends the request? The server receives it? The last retry fires? Two agents will instrument this differently and both will pass. "Within 12 seconds of the order being submitted" — defining "submitted" as the moment the HTTP request body is sent — removes the ambiguity.

2. "Retried" vs "total attempts"

File: order_creation.feature — Scenario: payment gateway times out
Step: And the payment gateway is not retried more than 2 times

Does this mean 2 total attempts (1 original + 1 retry) or 2 retries on top of the original (3 total)? The English is genuinely ambiguous. An agent will pick one. The test will pass. The production system will behave differently than intended.

Fix: And the payment gateway receives no more than 2 charge requests total — "requests total" removes all ambiguity about whether the first attempt counts.

3. "Released" is not a mechanism

File: order_creation.feature — Scenario: payment declined
Step: And the inventory reservation is released

"Released" is not defined. Does the inventory service receive a DELETE? A POST to a release endpoint? Does a TTL fire? An agent will implement whichever mechanism seems natural. Two agents will produce incompatible implementations that both pass the spec.

Fix: Name the items and the mechanism: And the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M.

4. "Explicit user action" describes a flow that doesn't exist

File: order_creation.feature — Scenario: partial availability
Step: And no order is confirmed without explicit user action

"Explicit user action" is not defined anywhere in the spec. A second API call? A UI confirmation? A webhook? This step passes trivially because no order is confirmed — the negative condition is true by absence. But it implies a follow-up confirmation flow that was never built, never specced, and never reviewed. If a future agent reads this step and builds a confirmation flow to satisfy it, it will invent something that was never intended.

Fix: Remove it if the follow-up flow is out of scope. Or replace it with a concrete step: And a subsequent POST to /orders/{order_id}/confirm is required to complete the order.

5. Presence without value

File: order_status_bad.feature
Step: field-name assertions without value or type assertions

Asserting that a field exists only catches absence — not incorrect presence. An agent can return {"status": null} and pass. The spec catches the wrong thing.

Fix: Assert the full expected shape with explicit values rather than just field names.

6. "An order exists" doesn't say how

File: order_status_good.feature
Step: Given an order exists with status "CONFIRMED"

"An order exists" doesn't specify how it got there — full creation flow, or directly seeded into the store. The two methods produce different side effects. An agent building a test harness may seed the order directly, bypassing the creation flow entirely, which means the status endpoint tests never verify that a real confirmed order is actually readable via the API.

Fix: Given a previously confirmed order created via POST /orders with id "{order_id}" — or explicitly state that direct seeding is acceptable.

7. "Correct" is relative

File: notification_service.feature
Step: And the notification contains the correct order id and total

"Correct" compared to what? If the order total is computed, two agents may compute it differently and both pass "correct" against their own computation.

Fix: Hardcode the expected value: And the notification request body contains order_id matching the confirmed order and total of 134.97.

Why all seven of these matter even though they're all green

Every item in that audit passes its test. That is the point.

Spec debt is not visible in a green CI run. It is visible only when you ask: what would a second agent build from this spec? The step "the payment gateway is not retried more than 2 times" has been in the codebase since Issue #2. It has passed every run. But it encodes an ambiguity that will be resolved differently by every agent that implements it fresh. The "no order is confirmed without explicit user action" step describes a flow that does not exist anywhere in the codebase. It passes because the negative condition is trivially true.

If a future agent reads that step and builds a confirmation flow to satisfy it, it will build something that was never specced, never reviewed, and never integrated. The spec invited it. The tests blessed it. Nobody noticed.

This is the exact failure mode that makes AI-assisted development unreliable at scale. Specs that look precise, pass their tests, and silently invite incompatible implementations. The debt doesn't announce itself. It compounds.

Where the project stands

Fifteen tests passing across four bounded feature files. The notification service is integrated. The Pact contracts — which existed before this session — remain unbroken because the notification call happens after the transaction completes. Adding a new service boundary didn't require touching existing contracts.

Seven spec debt items documented. None fixed yet. The fixes are the next issue.

Next issue: The Spec Audit — applying the debt framework to a real existing service and building the diagnostic tool readers can use on their own codebases.

Sources & Further Reading

Cucumber + Gherkin documentation
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Session findings — Issue #7

This article was written with the assistance of AI tools.

My Tests Passed. My Pipeline Caught What They Missed.

Diya Burman — Sat, 13 Jun 2026 18:59:27 +0000

A Level 5 Engineer — Issue #6

Preface

Five issues in, everything we've built lives on one machine. The Gherkin scenarios, the WireMock stubs, the Pact contracts, the can-i-deploy script — all of it runs locally, passes locally, and means nothing the moment someone else touches the codebase.

Issue #6 fixes that. A GitHub Actions pipeline now runs on every push, executes the full specification stack in dependency order, and blocks merges to main if anything breaks. The pipeline is the guardrail. From this point on, a broken contract or a failing scenario cannot reach main undetected.

Getting there took ninety minutes and two interventions I didn't plan for. Both are worth documenting.

Before the YAML: deciding what "green" means

The first thing Claude Code did before touching any pipeline config was run the full test suite to establish a baseline. The instruction was explicit: everything must pass before a single line of YAML gets written.

It found a failure immediately — and it wasn't from the breaking change experiment. It was from Issue #5.

The bad-spec test (test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order) was still asserting db_status in the response body. That was intentional in Issue #5 — the failure was the finding. The session ended with it red because the point was to show what bad specs produce. But on main, with CI incoming, that means the pipeline would have been red on day one before a single feature change.

The fix was adding backward-compat aliases to the response:

return {
    "order_id": order_id,
    "status": order["db_status"],            # good spec field
    "db_status": order["db_status"],         # bad spec alias — keeps Issue #5 test passing
    "placed_at": order["order_created_at"],  # good spec field
    "order_created_at": order["order_created_at"],  # bad spec alias
}

Neither test file was modified. No feature files were touched. The aliases kept both the good-spec and bad-spec tests passing against the same endpoint.

The reason this matters before the pipeline exists: a team that starts CI with a known failure trains itself to ignore red. The cost of normalising a red CI is much higher than the cost of fixing the baseline first. Claude Code made the right call and documented it before moving on.

The pipeline structure

Four jobs, in dependency order:

test → pact-consumer → pact-verify → can-i-deploy

Each job only runs if its predecessor passes. If Gherkin breaks, Pact never runs. If the consumer tests fail, verification never runs. If verification fails, can-i-deploy is skipped. The pipeline fails fast and tells you exactly which layer broke.

The artifact chain is what makes it a pipeline rather than four parallel scripts. The pact-consumer job generates the .pact files and uploads them as a GitHub Actions artifact. The pact-verify job downloads that artifact and verifies it — the same files, not freshly regenerated ones. Without this, each job would build its own consumer contract from scratch, and verification would be proving that the contract matches the code rather than proving it matches what pact-consumer actually produced.

One non-obvious piece: mock_server.py is a library module with no command-line entry point. The pipeline needed servers running as background processes. The fix was an inline Python invocation:

- name: Start mock servers
  run: |
    . .venv/bin/activate
    python -c "
    import time
    from mock_server import start_mock_server
    start_mock_server(8091, 'wiremock/payment-mappings')
    start_mock_server(8092, 'wiremock/inventory-mappings')
    time.sleep(86400)
    " &
    sleep 2

The time.sleep(86400) keeps the process alive for the duration of the job. Inelegant but functional. A proper if __name__ == "__main__" entry point with argparse is the obvious cleanup for a future session.

The first CI run — and why I had to intervene manually

The YAML was committed, pushed to main, and the pipeline ran. All three runs failed on the test job:

OSError: [Errno 98] Address already in use

Ports 8091 and 8092. Every test in test_order_creation.py errored at setup. The order status tests — which don't use the mock servers — passed fine.

Claude Code didn't catch this on its own. Here's why that's worth explaining.

When Claude Code wrote the pipeline, it was working from the codebase and its own knowledge of GitHub Actions patterns. It knew the mock servers needed to be running before pytest started, so it added an explicit start-servers step to the YAML — a reasonable decision based on the information it had. What it couldn't see was the runtime interaction between that YAML step and pytest's session-scoped fixtures, because that interaction only manifests in the CI environment, not locally.

Locally, running pytest tests/steps/ -v has always worked correctly because the session fixture starts the servers and nothing else competes. Claude Code had only ever seen local runs succeed. It had no signal that the YAML step was creating a conflict — because the conflict doesn't exist locally.

This is a fundamental limit of the "paste and walk away" approach at the boundary between local and remote environments: the agent can reason about the codebase and about CI patterns, but it can't observe the CI run itself. The failure was on GitHub. Claude Code was in a terminal. Those two things weren't connected.

I diagnosed the error from the GitHub Actions log, explained the root cause, and pasted new instructions. Claude Code fixed it in one step — removing the redundant YAML steps entirely:

# Removed from both test and pact-verify jobs:
- name: Start mock servers
  run: |
    . .venv/bin/activate
    python -c "..." &
    sleep 2

The pytest session fixtures already own server lifecycle correctly. scope="session" means pytest starts the servers once per test run and keeps them alive. The YAML step was duplicating a responsibility that was already handled. The fix wasn't a workaround — it was removing the wrong layer.

The root cause in plain terms: the YAML step and the pytest fixture both thought they were responsible for starting the servers. The port was already bound when the fixture tried to bind it again. Works on my machine. Breaks in CI. Classic.

The breaking change experiment — in the pipeline

With the pipeline green, the breaking change test ran as designed.

Branch test/breaking-change-pipeline, commit 76c0d89: renamed "status" to "result" in wiremock/payment-mappings/payment-success.json. Same change as Issue #4, now running through CI instead of local verification.

The expected failure:

a successful payment charge (FAILED)

Failures:
1) Verifying a pact between OrderService and PaymentGateway
   1.1) has a matching body
          $ -> Actual map is missing the following keys: status
   {
     "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
     "transaction_id": "txn-abc-123"
   }

pact-verify fails. can-i-deploy is skipped. The merge is blocked.

And the key point from Issue #4 holds at the pipeline level: the test job — the Gherkin suite — would pass with the breaking change in place. The order creation scenarios check HTTP status codes and business outcomes. They never read pay_resp.json()["status"]. A stub returning result instead of status still returns HTTP 200. Gherkin passes. Pact catches it.

This is the division of labour. Gherkin proves the system does the right thing. Pact proves the contracts don't drift. You need both, and now both run automatically on every push.

The one step that requires the GitHub UI

Claude Code cannot configure branch protection rules — that requires the GitHub web UI or admin API. This step is non-negotiable and must be done manually:

Repo → Settings → Branches → Add branch protection rule
Branch name pattern: main
Enable Require status checks to pass before merging
Add all four status checks: test, pact-consumer, pact-verify, can-i-deploy
Enable Require branches to be up to date before merging
Save

Without this, the pipeline is advisory. A push to main can still happen even if all four jobs are red. The pipeline becomes a dashboard — it shows you the problem but doesn't stop anything. Branch protection is what turns "CI failed" from a notification into enforcement. The pipeline is only a guardrail if something stops you going around it.

The honest part

The YAML took about twenty minutes to write. The session took ninety minutes total — because the baseline fix and the port conflict ate the rest.

The instinct during the baseline audit was to skip past the known failure. It's a demo test, we know why it's there, configure CI to skip that file and move on. That would have been thirty seconds. It also would have been wrong — a pipeline with documented exceptions is a pipeline people route around.

The instinct during the port conflict was to blame the CI environment. Ubuntu runs things differently, ports work differently, it's a platform quirk. That framing would have sent the debugging in the wrong direction. The actual cause was simpler: two layers both thought they owned the same responsibility, and nobody had written down which one was actually in charge.

Both of those moments are the J-curve. Not the YAML — the discipline of not skipping and not blaming the environment. The overhead of CI is not the config file. It's every decision about what "green" actually means and who's responsible for what.

The pipeline is now real infrastructure. The breaking change can't reach main. That's worth ninety minutes.

Next issue: The Scope Problem — scaling Gherkin across a multi-service system. What happens when one spec file isn't enough, and how spec debt forms.

Sources & Further Reading

GitHub Actions documentation
Pact documentation
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Session findings — Issue #6

This article was written with the assistance of AI tools.

The AI Built the Wrong Thing. Every Test Passed.

Diya Burman — Wed, 10 Jun 2026 02:40:42 +0000

A Level 5 Engineer — Issue #5

Preface

Every issue so far has assumed something I haven't said out loud: that the specs are good. Issue #2 wrote them carefully. Issue #3 handed them to an agent and watched it build correctly. Issue #4 proved the contracts survive provider drift.

But what happens when the spec isn't good? Not broken — Gherkin syntax is fine, tests pass, the agent builds something. Just imprecise. Vague in ways that feel precise when you're writing them.

This issue answers that question by doing the thing deliberately. I wrote bad Gherkin on purpose, handed it to the agent, watched what it built — and then rewrote the spec and did it again. The difference between the two implementations is the article.

The hardest thing about bad specs

Bad specs are hard to spot when you're writing them because they feel complete.

A scenario that references implementation details sounds like reasonable description — you wrote the implementation, so the details feel like specifics. A Given clause that feels obvious to you will be interpreted differently by every reader who hasn't seen the code. The Gherkin is syntactically correct. The tests pass. Nothing in the output signals that anything is wrong.

This is the trap. It's not that bad specs break things. It's that they don't.

The endpoint

I added a new endpoint to the order-api project: GET /orders/{order_id}/status. It returns the current status of an order and relevant metadata. Simple enough that the spec should be easy to write well. Which makes it a good target for writing it badly on purpose.

The bad specs

Two scenarios. Both syntactically valid. Both produce passing tests. Both wrong in different ways.

# BAD SPEC 1 — The leaky spec
# Problem: references internal implementation concepts (db_status, order_created_at)
# rather than describing what a caller observes. The agent uses these names literally
# in the response body, leaking storage terminology into the public API contract.

Scenario: Retrieving status for a confirmed order
  Given an order exists in the system with db_status "CONFIRMED"
  When I request GET /orders/{order_id}/status
  Then the response should contain the db_status field set to "CONFIRMED"
  And the order_created_at field should be populated from the order record

# BAD SPEC 2 — The vague Given
# Problem: "an order that has not been placed" is underspecified. The agent must
# guess what this means — a malformed ID? A well-formed UUID with no record?
# A deleted order? Each interpretation is plausible and produces different behavior.

Scenario: Retrieving status for an order that does not exist
  Given an order that has not been placed
  When I request GET /orders/{order_id}/status
  Then the response should indicate the order was not found

Both passed immediately:

tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order PASSED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

2 passed in 0.34s

Green. No warnings. No hint that anything is wrong.

What the agent built from the bad specs

Here's the implementation the agent produced:

@app.get("/orders/{order_id}/status")
def get_order_status(order_id: str):
    order = _orders.get(order_id)
    if order is None:
        raise HTTPException(status_code=404, detail="Order not found")
    return {
        "order_id": order_id,
        "db_status": order["db_status"],
        "order_created_at": order["order_created_at"],
    }

It satisfies the spec completely. It also made four decisions the spec never made:

Decision 1: The field is named db_status in the response.
The spec said db_status so the agent used db_status. It never questioned whether this was an internal name leaking into a public API. It satisfied the spec literally.

Decision 2: A missing order returns 404.
The spec says "indicate the order was not found." 404 is a defensible interpretation. So is 422, 403, or a 200 with a NOT_FOUND status field. The agent picked the most conventional option — but the spec never mandated it, and FastAPI's default 404 body is {"detail": "Order not found"}, not {"error": "Order not found"}. A client checking response.json()["error"] gets a KeyError.

Decision 3: The timestamp field is named order_created_at with no format requirement.
The spec says "populated from the order record." The agent chose order_created_at and returned an ISO string because that's what datetime.utcnow().isoformat() produces. The step definition checked only that the field is non-empty and a string — so any format would have passed. A Unix timestamp integer would have passed. A human-readable string like "June 2nd" would have passed.

Decision 4: The order store is in-memory.
The spec says nothing about persistence. An in-memory dict is the simplest thing that makes the tests pass. In production, orders are persisted. The in-memory store vanishes on restart and isn't shared across worker processes.

Every one of these decisions is plausible. The agent made the reasonable call every time. That's not the problem. The problem is that a different agent, given the same spec, might have made different reasonable calls — and both implementations would pass the same test suite.

The rewrite

Writing the good spec forced every decision the bad spec had silently delegated:

# GOOD SPEC 1 — Caller's perspective, not implementation's
# Fixed: field names describe what the caller observes (status, placed_at)
# not what the storage layer calls them (db_status, order_created_at).
# The format of placed_at is now a contract obligation, not an assumption.

Scenario: Confirmed order status is returned with placement timestamp
  Given a confirmed order with id "order-abc-123" exists in the system
  When I request GET /orders/order-abc-123/status
  Then the response status code is 200
  And the response body contains "order_id" equal to "order-abc-123"
  And the response body contains "status" equal to "CONFIRMED"
  And the response body contains "placed_at" as a valid ISO 8601 timestamp

# GOOD SPEC 2 — Precise Given, explicit 404 body shape
# Fixed: "a well-formed UUID with no corresponding record" is now unambiguous.
# The 404 response body shape is now a contract obligation, not a guess.

Scenario: Unknown order id returns 404 with error message
  Given no order with id "order-xyz-999" exists in the system
  When I request GET /orders/order-xyz-999/status
  Then the response status code is 404
  And the response body contains an "error" field

Notice what changed. The scenarios describe the same two situations. The intent is identical. But now every decision is in the spec rather than in the agent's interpretation of the spec.

What the agent built from the good spec

@app.get("/orders/{order_id}/status")
def get_order_status(order_id: str):
    order = _orders.get(order_id)
    if order is None:
        return JSONResponse(status_code=404, content={"error": "Order not found"})
    return {
        "order_id": order_id,
        "status": order["db_status"],
        "placed_at": order["order_created_at"],
    }

Same endpoint. Same logic. Different API.

db_status became status. order_created_at became placed_at. The 404 body now contains error not detail. The timestamp is now asserted to be ISO 8601 — not just non-empty.

These are not cosmetic differences. They are different contracts that clients build against.

The cross-run

After building from the good spec, I ran the bad-spec tests against the new implementation:

tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order FAILED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

E   KeyError: 'db_status'

The leaky test failed. The field db_status doesn't exist in the good implementation — it's been renamed to status, which is what a caller should see. The test that was checking for an internal name is now broken, correctly.

The vague test passed. Both implementations return a 404 for a missing order — the good implementation just happened to reach the same conclusion, but for an explicit reason this time.

That asymmetry is instructive. The vague Given produced the right answer by coincidence. The leaky Then produced the wrong field name by construction. One was luck. One was baked in.

Why this matters

Both implementations pass their own test suites. That is the trap.

If you run the bad-spec tests against the bad-spec implementation: green. If you run the good-spec tests against the good-spec implementation: green. The difference only surfaces when you cross-run — and in production, you never cross-run. You ship the bad implementation, it passes CI, and the problem lands in a client exception report six months later.

Here's the concrete difference: the bad-spec implementation returns db_status and order_created_at with no format guarantee. The good-spec implementation returns status and placed_at with a mandatory ISO 8601 format. An agent given the bad spec had no way to know that db_status was wrong — the spec said db_status. An agent given the good spec had no choice but to produce status — the spec said status.

Spec quality is not about whether tests pass. It is about how much of the implementation the spec author wrote versus how much was silently delegated to the agent. Every silent delegation is a place where two agents given the same spec produce different code — code that both passes, but disagrees on the contract.

At scale — dozens of endpoints, hundreds of scenarios — that disagreement is the system.

The practical test for a good spec

Before handing any scenario to an agent, ask one question: what decisions does this scenario leave open?

If the answer is "none — every field name, format, response code, and body shape is specified," the spec is ready. If the answer is "a few reasonable ones," those are the places where your implementation and the next agent's implementation will silently diverge.

The agent will always make reasonable decisions. That's not the problem. The problem is that reasonable is not the same as specified — and at Level 4, specified is the only thing that counts.

Next issue: Wiring the Guardrails — GitHub Actions, the Pact Broker, and the pipeline that turns contract violations into blocked merges automatically.

Sources & Further Reading

Cucumber + Gherkin documentation
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Session findings — Issue #5

This article was written with the assistance of AI tools.

Green CI. Broken Contract. Nobody Noticed.

Diya Burman — Wed, 10 Jun 2026 02:24:34 +0000

A Level 5 Engineer — Issue #4

Preface

If you've been following along, you know where we are. Issue #2 introduced WireMock and Gherkin — write the behavioral contract before the code, stub your dependencies, run a real test suite. Issue #3 handed that spec to an AI agent and walked away. Five scenarios passed. The agent even found a bug in my code.

Everything worked. And that's exactly the problem this issue is about.

Because the WireMock stubs working perfectly is not the same thing as the real services working. The gap between those two statements is where production incidents are born.

The confidence trap

Here's the scenario nobody talks about until it happens to them.

Your order service calls a payment gateway. You've stubbed it with WireMock. Your Gherkin scenarios pass. Your agent builds against those stubs. Five for five, green across the board.

Meanwhile, the payment gateway team — a different squad, a different repo, maybe a different company entirely — ships a cleanup. They've been inconsistent about field naming across their API. status in one endpoint, result in another. They standardize. They rename status to result in the charge response. Their tests pass. They deploy.

Your tests still pass too. The stub hasn't changed. The stub will never change unless you change it.

The first time you learn about the rename is a production incident.

This is the confidence trap: a mock that can drift from the real service makes you feel safe right up until production proves you weren't. The tests are green. The contract is broken. You just don't know it yet.

What Pact does differently

WireMock is a behavioral double — it simulates a service so your tests can run in isolation. You define what it returns. You maintain it. You can make it say anything you want, which means it can silently lie about what the real service actually does.

Pact inverts the trust relationship.

Instead of you maintaining a stub that you hope reflects reality, your consumer tests declare what they need from the provider. Those declarations get written into a .pact file — a machine-readable contract. The provider then runs verification against that contract before it ships. If the provider no longer satisfies what the consumer declared, verification fails and the deploy is blocked.

The consumer defines the need. The provider proves delivery. No human has to remember to update a stub.

Building it — and what the docs didn't tell me

I added Pact to the order-api project this issue, covering both downstream dependencies — the payment gateway and the inventory service — with consumer tests matching the same five scenarios from the Gherkin feature file.

It was less smooth than I expected.

The pact-python v3 FFI surprise

Every tutorial for pact-python shows the same pattern: create a module-scoped Pact fixture, run multiple tests against it, write the pact file at the end. I wrote exactly that. The first test in each class passed. Every subsequent test failed with this:

RuntimeError: The provider state could not be specified.

No hint of what was actually wrong. After digging into the source, the root cause: pact-python 3.x is a complete rewrite backed by a Rust FFI binary. The Rust handle is consumed by the first serve() call — you cannot add new interactions to a handle after that point. The v2-style module-scoped pattern violates this constraint in a way the error message doesn't explain at all.

The fix was restructuring the consumer tests so all interactions are defined upfront before any serve() call:

# ❌ v2-style — breaks in pact-python v3
class TestPaymentConsumer:
    @pytest.fixture(scope="module")
    def pact(self):
        return Consumer("OrderService").has_pact_with(Provider("PaymentGateway"))

    def test_success(self, pact):
        pact.given("payment succeeds").upon_receiving("a charge")...
        with pact:
            # test

    def test_declined(self, pact):
        pact.given("payment declined").upon_receiving("a decline")...
        # RuntimeError — handle already consumed

# ✅ v3 correct pattern — all interactions before serve()
def test_payment_gateway_consumer():
    pact = Consumer("OrderService").has_pact_with(Provider("PaymentGateway"), ...)
    (pact
        .given("the payment gateway will accept the charge")
        .upon_receiving("a successful payment charge")
        .with_request("POST", "/payments/charge/success")
        .will_respond_with(200, body={"status": "ACCEPTED",
                                      "transaction_id": "txn-abc-123",
                                      "amount": 134.97}))
    (pact
        .given("the payment gateway will decline the charge")
        .upon_receiving("a declined payment charge")
        .with_request("POST", "/payments/charge/declined")
        .will_respond_with(402, body={"status": "DECLINED",
                                      "reason": "INSUFFICIENT_FUNDS"}))
    # ... all interactions defined ...
    with pact.serve() as srv:
        # exercise all interactions against srv.url
    pact.write_file("pacts/")

If you're upgrading from pact-python 1.x or 2.x: expect to rewrite your test fixtures. This isn't a syntax change — it's a different mental model of how the mock server lifecycle works.

The Verifier transport configuration gap

Provider verification had its own friction. The Verifier constructor in pact-python v3 takes a hostname, not a full URL. Passing a full URL causes a silent host mismatch when you later configure the transport:

# ❌ Causes "Host mismatch: localhost != http://localhost:8291"
Verifier("PaymentGateway", "http://localhost:8291")
    .add_transport(url="http://localhost:8291")

# ✅ Correct
Verifier("PaymentGateway", "localhost")
    .add_transport(protocol="http", port=8291, scheme="http")
    .add_source(pact_file)
    .set_request_timeout(10000)  # needed for the 6s timeout stub

The set_request_timeout(10000) line is also non-obvious: the payment timeout stub uses fixedDelayMilliseconds: 6000 to simulate a slow response. The verifier's default timeout is 5 seconds. Without the explicit timeout extension, the timeout interaction fails verification with a connection error rather than a clean pass.

Neither of these are in the main documentation. Both took real time to find. They're in the findings file for this session — linked at the bottom.

The breaking change experiment

All the Pact setup is preamble. This is the proof.

Step 1: Baseline — all contracts verified

pytest tests/pact/test_provider_verification.py -v

Verifying a pact between OrderService and PaymentGateway
  a declined payment charge         (OK)
  a successful payment charge       (OK)
  a timed-out payment charge        (OK)
PASSED

Verifying a pact between OrderService and InventoryService
  [3 interactions — all OK]
PASSED

2 passed in 8.19s

Step 2: Introduce the breaking change

In wiremock/payment-mappings/payment-success.json, one field rename:

// Before
{"status": "ACCEPTED", "transaction_id": "txn-abc-123", "amount": 134.97}

// After — "status" renamed to "result"
{"result": "ACCEPTED", "transaction_id": "txn-abc-123", "amount": 134.97}

Step 3: Provider verification with the breaking change

pytest tests/pact/test_provider_verification.py -v

  a successful payment charge (FAILED)

Failures:
  1.1) has a matching body
         $ -> Actual map is missing the following keys: status
  {
    "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
    "transaction_id": "txn-abc-123"
  }

1 failed in 7.22s

Pact caught it. Exact field. Exact diff. No ambiguity about what broke or why.

Step 4: The same breaking change against the WireMock test suite

pytest tests/steps/test_order_creation.py -v

test_order_is_successfully_created... PASSED
test_order_is_rejected_when_payment_is_declined PASSED
test_order_is_rejected_when_an_item_is_out_of_stock PASSED
test_order_surfaces_partial_unavailability... PASSED
test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

5 passed in 13.01s

Five for five. All green. The breaking change is completely invisible.

Step 5: Revert and confirm

2 passed in 8.19s

Why the WireMock tests stayed green

This isn't a flaw in the Gherkin approach — it's a precise boundary on what any behavioral test can and can't see.

The Gherkin scenarios test the order service's behavior: does the order get confirmed? Does the right status come back to the caller? In app/main.py, when the payment gateway responds, the code checks the HTTP status code and returns {"status": "CONFIRMED"} — it never reads the status field from the payment gateway body. So from the test harness's perspective, nothing changed. The right HTTP code came back, the order was confirmed, all assertions passed.

Pact caught it because the consumer test had explicitly declared that the order service expects a status field in the payment response. That expectation is encoded in the .pact file. When provider verification ran against the modified stub, the Rust verifier compared the actual response against the contract and found the key missing.

The Gherkin test and the Pact consumer test are testing different things. Gherkin tests the system's behavior end-to-end. Pact tests the shape of the conversation between services. You need both. They're not competing — they're covering different failure modes.

The can-i-deploy gate

The final piece was a local can-i-deploy simulation — a script that reads the generated .pact files, checks each interaction's expected response shape against the WireMock stub mappings, and exits 0 (safe) or 1 (blocked).

With contracts intact:

python scripts/can_i_deploy.py

Pact: OrderService → PaymentGateway
  PASS  a declined payment charge
  PASS  a successful payment charge
  PASS  a timed-out payment charge

Pact: OrderService → InventoryService
  PASS  [3 interactions]

RESULT: ALL CONTRACTS VERIFIED — safe to deploy
Exit: 0

With the breaking change in place:

  FAIL  a successful payment charge
        stub is missing fields expected by consumer: ['status']

RESULT: CONTRACT VIOLATIONS DETECTED — do not deploy
Exit: 1

In a real Pact Broker setup, this check queries a central record of which consumer versions have verified which provider versions. The local simulation does something simpler but teaches the same pattern: before you deploy, prove the contract is still satisfied. The exit code is what a CI pipeline reads. A non-zero exit stops the merge.

The full GitHub Actions wiring — where this becomes an automated gate on every PR — is Issue #6. The local simulation is enough to feel how it works.

Where we are

Four issues in, the specification layer is taking shape. Gherkin and WireMock proved the agent builds reliably against a well-written spec. The agent session proved that clean specs produce clean implementations and expose your assumptions. Pact closes the loop — the contract now survives beyond the stub and catches provider drift before it reaches production.

The stack is starting to look like something real. But there's a question I've been putting off since Issue #2 that can't wait any longer: what actually makes a Gherkin scenario good? Because not all specs are equal, and an agent that builds from a loose spec produces something very different from one that builds from a tight one. Next issue I'm going to prove that by deliberately writing bad Gherkin, handing it to the agent, and showing you what comes out.

Next issue: The Spec That Doesn't Lie — deliberately writing bad Gherkin, seeing what the agent builds from it, then rewriting it and comparing the output.

Sources & Further Reading

Pact documentation
pact-python v3 migration guide
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Session findings — Issue #4

This article was written with the assistance of AI tools.

The Agent Found What Code Review Missed.

Diya Burman — Wed, 10 Jun 2026 01:02:49 +0000

A Level 5 Engineer — Issue #3

If you've been following along, you know what we've built so far. Issue #1 introduced the five levels framework and the Dark Factory concept. Issue #2 got concrete — we wrote five Gherkin scenarios for an order management API before touching any implementation code, stubbed out two external dependencies with WireMock, and ran a real test suite against the whole thing.

At the end of Issue #2 I made a promise: hand the spec to an AI agent, spec only, no implementation hints, and see what it builds.

This is that issue.

The setup

The instruction I gave Claude Code at the start of the session was exactly this:

"The Gherkin scenarios in tests/features/order_creation.feature define the full behavioural contract for this API. Do not read the existing implementation in app/main.py. Build a fresh implementation that makes all 5 scenarios pass. Document your findings in FINDINGS.md as you go."

That's it. No architecture hints. No "use FastAPI." No "here's how the mock servers work." Just the spec and a documentation instruction.

The CLAUDE.md in the repo handled the rest — the guardrails, the project context, the constraint that the .feature files cannot be touched, and the format the FINDINGS.md should follow. If you missed the deep dive on CLAUDE.md in Issue #2, that file is essentially the agent's standing orders. It reads it at the start of every session.

Then I sat back and watched.

What the agent derived from the spec alone

Here's what I found interesting. Before writing a single line of code, the agent read the Gherkin scenarios and derived the entire API contract from them. Unprompted. It produced this:

POST /inventory/check/{inventory_scenario}
  → all available      → POST /payments/charge/{payment_scenario}
  → partial available  → return 207 PARTIAL_UNAVAILABLE (no charge)
  → all out of stock   → return 409 UNAVAILABLE (no charge)

And the full response shape for all five scenarios:

Scenario	status	status_code	Key fields
Success	CONFIRMED	—	`order_id`
Payment declined	PAYMENT_FAILED	402	`decline_reason`, `inventory_released: true`
Out of stock	UNAVAILABLE	409	`unavailable_items`
Partial stock	PARTIAL_UNAVAILABLE	207	`available_items`, `unavailable_items`
Payment timeout	PAYMENT_PENDING	202	`inventory_hold_minutes: 15`, `retry_count`

This is exactly right. The agent read five plain-language scenarios and extracted a precise technical contract — the order of operations, the response codes, the body fields, the retry behaviour — without being told any of it explicitly.

That's not nothing. That's the spec doing its job.

Where it got interesting — the timeout scenario

Scenario 5 is the one I was most curious about. Timeout behaviour is notoriously hard to test and easy to get wrong. The agent worked through it carefully and documented its reasoning:

PAYMENT_TIMEOUT_SECONDS=5 — per-attempt HTTP client timeout
MAX_PAYMENT_RETRIES=2 — total attempt cap, not a retry count on top of the first attempt
Worst-case wall time with 2 attempts at 5 seconds each: 10 seconds — comfortably inside the 12-second contract from the scenario
The WireMock timeout stub uses fixedDelayMilliseconds: 6000 — deliberately longer than the client timeout so the client always times out before the mock responds

That last detail is subtle and correct. If the mock delay were shorter than the client timeout, the test would be testing the wrong thing — the mock responding slowly rather than the client giving up. The agent caught this without being prompted. It's in the FINDINGS.md.

The bug it found that I had written

This is my favourite part of this issue.

The original test setup — the code I pointed Claude Code at — had a hard-coded path:

sys.path.insert(0, "/home/claude/order-api")

On my machine this would silently start mock servers with no stubs loaded. Every payment call would return a 404. Every inventory call would return a 404. The tests would fail in ways that looked like logic errors rather than a configuration problem.

The agent caught it, diagnosed the root cause, and fixed it:

# Before — hard-coded, breaks on any machine but the original
sys.path.insert(0, "/home/claude/order-api")

# After — computed dynamically, works everywhere
PROJECT_ROOT = Path(__file__).parent.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

To be clear: this bug was in my code. Code I had written and shipped to the repo. The agent found it during implementation because it was trying to run the tests on a different environment and they failed in a way that forced the diagnosis.

This is a thing that happens at Level 4 that doesn't happen at Level 2. When you're implementing yourself, you don't notice the hard-coded paths because everything works on your machine. When an agent implements on a clean environment, your assumptions get exposed immediately.

My honest reaction

I'll be transparent about something. This API isn't complex. It's an order endpoint with two downstream dependencies and five scenarios. I didn't expect the agent to struggle with it, and it didn't. It hit errors, diagnosed them promptly, and moved on. Five scenarios, all passing.

What struck me wasn't the capability — it was the texture of the experience.

Watching Claude Code work, I found myself doing something I don't usually do when I'm implementing: I was evaluating. Not writing, not debugging, not context-switching. Just reading the agent's reasoning and deciding whether I agreed with it. That's a different cognitive posture entirely. It felt closer to a code review than a coding session.

I also noticed I spent the entire session approving individual commands — every file edit, every pytest run, every pip install. Claude Code asks for permission before each action by default. For this first session I let it. From the next task onward I'm going to configure it to run basic commands without checking in every thirty seconds. There's a trust-building curve here, and I'm on the early part of it.

What this proves — and what it doesn't

Five passing scenarios on a moderately simple API is not proof that Level 5 is solved. It's proof that the approach works at this scale and this complexity.

The honest question — the one this newsletter is actually tracking — is whether it holds as the system grows. Pact tests across services. CI/CD pipelines. Evals as guardrails. Contextual stewardship documents for systems with years of history and undocumented decisions baked into the architecture.

That's where the real test is. And that's where we're going next.

What I'd do differently

One thing the exercise exposed: the spec was good enough for the agent to build correctly, but I had one implicit assumption that didn't make it into the scenarios. The response shape for the success case doesn't specify that status_code should be absent — it just checks for order_id. The agent inferred this correctly, but if it hadn't, the test would have passed anyway.

That's a gap in the spec, not a gap in the agent. The lesson is the same one from Issue #2: every implicit assumption is a decision waiting to cause a bug in production. Write it down. Make it a scenario. Make the machine prove it.

Next issue: Phase 3 — adding Pact contract testing between the order service and its dependencies. What happens when the service contract and the mock stub disagree?

Sources & Further Reading

Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Claude Code documentation
pytest-bdd documentation
Project repository
Session findings - Issue #3

This article was written with the assistance of AI tools.