<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Diya Burman</title>
    <description>The latest articles on DEV Community by Diya Burman (@diyaburman).</description>
    <link>https://dev.to/diyaburman</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F93964%2Fa85c0e0d-f413-4c6e-b6a0-b26ddf9b739d.jpeg</url>
      <title>DEV Community: Diya Burman</title>
      <link>https://dev.to/diyaburman</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/diyaburman"/>
    <language>en</language>
    <item>
      <title>Prompts Are Disposable. Skills Are Infrastructure.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Mon, 29 Jun 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/diyaburman/prompts-are-disposable-skills-are-infrastructure-575p</link>
      <guid>https://dev.to/diyaburman/prompts-are-disposable-skills-are-infrastructure-575p</guid>
      <description>&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Layer 1 is complete. Eight issues, a working order management API, Pact contracts, a CI/CD pipeline, and a spec audit framework. The specification layer is done.&lt;/p&gt;

&lt;p&gt;Layer 2 starts here. And it begins with a question that sounds simple until you think about it: why do you keep rewriting the same prompts?&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with copying prompts
&lt;/h2&gt;

&lt;p&gt;If you've been using AI seriously for more than a few weeks, you have a collection of prompts that work. You've refined them. You copy them between sessions. You paste them into Claude Code at the start of a task and the agent does the right thing.&lt;/p&gt;

&lt;p&gt;That feels like a system. It isn't.&lt;/p&gt;

&lt;p&gt;Here's what copying a prompt actually does: it copies the words. It doesn't copy the contract. The agent reads the words, interprets them in the context of this session, and makes a series of decisions that aren't in the prompt. Different sessions, different context, different decisions — even with the same words. You won't notice until two agents produce incompatible outputs from the same prompt and you have to figure out which one is right.&lt;/p&gt;

&lt;p&gt;A skill is different. A skill specifies what to produce, not just what to consider. It has a version, an output contract, and a routing signal. It gets better over time and the improvements persist. It's the difference between a note you wrote to yourself and infrastructure your whole team — human and agent — can depend on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding the right candidate
&lt;/h2&gt;

&lt;p&gt;I reviewed the entire order-api project to find the best prompt-to-skill conversion candidate. Three instructions surfaced:&lt;/p&gt;

&lt;p&gt;The test-run verification sequence (&lt;code&gt;pytest tests/steps/ -v &amp;amp;&amp;amp; pytest tests/pact/ -v &amp;amp;&amp;amp; python scripts/can_i_deploy.py&lt;/code&gt;) appears in every session. Rejected — it's a procedure, not a judgment call. Any agent can run three commands.&lt;/p&gt;

&lt;p&gt;The findings file protocol appears in CLAUDE.md and has been followed since Issue #3. Rejected — it describes a format and cadence, not a methodology.&lt;/p&gt;

&lt;p&gt;The Gherkin scenario quality evaluation — the methodology for deciding whether a scenario is well-formed before accepting or writing it — appeared across Issues #5, #7, and #8. Every time, the agent re-derived the same judgment framework from scratch. This is the winner.&lt;/p&gt;

&lt;p&gt;Why: it encodes judgment, not procedure. Whether a step is UNDERSPECIFIED or LEAKY ABSTRACTION is a reasoning call. Its output drives everything downstream — every implementation session depends on the scenarios being well-formed. A bad scenario written in a planning session becomes broken step definitions two sessions later.&lt;/p&gt;

&lt;p&gt;And here's the uncomfortable detail: the timeout ambiguity that was fixed in Issue #8 — &lt;code&gt;And the response is returned within 12 seconds&lt;/code&gt; — was introduced in Issue #2. Three sessions inherited it silently before it was caught. A quality evaluation skill running in Issue #2 would have caught it before it was ever committed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The prompt version — and what it gets wrong
&lt;/h2&gt;

&lt;p&gt;Here's the current prompt as it would be pasted into a session:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before writing or accepting a Gherkin scenario, check that it is well-formed. A well-formed scenario describes behavior from the caller's perspective, not from the implementation. Each step should be specific enough that only one implementation can satisfy it. Check for: vague quantities, counts that could be read as total or additional, time bounds without a start anchor, mechanism claims without the mechanism, and internal field names leaking into the spec. If the scenario has these problems, rewrite it before proceeding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Four weaknesses:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No output format.&lt;/strong&gt; The prompt implies the agent will either rewrite or proceed — but doesn't say what to return. A list of problems? An annotated version? The corrected Gherkin? Two agents will produce two different output shapes. A downstream agent consuming either has to parse whichever shape it receives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No taxonomy.&lt;/strong&gt; "Internal field names" means something specific — but the prompt doesn't define what internal means vs external. Two agents draw that line differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No handling for partial problems.&lt;/strong&gt; "Rewrite before proceeding" — does that mean rewrite every step, or only the problematic ones?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-dependence.&lt;/strong&gt; The prompt was written for reviewing an existing scenario. Used in a planning context ("write a new scenario for DELETE /orders/{id}"), an agent either ignores the check or applies it mid-draft. The two strategies produce different levels of revision.&lt;/p&gt;

&lt;p&gt;The fundamental problem: the prompt describes what to look for but not what to produce. The agent invents the output format. When the output format is invented, it is never the same twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The demonstration
&lt;/h2&gt;

&lt;p&gt;Same input. Same endpoint. Prompt version first, skill version second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The task:&lt;/strong&gt; Write a Gherkin scenario for &lt;code&gt;DELETE /orders/{order_id}&lt;/code&gt; — cancel a confirmed order. The order can only be cancelled if payment has not been captured yet.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Task A — prompt version output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Cancel a confirmed order before payment capture
  &lt;span class="nf"&gt;Given &lt;/span&gt;a confirmed order exists with order ID &lt;span class="s"&gt;"order-123"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;payment has not been captured for the order
  &lt;span class="nf"&gt;When &lt;/span&gt;the user sends DELETE /orders/order-123
  &lt;span class="nf"&gt;Then &lt;/span&gt;the order is cancelled
  &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 200
  &lt;span class="nf"&gt;And &lt;/span&gt;the response includes a confirmation message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six implicit decisions the agent made that weren't in the prompt:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;em&gt;"confirmed order exists"&lt;/em&gt; — method of setup not specified. Seed directly or call POST /orders? Left open.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"payment has not been captured"&lt;/em&gt; — mechanism not specified. Mock server state? Flag on the order?&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"the order is cancelled"&lt;/em&gt; — mechanism language. What field changes? What value? Not specified.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"confirmation message"&lt;/em&gt; — any text in any field satisfies this. Undefined.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;200 vs 204&lt;/em&gt; — the agent chose 200 over 204 (no body). Undocumented judgment call.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;No failure scenario&lt;/em&gt; — the prompt mentioned "can only be cancelled if payment has not been captured yet" but the agent wrote only the success case.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total implicit decisions: 6. All silent.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Task B — skill version output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Confirmed order is cancelled when payment has not yet been captured
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order was created via POST /orders with order ID &lt;span class="s"&gt;"order-del-001"&lt;/span&gt; and status &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway has not captured payment for order &lt;span class="s"&gt;"order-del-001"&lt;/span&gt;
  &lt;span class="nf"&gt;When &lt;/span&gt;the client sends DELETE /orders/order-del-001
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response HTTP status is 200
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains a &lt;span class="s"&gt;"status"&lt;/span&gt; field with value &lt;span class="s"&gt;"CANCELLED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains an &lt;span class="s"&gt;"order_id"&lt;/span&gt; field with value &lt;span class="s"&gt;"order-del-001"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway receives no capture or charge requests for order &lt;span class="s"&gt;"order-del-001"&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Cancellation is rejected when payment has already been captured
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order was created via POST /orders with order ID &lt;span class="s"&gt;"order-del-002"&lt;/span&gt; and status &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway has already captured payment for order &lt;span class="s"&gt;"order-del-002"&lt;/span&gt;
  &lt;span class="nf"&gt;When &lt;/span&gt;the client sends DELETE /orders/order-del-002
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response HTTP status is 422
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains an &lt;span class="s"&gt;"error"&lt;/span&gt; field with value &lt;span class="s"&gt;"PAYMENT_ALREADY_CAPTURED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the order status remains &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two implicit decisions — both surfaced explicitly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;em&gt;422 vs 409&lt;/em&gt; — the skill's output contract requires documenting assumptions, so this was flagged as a comment rather than silently embedded.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"status" vs "cancellation_status"&lt;/em&gt; — the skill's LEAKY ABSTRACTION check prevented using an implementation-facing field name.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total implicit decisions: 2. Both visible.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The diff
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;th&gt;Classification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"confirmed order exists" → "created via POST /orders with status CONFIRMED"&lt;/td&gt;
&lt;td&gt;SKILL CONSTRAINT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"payment has not been captured" → "the payment gateway has not captured payment for order-del-001"&lt;/td&gt;
&lt;td&gt;SKILL CONSTRAINT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"the order is cancelled" → HTTP status + "status" field with value "CANCELLED"&lt;/td&gt;
&lt;td&gt;QUALITY DELTA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"confirmation message" → specific field name and value&lt;/td&gt;
&lt;td&gt;QUALITY DELTA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;em&gt;(absent)&lt;/em&gt; → "payment gateway receives no capture requests"&lt;/td&gt;
&lt;td&gt;SKILL CONSTRAINT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;em&gt;(absent)&lt;/em&gt; → full second scenario for failure case&lt;/td&gt;
&lt;td&gt;QUALITY DELTA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Six meaningful differences. Three skill constraints, three quality deltas, six prompt ambiguities eliminated.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three properties skills have that prompts don't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Version control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A prompt has no version. When you improve it, you copy the new text into the next session. The old version exists in your clipboard history or a chat transcript from three weeks ago. You cannot diff it. You cannot pin a session to it. You cannot see what changed between the prompt that worked and the prompt that produced the wrong output.&lt;/p&gt;

&lt;p&gt;The Gherkin quality skill lives in &lt;code&gt;docs/skills/gherkin-scenario-quality.md&lt;/code&gt;. When Issue #8 added the IMPLICIT FLOW debt class, the skill gets a one-line update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gi"&gt;+| IMPLICIT FLOW | A step that implies a follow-up flow that is not specced anywhere |
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every session after that commit uses the updated skill. Every session before it used the previous version. &lt;code&gt;git blame&lt;/code&gt; tells you exactly when IMPLICIT FLOW was added and which issue prompted it. With a prompt, "skill v1.1" means nothing. There is only "the prompt I'm using today."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Output contract&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The skill specifies exactly what it must return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One or more complete Gherkin scenarios in Given/When/Then format&lt;/li&gt;
&lt;li&gt;All Then clauses must assert a field name AND a value — not just presence&lt;/li&gt;
&lt;li&gt;All counts must use "exactly N" or "no more than N total" — never "N times"&lt;/li&gt;
&lt;li&gt;All time bounds must include a start anchor&lt;/li&gt;
&lt;li&gt;Each external service in a Given clause must be named explicitly&lt;/li&gt;
&lt;li&gt;Assumptions not in the input must appear as &lt;code&gt;# Assumption:&lt;/code&gt; comments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downstream dependency is the step definition author. When &lt;code&gt;tests/steps/test_order_creation.py&lt;/code&gt; implements &lt;code&gt;And the payment gateway received exactly one charge request&lt;/code&gt; — "exactly one", "charge request", "payment gateway" are all actionable. When it implements "And the response includes a confirmation message" — the author must invent an assertion. That invention is where test coverage becomes unreliable.&lt;/p&gt;

&lt;p&gt;The output contract is the interface between the agent that writes scenarios and the agent that implements from them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Routing signal description&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The skill's description line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Evaluate and produce well-formed Gherkin scenarios for the order-api project using the five-question debt diagnostic and output contract.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It names the artifact type, the project, the method, and the output. An agent knows exactly when to use this skill and what it will receive.&lt;/p&gt;

&lt;p&gt;A bad description for the same skill:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Help with writing tests and checking scenarios for the project.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;"Tests" matches pytest, Pact contracts, unit tests, and Gherkin. "The project" matches any repo. No methodology named means two agents doing "help with writing tests" produce incompatible outputs — which is exactly the problem the skill exists to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  The answer
&lt;/h2&gt;

&lt;p&gt;If both the prompt and the skill produce output that works, the difference is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt produces output that passes today's tests. The skill produces output that a different agent can implement tomorrow without making any decisions you didn't make.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's why copying prompts isn't enough. The words travel. The contract doesn't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The 3-Tier Skill Architecture in Practice — mapping your skills to the right tier and why Tier 2 is where individual expertise becomes organizational leverage.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Agent-First Skills Architecture&lt;/a&gt; · &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/docs/skills/gherkin-scenario-quality.md" rel="noopener noreferrer"&gt;Gherkin quality skill&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-09-skills-infrastructure.md" rel="noopener noreferrer"&gt;Session findings — Issue #9&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>agents</category>
      <category>testing</category>
    </item>
    <item>
      <title>Spec Debt Doesn't Disappear When You Fix It. It Migrates.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/diyaburman/spec-debt-doesnt-disappear-when-you-fix-it-it-migrates-d25</link>
      <guid>https://dev.to/diyaburman/spec-debt-doesnt-disappear-when-you-fix-it-it-migrates-d25</guid>
      <description>&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Issue #7 ended with seven spec debt items documented in a project that had been built carefully for seven issues. Every item was passing its tests. None of them announced themselves. They were found by asking a different question: not "does this pass?" but "what would a second agent build from this step?"&lt;/p&gt;

&lt;p&gt;Issue #8 fixes all seven — and builds the tool that found them into something reusable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The seven fixes
&lt;/h2&gt;

&lt;p&gt;Working through each item one at a time, running the test suite after every individual fix. Not batching them. The discipline matters — if a fix breaks something, you want to know which fix broke it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 1 — Timeout measurement ambiguity&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the response is returned within 12 seconds

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the response is returned within 12 seconds of the order being submitted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Of the order being submitted" anchors the clock to client-side HTTP request dispatch — the same moment &lt;code&gt;time.time()&lt;/code&gt; is captured in the step definition. Without this anchor, a second implementation could measure from server receipt, from the last retry attempt, or from when the response body is fully read. All three produce different numbers under load.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 2 — "Retried" vs "total attempts"&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is not retried more than 2 times

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway receives no more than 2 charge requests total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Retried 2 times" has two valid English readings: 2 retries meaning 3 total requests, or retried up to 2 times meaning 2 total. "No more than 2 charge requests total" counts requests, not retries, and the word "total" makes clear the initial attempt is included. This also changed the assertion in the step definition — from trusting the response body's &lt;code&gt;retry_count&lt;/code&gt; field to checking the actual call count at the mock server. Stronger assertion, same outcome.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 3 — "Released" without mechanism&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the inventory reservation is released

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Released" says what happened but not how, and not for which items. The rewrite names the items and specifies that a request is sent to the inventory service. This fix also revealed a gap: the current implementation signals release via a response body field (&lt;code&gt;inventory_released: true&lt;/code&gt;) rather than a separate API call to the inventory service. The spec now describes the intended behaviour. The implementation doesn't fully match it yet. That's a future issue — but the gap is now visible rather than hidden.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 4 — "Explicit user action" — removed entirely&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;no order is confirmed without explicit user action

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="err"&gt;(step&lt;/span&gt; &lt;span class="err"&gt;removed)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step implies a follow-up confirmation flow (&lt;code&gt;POST /orders/{id}/confirm&lt;/code&gt; or equivalent) that does not exist anywhere in the codebase. It passes trivially because no order is confirmed in the partial availability scenario — not because the confirmation flow was implemented. A spec step that passes for the wrong reason is not a safety net. It is a false guarantee. If the confirmation flow is built in a future issue, a new scenario should specify it precisely. Leaving this step in place would invite an agent to invent an unspecced endpoint.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 5 — Presence without value assertions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;order_status_bad.feature&lt;/code&gt; timestamp step was asserting only that a field exists and is a non-empty string. Tightened to assert the field name, the value, and the type explicitly. Kept conservative — &lt;code&gt;order_status_bad.feature&lt;/code&gt; is a pedagogical artifact and shouldn't be converted into a good spec, which would defeat its purpose in the newsletter.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 6 — "An order exists" without specifying how&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;Given &lt;/span&gt;an order was successfully placed and confirmed with order ID &lt;span class="s"&gt;"aaa00000-..."&lt;/span&gt;

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nf"&gt;Given &lt;/span&gt;an order was created via POST /orders and confirmed with order ID &lt;span class="s"&gt;"aaa00000-..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Successfully placed and confirmed" describes the outcome but not the mechanism. "Created via POST /orders" makes explicit that a real creation flow is expected. The step definition currently seeds the order directly into the in-memory store — a shortcut. The rewrite creates a documented gap between spec intent and step implementation. Visible gap, not hidden one.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Fix 7 — "Correct" without definition&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the notification contains the correct order id and total

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nf"&gt;And &lt;/span&gt;the notification request body contains order_id &lt;span class="s"&gt;"order-abc-123"&lt;/span&gt; and total 134.97
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Correct" is relative to context that may not be available to the reader. The rewrite hardcodes the expected values established in the When clause. Two agents reading the original step would both implement something that checks the notification body — but one might compare against the When-clause values, another might check against a computed total, a third might only verify field presence. The rewrite removes all three interpretations.&lt;/p&gt;

&lt;p&gt;This fix also caught something the stub had been hiding: the notification mock was returning &lt;code&gt;"mock-notif-001"&lt;/code&gt; as a notification id. Not a UUID. The format assertion caught it immediately. This is exactly the value of adding concrete assertions — it surfaces stub data that was never valid but was never checked.&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit framework
&lt;/h2&gt;

&lt;p&gt;After fixing all seven items, I built the diagnostic tool into a standalone document: &lt;code&gt;docs/spec-audit-framework.md&lt;/code&gt;. The full document is in the repo. Here's the core of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Five questions — ask them for every scenario in every feature file:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q1: Who owns this scenario?&lt;/strong&gt;&lt;br&gt;
Can you name the team, service, or domain this scenario belongs to? If the answer includes "and also", the scenario is in the wrong file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q2: What decisions does this scenario leave open?&lt;/strong&gt;&lt;br&gt;
For every Given, When, and Then clause: could two agents build different implementations that both pass? If yes, the step is underspecified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q3: Are all terms defined within the file?&lt;/strong&gt;&lt;br&gt;
Every noun that is not a standard HTTP concept or a primitive type should be defined in the scenario or a Background clause. If understanding a term requires reading another file or asking a colleague, it is spec debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4: Does this scenario describe behaviour or implementation?&lt;/strong&gt;&lt;br&gt;
Steps should describe what the system does from the caller's perspective. Any step that references internal concepts — database field names, function names, internal status codes — is leaking implementation into the spec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q5: What does this scenario NOT say that it should?&lt;/strong&gt;&lt;br&gt;
List the edge cases, error states, and boundary conditions the scenario implies but does not specify. Each one is a silent assumption waiting to become a production incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Six debt classes:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;What it looks like&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNDERSPECIFIED&lt;/td&gt;
&lt;td&gt;Step present but leaves a decision open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MIXED CONCERN&lt;/td&gt;
&lt;td&gt;Scenario covers more than one service domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNDEFINED TERM&lt;/td&gt;
&lt;td&gt;A noun used without being defined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMBIGUOUS COUNT&lt;/td&gt;
&lt;td&gt;A quantity with two valid interpretations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IMPLICIT FLOW&lt;/td&gt;
&lt;td&gt;Implies a follow-up flow that isn't specced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LEAKY ABSTRACTION&lt;/td&gt;
&lt;td&gt;References implementation details&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What the framework found that the manual audit missed
&lt;/h2&gt;

&lt;p&gt;Applying the five questions to all four fixed feature files surfaced one item the Issue #7 manual audit didn't catch.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;order_status_good.feature&lt;/code&gt;, the Given clause now reads "created via POST /orders" — the fixed version from this session. Q4 flagged it for a different reason than the original audit: the step definition still seeds the order directly into the in-memory store. The spec text is precise. The implementation of the spec takes a shortcut.&lt;/p&gt;

&lt;p&gt;The manual audit looked at feature file text. The framework applies Q4 to step definitions as well — and a step definition that silently does something different from what the spec says is spec debt, even if the test passes.&lt;/p&gt;

&lt;p&gt;This distinction matters: &lt;strong&gt;spec debt can migrate from the feature file into the step definition.&lt;/strong&gt; You fix the scenario, tighten the language, run the tests — green. But the step definition now implements a shortcut that contradicts the precise step text. The debt moved, it didn't disappear.&lt;/p&gt;




&lt;h2&gt;
  
  
  The scorecard — after all fixes
&lt;/h2&gt;

&lt;p&gt;Applied the framework to all four non-pedagogical feature files:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;order_creation.feature&lt;/code&gt; — 5 scenarios, 1 debt item remaining (LEAKY ABSTRACTION at step definition level — inventory release mechanism gap from Fix 3)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;order_status_good.feature&lt;/code&gt; — 2 scenarios, 1 debt item remaining (LEAKY ABSTRACTION — step definition seeds order directly rather than via POST /orders)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;notification_service.feature&lt;/code&gt; — 2 scenarios, 0 debt items&lt;/p&gt;

&lt;p&gt;&lt;code&gt;order_status_bad.feature&lt;/code&gt; — kept as pedagogical artifact, not audited for debt&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debt density after fixes: 0.22 items per scenario.&lt;/strong&gt; Both remaining items are LEAKY ABSTRACTION at the step definition level. Zero AMBIGUOUS COUNT or IMPLICIT FLOW items remain — the two highest-risk classes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The uncomfortable answer
&lt;/h2&gt;

&lt;p&gt;After fixing seven spec debt items and applying a structured audit framework to a project that has been built carefully for eight issues, two debt items remain. Both were introduced by the same sessions that fixed other debt — a precise spec step was written, and the implementation of that step took a shortcut.&lt;/p&gt;

&lt;p&gt;Spec debt is not eliminated by fixing debt. It migrates.&lt;/p&gt;

&lt;p&gt;The practical conclusion: treat step definitions as part of the spec surface, not just as test harness code. A step definition that silently does something different from what the spec says is spec debt, even if the test passes. The audit framework catches both — but only if you apply Q4 to the step definitions as well as the feature text.&lt;/p&gt;

&lt;p&gt;The other finding worth naming: &lt;code&gt;notification_service.feature&lt;/code&gt; scored zero debt items. It was written after eight issues of accumulating lessons about what the previous files got wrong. The absence of debt is not accidental — it's the result of knowing what bad specs look like before writing the next one.&lt;/p&gt;

&lt;p&gt;The best time to write a spec is after you've written a few bad ones. Auditing retroactively and fixing forward is the realistic path. Not "write it right the first time."&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: Prompts Are Disposable. Skills Are Infrastructure — the conceptual shift from session-level prompts to versioned, reusable skill definitions. Layer 2 begins.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/docs/spec-audit-framework.md" rel="noopener noreferrer"&gt;Spec audit framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-08-spec-audit.md" rel="noopener noreferrer"&gt;Session findings — Issue #8&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>agents</category>
      <category>testing</category>
    </item>
    <item>
      <title>Your Spec Files Are Lying to You. Mine Were Too.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Mon, 15 Jun 2026 17:30:55 +0000</pubDate>
      <link>https://dev.to/diyaburman/your-spec-files-are-lying-to-you-mine-were-too-1nie</link>
      <guid>https://dev.to/diyaburman/your-spec-files-are-lying-to-you-mine-were-too-1nie</guid>
      <description>&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Every issue so far has worked with one service and one spec file. Issue #7 changes that. A second service enters the picture — a notification service that the order service calls after a confirmed payment — and with it comes the question that every growing system eventually forces: where do spec file boundaries go?&lt;/p&gt;

&lt;p&gt;The answer turns out to matter more than it looks. And the audit at the end of this issue found seven spec debt items in files we've been running since Issue #2. All passing. All carrying risk.&lt;/p&gt;




&lt;h2&gt;
  
  
  The notification service — and a design decision that has spec implications
&lt;/h2&gt;

&lt;p&gt;The new service is minimal: &lt;code&gt;POST /notifications/order-confirmed&lt;/code&gt; accepts an order id, user id, and total, and returns a notification id and a &lt;code&gt;QUEUED&lt;/code&gt; status. Simple enough. The interesting part is how the order service calls it.&lt;/p&gt;

&lt;p&gt;The call is fire-and-forget.&lt;/p&gt;

&lt;p&gt;When an order is confirmed, the order service starts a daemon thread, fires the notification request, and returns the &lt;code&gt;CONFIRMED&lt;/code&gt; response immediately — without waiting for the notification to succeed. If the notification service is down, slow, or returning errors, the order is still confirmed. The customer gets their confirmation. The notification may or may not arrive.&lt;/p&gt;

&lt;p&gt;This is a deliberate design decision. The order service owns the transaction. The notification service owns delivery. Coupling the order confirmation response to notification delivery would mean a flaky notification service could block order creation — which is a much worse failure mode than a missed notification.&lt;/p&gt;

&lt;p&gt;But the decision has a direct spec implication: any scenario that asserts &lt;code&gt;Then the order status is "CONFIRMED"&lt;/code&gt; must remain true regardless of what the notification service does. The spec cannot simultaneously require &lt;code&gt;CONFIRMED&lt;/code&gt; and make &lt;code&gt;CONFIRMED&lt;/code&gt; depend on notification success. That would be a hidden coupling — the spec would look independent but the implementation would not be.&lt;/p&gt;

&lt;p&gt;This is the kind of architectural decision that should be in the spec before it's in the code. Once it's in the code it becomes folklore.&lt;/p&gt;




&lt;h2&gt;
  
  
  The wrong way first: one big spec file
&lt;/h2&gt;

&lt;p&gt;Before doing it right I did it wrong deliberately. I added two notification scenarios to the bottom of &lt;code&gt;order_creation.feature&lt;/code&gt; — the existing file that's been covering order creation since Issue #2.&lt;/p&gt;

&lt;p&gt;All 7 tests passed. Green across the board. &lt;code&gt;pytest&lt;/code&gt; has no opinion about spec architecture.&lt;/p&gt;

&lt;p&gt;The problems are structural, not functional:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed ownership.&lt;/strong&gt; &lt;code&gt;order_creation.feature&lt;/code&gt; line 1 says &lt;code&gt;Feature: Order Creation&lt;/code&gt;. By line 48 it's testing notification delivery. If the notification team changes their contract — say, adding a &lt;code&gt;channel&lt;/code&gt; field to the request — they have to open &lt;code&gt;order_creation.feature&lt;/code&gt; to update it. That file is not theirs. The filename, the feature declaration, and the existing scenarios all signal "this belongs to the order team." The notification scenarios are squatters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The growing file problem.&lt;/strong&gt; At 5 scenarios the file is readable. At 7 it starts to smell. Extrapolate to a real system: 10 downstream services, 5–10 scenarios each, all appended to the originating feature file because each was "triggered by" an order creation event. The file becomes a catch-all that nobody owns and everybody edits. Ownership dissolves into "whoever last touched it."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent routing problem.&lt;/strong&gt; When an agent is handed &lt;code&gt;order_creation.feature&lt;/code&gt; to build against, it must now implement both order logic and notification logic. It cannot know from the file whether the notification call belongs in &lt;code&gt;POST /orders&lt;/code&gt; or in a separate endpoint. It will make a decision — probably the wrong one — and that decision will be baked into the implementation before anyone notices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec debt seed.&lt;/strong&gt; The scenario "Order confirmation succeeds even if notification fails" uses the step &lt;code&gt;"the notification service is unavailable"&lt;/code&gt; without defining what unavailable means. TCP connection refused? 503? A 30-second hang? Each is a different failure mode with different implications for retry logic. An agent will pick one interpretation silently. Two agents will pick different ones. Both implementations will pass the spec. This is spec debt: it forms quietly, passes its tests, and surfaces as a production incident months later.&lt;/p&gt;




&lt;h2&gt;
  
  
  The right way: bounded spec files
&lt;/h2&gt;

&lt;p&gt;After documenting what was wrong, I moved the notification scenarios into their own file: &lt;code&gt;tests/features/notification_service.feature&lt;/code&gt;. Rewrote both scenarios to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Precisely define "unavailable" as &lt;code&gt;503 Service Unavailable&lt;/code&gt; — not a timeout, not a connection refused, not an ambiguous network failure&lt;/li&gt;
&lt;li&gt;Describe the notification contract from the notification service's perspective&lt;/li&gt;
&lt;li&gt;Make the file self-contained — a notification service team reading it wouldn't need to open &lt;code&gt;order_creation.feature&lt;/code&gt; to understand it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;order_creation.feature&lt;/code&gt;: 5 scenarios, all about order creation. No references to notifications.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;notification_service.feature&lt;/code&gt;: 2 scenarios, all about notification delivery behaviour.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The file boundary is now a contract boundary. They can be versioned, owned, and handed to different agents independently.&lt;/p&gt;

&lt;p&gt;Bounded spec files are not a tidiness preference. They are a precision tool for multi-agent systems. When a spec file is bounded to one service, an agent can be assigned exactly that file and nothing else. It builds one surface, tests one contract, returns. When the spec bleeds across services, the agent must make decisions about service ownership that were never written down. Those decisions accumulate as hidden assumptions in the implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The spec debt audit
&lt;/h2&gt;

&lt;p&gt;With the bounded file structure in place, I audited all four feature files in the project for spec debt — places where the spec passes its tests but leaves decisions that should have been made explicitly.&lt;/p&gt;

&lt;p&gt;Seven items. All passing. All carrying risk.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;1. Ambiguous timeout measurement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_creation.feature&lt;/code&gt; — Scenario: payment gateway times out&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;And the response is returned within 12 seconds&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;From when? The client sends the request? The server receives it? The last retry fires? Two agents will instrument this differently and both will pass. "Within 12 seconds of the order being submitted" — defining "submitted" as the moment the HTTP request body is sent — removes the ambiguity.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;2. "Retried" vs "total attempts"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_creation.feature&lt;/code&gt; — Scenario: payment gateway times out&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;And the payment gateway is not retried more than 2 times&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Does this mean 2 total attempts (1 original + 1 retry) or 2 retries on top of the original (3 total)? The English is genuinely ambiguous. An agent will pick one. The test will pass. The production system will behave differently than intended.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;And the payment gateway receives no more than 2 charge requests total&lt;/code&gt; — "requests total" removes all ambiguity about whether the first attempt counts.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;3. "Released" is not a mechanism&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_creation.feature&lt;/code&gt; — Scenario: payment declined&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;And the inventory reservation is released&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Released" is not defined. Does the inventory service receive a DELETE? A POST to a release endpoint? Does a TTL fire? An agent will implement whichever mechanism seems natural. Two agents will produce incompatible implementations that both pass the spec.&lt;/p&gt;

&lt;p&gt;Fix: Name the items and the mechanism: &lt;code&gt;And the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;4. "Explicit user action" describes a flow that doesn't exist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_creation.feature&lt;/code&gt; — Scenario: partial availability&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;And no order is confirmed without explicit user action&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Explicit user action" is not defined anywhere in the spec. A second API call? A UI confirmation? A webhook? This step passes trivially because no order is confirmed — the negative condition is true by absence. But it implies a follow-up confirmation flow that was never built, never specced, and never reviewed. If a future agent reads this step and builds a confirmation flow to satisfy it, it will invent something that was never intended.&lt;/p&gt;

&lt;p&gt;Fix: Remove it if the follow-up flow is out of scope. Or replace it with a concrete step: &lt;code&gt;And a subsequent POST to /orders/{order_id}/confirm is required to complete the order&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;5. Presence without value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_status_bad.feature&lt;/code&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: field-name assertions without value or type assertions&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Asserting that a field exists only catches absence — not incorrect presence. An agent can return &lt;code&gt;{"status": null}&lt;/code&gt; and pass. The spec catches the wrong thing.&lt;/p&gt;

&lt;p&gt;Fix: Assert the full expected shape with explicit values rather than just field names.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;6. "An order exists" doesn't say how&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;order_status_good.feature&lt;/code&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;Given an order exists with status "CONFIRMED"&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"An order exists" doesn't specify how it got there — full creation flow, or directly seeded into the store. The two methods produce different side effects. An agent building a test harness may seed the order directly, bypassing the creation flow entirely, which means the status endpoint tests never verify that a real confirmed order is actually readable via the API.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;Given a previously confirmed order created via POST /orders with id "{order_id}"&lt;/code&gt; — or explicitly state that direct seeding is acceptable.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;7. "Correct" is relative&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;File: &lt;code&gt;notification_service.feature&lt;/code&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Step: &lt;code&gt;And the notification contains the correct order id and total&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Correct" compared to what? If the order total is computed, two agents may compute it differently and both pass "correct" against their own computation.&lt;/p&gt;

&lt;p&gt;Fix: Hardcode the expected value: &lt;code&gt;And the notification request body contains order_id matching the confirmed order and total of 134.97&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why all seven of these matter even though they're all green
&lt;/h2&gt;

&lt;p&gt;Every item in that audit passes its test. That is the point.&lt;/p&gt;

&lt;p&gt;Spec debt is not visible in a green CI run. It is visible only when you ask: &lt;em&gt;what would a second agent build from this spec?&lt;/em&gt; The step "the payment gateway is not retried more than 2 times" has been in the codebase since Issue #2. It has passed every run. But it encodes an ambiguity that will be resolved differently by every agent that implements it fresh. The "no order is confirmed without explicit user action" step describes a flow that does not exist anywhere in the codebase. It passes because the negative condition is trivially true.&lt;/p&gt;

&lt;p&gt;If a future agent reads that step and builds a confirmation flow to satisfy it, it will build something that was never specced, never reviewed, and never integrated. The spec invited it. The tests blessed it. Nobody noticed.&lt;/p&gt;

&lt;p&gt;This is the exact failure mode that makes AI-assisted development unreliable at scale. Specs that look precise, pass their tests, and silently invite incompatible implementations. The debt doesn't announce itself. It compounds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the project stands
&lt;/h2&gt;

&lt;p&gt;Fifteen tests passing across four bounded feature files. The notification service is integrated. The Pact contracts — which existed before this session — remain unbroken because the notification call happens after the transaction completes. Adding a new service boundary didn't require touching existing contracts.&lt;/p&gt;

&lt;p&gt;Seven spec debt items documented. None fixed yet. The fixes are the next issue.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The Spec Audit — applying the debt framework to a real existing service and building the diagnostic tool readers can use on their own codebases.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-07-scope-problem.md" rel="noopener noreferrer"&gt;Session findings — Issue #7&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>agents</category>
      <category>testing</category>
    </item>
    <item>
      <title>My Tests Passed. My Pipeline Caught What They Missed.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Sat, 13 Jun 2026 18:59:27 +0000</pubDate>
      <link>https://dev.to/diyaburman/wiring-the-guardrails-19i</link>
      <guid>https://dev.to/diyaburman/wiring-the-guardrails-19i</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #6&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Five issues in, everything we've built lives on one machine. The Gherkin scenarios, the WireMock stubs, the Pact contracts, the can-i-deploy script — all of it runs locally, passes locally, and means nothing the moment someone else touches the codebase.&lt;/p&gt;

&lt;p&gt;Issue #6 fixes that. A GitHub Actions pipeline now runs on every push, executes the full specification stack in dependency order, and blocks merges to main if anything breaks. The pipeline is the guardrail. From this point on, a broken contract or a failing scenario cannot reach main undetected.&lt;/p&gt;

&lt;p&gt;Getting there took ninety minutes and two interventions I didn't plan for. Both are worth documenting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before the YAML: deciding what "green" means
&lt;/h2&gt;

&lt;p&gt;The first thing Claude Code did before touching any pipeline config was run the full test suite to establish a baseline. The instruction was explicit: everything must pass before a single line of YAML gets written.&lt;/p&gt;

&lt;p&gt;It found a failure immediately — and it wasn't from the breaking change experiment. It was from Issue #5.&lt;/p&gt;

&lt;p&gt;The bad-spec test (&lt;code&gt;test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order&lt;/code&gt;) was still asserting &lt;code&gt;db_status&lt;/code&gt; in the response body. That was intentional in Issue #5 — the failure was the finding. The session ended with it red because the point was to show what bad specs produce. But on main, with CI incoming, that means the pipeline would have been red on day one before a single feature change.&lt;/p&gt;

&lt;p&gt;The fix was adding backward-compat aliases to the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;            &lt;span class="c1"&gt;# good spec field
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;         &lt;span class="c1"&gt;# bad spec alias — keeps Issue #5 test passing
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;placed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# good spec field
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# bad spec alias
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Neither test file was modified. No feature files were touched. The aliases kept both the good-spec and bad-spec tests passing against the same endpoint.&lt;/p&gt;

&lt;p&gt;The reason this matters before the pipeline exists: a team that starts CI with a known failure trains itself to ignore red. The cost of normalising a red CI is much higher than the cost of fixing the baseline first. Claude Code made the right call and documented it before moving on.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pipeline structure
&lt;/h2&gt;

&lt;p&gt;Four jobs, in dependency order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test → pact-consumer → pact-verify → can-i-deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each job only runs if its predecessor passes. If Gherkin breaks, Pact never runs. If the consumer tests fail, verification never runs. If verification fails, can-i-deploy is skipped. The pipeline fails fast and tells you exactly which layer broke.&lt;/p&gt;

&lt;p&gt;The artifact chain is what makes it a pipeline rather than four parallel scripts. The &lt;code&gt;pact-consumer&lt;/code&gt; job generates the &lt;code&gt;.pact&lt;/code&gt; files and uploads them as a GitHub Actions artifact. The &lt;code&gt;pact-verify&lt;/code&gt; job downloads that artifact and verifies it — the same files, not freshly regenerated ones. Without this, each job would build its own consumer contract from scratch, and verification would be proving that the contract matches the code rather than proving it matches what &lt;code&gt;pact-consumer&lt;/code&gt; actually produced.&lt;/p&gt;

&lt;p&gt;One non-obvious piece: &lt;code&gt;mock_server.py&lt;/code&gt; is a library module with no command-line entry point. The pipeline needed servers running as background processes. The fix was an inline Python invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Start mock servers&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;. .venv/bin/activate&lt;/span&gt;
    &lt;span class="s"&gt;python -c "&lt;/span&gt;
    &lt;span class="s"&gt;import time&lt;/span&gt;
    &lt;span class="s"&gt;from mock_server import start_mock_server&lt;/span&gt;
    &lt;span class="s"&gt;start_mock_server(8091, 'wiremock/payment-mappings')&lt;/span&gt;
    &lt;span class="s"&gt;start_mock_server(8092, 'wiremock/inventory-mappings')&lt;/span&gt;
    &lt;span class="s"&gt;time.sleep(86400)&lt;/span&gt;
    &lt;span class="s"&gt;" &amp;amp;&lt;/span&gt;
    &lt;span class="s"&gt;sleep 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;time.sleep(86400)&lt;/code&gt; keeps the process alive for the duration of the job. Inelegant but functional. A proper &lt;code&gt;if __name__ == "__main__"&lt;/code&gt; entry point with argparse is the obvious cleanup for a future session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The first CI run — and why I had to intervene manually
&lt;/h2&gt;

&lt;p&gt;The YAML was committed, pushed to main, and the pipeline ran. All three runs failed on the &lt;code&gt;test&lt;/code&gt; job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OSError: [Errno 98] Address already in use
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ports 8091 and 8092. Every test in &lt;code&gt;test_order_creation.py&lt;/code&gt; errored at setup. The order status tests — which don't use the mock servers — passed fine.&lt;/p&gt;

&lt;p&gt;Claude Code didn't catch this on its own. Here's why that's worth explaining.&lt;/p&gt;

&lt;p&gt;When Claude Code wrote the pipeline, it was working from the codebase and its own knowledge of GitHub Actions patterns. It knew the mock servers needed to be running before pytest started, so it added an explicit start-servers step to the YAML — a reasonable decision based on the information it had. What it couldn't see was the runtime interaction between that YAML step and pytest's session-scoped fixtures, because that interaction only manifests in the CI environment, not locally.&lt;/p&gt;

&lt;p&gt;Locally, running &lt;code&gt;pytest tests/steps/ -v&lt;/code&gt; has always worked correctly because the session fixture starts the servers and nothing else competes. Claude Code had only ever seen local runs succeed. It had no signal that the YAML step was creating a conflict — because the conflict doesn't exist locally.&lt;/p&gt;

&lt;p&gt;This is a fundamental limit of the "paste and walk away" approach at the boundary between local and remote environments: the agent can reason about the codebase and about CI patterns, but it can't observe the CI run itself. The failure was on GitHub. Claude Code was in a terminal. Those two things weren't connected.&lt;/p&gt;

&lt;p&gt;I diagnosed the error from the GitHub Actions log, explained the root cause, and pasted new instructions. Claude Code fixed it in one step — removing the redundant YAML steps entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Removed from both test and pact-verify jobs:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Start mock servers&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;. .venv/bin/activate&lt;/span&gt;
    &lt;span class="s"&gt;python -c "..." &amp;amp;&lt;/span&gt;
    &lt;span class="s"&gt;sleep 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pytest session fixtures already own server lifecycle correctly. &lt;code&gt;scope="session"&lt;/code&gt; means pytest starts the servers once per test run and keeps them alive. The YAML step was duplicating a responsibility that was already handled. The fix wasn't a workaround — it was removing the wrong layer.&lt;/p&gt;

&lt;p&gt;The root cause in plain terms: the YAML step and the pytest fixture both thought they were responsible for starting the servers. The port was already bound when the fixture tried to bind it again. Works on my machine. Breaks in CI. Classic.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breaking change experiment — in the pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pme03rx2l6f1uwkhb5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pme03rx2l6f1uwkhb5f.png" alt="All four jobs green — 1m 34s. SAFE TO DEPLOY." width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the pipeline green, the breaking change test ran as designed.&lt;/p&gt;

&lt;p&gt;Branch &lt;code&gt;test/breaking-change-pipeline&lt;/code&gt;, commit &lt;code&gt;76c0d89&lt;/code&gt;: renamed &lt;code&gt;"status"&lt;/code&gt; to &lt;code&gt;"result"&lt;/code&gt; in &lt;code&gt;wiremock/payment-mappings/payment-success.json&lt;/code&gt;. Same change as Issue #4, now running through CI instead of local verification.&lt;/p&gt;

&lt;p&gt;The expected failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;a successful payment charge (FAILED)
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;Failures:
1) Verifying a pact between OrderService and PaymentGateway
&lt;/span&gt;   1.1) has a matching body
          $ -&amp;gt; Actual map is missing the following keys: status
   {
     "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
     "transaction_id": "txn-abc-123"
   }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pact-verify&lt;/code&gt; fails. &lt;code&gt;can-i-deploy&lt;/code&gt; is skipped. The merge is blocked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fssmd6jo21xf0fw95613k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fssmd6jo21xf0fw95613k.png" alt="pact-verify catches the broken contract. can-i-deploy never runs." width="799" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the key point from Issue #4 holds at the pipeline level: the &lt;code&gt;test&lt;/code&gt; job — the Gherkin suite — would pass with the breaking change in place. The order creation scenarios check HTTP status codes and business outcomes. They never read &lt;code&gt;pay_resp.json()["status"]&lt;/code&gt;. A stub returning &lt;code&gt;result&lt;/code&gt; instead of &lt;code&gt;status&lt;/code&gt; still returns HTTP 200. Gherkin passes. Pact catches it.&lt;/p&gt;

&lt;p&gt;This is the division of labour. Gherkin proves the system does the right thing. Pact proves the contracts don't drift. You need both, and now both run automatically on every push.&lt;/p&gt;




&lt;h2&gt;
  
  
  The one step that requires the GitHub UI
&lt;/h2&gt;

&lt;p&gt;Claude Code cannot configure branch protection rules — that requires the GitHub web UI or admin API. This step is non-negotiable and must be done manually:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Repo → &lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;Branches&lt;/strong&gt; → &lt;strong&gt;Add branch protection rule&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Branch name pattern: &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enable &lt;strong&gt;Require status checks to pass before merging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add all four status checks: &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;pact-consumer&lt;/code&gt;, &lt;code&gt;pact-verify&lt;/code&gt;, &lt;code&gt;can-i-deploy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enable &lt;strong&gt;Require branches to be up to date before merging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without this, the pipeline is advisory. A push to main can still happen even if all four jobs are red. The pipeline becomes a dashboard — it shows you the problem but doesn't stop anything. Branch protection is what turns "CI failed" from a notification into enforcement. The pipeline is only a guardrail if something stops you going around it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;The YAML took about twenty minutes to write. The session took ninety minutes total — because the baseline fix and the port conflict ate the rest.&lt;/p&gt;

&lt;p&gt;The instinct during the baseline audit was to skip past the known failure. It's a demo test, we know why it's there, configure CI to skip that file and move on. That would have been thirty seconds. It also would have been wrong — a pipeline with documented exceptions is a pipeline people route around.&lt;/p&gt;

&lt;p&gt;The instinct during the port conflict was to blame the CI environment. Ubuntu runs things differently, ports work differently, it's a platform quirk. That framing would have sent the debugging in the wrong direction. The actual cause was simpler: two layers both thought they owned the same responsibility, and nobody had written down which one was actually in charge.&lt;/p&gt;

&lt;p&gt;Both of those moments are the J-curve. Not the YAML — the discipline of not skipping and not blaming the environment. The overhead of CI is not the config file. It's every decision about what "green" actually means and who's responsible for what.&lt;/p&gt;

&lt;p&gt;The pipeline is now real infrastructure. The breaking change can't reach main. That's worth ninety minutes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The Scope Problem — scaling Gherkin across a multi-service system. What happens when one spec file isn't enough, and how spec debt forms.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions" rel="noopener noreferrer"&gt;GitHub Actions documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io" rel="noopener noreferrer"&gt;Pact documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-06-cicd-guardrails.md" rel="noopener noreferrer"&gt;Session findings — Issue #6&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>The AI Built the Wrong Thing. Every Test Passed.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:40:42 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-spec-that-doesnt-lie-5a00</link>
      <guid>https://dev.to/diyaburman/the-spec-that-doesnt-lie-5a00</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #5&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Every issue so far has assumed something I haven't said out loud: that the specs are good. Issue #2 wrote them carefully. Issue #3 handed them to an agent and watched it build correctly. Issue #4 proved the contracts survive provider drift.&lt;/p&gt;

&lt;p&gt;But what happens when the spec isn't good? Not broken — Gherkin syntax is fine, tests pass, the agent builds something. Just imprecise. Vague in ways that feel precise when you're writing them.&lt;/p&gt;

&lt;p&gt;This issue answers that question by doing the thing deliberately. I wrote bad Gherkin on purpose, handed it to the agent, watched what it built — and then rewrote the spec and did it again. The difference between the two implementations is the article.&lt;/p&gt;




&lt;h2&gt;
  
  
  The hardest thing about bad specs
&lt;/h2&gt;

&lt;p&gt;Bad specs are hard to spot when you're writing them because they feel complete.&lt;/p&gt;

&lt;p&gt;A scenario that references implementation details sounds like reasonable description — you wrote the implementation, so the details feel like specifics. A Given clause that feels obvious to you will be interpreted differently by every reader who hasn't seen the code. The Gherkin is syntactically correct. The tests pass. Nothing in the output signals that anything is wrong.&lt;/p&gt;

&lt;p&gt;This is the trap. It's not that bad specs break things. It's that they don't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The endpoint
&lt;/h2&gt;

&lt;p&gt;I added a new endpoint to the order-api project: &lt;code&gt;GET /orders/{order_id}/status&lt;/code&gt;. It returns the current status of an order and relevant metadata. Simple enough that the spec should be easy to write well. Which makes it a good target for writing it badly on purpose.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bad specs
&lt;/h2&gt;

&lt;p&gt;Two scenarios. Both syntactically valid. Both produce passing tests. Both wrong in different ways.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# BAD SPEC 1 — The leaky spec&lt;/span&gt;
&lt;span class="c"&gt;# Problem: references internal implementation concepts (db_status, order_created_at)&lt;/span&gt;
&lt;span class="c"&gt;# rather than describing what a caller observes. The agent uses these names literally&lt;/span&gt;
&lt;span class="c"&gt;# in the response body, leaking storage terminology into the public API contract.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Retrieving status for a confirmed order
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order exists in the system with db_status &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/{order_id}/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response should contain the db_status field set to &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the order_created_at field should be populated from the order record

&lt;span class="c"&gt;# BAD SPEC 2 — The vague Given&lt;/span&gt;
&lt;span class="c"&gt;# Problem: "an order that has not been placed" is underspecified. The agent must&lt;/span&gt;
&lt;span class="c"&gt;# guess what this means — a malformed ID? A well-formed UUID with no record?&lt;/span&gt;
&lt;span class="c"&gt;# A deleted order? Each interpretation is plausible and produces different behavior.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Retrieving status for an order that does not exist
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order that has not been placed
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/{order_id}/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response should indicate the order was not found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both passed immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order PASSED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

2 passed in 0.34s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Green. No warnings. No hint that anything is wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent built from the bad specs
&lt;/h2&gt;

&lt;p&gt;Here's the implementation the agent produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/orders/{order_id}/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It satisfies the spec completely. It also made four decisions the spec never made:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 1: The field is named &lt;code&gt;db_status&lt;/code&gt; in the response.&lt;/strong&gt;&lt;br&gt;
The spec said &lt;code&gt;db_status&lt;/code&gt; so the agent used &lt;code&gt;db_status&lt;/code&gt;. It never questioned whether this was an internal name leaking into a public API. It satisfied the spec literally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 2: A missing order returns 404.&lt;/strong&gt;&lt;br&gt;
The spec says "indicate the order was not found." 404 is a defensible interpretation. So is 422, 403, or a 200 with a &lt;code&gt;NOT_FOUND&lt;/code&gt; status field. The agent picked the most conventional option — but the spec never mandated it, and FastAPI's default 404 body is &lt;code&gt;{"detail": "Order not found"}&lt;/code&gt;, not &lt;code&gt;{"error": "Order not found"}&lt;/code&gt;. A client checking &lt;code&gt;response.json()["error"]&lt;/code&gt; gets a KeyError.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 3: The timestamp field is named &lt;code&gt;order_created_at&lt;/code&gt; with no format requirement.&lt;/strong&gt;&lt;br&gt;
The spec says "populated from the order record." The agent chose &lt;code&gt;order_created_at&lt;/code&gt; and returned an ISO string because that's what &lt;code&gt;datetime.utcnow().isoformat()&lt;/code&gt; produces. The step definition checked only that the field is non-empty and a string — so any format would have passed. A Unix timestamp integer would have passed. A human-readable string like "June 2nd" would have passed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 4: The order store is in-memory.&lt;/strong&gt;&lt;br&gt;
The spec says nothing about persistence. An in-memory dict is the simplest thing that makes the tests pass. In production, orders are persisted. The in-memory store vanishes on restart and isn't shared across worker processes.&lt;/p&gt;

&lt;p&gt;Every one of these decisions is plausible. The agent made the reasonable call every time. That's not the problem. The problem is that a different agent, given the same spec, might have made different reasonable calls — and both implementations would pass the same test suite.&lt;/p&gt;


&lt;h2&gt;
  
  
  The rewrite
&lt;/h2&gt;

&lt;p&gt;Writing the good spec forced every decision the bad spec had silently delegated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# GOOD SPEC 1 — Caller's perspective, not implementation's&lt;/span&gt;
&lt;span class="c"&gt;# Fixed: field names describe what the caller observes (status, placed_at)&lt;/span&gt;
&lt;span class="c"&gt;# not what the storage layer calls them (db_status, order_created_at).&lt;/span&gt;
&lt;span class="c"&gt;# The format of placed_at is now a contract obligation, not an assumption.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Confirmed order status is returned with placement timestamp
  &lt;span class="nf"&gt;Given &lt;/span&gt;a confirmed order with id &lt;span class="s"&gt;"order-abc-123"&lt;/span&gt; exists in the system
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/order-abc-123/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response status code is 200
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"order_id"&lt;/span&gt; equal to &lt;span class="s"&gt;"order-abc-123"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"status"&lt;/span&gt; equal to &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"placed_at"&lt;/span&gt; as a valid ISO 8601 timestamp

&lt;span class="c"&gt;# GOOD SPEC 2 — Precise Given, explicit 404 body shape&lt;/span&gt;
&lt;span class="c"&gt;# Fixed: "a well-formed UUID with no corresponding record" is now unambiguous.&lt;/span&gt;
&lt;span class="c"&gt;# The 404 response body shape is now a contract obligation, not a guess.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Unknown order id returns 404 with error message
  &lt;span class="nf"&gt;Given &lt;/span&gt;no order with id &lt;span class="s"&gt;"order-xyz-999"&lt;/span&gt; exists in the system
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/order-xyz-999/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response status code is 404
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains an &lt;span class="s"&gt;"error"&lt;/span&gt; field
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what changed. The scenarios describe the same two situations. The intent is identical. But now every decision is in the spec rather than in the agent's interpretation of the spec.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent built from the good spec
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/orders/{order_id}/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;placed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same endpoint. Same logic. Different API.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;db_status&lt;/code&gt; became &lt;code&gt;status&lt;/code&gt;. &lt;code&gt;order_created_at&lt;/code&gt; became &lt;code&gt;placed_at&lt;/code&gt;. The 404 body now contains &lt;code&gt;error&lt;/code&gt; not &lt;code&gt;detail&lt;/code&gt;. The timestamp is now asserted to be ISO 8601 — not just non-empty.&lt;/p&gt;

&lt;p&gt;These are not cosmetic differences. They are different contracts that clients build against.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cross-run
&lt;/h2&gt;

&lt;p&gt;After building from the good spec, I ran the bad-spec tests against the new implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order FAILED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

E   KeyError: 'db_status'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The leaky test failed. The field &lt;code&gt;db_status&lt;/code&gt; doesn't exist in the good implementation — it's been renamed to &lt;code&gt;status&lt;/code&gt;, which is what a caller should see. The test that was checking for an internal name is now broken, correctly.&lt;/p&gt;

&lt;p&gt;The vague test passed. Both implementations return a 404 for a missing order — the good implementation just happened to reach the same conclusion, but for an explicit reason this time.&lt;/p&gt;

&lt;p&gt;That asymmetry is instructive. The vague Given produced the right answer by coincidence. The leaky Then produced the wrong field name by construction. One was luck. One was baked in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Both implementations pass their own test suites. That is the trap.&lt;/p&gt;

&lt;p&gt;If you run the bad-spec tests against the bad-spec implementation: green. If you run the good-spec tests against the good-spec implementation: green. The difference only surfaces when you cross-run — and in production, you never cross-run. You ship the bad implementation, it passes CI, and the problem lands in a client exception report six months later.&lt;/p&gt;

&lt;p&gt;Here's the concrete difference: the bad-spec implementation returns &lt;code&gt;db_status&lt;/code&gt; and &lt;code&gt;order_created_at&lt;/code&gt; with no format guarantee. The good-spec implementation returns &lt;code&gt;status&lt;/code&gt; and &lt;code&gt;placed_at&lt;/code&gt; with a mandatory ISO 8601 format. An agent given the bad spec had no way to know that &lt;code&gt;db_status&lt;/code&gt; was wrong — the spec said &lt;code&gt;db_status&lt;/code&gt;. An agent given the good spec had no choice but to produce &lt;code&gt;status&lt;/code&gt; — the spec said &lt;code&gt;status&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Spec quality is not about whether tests pass. It is about how much of the implementation the spec author wrote versus how much was silently delegated to the agent. Every silent delegation is a place where two agents given the same spec produce different code — code that both passes, but disagrees on the contract.&lt;/p&gt;

&lt;p&gt;At scale — dozens of endpoints, hundreds of scenarios — that disagreement is the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The practical test for a good spec
&lt;/h2&gt;

&lt;p&gt;Before handing any scenario to an agent, ask one question: &lt;em&gt;what decisions does this scenario leave open?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "none — every field name, format, response code, and body shape is specified," the spec is ready. If the answer is "a few reasonable ones," those are the places where your implementation and the next agent's implementation will silently diverge.&lt;/p&gt;

&lt;p&gt;The agent will always make reasonable decisions. That's not the problem. The problem is that reasonable is not the same as specified — and at Level 4, specified is the only thing that counts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: Wiring the Guardrails — GitHub Actions, the Pact Broker, and the pipeline that turns contract violations into blocked merges automatically.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-05-the-spec-that-doesnt-lie.mdL" rel="noopener noreferrer"&gt;Session findings — Issue #5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>Green CI. Broken Contract. Nobody Noticed.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:24:34 +0000</pubDate>
      <link>https://dev.to/diyaburman/how-pact-contract-testing-catches-breaking-changes-that-wiremock-misses-3ge6</link>
      <guid>https://dev.to/diyaburman/how-pact-contract-testing-catches-breaking-changes-that-wiremock-misses-3ge6</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #4&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you've been following along, you know where we are. &lt;a href="https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb"&gt;Issue #2&lt;/a&gt; introduced WireMock and Gherkin — write the behavioral contract before the code, stub your dependencies, run a real test suite. &lt;a href="https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja"&gt;Issue #3&lt;/a&gt; handed that spec to an AI agent and walked away. Five scenarios passed. The agent even found a bug in my code.&lt;/p&gt;

&lt;p&gt;Everything worked. And that's exactly the problem this issue is about.&lt;/p&gt;

&lt;p&gt;Because the WireMock stubs working perfectly is not the same thing as the real services working. The gap between those two statements is where production incidents are born.&lt;/p&gt;




&lt;h2&gt;
  
  
  The confidence trap
&lt;/h2&gt;

&lt;p&gt;Here's the scenario nobody talks about until it happens to them.&lt;/p&gt;

&lt;p&gt;Your order service calls a payment gateway. You've stubbed it with WireMock. Your Gherkin scenarios pass. Your agent builds against those stubs. Five for five, green across the board.&lt;/p&gt;

&lt;p&gt;Meanwhile, the payment gateway team — a different squad, a different repo, maybe a different company entirely — ships a cleanup. They've been inconsistent about field naming across their API. &lt;code&gt;status&lt;/code&gt; in one endpoint, &lt;code&gt;result&lt;/code&gt; in another. They standardize. They rename &lt;code&gt;status&lt;/code&gt; to &lt;code&gt;result&lt;/code&gt; in the charge response. Their tests pass. They deploy.&lt;/p&gt;

&lt;p&gt;Your tests still pass too. The stub hasn't changed. The stub will never change unless you change it.&lt;/p&gt;

&lt;p&gt;The first time you learn about the rename is a production incident.&lt;/p&gt;

&lt;p&gt;This is the confidence trap: a mock that can drift from the real service makes you feel safe right up until production proves you weren't. The tests are green. The contract is broken. You just don't know it yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Pact does differently
&lt;/h2&gt;

&lt;p&gt;WireMock is a &lt;em&gt;behavioral double&lt;/em&gt; — it simulates a service so your tests can run in isolation. You define what it returns. You maintain it. You can make it say anything you want, which means it can silently lie about what the real service actually does.&lt;/p&gt;

&lt;p&gt;Pact inverts the trust relationship.&lt;/p&gt;

&lt;p&gt;Instead of you maintaining a stub that you hope reflects reality, your consumer tests &lt;em&gt;declare what they need&lt;/em&gt; from the provider. Those declarations get written into a &lt;code&gt;.pact&lt;/code&gt; file — a machine-readable contract. The provider then runs verification against that contract before it ships. If the provider no longer satisfies what the consumer declared, verification fails and the deploy is blocked.&lt;/p&gt;

&lt;p&gt;The consumer defines the need. The provider proves delivery. No human has to remember to update a stub.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building it — and what the docs didn't tell me
&lt;/h2&gt;

&lt;p&gt;I added Pact to the order-api project this issue, covering both downstream dependencies — the payment gateway and the inventory service — with consumer tests matching the same five scenarios from the Gherkin feature file.&lt;/p&gt;

&lt;p&gt;It was less smooth than I expected.&lt;/p&gt;

&lt;h3&gt;
  
  
  The pact-python v3 FFI surprise
&lt;/h3&gt;

&lt;p&gt;Every tutorial for pact-python shows the same pattern: create a module-scoped Pact fixture, run multiple tests against it, write the pact file at the end. I wrote exactly that. The first test in each class passed. Every subsequent test failed with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: The provider state could not be specified.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No hint of what was actually wrong. After digging into the source, the root cause: &lt;code&gt;pact-python&lt;/code&gt; 3.x is a complete rewrite backed by a Rust FFI binary. The Rust handle is &lt;em&gt;consumed&lt;/em&gt; by the first &lt;code&gt;serve()&lt;/code&gt; call — you cannot add new interactions to a handle after that point. The v2-style module-scoped pattern violates this constraint in a way the error message doesn't explain at all.&lt;/p&gt;

&lt;p&gt;The fix was restructuring the consumer tests so all interactions are defined upfront before any &lt;code&gt;serve()&lt;/code&gt; call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ v2-style — breaks in pact-python v3
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestPaymentConsumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;module&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OrderService&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;has_pact_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment succeeds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# test
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_declined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a decline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="c1"&gt;# RuntimeError — handle already consumed
&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ v3 correct pattern — all interactions before serve()
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_payment_gateway_consumer&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;pact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OrderService&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;has_pact_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will accept the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a successful payment charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;will_respond_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACCEPTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transaction_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;txn-abc-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will decline the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a declined payment charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;will_respond_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;402&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DECLINED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSUFFICIENT_FUNDS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="c1"&gt;# ... all interactions defined ...
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;srv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# exercise all interactions against srv.url
&lt;/span&gt;    &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pacts/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're upgrading from pact-python 1.x or 2.x: expect to rewrite your test fixtures. This isn't a syntax change — it's a different mental model of how the mock server lifecycle works.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Verifier transport configuration gap
&lt;/h3&gt;

&lt;p&gt;Provider verification had its own friction. The &lt;code&gt;Verifier&lt;/code&gt; constructor in pact-python v3 takes a hostname, not a full URL. Passing a full URL causes a silent host mismatch when you later configure the transport:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Causes "Host mismatch: localhost != http://localhost:8291"
&lt;/span&gt;&lt;span class="nc"&gt;Verifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8291&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_transport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8291&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ Correct
&lt;/span&gt;&lt;span class="nc"&gt;Verifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_transport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8291&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_request_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# needed for the 6s timeout stub
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;set_request_timeout(10000)&lt;/code&gt; line is also non-obvious: the payment timeout stub uses &lt;code&gt;fixedDelayMilliseconds: 6000&lt;/code&gt; to simulate a slow response. The verifier's default timeout is 5 seconds. Without the explicit timeout extension, the timeout interaction fails verification with a connection error rather than a clean pass.&lt;/p&gt;

&lt;p&gt;Neither of these are in the main documentation. Both took real time to find. They're in the findings file for this session — linked at the bottom.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breaking change experiment
&lt;/h2&gt;

&lt;p&gt;All the Pact setup is preamble. This is the proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Baseline — all contracts verified&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;pytest tests/pact/test_provider_verification.py -v

Verifying a pact between OrderService and PaymentGateway
  a declined payment charge         (OK)
  a successful payment charge       (OK)
  a timed-out payment charge        (OK)
PASSED

Verifying a pact between OrderService and InventoryService
  [3 interactions — all OK]
PASSED

2 passed in 8.19s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Introduce the breaking change&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;wiremock/payment-mappings/payment-success.json&lt;/code&gt;, one field rename:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Before&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ACCEPTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"txn-abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;After&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;renamed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"result"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ACCEPTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"txn-abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Provider verification with the breaking change&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;pytest tests/pact/test_provider_verification.py -v
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;  a successful payment charge (FAILED)
&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;Failures:
&lt;/span&gt;  1.1) has a matching body
         $ -&amp;gt; Actual map is missing the following keys: status
  {
    "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
    "transaction_id": "txn-abc-123"
  }
&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;1 failed in 7.22s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pact caught it. Exact field. Exact diff. No ambiguity about what broke or why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: The same breaking change against the WireMock test suite&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;pytest tests/steps/test_order_creation.py -v

test_order_is_successfully_created... PASSED
test_order_is_rejected_when_payment_is_declined PASSED
test_order_is_rejected_when_an_item_is_out_of_stock PASSED
test_order_surfaces_partial_unavailability... PASSED
test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

5 passed in 13.01s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five for five. All green. The breaking change is completely invisible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Revert and confirm&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2 passed in 8.19s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why the WireMock tests stayed green
&lt;/h2&gt;

&lt;p&gt;This isn't a flaw in the Gherkin approach — it's a precise boundary on what any behavioral test can and can't see.&lt;/p&gt;

&lt;p&gt;The Gherkin scenarios test the order service's &lt;em&gt;behavior&lt;/em&gt;: does the order get confirmed? Does the right status come back to the caller? In &lt;code&gt;app/main.py&lt;/code&gt;, when the payment gateway responds, the code checks the HTTP status code and returns &lt;code&gt;{"status": "CONFIRMED"}&lt;/code&gt; — it never reads the &lt;code&gt;status&lt;/code&gt; field from the payment gateway body. So from the test harness's perspective, nothing changed. The right HTTP code came back, the order was confirmed, all assertions passed.&lt;/p&gt;

&lt;p&gt;Pact caught it because the consumer test had explicitly declared that the order service &lt;em&gt;expects&lt;/em&gt; a &lt;code&gt;status&lt;/code&gt; field in the payment response. That expectation is encoded in the &lt;code&gt;.pact&lt;/code&gt; file. When provider verification ran against the modified stub, the Rust verifier compared the actual response against the contract and found the key missing.&lt;/p&gt;

&lt;p&gt;The Gherkin test and the Pact consumer test are testing different things. Gherkin tests the system's behavior end-to-end. Pact tests the shape of the conversation between services. You need both. They're not competing — they're covering different failure modes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The can-i-deploy gate
&lt;/h2&gt;

&lt;p&gt;The final piece was a local &lt;code&gt;can-i-deploy&lt;/code&gt; simulation — a script that reads the generated &lt;code&gt;.pact&lt;/code&gt; files, checks each interaction's expected response shape against the WireMock stub mappings, and exits 0 (safe) or 1 (blocked).&lt;/p&gt;

&lt;p&gt;With contracts intact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;python scripts/can_i_deploy.py

Pact: OrderService → PaymentGateway
  PASS  a declined payment charge
  PASS  a successful payment charge
  PASS  a timed-out payment charge

Pact: OrderService → InventoryService
  PASS  [3 interactions]

RESULT: ALL CONTRACTS VERIFIED — safe to deploy
Exit: 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the breaking change in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;  FAIL  a successful payment charge
        stub is missing fields expected by consumer: ['status']

RESULT: CONTRACT VIOLATIONS DETECTED — do not deploy
Exit: 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a real Pact Broker setup, this check queries a central record of which consumer versions have verified which provider versions. The local simulation does something simpler but teaches the same pattern: before you deploy, prove the contract is still satisfied. The exit code is what a CI pipeline reads. A non-zero exit stops the merge.&lt;/p&gt;

&lt;p&gt;The full GitHub Actions wiring — where this becomes an automated gate on every PR — is Issue #6. The local simulation is enough to feel how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;Four issues in, the specification layer is taking shape. Gherkin and WireMock proved the agent builds reliably against a well-written spec. The agent session proved that clean specs produce clean implementations and expose your assumptions. Pact closes the loop — the contract now survives beyond the stub and catches provider drift before it reaches production.&lt;/p&gt;

&lt;p&gt;The stack is starting to look like something real. But there's a question I've been putting off since Issue #2 that can't wait any longer: what actually makes a Gherkin scenario &lt;em&gt;good&lt;/em&gt;? Because not all specs are equal, and an agent that builds from a loose spec produces something very different from one that builds from a tight one. Next issue I'm going to prove that by deliberately writing bad Gherkin, handing it to the agent, and showing you what comes out.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The Spec That Doesn't Lie — deliberately writing bad Gherkin, seeing what the agent builds from it, then rewriting it and comparing the output.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io" rel="noopener noreferrer"&gt;Pact documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io/implementation_guides/python" rel="noopener noreferrer"&gt;pact-python v3 migration guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/yourusername/order-api/blob/main/findings/issue-04-pact-contract-testing.md" rel="noopener noreferrer"&gt;Session findings — Issue #4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>The Agent Found What Code Review Missed.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 01:02:49 +0000</pubDate>
      <link>https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja</link>
      <guid>https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #3&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you've been following along, you know what we've built so far. &lt;a href="https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5"&gt;Issue #1&lt;/a&gt; introduced the five levels framework and the Dark Factory concept. &lt;a href="https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb"&gt;Issue #2&lt;/a&gt; got concrete — we wrote five Gherkin scenarios for an order management API before touching any implementation code, stubbed out two external dependencies with WireMock, and ran a real test suite against the whole thing.&lt;/p&gt;

&lt;p&gt;At the end of Issue #2 I made a promise: hand the spec to an AI agent, spec only, no implementation hints, and see what it builds.&lt;/p&gt;

&lt;p&gt;This is that issue.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The instruction I gave Claude Code at the start of the session was exactly this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The Gherkin scenarios in &lt;code&gt;tests/features/order_creation.feature&lt;/code&gt; define the full behavioural contract for this API. Do not read the existing implementation in &lt;code&gt;app/main.py&lt;/code&gt;. Build a fresh implementation that makes all 5 scenarios pass. Document your findings in &lt;code&gt;FINDINGS.md&lt;/code&gt; as you go."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25xjzuwf4rqp71o099yc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25xjzuwf4rqp71o099yc.png" alt="screenshot" width="800" height="852"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcokijs484m3r0ub7wmzt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcokijs484m3r0ub7wmzt.png" alt="screenshot" width="800" height="773"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's it. No architecture hints. No "use FastAPI." No "here's how the mock servers work." Just the spec and a documentation instruction.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; in the repo handled the rest — the guardrails, the project context, the constraint that the &lt;code&gt;.feature&lt;/code&gt; files cannot be touched, and the format the &lt;code&gt;FINDINGS.md&lt;/code&gt; should follow. If you missed the deep dive on &lt;code&gt;CLAUDE.md&lt;/code&gt; in Issue #2, that file is essentially the agent's standing orders. It reads it at the start of every session.&lt;/p&gt;

&lt;p&gt;Then I sat back and watched.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent derived from the spec alone
&lt;/h2&gt;

&lt;p&gt;Here's what I found interesting. Before writing a single line of code, the agent read the Gherkin scenarios and derived the entire API contract from them. Unprompted. It produced this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /inventory/check/{inventory_scenario}
  → all available      → POST /payments/charge/{payment_scenario}
  → partial available  → return 207 PARTIAL_UNAVAILABLE (no charge)
  → all out of stock   → return 409 UNAVAILABLE (no charge)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the full response shape for all five scenarios:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;th&gt;status_code&lt;/th&gt;
&lt;th&gt;Key fields&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;td&gt;CONFIRMED&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;code&gt;order_id&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment declined&lt;/td&gt;
&lt;td&gt;PAYMENT_FAILED&lt;/td&gt;
&lt;td&gt;402&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;decline_reason&lt;/code&gt;, &lt;code&gt;inventory_released: true&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out of stock&lt;/td&gt;
&lt;td&gt;UNAVAILABLE&lt;/td&gt;
&lt;td&gt;409&lt;/td&gt;
&lt;td&gt;&lt;code&gt;unavailable_items&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partial stock&lt;/td&gt;
&lt;td&gt;PARTIAL_UNAVAILABLE&lt;/td&gt;
&lt;td&gt;207&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;available_items&lt;/code&gt;, &lt;code&gt;unavailable_items&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment timeout&lt;/td&gt;
&lt;td&gt;PAYMENT_PENDING&lt;/td&gt;
&lt;td&gt;202&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;inventory_hold_minutes: 15&lt;/code&gt;, &lt;code&gt;retry_count&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is exactly right. The agent read five plain-language scenarios and extracted a precise technical contract — the order of operations, the response codes, the body fields, the retry behaviour — without being told any of it explicitly.&lt;/p&gt;

&lt;p&gt;That's not nothing. That's the spec doing its job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it got interesting — the timeout scenario
&lt;/h2&gt;

&lt;p&gt;Scenario 5 is the one I was most curious about. Timeout behaviour is notoriously hard to test and easy to get wrong. The agent worked through it carefully and documented its reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PAYMENT_TIMEOUT_SECONDS=5&lt;/code&gt; — per-attempt HTTP client timeout&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MAX_PAYMENT_RETRIES=2&lt;/code&gt; — total attempt cap, not a retry count on top of the first attempt&lt;/li&gt;
&lt;li&gt;Worst-case wall time with 2 attempts at 5 seconds each: 10 seconds — comfortably inside the 12-second contract from the scenario&lt;/li&gt;
&lt;li&gt;The WireMock timeout stub uses &lt;code&gt;fixedDelayMilliseconds: 6000&lt;/code&gt; — deliberately longer than the client timeout so the client always times out before the mock responds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last detail is subtle and correct. If the mock delay were shorter than the client timeout, the test would be testing the wrong thing — the mock responding slowly rather than the client giving up. The agent caught this without being prompted. It's in the FINDINGS.md.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj12ykckkfpgj974d9m8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj12ykckkfpgj974d9m8.png" alt="screenshot" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug it found that I had written
&lt;/h2&gt;

&lt;p&gt;This is my favourite part of this issue.&lt;/p&gt;

&lt;p&gt;The original test setup — the code I pointed Claude Code at — had a hard-coded path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/claude/order-api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On my machine this would silently start mock servers with no stubs loaded. Every payment call would return a 404. Every inventory call would return a 404. The tests would fail in ways that looked like logic errors rather than a configuration problem.&lt;/p&gt;

&lt;p&gt;The agent caught it, diagnosed the root cause, and fixed it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrvhf3xhyz1n8j0i828o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrvhf3xhyz1n8j0i828o.png" alt="screenshot" width="800" height="855"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — hard-coded, breaks on any machine but the original
&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/claude/order-api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After — computed dynamically, works everywhere
&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;
&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ROOT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear: this bug was in &lt;em&gt;my&lt;/em&gt; code. Code I had written and shipped to the repo. The agent found it during implementation because it was trying to run the tests on a different environment and they failed in a way that forced the diagnosis.&lt;/p&gt;

&lt;p&gt;This is a thing that happens at Level 4 that doesn't happen at Level 2. When you're implementing yourself, you don't notice the hard-coded paths because everything works on your machine. When an agent implements on a clean environment, your assumptions get exposed immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  My honest reaction
&lt;/h2&gt;

&lt;p&gt;I'll be transparent about something. This API isn't complex. It's an order endpoint with two downstream dependencies and five scenarios. I didn't expect the agent to struggle with it, and it didn't. It hit errors, diagnosed them promptly, and moved on. Five scenarios, all passing.&lt;/p&gt;

&lt;p&gt;What struck me wasn't the capability — it was the &lt;em&gt;texture&lt;/em&gt; of the experience.&lt;/p&gt;

&lt;p&gt;Watching Claude Code work, I found myself doing something I don't usually do when I'm implementing: I was evaluating. Not writing, not debugging, not context-switching. Just reading the agent's reasoning and deciding whether I agreed with it. That's a different cognitive posture entirely. It felt closer to a code review than a coding session.&lt;/p&gt;

&lt;p&gt;I also noticed I spent the entire session approving individual commands — every file edit, every &lt;code&gt;pytest&lt;/code&gt; run, every &lt;code&gt;pip install&lt;/code&gt;. Claude Code asks for permission before each action by default. For this first session I let it. From the next task onward I'm going to configure it to run basic commands without checking in every thirty seconds. There's a trust-building curve here, and I'm on the early part of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this proves — and what it doesn't
&lt;/h2&gt;

&lt;p&gt;Five passing scenarios on a moderately simple API is not proof that Level 5 is solved. It's proof that the approach works at this scale and this complexity.&lt;/p&gt;

&lt;p&gt;The honest question — the one this newsletter is actually tracking — is whether it holds as the system grows. Pact tests across services. CI/CD pipelines. Evals as guardrails. Contextual stewardship documents for systems with years of history and undocumented decisions baked into the architecture.&lt;/p&gt;

&lt;p&gt;That's where the real test is. And that's where we're going next.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;One thing the exercise exposed: the spec was good enough for the agent to build correctly, but I had one implicit assumption that didn't make it into the scenarios. The response shape for the success case doesn't specify that &lt;code&gt;status_code&lt;/code&gt; should be absent — it just checks for &lt;code&gt;order_id&lt;/code&gt;. The agent inferred this correctly, but if it hadn't, the test would have passed anyway.&lt;/p&gt;

&lt;p&gt;That's a gap in the spec, not a gap in the agent. The lesson is the same one from Issue #2: every implicit assumption is a decision waiting to cause a bug in production. Write it down. Make it a scenario. Make the machine prove it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: Phase 3 — adding Pact contract testing between the order service and its dependencies. What happens when the service contract and the mock stub disagree?&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Claude Code documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytest-bdd.readthedocs.io" rel="noopener noreferrer"&gt;pytest-bdd documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-03-agent-fresh-implementation.md" rel="noopener noreferrer"&gt;Session findings - Issue #3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>Senior Developers Using AI Are Getting Slower. The Data Says So.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 00:37:33 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb</link>
      <guid>https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #2&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Preface
&lt;/h3&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you read &lt;a href="https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5"&gt;Issue 1&lt;/a&gt;, you walked away with the map. Six levels, a plateau most engineers never escape, and a Dark Factory that a handful of teams are quietly running in production. If you missed it, go read it first — this one builds directly on it.&lt;/p&gt;

&lt;p&gt;This issue is about the single most important shift that happens when you try to move from Level 3 to Level 4. Not the tools. Not the mindset. The &lt;em&gt;bottleneck&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Because it moved. And most of us didn't notice.&lt;/p&gt;




&lt;h2&gt;
  
  
  When speed stops being the problem
&lt;/h2&gt;

&lt;p&gt;For most of our careers, the bottleneck in software development was implementation speed. You had the idea, you had the design, you had the ticket — the constraint was how fast fingers could turn it into working code. That's the world we optimized for. That's why we measured velocity. That's why standups exist. That's why "10x engineer" was ever a phrase people said out loud without embarrassment.&lt;/p&gt;

&lt;p&gt;AI blew that bottleneck wide open.&lt;/p&gt;

&lt;p&gt;At Level 2, implementation stops being the constraint almost overnight. You're pairing with an agent and the code just... appears. Features that used to take days take hours. Hours take minutes. It feels like the problem is solved.&lt;/p&gt;

&lt;p&gt;Except you haven't solved it. You've just exposed the one that was hiding behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The new bottleneck is specification quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent can build anything you can describe precisely enough. The operative word is &lt;em&gt;precisely&lt;/em&gt;. The moment you try to hand off a vague, half-formed idea — the kind a human developer would fill in with reasonable assumptions and a quick Slack message — the agent either hallucinates something plausible-looking that isn't what you wanted, or it freezes, or worse, it confidently builds the wrong thing all the way to completion.&lt;/p&gt;

&lt;p&gt;The constraint is no longer your ability to implement. It's your ability to &lt;em&gt;specify&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a bad spec actually looks like
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth — most "requirements" we write as engineers are not specifications. They are vibes dressed up in Jira tickets.&lt;/p&gt;

&lt;p&gt;"Add pagination to the users endpoint." That's not a spec. How many results per page? Is the default configurable? What happens when the page number exceeds the total — empty array or 404? What's the sort order? Cursor-based or offset-based? What happens to existing API consumers who aren't sending page parameters yet?&lt;/p&gt;

&lt;p&gt;A human developer asks those questions in standup or figures them out from context. An agent working autonomously at Level 4 cannot do that. It will make a choice — silently, confidently, and consistently wrong in a way you won't catch until production.&lt;/p&gt;

&lt;p&gt;This is why Dan Shapiro's insight about specification quality isn't just a productivity tip. It's a prerequisite for moving up the ladder at all. You cannot reach Level 4 with Level 2 specs. The system won't let you.&lt;/p&gt;




&lt;h2&gt;
  
  
  So I built one. Here's what happened.
&lt;/h2&gt;

&lt;p&gt;I wanted to do something concrete this issue rather than just theorize. So I picked a real-world-shaped scenario — an e-commerce order management API with two external dependencies — and built it end to end with &lt;strong&gt;WireMock simulating the dependencies&lt;/strong&gt; and &lt;strong&gt;Gherkin scenarios written before the code&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The full project is on &lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; so you can clone it and run the exact same setup on your machine. Everything below is reproducible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The scenario
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;POST /orders&lt;/code&gt; endpoint that talks to two external services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;payment gateway&lt;/strong&gt; (think Stripe) that can succeed, decline, or time out&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;inventory service&lt;/strong&gt; that can confirm stock, report out-of-stock, or report partial availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Realistic enough to be relatable. Scoped enough to finish in an afternoon. The kind of integration complexity every backend engineer deals with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Write the spec first. Actually first.
&lt;/h3&gt;

&lt;p&gt;Here are the five Gherkin scenarios I wrote &lt;em&gt;before&lt;/em&gt; a single line of implementation code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kd"&gt;Feature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order Creation

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is successfully created when payment succeeds and all items are in stock
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-123"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will accept the charge
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response includes an order id
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway received exactly one charge request
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service received a reservation request

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is rejected when payment is declined
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-456"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will decline the charge
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"PAYMENT_FAILED"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 402
    &lt;span class="nf"&gt;And &lt;/span&gt;the response includes the decline reason &lt;span class="s"&gt;"INSUFFICIENT_FUNDS"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory reservation is released
    &lt;span class="nf"&gt;And &lt;/span&gt;no order id is issued

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is rejected when an item is out of stock
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-789"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service reports SHOE-RED-42 is out of stock
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"UNAVAILABLE"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 409
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is never called

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order surfaces partial unavailability without auto-confirming
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-321"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service reports SHOE-RED-42 as available but BELT-BRN-M as unavailable
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"PARTIAL_UNAVAILABLE"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 207
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is never called
    &lt;span class="nf"&gt;And &lt;/span&gt;no order is confirmed without explicit user action

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order handling is graceful when the payment gateway times out
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-654"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will not respond within the timeout window
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the response is returned within 12 seconds
    &lt;span class="nf"&gt;And &lt;/span&gt;the order status is &lt;span class="s"&gt;"PAYMENT_PENDING"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory is held for 15 minutes
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is not retried more than 2 times
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what these are. Not implementation documents. Not pseudocode. They're a &lt;strong&gt;behavioural contract&lt;/strong&gt; — plain-language descriptions of exactly what the system should do in specific situations, written in a format any teammate, PM, or yes — agent — can read.&lt;/p&gt;

&lt;p&gt;The discipline of writing them &lt;em&gt;first&lt;/em&gt; forced me to make decisions I would normally have postponed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do we check inventory before charging, or charge first? (Inventory first. The fourth scenario locks this in.)&lt;/li&gt;
&lt;li&gt;What happens during partial availability — auto-fulfill what's available, or ask the user? (Ask. Encoded in scenario 4.)&lt;/li&gt;
&lt;li&gt;What's the timeout SLA on the payment gateway? (5 seconds, with a max of 2 retries. Scenario 5 makes this testable.)&lt;/li&gt;
&lt;li&gt;What's the response code for partial availability? (207 Multi-Status.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these scenarios, every one of those decisions would have been made silently by whoever wrote the code first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Stub the dependencies with WireMock
&lt;/h3&gt;

&lt;p&gt;Before writing any tests you can actually run, the external services need to be simulated. This is what Dan Shapiro calls a &lt;strong&gt;digital twin universe&lt;/strong&gt; — a fully simulated version of your dependencies that behaves like the real thing without the real thing's unpredictability, cost, or rate limits.&lt;/p&gt;

&lt;p&gt;WireMock is the industry standard for this. A WireMock stub is just a JSON file describing how a service should respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/payments/charge/success"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ACCEPTED&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;transaction_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;txn-abc-123&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;amount&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 134.97}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the payment timeout scenario, WireMock has a built-in &lt;code&gt;fixedDelayMilliseconds&lt;/code&gt; parameter. One line and the mock takes 6 seconds to respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/payments/charge/timeout"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;TIMEOUT&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fixedDelayMilliseconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tiny config line is what makes scenario 5 testable. Without it, you cannot exercise timeout behaviour in a local environment without disabling network connectivity at the OS level — which I have done in the past, and it is exactly as miserable as it sounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Wire the scenarios to real assertions
&lt;/h3&gt;

&lt;p&gt;Gherkin by itself is just text. To turn it into an executable test suite I used &lt;strong&gt;pytest-bdd&lt;/strong&gt;, which lets each &lt;code&gt;Given/When/Then&lt;/code&gt; line map to a Python function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will decline the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_fixture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pay_declined&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_fixture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;submit_two&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inventory_scenario&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SHOE-RED-42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;89.99&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BELT-BRN-M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;44.98&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_PORT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventory_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inventory_scenario&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway is never called&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;payment_not_called&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;payment_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expected no payment calls, got: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;payment_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last assertion — &lt;code&gt;the payment gateway is never called&lt;/code&gt; — is the kind of thing that's almost impossible to verify with traditional unit tests but &lt;em&gt;trivial&lt;/em&gt; with WireMock. WireMock records every call it receives. You assert against that log directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Run the suite
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pytest tests/steps/test_order_creation.py &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test &lt;/span&gt;session starts &lt;span class="o"&gt;==============================&lt;/span&gt;
collected 5 items

tests/steps/test_order_creation.py::test_order_is_successfully_created_when_payment_succeeds_and_all_items_are_in_stock PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_payment_is_declined PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_an_item_is_out_of_stock PASSED
tests/steps/test_order_creation.py::test_order_surfaces_partial_unavailability_without_autoconfirming PASSED
tests/steps/test_order_creation.py::test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

&lt;span class="o"&gt;==============================&lt;/span&gt; 5 passed &lt;span class="k"&gt;in &lt;/span&gt;13.53s &lt;span class="o"&gt;==============================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five for five. But getting there was educational.&lt;/p&gt;




&lt;h2&gt;
  
  
  What broke along the way
&lt;/h2&gt;

&lt;p&gt;I want to be honest about the failures because &lt;em&gt;that's&lt;/em&gt; where the actual learning happened. The five tests didn't pass on the first run. They didn't pass on the second run either.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure 1 — The "no stub matched" silent success
&lt;/h3&gt;

&lt;p&gt;When a request comes in that no WireMock stub knows how to handle, the default behaviour is to return a 404. My API code did this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PAYMENT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="n"&gt;payment_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment service error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 404 is not an exception in &lt;code&gt;httpx&lt;/code&gt;. It's just a response. So the API would happily call &lt;code&gt;pay.json()&lt;/code&gt;, get &lt;code&gt;{"error": "No stub matched"}&lt;/code&gt;, and treat the entire interaction as a success — issuing an order id and confirming the order &lt;em&gt;even though no real payment had been processed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is genuinely dangerous. A misconfigured mock would have made all my tests pass while hiding that the real service path was broken. Lesson: &lt;strong&gt;always explicitly check the response status from a mock&lt;/strong&gt;. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PAYMENT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment scenario not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payment_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Failure 2 — The shared call log bug
&lt;/h3&gt;

&lt;p&gt;I started with one &lt;code&gt;MockServer&lt;/code&gt; class that held a single class-level call log. Both the payment and inventory mocks recorded into the same list. When the test asserted "the payment gateway received exactly one charge request," the inventory call was in the log but no payment call was — because of failure 1 — and the assertion was looking at the combined log.&lt;/p&gt;

&lt;p&gt;The fix was conceptually small but architecturally important — each mock server instance gets its own call log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_mock_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mappings_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;HTTPServer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MockCallLog&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;stubs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mappings_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MockCallLog&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                  &lt;span class="c1"&gt;# ← per-instance log
&lt;/span&gt;    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stubs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HTTPServer&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serve_forever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mirrors how real WireMock works in production — you run separate WireMock instances per service, each with its own request log. The bug was a direct consequence of cutting that corner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure 3 — The fixture wiring gap
&lt;/h3&gt;

&lt;p&gt;Scenarios 3 and 4 don't define a payment scenario in their Given clauses, because the payment gateway should never be called in those cases. But pytest-bdd was still expecting the &lt;code&gt;payment_scenario&lt;/code&gt; fixture — and erroring out before the test even ran.&lt;/p&gt;

&lt;p&gt;This is a subtle distinction worth naming. The Gherkin spec was correct. It said exactly what it should say. The error was in the &lt;em&gt;test wiring&lt;/em&gt; that connected the spec to the assertions. The fix was a default fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Default — overridden by specific Given steps.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spec stays clean. The wiring handles the case where a scenario doesn't care about a particular setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this exercise actually proved to me
&lt;/h2&gt;

&lt;p&gt;A few things that are now visceral rather than abstract:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specs that the AI cannot see during the build are uniquely powerful.&lt;/strong&gt; My scenarios live in &lt;code&gt;tests/features/order_creation.feature&lt;/code&gt;. The implementation lives in &lt;code&gt;app/main.py&lt;/code&gt;. When I asked an agent to modify the API, I could give it the implementation only. The spec stayed external. The agent had to make the test pass against behaviour it couldn't reverse-engineer from the assertions themselves. This is the part that genuinely changes things at Level 4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WireMock's 404-on-no-match is a feature, not a bug.&lt;/strong&gt; It exposes integration mistakes that would otherwise hide forever. The first time I saw a test silently succeed because of the 404 passthrough I was annoyed. Now I think it should be louder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing the scenarios first changed what I built.&lt;/strong&gt; Scenario 4 — partial availability — would not have existed if I'd written the code first. I would have implemented "all available or fail" and shipped it. Writing the spec first made me confront the question. The answer became part of the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try this yourself
&lt;/h2&gt;

&lt;p&gt;Everything above is in a &lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/" rel="noopener noreferrer"&gt;project&lt;/a&gt; you can clone, run, and break. Five scenarios, two mock services, one API. Total setup time: under fifteen minutes if you have Python and pip installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;repo-url&amp;gt; order-api
&lt;span class="nb"&gt;cd &lt;/span&gt;order-api
pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn httpx pytest pytest-bdd requests
pytest tests/steps/test_order_creation.py &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to use real WireMock instead of the Python-based mock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download WireMock standalone&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; wiremock.jar &lt;span class="se"&gt;\&lt;/span&gt;
  https://repo1.maven.org/maven2/org/wiremock/wiremock-standalone/3.3.1/wiremock-standalone-3.3.1.jar

&lt;span class="c"&gt;# Run two instances — the JSON mappings work as-is&lt;/span&gt;
java &lt;span class="nt"&gt;-jar&lt;/span&gt; wiremock.jar &lt;span class="nt"&gt;--port&lt;/span&gt; 8081 &lt;span class="nt"&gt;--root-dir&lt;/span&gt; wiremock/payment-mappings &amp;amp;
java &lt;span class="nt"&gt;-jar&lt;/span&gt; wiremock.jar &lt;span class="nt"&gt;--port&lt;/span&gt; 8082 &lt;span class="nt"&gt;--root-dir&lt;/span&gt; wiremock/inventory-mappings &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WireMock mapping JSON files I wrote work in real WireMock with zero changes. That was deliberate. The Python mock is for getting started fast. The real WireMock is for when you want to scale this pattern across an actual service mesh.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: I take this same setup and hand it to an AI agent. Spec only — no implementation hints. We see what it builds, what it gets wrong, and how the spec acts as a guardrail.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wiremock.org/docs/" rel="noopener noreferrer"&gt;WireMock documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytest-bdd.readthedocs.io" rel="noopener noreferrer"&gt;pytest-bdd documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-02-wiremock-gherkin.md" rel="noopener noreferrer"&gt;Session findings - Issue #2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>There Are 5 Levels of AI Coding. Most Engineers Are Stuck on Level 2.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Tue, 09 Jun 2026 19:12:34 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5</link>
      <guid>https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5</guid>
      <description>&lt;p&gt;&lt;em&gt;The Level 5 Engineer Newsletter — Issue #1&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Preface
&lt;/h3&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Before you panic and close this tab — yes, I see you hovering!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’m not naively chasing the thing that puts me out of a job — I’m aware of the irony. The tech world is in full meltdown mode about AI taking developer jobs right now, and, I think, most of that noise misses the point entirely. &lt;/p&gt;

&lt;p&gt;A Dark Factory still needs someone who understands the system deeply enough to define what it should build, catch what it shouldn’t touch, and course-correct when it goes sideways. That’s not a developer writing code anymore — that’s a steward. The role doesn’t disappear; it transforms. From worker bee to architect. From implementer to the person who holds the mental model of the entire system and encodes that judgment into infrastructure that the agents can operate safely within. That shift is actually what this newsletter is about — and it deserves its own deep dive, which is coming. For now, just know: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Level 5 is the destination, not the cliff edge.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  So. The Five Levels.
&lt;/h3&gt;

&lt;p&gt;Dan Shapiro borrowed the structure from the NHTSA’s autonomous driving classification — five levels from “human does everything” to “machine does everything, humans not required.” Applied to software development, it maps out like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2h5vw3zfmbpph9qsirw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2h5vw3zfmbpph9qsirw.png" alt="The Five Levels" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Level 0 — Spicy Autocomplete. You’re still writing every character. Maybe you use AI as a search engine or accept the occasional tab suggestion. The code is yours. You’re also losing ground every day to the people who aren’t at Level 0.&lt;/p&gt;

&lt;p&gt;Level 1 — The Coding Intern. You’re delegating discrete tasks. “Write a unit test for this.” “Add a docstring here.” You’re seeing some speedup. Your job is essentially unchanged. YOU are still the bottleneck.&lt;/p&gt;

&lt;p&gt;Level 2 — Junior Developer. This is where it starts to feel like something. You’re pairing with AI like a colleague. Flow state. Productivity you haven’t felt in years. You hand off the boring stuff and focus on the interesting parts. Here’s the trap though — Level 2 feels like you’re done. You’re not done. Most people who think they’re “using AI” are living here permanently and calling it Level 5.&lt;/p&gt;

&lt;p&gt;Level 3 — Developer as Manager. You’re not writing much code anymore. Your AI agent has multiple tabs running at all times. Your life is diffs. You review everything at the PR level. For a lot of people, this actually feels worse than Level 2 — more overhead, less flow. And this is where almost everyone tops out. Not because they can’t go further. Because Level 3 feels like the ceiling.&lt;/p&gt;

&lt;p&gt;Level 4 — Developer as Product Manager. The code is a black box. You write specs. You argue about the specs. You set up the right tools, define the right constraints, and then you leave for 12 hours. You come back and check if the tests passed. Dan Shapiro says he’s here. I believe him because of how he describes it — it doesn’t sound glamorous, it sounds like a different kind of hard work.&lt;/p&gt;

&lt;p&gt;Level 5 — The Dark Factory. Named after Fanuc’s robot factory — staffed entirely by robots, lights off, no humans needed or welcome. At Level 5, you’re not really running a software process anymore. You have a system that turns specs into software, autonomously. A handful of teams are doing this today. Small teams, less than five people, shipping production software with no human-written or human-reviewed code. A prominent example would be Claude’s own engineering team, who claim that Claude Code wrote most of Claude Code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Like what you read thus far? This post is public, so feel free to share it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  The Part That Actually Stings
&lt;/h3&gt;

&lt;p&gt;Here’s what Nate B. Jones added to this framework, which made me sit with it for a while.&lt;/p&gt;

&lt;p&gt;There’s a rigorous study showing that experienced developers using AI tools took 19% longer on tasks — while believing they were 24% faster. The gap between perceived productivity and actual productivity even has a name — the AI confidence gap. You feel faster. The clock disagrees.&lt;/p&gt;

&lt;p&gt;The teams that are pulling away aren’t using more AI tools. They’re using AI differently. The bottleneck has moved. It’s no longer about how fast you can implement. It’s about how precisely you can specify.&lt;/p&gt;

&lt;p&gt;The Dark Factory teams aren’t superhuman coders. They’ve built infrastructure around judgment — external behavioural scenarios that the AI cannot see during the build process, digital twin environments that simulate production dependencies safely, testing architectures designed specifically so the AI can’t reverse-engineer the passing criteria.&lt;/p&gt;

&lt;p&gt;The rest of the industry is plateaued at Level 3, reviewing diffs, and measuring velocity in story points on a Jira board that hasn’t been groomed since Q2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why I’m Writing This
&lt;/h3&gt;

&lt;p&gt;I’ve been a software engineer for nearly a decade. I’ve done the serverless architectures, the MLOps pipelines, the distributed systems. I’ve been to Re:Invent (3 times. Ooooh, fancy schmancy). I know the stack.&lt;/p&gt;

&lt;p&gt;And I watched these videos and realized I was at Level 2. Maybe Level 3 on a good week. And I had been calling it “using AI effectively.”&lt;/p&gt;

&lt;p&gt;This newsletter is the documentation of that climb — starting with understanding what the levels actually mean and why most of us have been misreading where we stand. Or at least that’s where I’m starting. I wouldn’t be surprised if my own understanding shifts incrementally as I make progress. That’s kind of the point.&lt;/p&gt;

&lt;p&gt;Not the theory — the practice. The tools, the habits, the org-level thinking, the moments where something clicks. I’ll be honest when I’m stuck and honest when something works. No performance, no hype.&lt;/p&gt;

&lt;p&gt;The climb starts with understanding the map. Now you have it.&lt;/p&gt;




&lt;p&gt;Next issue: The specification quality problem — why the bottleneck has shifted and what “writing a good spec” actually means in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources &amp;amp; Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)&lt;/a&gt; · &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; · &lt;a href="https://natesnewsletter.substack.com/" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Thanks for reading The Level 5 Engineer! Subscribe for free to receive new posts and support my work.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
