<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Diya Burman</title>
    <description>The latest articles on DEV Community by Diya Burman (@diyaburman).</description>
    <link>https://dev.to/diyaburman</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F93964%2Fa85c0e0d-f413-4c6e-b6a0-b26ddf9b739d.jpeg</url>
      <title>DEV Community: Diya Burman</title>
      <link>https://dev.to/diyaburman</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/diyaburman"/>
    <language>en</language>
    <item>
      <title>Wiring the Guardrails</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 03:04:02 +0000</pubDate>
      <link>https://dev.to/diyaburman/wiring-the-guardrails-2oli</link>
      <guid>https://dev.to/diyaburman/wiring-the-guardrails-2oli</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #6&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Five issues in, everything we've built lives on one machine. The Gherkin scenarios, the WireMock stubs, the Pact contracts, the can-i-deploy script — all of it runs locally, passes locally, and means nothing the moment someone else touches the codebase.&lt;/p&gt;

&lt;p&gt;Issue #6 fixes that. A GitHub Actions pipeline now runs on every push, executes the full specification stack in dependency order, and blocks merges to main if anything breaks. The pipeline is the guardrail. From this point on, a broken contract or a failing scenario cannot reach main undetected.&lt;/p&gt;

&lt;p&gt;Getting there took ninety minutes and two interventions I didn't plan for. Both are worth documenting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before the YAML: deciding what "green" means
&lt;/h2&gt;

&lt;p&gt;The first thing Claude Code did before touching any pipeline config was run the full test suite to establish a baseline. The instruction was explicit: everything must pass before a single line of YAML gets written.&lt;/p&gt;

&lt;p&gt;It found a failure immediately — and it wasn't from the breaking change experiment. It was from Issue #5.&lt;/p&gt;

&lt;p&gt;The bad-spec test (&lt;code&gt;test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order&lt;/code&gt;) was still asserting &lt;code&gt;db_status&lt;/code&gt; in the response body. That was intentional in Issue #5 — the failure was the finding. The session ended with it red because the point was to show what bad specs produce. But on main, with CI incoming, that means the pipeline would have been red on day one before a single feature change.&lt;/p&gt;

&lt;p&gt;The fix was adding backward-compat aliases to the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;            &lt;span class="c1"&gt;# good spec field
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;         &lt;span class="c1"&gt;# bad spec alias — keeps Issue #5 test passing
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;placed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# good spec field
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# bad spec alias
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Neither test file was modified. No feature files were touched. The aliases kept both the good-spec and bad-spec tests passing against the same endpoint.&lt;/p&gt;

&lt;p&gt;The reason this matters before the pipeline exists: a team that starts CI with a known failure trains itself to ignore red. The cost of normalising a red CI is much higher than the cost of fixing the baseline first. Claude Code made the right call and documented it before moving on.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pipeline structure
&lt;/h2&gt;

&lt;p&gt;Four jobs, in dependency order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test → pact-consumer → pact-verify → can-i-deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each job only runs if its predecessor passes. If Gherkin breaks, Pact never runs. If the consumer tests fail, verification never runs. If verification fails, can-i-deploy is skipped. The pipeline fails fast and tells you exactly which layer broke.&lt;/p&gt;

&lt;p&gt;The artifact chain is what makes it a pipeline rather than four parallel scripts. The &lt;code&gt;pact-consumer&lt;/code&gt; job generates the &lt;code&gt;.pact&lt;/code&gt; files and uploads them as a GitHub Actions artifact. The &lt;code&gt;pact-verify&lt;/code&gt; job downloads that artifact and verifies it — the same files, not freshly regenerated ones. Without this, each job would build its own consumer contract from scratch, and verification would be proving that the contract matches the code rather than proving it matches what &lt;code&gt;pact-consumer&lt;/code&gt; actually produced.&lt;/p&gt;

&lt;p&gt;One non-obvious piece: &lt;code&gt;mock_server.py&lt;/code&gt; is a library module with no command-line entry point. The pipeline needed servers running as background processes. The fix was an inline Python invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Start mock servers&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;. .venv/bin/activate&lt;/span&gt;
    &lt;span class="s"&gt;python -c "&lt;/span&gt;
    &lt;span class="s"&gt;import time&lt;/span&gt;
    &lt;span class="s"&gt;from mock_server import start_mock_server&lt;/span&gt;
    &lt;span class="s"&gt;start_mock_server(8091, 'wiremock/payment-mappings')&lt;/span&gt;
    &lt;span class="s"&gt;start_mock_server(8092, 'wiremock/inventory-mappings')&lt;/span&gt;
    &lt;span class="s"&gt;time.sleep(86400)&lt;/span&gt;
    &lt;span class="s"&gt;" &amp;amp;&lt;/span&gt;
    &lt;span class="s"&gt;sleep 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;time.sleep(86400)&lt;/code&gt; keeps the process alive for the duration of the job. Inelegant but functional. A proper &lt;code&gt;if __name__ == "__main__"&lt;/code&gt; entry point with argparse is the obvious cleanup for a future session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The first CI run — and why I had to intervene manually
&lt;/h2&gt;

&lt;p&gt;The YAML was committed, pushed to main, and the pipeline ran. All three runs failed on the &lt;code&gt;test&lt;/code&gt; job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;OSError: [Errno 98] Address already in use
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ports 8091 and 8092. Every test in &lt;code&gt;test_order_creation.py&lt;/code&gt; errored at setup. The order status tests — which don't use the mock servers — passed fine.&lt;/p&gt;

&lt;p&gt;Claude Code didn't catch this on its own. Here's why that's worth explaining.&lt;/p&gt;

&lt;p&gt;When Claude Code wrote the pipeline, it was working from the codebase and its own knowledge of GitHub Actions patterns. It knew the mock servers needed to be running before pytest started, so it added an explicit start-servers step to the YAML — a reasonable decision based on the information it had. What it couldn't see was the runtime interaction between that YAML step and pytest's session-scoped fixtures, because that interaction only manifests in the CI environment, not locally.&lt;/p&gt;

&lt;p&gt;Locally, running &lt;code&gt;pytest tests/steps/ -v&lt;/code&gt; has always worked correctly because the session fixture starts the servers and nothing else competes. Claude Code had only ever seen local runs succeed. It had no signal that the YAML step was creating a conflict — because the conflict doesn't exist locally.&lt;/p&gt;

&lt;p&gt;This is a fundamental limit of the "paste and walk away" approach at the boundary between local and remote environments: the agent can reason about the codebase and about CI patterns, but it can't observe the CI run itself. The failure was on GitHub. Claude Code was in a terminal. Those two things weren't connected.&lt;/p&gt;

&lt;p&gt;I diagnosed the error from the GitHub Actions log, explained the root cause, and pasted new instructions. Claude Code fixed it in one step — removing the redundant YAML steps entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Removed from both test and pact-verify jobs:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Start mock servers&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;. .venv/bin/activate&lt;/span&gt;
    &lt;span class="s"&gt;python -c "..." &amp;amp;&lt;/span&gt;
    &lt;span class="s"&gt;sleep 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pytest session fixtures already own server lifecycle correctly. &lt;code&gt;scope="session"&lt;/code&gt; means pytest starts the servers once per test run and keeps them alive. The YAML step was duplicating a responsibility that was already handled. The fix wasn't a workaround — it was removing the wrong layer.&lt;/p&gt;

&lt;p&gt;The root cause in plain terms: the YAML step and the pytest fixture both thought they were responsible for starting the servers. The port was already bound when the fixture tried to bind it again. Works on my machine. Breaks in CI. Classic.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breaking change experiment — in the pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hcbs5naigu35jq49rsh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hcbs5naigu35jq49rsh.png" alt="All four jobs green. 1m 34s. SAFE TO DEPLOY." width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the pipeline green, the breaking change test ran as designed.&lt;/p&gt;

&lt;p&gt;Branch &lt;code&gt;test/breaking-change-pipeline&lt;/code&gt;, commit &lt;code&gt;76c0d89&lt;/code&gt;: renamed &lt;code&gt;"status"&lt;/code&gt; to &lt;code&gt;"result"&lt;/code&gt; in &lt;code&gt;wiremock/payment-mappings/payment-success.json&lt;/code&gt;. Same change as Issue #4, now running through CI instead of local verification.&lt;/p&gt;

&lt;p&gt;The expected failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;a successful payment charge (FAILED)
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;Failures:
1) Verifying a pact between OrderService and PaymentGateway
&lt;/span&gt;   1.1) has a matching body
          $ -&amp;gt; Actual map is missing the following keys: status
   {
     "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
     "transaction_id": "txn-abc-123"
   }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pact-verify&lt;/code&gt; fails. &lt;code&gt;can-i-deploy&lt;/code&gt; is skipped. The merge is blocked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp59n3ed1uf7lvpqvnbh9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp59n3ed1uf7lvpqvnbh9.png" alt=" raw `test` endraw  and  raw `pact-consumer` endraw  pass.  raw `pact-verify` endraw  catches the broken contract.  raw `can-i-deploy` endraw  never runs." width="799" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the key point from Issue #4 holds at the pipeline level: the &lt;code&gt;test&lt;/code&gt; job — the Gherkin suite — would pass with the breaking change in place. The order creation scenarios check HTTP status codes and business outcomes. They never read &lt;code&gt;pay_resp.json()["status"]&lt;/code&gt;. A stub returning &lt;code&gt;result&lt;/code&gt; instead of &lt;code&gt;status&lt;/code&gt; still returns HTTP 200. Gherkin passes. Pact catches it.&lt;/p&gt;

&lt;p&gt;This is the division of labour. Gherkin proves the system does the right thing. Pact proves the contracts don't drift. You need both, and now both run automatically on every push.&lt;/p&gt;




&lt;h2&gt;
  
  
  The one step that requires the GitHub UI
&lt;/h2&gt;

&lt;p&gt;Claude Code cannot configure branch protection rules — that requires the GitHub web UI or admin API. This step is non-negotiable and must be done manually:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Repo → &lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;Branches&lt;/strong&gt; → &lt;strong&gt;Add branch protection rule&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Branch name pattern: &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enable &lt;strong&gt;Require status checks to pass before merging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add all four status checks: &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;pact-consumer&lt;/code&gt;, &lt;code&gt;pact-verify&lt;/code&gt;, &lt;code&gt;can-i-deploy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enable &lt;strong&gt;Require branches to be up to date before merging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without this, the pipeline is advisory. A push to main can still happen even if all four jobs are red. The pipeline becomes a dashboard — it shows you the problem but doesn't stop anything. Branch protection is what turns "CI failed" from a notification into enforcement. The pipeline is only a guardrail if something stops you going around it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;The YAML took about twenty minutes to write. The session took ninety minutes total — because the baseline fix and the port conflict ate the rest.&lt;/p&gt;

&lt;p&gt;The instinct during the baseline audit was to skip past the known failure. It's a demo test, we know why it's there, configure CI to skip that file and move on. That would have been thirty seconds. It also would have been wrong — a pipeline with documented exceptions is a pipeline people route around.&lt;/p&gt;

&lt;p&gt;The instinct during the port conflict was to blame the CI environment. Ubuntu runs things differently, ports work differently, it's a platform quirk. That framing would have sent the debugging in the wrong direction. The actual cause was simpler: two layers both thought they owned the same responsibility, and nobody had written down which one was actually in charge.&lt;/p&gt;

&lt;p&gt;Both of those moments are the J-curve. Not the YAML — the discipline of not skipping and not blaming the environment. The overhead of CI is not the config file. It's every decision about what "green" actually means and who's responsible for what.&lt;/p&gt;

&lt;p&gt;The pipeline is now real infrastructure. The breaking change can't reach main. That's worth ninety minutes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The Scope Problem — scaling Gherkin across a multi-service system. What happens when one spec file isn't enough, and how spec debt forms.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions" rel="noopener noreferrer"&gt;GitHub Actions documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io" rel="noopener noreferrer"&gt;Pact documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-06-cicd-guardrails.md" rel="noopener noreferrer"&gt;Session findings — Issue #6&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>devops</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Spec That Doesn't Lie</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:40:42 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-spec-that-doesnt-lie-5a00</link>
      <guid>https://dev.to/diyaburman/the-spec-that-doesnt-lie-5a00</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #5&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;Every issue so far has assumed something I haven't said out loud: that the specs are good. Issue #2 wrote them carefully. Issue #3 handed them to an agent and watched it build correctly. Issue #4 proved the contracts survive provider drift.&lt;/p&gt;

&lt;p&gt;But what happens when the spec isn't good? Not broken — Gherkin syntax is fine, tests pass, the agent builds something. Just imprecise. Vague in ways that feel precise when you're writing them.&lt;/p&gt;

&lt;p&gt;This issue answers that question by doing the thing deliberately. I wrote bad Gherkin on purpose, handed it to the agent, watched what it built — and then rewrote the spec and did it again. The difference between the two implementations is the article.&lt;/p&gt;




&lt;h2&gt;
  
  
  The hardest thing about bad specs
&lt;/h2&gt;

&lt;p&gt;Bad specs are hard to spot when you're writing them because they feel complete.&lt;/p&gt;

&lt;p&gt;A scenario that references implementation details sounds like reasonable description — you wrote the implementation, so the details feel like specifics. A Given clause that feels obvious to you will be interpreted differently by every reader who hasn't seen the code. The Gherkin is syntactically correct. The tests pass. Nothing in the output signals that anything is wrong.&lt;/p&gt;

&lt;p&gt;This is the trap. It's not that bad specs break things. It's that they don't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The endpoint
&lt;/h2&gt;

&lt;p&gt;I added a new endpoint to the order-api project: &lt;code&gt;GET /orders/{order_id}/status&lt;/code&gt;. It returns the current status of an order and relevant metadata. Simple enough that the spec should be easy to write well. Which makes it a good target for writing it badly on purpose.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bad specs
&lt;/h2&gt;

&lt;p&gt;Two scenarios. Both syntactically valid. Both produce passing tests. Both wrong in different ways.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# BAD SPEC 1 — The leaky spec&lt;/span&gt;
&lt;span class="c"&gt;# Problem: references internal implementation concepts (db_status, order_created_at)&lt;/span&gt;
&lt;span class="c"&gt;# rather than describing what a caller observes. The agent uses these names literally&lt;/span&gt;
&lt;span class="c"&gt;# in the response body, leaking storage terminology into the public API contract.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Retrieving status for a confirmed order
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order exists in the system with db_status &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/{order_id}/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response should contain the db_status field set to &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the order_created_at field should be populated from the order record

&lt;span class="c"&gt;# BAD SPEC 2 — The vague Given&lt;/span&gt;
&lt;span class="c"&gt;# Problem: "an order that has not been placed" is underspecified. The agent must&lt;/span&gt;
&lt;span class="c"&gt;# guess what this means — a malformed ID? A well-formed UUID with no record?&lt;/span&gt;
&lt;span class="c"&gt;# A deleted order? Each interpretation is plausible and produces different behavior.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Retrieving status for an order that does not exist
  &lt;span class="nf"&gt;Given &lt;/span&gt;an order that has not been placed
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/{order_id}/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response should indicate the order was not found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both passed immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order PASSED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

2 passed in 0.34s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Green. No warnings. No hint that anything is wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent built from the bad specs
&lt;/h2&gt;

&lt;p&gt;Here's the implementation the agent produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/orders/{order_id}/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It satisfies the spec completely. It also made four decisions the spec never made:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 1: The field is named &lt;code&gt;db_status&lt;/code&gt; in the response.&lt;/strong&gt;&lt;br&gt;
The spec said &lt;code&gt;db_status&lt;/code&gt; so the agent used &lt;code&gt;db_status&lt;/code&gt;. It never questioned whether this was an internal name leaking into a public API. It satisfied the spec literally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 2: A missing order returns 404.&lt;/strong&gt;&lt;br&gt;
The spec says "indicate the order was not found." 404 is a defensible interpretation. So is 422, 403, or a 200 with a &lt;code&gt;NOT_FOUND&lt;/code&gt; status field. The agent picked the most conventional option — but the spec never mandated it, and FastAPI's default 404 body is &lt;code&gt;{"detail": "Order not found"}&lt;/code&gt;, not &lt;code&gt;{"error": "Order not found"}&lt;/code&gt;. A client checking &lt;code&gt;response.json()["error"]&lt;/code&gt; gets a KeyError.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 3: The timestamp field is named &lt;code&gt;order_created_at&lt;/code&gt; with no format requirement.&lt;/strong&gt;&lt;br&gt;
The spec says "populated from the order record." The agent chose &lt;code&gt;order_created_at&lt;/code&gt; and returned an ISO string because that's what &lt;code&gt;datetime.utcnow().isoformat()&lt;/code&gt; produces. The step definition checked only that the field is non-empty and a string — so any format would have passed. A Unix timestamp integer would have passed. A human-readable string like "June 2nd" would have passed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 4: The order store is in-memory.&lt;/strong&gt;&lt;br&gt;
The spec says nothing about persistence. An in-memory dict is the simplest thing that makes the tests pass. In production, orders are persisted. The in-memory store vanishes on restart and isn't shared across worker processes.&lt;/p&gt;

&lt;p&gt;Every one of these decisions is plausible. The agent made the reasonable call every time. That's not the problem. The problem is that a different agent, given the same spec, might have made different reasonable calls — and both implementations would pass the same test suite.&lt;/p&gt;


&lt;h2&gt;
  
  
  The rewrite
&lt;/h2&gt;

&lt;p&gt;Writing the good spec forced every decision the bad spec had silently delegated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="c"&gt;# GOOD SPEC 1 — Caller's perspective, not implementation's&lt;/span&gt;
&lt;span class="c"&gt;# Fixed: field names describe what the caller observes (status, placed_at)&lt;/span&gt;
&lt;span class="c"&gt;# not what the storage layer calls them (db_status, order_created_at).&lt;/span&gt;
&lt;span class="c"&gt;# The format of placed_at is now a contract obligation, not an assumption.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Confirmed order status is returned with placement timestamp
  &lt;span class="nf"&gt;Given &lt;/span&gt;a confirmed order with id &lt;span class="s"&gt;"order-abc-123"&lt;/span&gt; exists in the system
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/order-abc-123/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response status code is 200
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"order_id"&lt;/span&gt; equal to &lt;span class="s"&gt;"order-abc-123"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"status"&lt;/span&gt; equal to &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains &lt;span class="s"&gt;"placed_at"&lt;/span&gt; as a valid ISO 8601 timestamp

&lt;span class="c"&gt;# GOOD SPEC 2 — Precise Given, explicit 404 body shape&lt;/span&gt;
&lt;span class="c"&gt;# Fixed: "a well-formed UUID with no corresponding record" is now unambiguous.&lt;/span&gt;
&lt;span class="c"&gt;# The 404 response body shape is now a contract obligation, not a guess.&lt;/span&gt;

&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Unknown order id returns 404 with error message
  &lt;span class="nf"&gt;Given &lt;/span&gt;no order with id &lt;span class="s"&gt;"order-xyz-999"&lt;/span&gt; exists in the system
  &lt;span class="nf"&gt;When &lt;/span&gt;I request GET /orders/order-xyz-999/status
  &lt;span class="nf"&gt;Then &lt;/span&gt;the response status code is 404
  &lt;span class="nf"&gt;And &lt;/span&gt;the response body contains an &lt;span class="s"&gt;"error"&lt;/span&gt; field
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what changed. The scenarios describe the same two situations. The intent is identical. But now every decision is in the spec rather than in the agent's interpretation of the spec.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent built from the good spec
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/orders/{order_id}/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;placed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same endpoint. Same logic. Different API.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;db_status&lt;/code&gt; became &lt;code&gt;status&lt;/code&gt;. &lt;code&gt;order_created_at&lt;/code&gt; became &lt;code&gt;placed_at&lt;/code&gt;. The 404 body now contains &lt;code&gt;error&lt;/code&gt; not &lt;code&gt;detail&lt;/code&gt;. The timestamp is now asserted to be ISO 8601 — not just non-empty.&lt;/p&gt;

&lt;p&gt;These are not cosmetic differences. They are different contracts that clients build against.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cross-run
&lt;/h2&gt;

&lt;p&gt;After building from the good spec, I ran the bad-spec tests against the new implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order FAILED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

E   KeyError: 'db_status'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The leaky test failed. The field &lt;code&gt;db_status&lt;/code&gt; doesn't exist in the good implementation — it's been renamed to &lt;code&gt;status&lt;/code&gt;, which is what a caller should see. The test that was checking for an internal name is now broken, correctly.&lt;/p&gt;

&lt;p&gt;The vague test passed. Both implementations return a 404 for a missing order — the good implementation just happened to reach the same conclusion, but for an explicit reason this time.&lt;/p&gt;

&lt;p&gt;That asymmetry is instructive. The vague Given produced the right answer by coincidence. The leaky Then produced the wrong field name by construction. One was luck. One was baked in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Both implementations pass their own test suites. That is the trap.&lt;/p&gt;

&lt;p&gt;If you run the bad-spec tests against the bad-spec implementation: green. If you run the good-spec tests against the good-spec implementation: green. The difference only surfaces when you cross-run — and in production, you never cross-run. You ship the bad implementation, it passes CI, and the problem lands in a client exception report six months later.&lt;/p&gt;

&lt;p&gt;Here's the concrete difference: the bad-spec implementation returns &lt;code&gt;db_status&lt;/code&gt; and &lt;code&gt;order_created_at&lt;/code&gt; with no format guarantee. The good-spec implementation returns &lt;code&gt;status&lt;/code&gt; and &lt;code&gt;placed_at&lt;/code&gt; with a mandatory ISO 8601 format. An agent given the bad spec had no way to know that &lt;code&gt;db_status&lt;/code&gt; was wrong — the spec said &lt;code&gt;db_status&lt;/code&gt;. An agent given the good spec had no choice but to produce &lt;code&gt;status&lt;/code&gt; — the spec said &lt;code&gt;status&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Spec quality is not about whether tests pass. It is about how much of the implementation the spec author wrote versus how much was silently delegated to the agent. Every silent delegation is a place where two agents given the same spec produce different code — code that both passes, but disagrees on the contract.&lt;/p&gt;

&lt;p&gt;At scale — dozens of endpoints, hundreds of scenarios — that disagreement is the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The practical test for a good spec
&lt;/h2&gt;

&lt;p&gt;Before handing any scenario to an agent, ask one question: &lt;em&gt;what decisions does this scenario leave open?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "none — every field name, format, response code, and body shape is specified," the spec is ready. If the answer is "a few reasonable ones," those are the places where your implementation and the next agent's implementation will silently diverge.&lt;/p&gt;

&lt;p&gt;The agent will always make reasonable decisions. That's not the problem. The problem is that reasonable is not the same as specified — and at Level 4, specified is the only thing that counts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: Wiring the Guardrails — GitHub Actions, the Pact Broker, and the pipeline that turns contract violations into blocked merges automatically.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-05-the-spec-that-doesnt-lie.mdL" rel="noopener noreferrer"&gt;Session findings — Issue #5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>softwareengineering</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Contract That Survives the Agent</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:24:34 +0000</pubDate>
      <link>https://dev.to/diyaburman/how-pact-contract-testing-catches-breaking-changes-that-wiremock-misses-3ge6</link>
      <guid>https://dev.to/diyaburman/how-pact-contract-testing-catches-breaking-changes-that-wiremock-misses-3ge6</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #4&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preface
&lt;/h2&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you've been following along, you know where we are. &lt;a href="https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb"&gt;Issue #2&lt;/a&gt; introduced WireMock and Gherkin — write the behavioral contract before the code, stub your dependencies, run a real test suite. &lt;a href="https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja"&gt;Issue #3&lt;/a&gt; handed that spec to an AI agent and walked away. Five scenarios passed. The agent even found a bug in my code.&lt;/p&gt;

&lt;p&gt;Everything worked. And that's exactly the problem this issue is about.&lt;/p&gt;

&lt;p&gt;Because the WireMock stubs working perfectly is not the same thing as the real services working. The gap between those two statements is where production incidents are born.&lt;/p&gt;




&lt;h2&gt;
  
  
  The confidence trap
&lt;/h2&gt;

&lt;p&gt;Here's the scenario nobody talks about until it happens to them.&lt;/p&gt;

&lt;p&gt;Your order service calls a payment gateway. You've stubbed it with WireMock. Your Gherkin scenarios pass. Your agent builds against those stubs. Five for five, green across the board.&lt;/p&gt;

&lt;p&gt;Meanwhile, the payment gateway team — a different squad, a different repo, maybe a different company entirely — ships a cleanup. They've been inconsistent about field naming across their API. &lt;code&gt;status&lt;/code&gt; in one endpoint, &lt;code&gt;result&lt;/code&gt; in another. They standardize. They rename &lt;code&gt;status&lt;/code&gt; to &lt;code&gt;result&lt;/code&gt; in the charge response. Their tests pass. They deploy.&lt;/p&gt;

&lt;p&gt;Your tests still pass too. The stub hasn't changed. The stub will never change unless you change it.&lt;/p&gt;

&lt;p&gt;The first time you learn about the rename is a production incident.&lt;/p&gt;

&lt;p&gt;This is the confidence trap: a mock that can drift from the real service makes you feel safe right up until production proves you weren't. The tests are green. The contract is broken. You just don't know it yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Pact does differently
&lt;/h2&gt;

&lt;p&gt;WireMock is a &lt;em&gt;behavioral double&lt;/em&gt; — it simulates a service so your tests can run in isolation. You define what it returns. You maintain it. You can make it say anything you want, which means it can silently lie about what the real service actually does.&lt;/p&gt;

&lt;p&gt;Pact inverts the trust relationship.&lt;/p&gt;

&lt;p&gt;Instead of you maintaining a stub that you hope reflects reality, your consumer tests &lt;em&gt;declare what they need&lt;/em&gt; from the provider. Those declarations get written into a &lt;code&gt;.pact&lt;/code&gt; file — a machine-readable contract. The provider then runs verification against that contract before it ships. If the provider no longer satisfies what the consumer declared, verification fails and the deploy is blocked.&lt;/p&gt;

&lt;p&gt;The consumer defines the need. The provider proves delivery. No human has to remember to update a stub.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building it — and what the docs didn't tell me
&lt;/h2&gt;

&lt;p&gt;I added Pact to the order-api project this issue, covering both downstream dependencies — the payment gateway and the inventory service — with consumer tests matching the same five scenarios from the Gherkin feature file.&lt;/p&gt;

&lt;p&gt;It was less smooth than I expected.&lt;/p&gt;

&lt;h3&gt;
  
  
  The pact-python v3 FFI surprise
&lt;/h3&gt;

&lt;p&gt;Every tutorial for pact-python shows the same pattern: create a module-scoped Pact fixture, run multiple tests against it, write the pact file at the end. I wrote exactly that. The first test in each class passed. Every subsequent test failed with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: The provider state could not be specified.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No hint of what was actually wrong. After digging into the source, the root cause: &lt;code&gt;pact-python&lt;/code&gt; 3.x is a complete rewrite backed by a Rust FFI binary. The Rust handle is &lt;em&gt;consumed&lt;/em&gt; by the first &lt;code&gt;serve()&lt;/code&gt; call — you cannot add new interactions to a handle after that point. The v2-style module-scoped pattern violates this constraint in a way the error message doesn't explain at all.&lt;/p&gt;

&lt;p&gt;The fix was restructuring the consumer tests so all interactions are defined upfront before any &lt;code&gt;serve()&lt;/code&gt; call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ v2-style — breaks in pact-python v3
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestPaymentConsumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;module&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OrderService&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;has_pact_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment succeeds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# test
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_declined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a decline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="c1"&gt;# RuntimeError — handle already consumed
&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ v3 correct pattern — all interactions before serve()
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_payment_gateway_consumer&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;pact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OrderService&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;has_pact_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will accept the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a successful payment charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;will_respond_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACCEPTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transaction_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;txn-abc-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will decline the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upon_receiving&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a declined payment charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;will_respond_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;402&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DECLINED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSUFFICIENT_FUNDS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="c1"&gt;# ... all interactions defined ...
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;srv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# exercise all interactions against srv.url
&lt;/span&gt;    &lt;span class="n"&gt;pact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pacts/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're upgrading from pact-python 1.x or 2.x: expect to rewrite your test fixtures. This isn't a syntax change — it's a different mental model of how the mock server lifecycle works.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Verifier transport configuration gap
&lt;/h3&gt;

&lt;p&gt;Provider verification had its own friction. The &lt;code&gt;Verifier&lt;/code&gt; constructor in pact-python v3 takes a hostname, not a full URL. Passing a full URL causes a silent host mismatch when you later configure the transport:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Causes "Host mismatch: localhost != http://localhost:8291"
&lt;/span&gt;&lt;span class="nc"&gt;Verifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8291&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_transport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8291&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ Correct
&lt;/span&gt;&lt;span class="nc"&gt;Verifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentGateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_transport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8291&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pact_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_request_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# needed for the 6s timeout stub
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;set_request_timeout(10000)&lt;/code&gt; line is also non-obvious: the payment timeout stub uses &lt;code&gt;fixedDelayMilliseconds: 6000&lt;/code&gt; to simulate a slow response. The verifier's default timeout is 5 seconds. Without the explicit timeout extension, the timeout interaction fails verification with a connection error rather than a clean pass.&lt;/p&gt;

&lt;p&gt;Neither of these are in the main documentation. Both took real time to find. They're in the findings file for this session — linked at the bottom.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breaking change experiment
&lt;/h2&gt;

&lt;p&gt;All the Pact setup is preamble. This is the proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Baseline — all contracts verified&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;pytest tests/pact/test_provider_verification.py -v

Verifying a pact between OrderService and PaymentGateway
  a declined payment charge         (OK)
  a successful payment charge       (OK)
  a timed-out payment charge        (OK)
PASSED

Verifying a pact between OrderService and InventoryService
  [3 interactions — all OK]
PASSED

2 passed in 8.19s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Introduce the breaking change&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;wiremock/payment-mappings/payment-success.json&lt;/code&gt;, one field rename:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Before&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ACCEPTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"txn-abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;After&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;renamed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"result"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ACCEPTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"txn-abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;134.97&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Provider verification with the breaking change&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;pytest tests/pact/test_provider_verification.py -v
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;  a successful payment charge (FAILED)
&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;Failures:
&lt;/span&gt;  1.1) has a matching body
         $ -&amp;gt; Actual map is missing the following keys: status
  {
    "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
    "transaction_id": "txn-abc-123"
  }
&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;1 failed in 7.22s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pact caught it. Exact field. Exact diff. No ambiguity about what broke or why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: The same breaking change against the WireMock test suite&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;pytest tests/steps/test_order_creation.py -v

test_order_is_successfully_created... PASSED
test_order_is_rejected_when_payment_is_declined PASSED
test_order_is_rejected_when_an_item_is_out_of_stock PASSED
test_order_surfaces_partial_unavailability... PASSED
test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

5 passed in 13.01s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five for five. All green. The breaking change is completely invisible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Revert and confirm&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2 passed in 8.19s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why the WireMock tests stayed green
&lt;/h2&gt;

&lt;p&gt;This isn't a flaw in the Gherkin approach — it's a precise boundary on what any behavioral test can and can't see.&lt;/p&gt;

&lt;p&gt;The Gherkin scenarios test the order service's &lt;em&gt;behavior&lt;/em&gt;: does the order get confirmed? Does the right status come back to the caller? In &lt;code&gt;app/main.py&lt;/code&gt;, when the payment gateway responds, the code checks the HTTP status code and returns &lt;code&gt;{"status": "CONFIRMED"}&lt;/code&gt; — it never reads the &lt;code&gt;status&lt;/code&gt; field from the payment gateway body. So from the test harness's perspective, nothing changed. The right HTTP code came back, the order was confirmed, all assertions passed.&lt;/p&gt;

&lt;p&gt;Pact caught it because the consumer test had explicitly declared that the order service &lt;em&gt;expects&lt;/em&gt; a &lt;code&gt;status&lt;/code&gt; field in the payment response. That expectation is encoded in the &lt;code&gt;.pact&lt;/code&gt; file. When provider verification ran against the modified stub, the Rust verifier compared the actual response against the contract and found the key missing.&lt;/p&gt;

&lt;p&gt;The Gherkin test and the Pact consumer test are testing different things. Gherkin tests the system's behavior end-to-end. Pact tests the shape of the conversation between services. You need both. They're not competing — they're covering different failure modes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The can-i-deploy gate
&lt;/h2&gt;

&lt;p&gt;The final piece was a local &lt;code&gt;can-i-deploy&lt;/code&gt; simulation — a script that reads the generated &lt;code&gt;.pact&lt;/code&gt; files, checks each interaction's expected response shape against the WireMock stub mappings, and exits 0 (safe) or 1 (blocked).&lt;/p&gt;

&lt;p&gt;With contracts intact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;python scripts/can_i_deploy.py

Pact: OrderService → PaymentGateway
  PASS  a declined payment charge
  PASS  a successful payment charge
  PASS  a timed-out payment charge

Pact: OrderService → InventoryService
  PASS  [3 interactions]

RESULT: ALL CONTRACTS VERIFIED — safe to deploy
Exit: 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the breaking change in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;  FAIL  a successful payment charge
        stub is missing fields expected by consumer: ['status']

RESULT: CONTRACT VIOLATIONS DETECTED — do not deploy
Exit: 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a real Pact Broker setup, this check queries a central record of which consumer versions have verified which provider versions. The local simulation does something simpler but teaches the same pattern: before you deploy, prove the contract is still satisfied. The exit code is what a CI pipeline reads. A non-zero exit stops the merge.&lt;/p&gt;

&lt;p&gt;The full GitHub Actions wiring — where this becomes an automated gate on every PR — is Issue #6. The local simulation is enough to feel how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;Four issues in, the specification layer is taking shape. Gherkin and WireMock proved the agent builds reliably against a well-written spec. The agent session proved that clean specs produce clean implementations and expose your assumptions. Pact closes the loop — the contract now survives beyond the stub and catches provider drift before it reaches production.&lt;/p&gt;

&lt;p&gt;The stack is starting to look like something real. But there's a question I've been putting off since Issue #2 that can't wait any longer: what actually makes a Gherkin scenario &lt;em&gt;good&lt;/em&gt;? Because not all specs are equal, and an agent that builds from a loose spec produces something very different from one that builds from a tight one. Next issue I'm going to prove that by deliberately writing bad Gherkin, handing it to the agent, and showing you what comes out.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: The Spec That Doesn't Lie — deliberately writing bad Gherkin, seeing what the agent builds from it, then rewriting it and comparing the output.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io" rel="noopener noreferrer"&gt;Pact documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pact.io/implementation_guides/python" rel="noopener noreferrer"&gt;pact-python v3 migration guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/yourusername/order-api/blob/main/findings/issue-04-pact-contract-testing.md" rel="noopener noreferrer"&gt;Session findings — Issue #4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>devops</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>I Gave the Agent the Spec and Walked Away. Here's What It Built.</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 01:02:49 +0000</pubDate>
      <link>https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja</link>
      <guid>https://dev.to/diyaburman/i-gave-the-agent-the-spec-and-walked-away-heres-what-it-built-jja</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #3&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you've been following along, you know what we've built so far. &lt;a href="https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5"&gt;Issue #1&lt;/a&gt; introduced the five levels framework and the Dark Factory concept. &lt;a href="https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb"&gt;Issue #2&lt;/a&gt; got concrete — we wrote five Gherkin scenarios for an order management API before touching any implementation code, stubbed out two external dependencies with WireMock, and ran a real test suite against the whole thing.&lt;/p&gt;

&lt;p&gt;At the end of Issue #2 I made a promise: hand the spec to an AI agent, spec only, no implementation hints, and see what it builds.&lt;/p&gt;

&lt;p&gt;This is that issue.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The instruction I gave Claude Code at the start of the session was exactly this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The Gherkin scenarios in &lt;code&gt;tests/features/order_creation.feature&lt;/code&gt; define the full behavioural contract for this API. Do not read the existing implementation in &lt;code&gt;app/main.py&lt;/code&gt;. Build a fresh implementation that makes all 5 scenarios pass. Document your findings in &lt;code&gt;FINDINGS.md&lt;/code&gt; as you go."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25xjzuwf4rqp71o099yc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25xjzuwf4rqp71o099yc.png" alt="screenshot" width="800" height="852"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcokijs484m3r0ub7wmzt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcokijs484m3r0ub7wmzt.png" alt="screenshot" width="800" height="773"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's it. No architecture hints. No "use FastAPI." No "here's how the mock servers work." Just the spec and a documentation instruction.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; in the repo handled the rest — the guardrails, the project context, the constraint that the &lt;code&gt;.feature&lt;/code&gt; files cannot be touched, and the format the &lt;code&gt;FINDINGS.md&lt;/code&gt; should follow. If you missed the deep dive on &lt;code&gt;CLAUDE.md&lt;/code&gt; in Issue #2, that file is essentially the agent's standing orders. It reads it at the start of every session.&lt;/p&gt;

&lt;p&gt;Then I sat back and watched.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the agent derived from the spec alone
&lt;/h2&gt;

&lt;p&gt;Here's what I found interesting. Before writing a single line of code, the agent read the Gherkin scenarios and derived the entire API contract from them. Unprompted. It produced this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /inventory/check/{inventory_scenario}
  → all available      → POST /payments/charge/{payment_scenario}
  → partial available  → return 207 PARTIAL_UNAVAILABLE (no charge)
  → all out of stock   → return 409 UNAVAILABLE (no charge)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the full response shape for all five scenarios:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;th&gt;status_code&lt;/th&gt;
&lt;th&gt;Key fields&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;td&gt;CONFIRMED&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;code&gt;order_id&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment declined&lt;/td&gt;
&lt;td&gt;PAYMENT_FAILED&lt;/td&gt;
&lt;td&gt;402&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;decline_reason&lt;/code&gt;, &lt;code&gt;inventory_released: true&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out of stock&lt;/td&gt;
&lt;td&gt;UNAVAILABLE&lt;/td&gt;
&lt;td&gt;409&lt;/td&gt;
&lt;td&gt;&lt;code&gt;unavailable_items&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partial stock&lt;/td&gt;
&lt;td&gt;PARTIAL_UNAVAILABLE&lt;/td&gt;
&lt;td&gt;207&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;available_items&lt;/code&gt;, &lt;code&gt;unavailable_items&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment timeout&lt;/td&gt;
&lt;td&gt;PAYMENT_PENDING&lt;/td&gt;
&lt;td&gt;202&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;inventory_hold_minutes: 15&lt;/code&gt;, &lt;code&gt;retry_count&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is exactly right. The agent read five plain-language scenarios and extracted a precise technical contract — the order of operations, the response codes, the body fields, the retry behaviour — without being told any of it explicitly.&lt;/p&gt;

&lt;p&gt;That's not nothing. That's the spec doing its job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it got interesting — the timeout scenario
&lt;/h2&gt;

&lt;p&gt;Scenario 5 is the one I was most curious about. Timeout behaviour is notoriously hard to test and easy to get wrong. The agent worked through it carefully and documented its reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PAYMENT_TIMEOUT_SECONDS=5&lt;/code&gt; — per-attempt HTTP client timeout&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MAX_PAYMENT_RETRIES=2&lt;/code&gt; — total attempt cap, not a retry count on top of the first attempt&lt;/li&gt;
&lt;li&gt;Worst-case wall time with 2 attempts at 5 seconds each: 10 seconds — comfortably inside the 12-second contract from the scenario&lt;/li&gt;
&lt;li&gt;The WireMock timeout stub uses &lt;code&gt;fixedDelayMilliseconds: 6000&lt;/code&gt; — deliberately longer than the client timeout so the client always times out before the mock responds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last detail is subtle and correct. If the mock delay were shorter than the client timeout, the test would be testing the wrong thing — the mock responding slowly rather than the client giving up. The agent caught this without being prompted. It's in the FINDINGS.md.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj12ykckkfpgj974d9m8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj12ykckkfpgj974d9m8.png" alt="screenshot" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug it found that I had written
&lt;/h2&gt;

&lt;p&gt;This is my favourite part of this issue.&lt;/p&gt;

&lt;p&gt;The original test setup — the code I pointed Claude Code at — had a hard-coded path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/claude/order-api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On my machine this would silently start mock servers with no stubs loaded. Every payment call would return a 404. Every inventory call would return a 404. The tests would fail in ways that looked like logic errors rather than a configuration problem.&lt;/p&gt;

&lt;p&gt;The agent caught it, diagnosed the root cause, and fixed it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrvhf3xhyz1n8j0i828o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrvhf3xhyz1n8j0i828o.png" alt="screenshot" width="800" height="855"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — hard-coded, breaks on any machine but the original
&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/claude/order-api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After — computed dynamically, works everywhere
&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;
&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ROOT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear: this bug was in &lt;em&gt;my&lt;/em&gt; code. Code I had written and shipped to the repo. The agent found it during implementation because it was trying to run the tests on a different environment and they failed in a way that forced the diagnosis.&lt;/p&gt;

&lt;p&gt;This is a thing that happens at Level 4 that doesn't happen at Level 2. When you're implementing yourself, you don't notice the hard-coded paths because everything works on your machine. When an agent implements on a clean environment, your assumptions get exposed immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  My honest reaction
&lt;/h2&gt;

&lt;p&gt;I'll be transparent about something. This API isn't complex. It's an order endpoint with two downstream dependencies and five scenarios. I didn't expect the agent to struggle with it, and it didn't. It hit errors, diagnosed them promptly, and moved on. Five scenarios, all passing.&lt;/p&gt;

&lt;p&gt;What struck me wasn't the capability — it was the &lt;em&gt;texture&lt;/em&gt; of the experience.&lt;/p&gt;

&lt;p&gt;Watching Claude Code work, I found myself doing something I don't usually do when I'm implementing: I was evaluating. Not writing, not debugging, not context-switching. Just reading the agent's reasoning and deciding whether I agreed with it. That's a different cognitive posture entirely. It felt closer to a code review than a coding session.&lt;/p&gt;

&lt;p&gt;I also noticed I spent the entire session approving individual commands — every file edit, every &lt;code&gt;pytest&lt;/code&gt; run, every &lt;code&gt;pip install&lt;/code&gt;. Claude Code asks for permission before each action by default. For this first session I let it. From the next task onward I'm going to configure it to run basic commands without checking in every thirty seconds. There's a trust-building curve here, and I'm on the early part of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this proves — and what it doesn't
&lt;/h2&gt;

&lt;p&gt;Five passing scenarios on a moderately simple API is not proof that Level 5 is solved. It's proof that the approach works at this scale and this complexity.&lt;/p&gt;

&lt;p&gt;The honest question — the one this newsletter is actually tracking — is whether it holds as the system grows. Pact tests across services. CI/CD pipelines. Evals as guardrails. Contextual stewardship documents for systems with years of history and undocumented decisions baked into the architecture.&lt;/p&gt;

&lt;p&gt;That's where the real test is. And that's where we're going next.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;One thing the exercise exposed: the spec was good enough for the agent to build correctly, but I had one implicit assumption that didn't make it into the scenarios. The response shape for the success case doesn't specify that &lt;code&gt;status_code&lt;/code&gt; should be absent — it just checks for &lt;code&gt;order_id&lt;/code&gt;. The agent inferred this correctly, but if it hadn't, the test would have passed anyway.&lt;/p&gt;

&lt;p&gt;That's a gap in the spec, not a gap in the agent. The lesson is the same one from Issue #2: every implicit assumption is a decision waiting to cause a bug in production. Write it down. Make it a scenario. Make the machine prove it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: Phase 3 — adding Pact contract testing between the order service and its dependencies. What happens when the service contract and the mock stub disagree?&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Claude Code documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytest-bdd.readthedocs.io" rel="noopener noreferrer"&gt;pytest-bdd documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api" rel="noopener noreferrer"&gt;Project repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-03-agent-fresh-implementation.md" rel="noopener noreferrer"&gt;Session findings - Issue #3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Bottleneck Moved. Did You Notice?</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Wed, 10 Jun 2026 00:37:33 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb</link>
      <guid>https://dev.to/diyaburman/the-bottleneck-moved-did-you-notice-5beb</guid>
      <description>&lt;p&gt;&lt;em&gt;A Level 5 Engineer — Issue #2&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Preface
&lt;/h3&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;p&gt;If you read &lt;a href="https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5"&gt;Issue 1&lt;/a&gt;, you walked away with the map. Six levels, a plateau most engineers never escape, and a Dark Factory that a handful of teams are quietly running in production. If you missed it, go read it first — this one builds directly on it.&lt;/p&gt;

&lt;p&gt;This issue is about the single most important shift that happens when you try to move from Level 3 to Level 4. Not the tools. Not the mindset. The &lt;em&gt;bottleneck&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Because it moved. And most of us didn't notice.&lt;/p&gt;




&lt;h2&gt;
  
  
  When speed stops being the problem
&lt;/h2&gt;

&lt;p&gt;For most of our careers, the bottleneck in software development was implementation speed. You had the idea, you had the design, you had the ticket — the constraint was how fast fingers could turn it into working code. That's the world we optimized for. That's why we measured velocity. That's why standups exist. That's why "10x engineer" was ever a phrase people said out loud without embarrassment.&lt;/p&gt;

&lt;p&gt;AI blew that bottleneck wide open.&lt;/p&gt;

&lt;p&gt;At Level 2, implementation stops being the constraint almost overnight. You're pairing with an agent and the code just... appears. Features that used to take days take hours. Hours take minutes. It feels like the problem is solved.&lt;/p&gt;

&lt;p&gt;Except you haven't solved it. You've just exposed the one that was hiding behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The new bottleneck is specification quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent can build anything you can describe precisely enough. The operative word is &lt;em&gt;precisely&lt;/em&gt;. The moment you try to hand off a vague, half-formed idea — the kind a human developer would fill in with reasonable assumptions and a quick Slack message — the agent either hallucinates something plausible-looking that isn't what you wanted, or it freezes, or worse, it confidently builds the wrong thing all the way to completion.&lt;/p&gt;

&lt;p&gt;The constraint is no longer your ability to implement. It's your ability to &lt;em&gt;specify&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a bad spec actually looks like
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth — most "requirements" we write as engineers are not specifications. They are vibes dressed up in Jira tickets.&lt;/p&gt;

&lt;p&gt;"Add pagination to the users endpoint." That's not a spec. How many results per page? Is the default configurable? What happens when the page number exceeds the total — empty array or 404? What's the sort order? Cursor-based or offset-based? What happens to existing API consumers who aren't sending page parameters yet?&lt;/p&gt;

&lt;p&gt;A human developer asks those questions in standup or figures them out from context. An agent working autonomously at Level 4 cannot do that. It will make a choice — silently, confidently, and consistently wrong in a way you won't catch until production.&lt;/p&gt;

&lt;p&gt;This is why Dan Shapiro's insight about specification quality isn't just a productivity tip. It's a prerequisite for moving up the ladder at all. You cannot reach Level 4 with Level 2 specs. The system won't let you.&lt;/p&gt;




&lt;h2&gt;
  
  
  So I built one. Here's what happened.
&lt;/h2&gt;

&lt;p&gt;I wanted to do something concrete this issue rather than just theorize. So I picked a real-world-shaped scenario — an e-commerce order management API with two external dependencies — and built it end to end with &lt;strong&gt;WireMock simulating the dependencies&lt;/strong&gt; and &lt;strong&gt;Gherkin scenarios written before the code&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The full project is on &lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; so you can clone it and run the exact same setup on your machine. Everything below is reproducible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The scenario
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;POST /orders&lt;/code&gt; endpoint that talks to two external services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;payment gateway&lt;/strong&gt; (think Stripe) that can succeed, decline, or time out&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;inventory service&lt;/strong&gt; that can confirm stock, report out-of-stock, or report partial availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Realistic enough to be relatable. Scoped enough to finish in an afternoon. The kind of integration complexity every backend engineer deals with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Write the spec first. Actually first.
&lt;/h3&gt;

&lt;p&gt;Here are the five Gherkin scenarios I wrote &lt;em&gt;before&lt;/em&gt; a single line of implementation code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kd"&gt;Feature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order Creation

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is successfully created when payment succeeds and all items are in stock
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-123"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will accept the charge
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"CONFIRMED"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response includes an order id
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway received exactly one charge request
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service received a reservation request

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is rejected when payment is declined
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-456"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will decline the charge
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"PAYMENT_FAILED"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 402
    &lt;span class="nf"&gt;And &lt;/span&gt;the response includes the decline reason &lt;span class="s"&gt;"INSUFFICIENT_FUNDS"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory reservation is released
    &lt;span class="nf"&gt;And &lt;/span&gt;no order id is issued

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order is rejected when an item is out of stock
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-789"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service reports SHOE-RED-42 is out of stock
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"UNAVAILABLE"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 409
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is never called

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order surfaces partial unavailability without auto-confirming
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-321"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service reports SHOE-RED-42 as available but BELT-BRN-M as unavailable
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the order status is &lt;span class="s"&gt;"PARTIAL_UNAVAILABLE"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the response status code is 207
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is never called
    &lt;span class="nf"&gt;And &lt;/span&gt;no order is confirmed without explicit user action

  &lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Order handling is graceful when the payment gateway times out
    &lt;span class="nf"&gt;Given &lt;/span&gt;a registered user with id &lt;span class="s"&gt;"user-654"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory service confirms all items are in stock
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway will not respond within the timeout window
    &lt;span class="nf"&gt;When &lt;/span&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M
    &lt;span class="nf"&gt;Then &lt;/span&gt;the response is returned within 12 seconds
    &lt;span class="nf"&gt;And &lt;/span&gt;the order status is &lt;span class="s"&gt;"PAYMENT_PENDING"&lt;/span&gt;
    &lt;span class="nf"&gt;And &lt;/span&gt;the inventory is held for 15 minutes
    &lt;span class="nf"&gt;And &lt;/span&gt;the payment gateway is not retried more than 2 times
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what these are. Not implementation documents. Not pseudocode. They're a &lt;strong&gt;behavioural contract&lt;/strong&gt; — plain-language descriptions of exactly what the system should do in specific situations, written in a format any teammate, PM, or yes — agent — can read.&lt;/p&gt;

&lt;p&gt;The discipline of writing them &lt;em&gt;first&lt;/em&gt; forced me to make decisions I would normally have postponed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do we check inventory before charging, or charge first? (Inventory first. The fourth scenario locks this in.)&lt;/li&gt;
&lt;li&gt;What happens during partial availability — auto-fulfill what's available, or ask the user? (Ask. Encoded in scenario 4.)&lt;/li&gt;
&lt;li&gt;What's the timeout SLA on the payment gateway? (5 seconds, with a max of 2 retries. Scenario 5 makes this testable.)&lt;/li&gt;
&lt;li&gt;What's the response code for partial availability? (207 Multi-Status.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these scenarios, every one of those decisions would have been made silently by whoever wrote the code first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Stub the dependencies with WireMock
&lt;/h3&gt;

&lt;p&gt;Before writing any tests you can actually run, the external services need to be simulated. This is what Dan Shapiro calls a &lt;strong&gt;digital twin universe&lt;/strong&gt; — a fully simulated version of your dependencies that behaves like the real thing without the real thing's unpredictability, cost, or rate limits.&lt;/p&gt;

&lt;p&gt;WireMock is the industry standard for this. A WireMock stub is just a JSON file describing how a service should respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/payments/charge/success"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ACCEPTED&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;transaction_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;txn-abc-123&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;amount&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 134.97}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the payment timeout scenario, WireMock has a built-in &lt;code&gt;fixedDelayMilliseconds&lt;/code&gt; parameter. One line and the mock takes 6 seconds to respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/payments/charge/timeout"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;TIMEOUT&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fixedDelayMilliseconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tiny config line is what makes scenario 5 testable. Without it, you cannot exercise timeout behaviour in a local environment without disabling network connectivity at the OS level — which I have done in the past, and it is exactly as miserable as it sounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Wire the scenarios to real assertions
&lt;/h3&gt;

&lt;p&gt;Gherkin by itself is just text. To turn it into an executable test suite I used &lt;strong&gt;pytest-bdd&lt;/strong&gt;, which lets each &lt;code&gt;Given/When/Then&lt;/code&gt; line map to a Python function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway will decline the charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_fixture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pay_declined&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;declined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the user submits an order for SHOE-RED-42 and BELT-BRN-M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_fixture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;submit_two&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inventory_scenario&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SHOE-RED-42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;89.99&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BELT-BRN-M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;44.98&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_PORT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventory_scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inventory_scenario&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the payment gateway is never called&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;payment_not_called&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;payment_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expected no payment calls, got: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;payment_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last assertion — &lt;code&gt;the payment gateway is never called&lt;/code&gt; — is the kind of thing that's almost impossible to verify with traditional unit tests but &lt;em&gt;trivial&lt;/em&gt; with WireMock. WireMock records every call it receives. You assert against that log directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Run the suite
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pytest tests/steps/test_order_creation.py &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test &lt;/span&gt;session starts &lt;span class="o"&gt;==============================&lt;/span&gt;
collected 5 items

tests/steps/test_order_creation.py::test_order_is_successfully_created_when_payment_succeeds_and_all_items_are_in_stock PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_payment_is_declined PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_an_item_is_out_of_stock PASSED
tests/steps/test_order_creation.py::test_order_surfaces_partial_unavailability_without_autoconfirming PASSED
tests/steps/test_order_creation.py::test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

&lt;span class="o"&gt;==============================&lt;/span&gt; 5 passed &lt;span class="k"&gt;in &lt;/span&gt;13.53s &lt;span class="o"&gt;==============================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five for five. But getting there was educational.&lt;/p&gt;




&lt;h2&gt;
  
  
  What broke along the way
&lt;/h2&gt;

&lt;p&gt;I want to be honest about the failures because &lt;em&gt;that's&lt;/em&gt; where the actual learning happened. The five tests didn't pass on the first run. They didn't pass on the second run either.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure 1 — The "no stub matched" silent success
&lt;/h3&gt;

&lt;p&gt;When a request comes in that no WireMock stub knows how to handle, the default behaviour is to return a 404. My API code did this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PAYMENT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="n"&gt;payment_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment service error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 404 is not an exception in &lt;code&gt;httpx&lt;/code&gt;. It's just a response. So the API would happily call &lt;code&gt;pay.json()&lt;/code&gt;, get &lt;code&gt;{"error": "No stub matched"}&lt;/code&gt;, and treat the entire interaction as a success — issuing an order id and confirming the order &lt;em&gt;even though no real payment had been processed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is genuinely dangerous. A misconfigured mock would have made all my tests pass while hiding that the real service path was broken. Lesson: &lt;strong&gt;always explicitly check the response status from a mock&lt;/strong&gt;. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PAYMENT_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/payments/charge/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment scenario not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payment_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pay&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Failure 2 — The shared call log bug
&lt;/h3&gt;

&lt;p&gt;I started with one &lt;code&gt;MockServer&lt;/code&gt; class that held a single class-level call log. Both the payment and inventory mocks recorded into the same list. When the test asserted "the payment gateway received exactly one charge request," the inventory call was in the log but no payment call was — because of failure 1 — and the assertion was looking at the combined log.&lt;/p&gt;

&lt;p&gt;The fix was conceptually small but architecturally important — each mock server instance gets its own call log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_mock_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mappings_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;HTTPServer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MockCallLog&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;stubs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mappings_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MockCallLog&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                  &lt;span class="c1"&gt;# ← per-instance log
&lt;/span&gt;    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stubs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HTTPServer&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serve_forever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mirrors how real WireMock works in production — you run separate WireMock instances per service, each with its own request log. The bug was a direct consequence of cutting that corner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure 3 — The fixture wiring gap
&lt;/h3&gt;

&lt;p&gt;Scenarios 3 and 4 don't define a payment scenario in their Given clauses, because the payment gateway should never be called in those cases. But pytest-bdd was still expecting the &lt;code&gt;payment_scenario&lt;/code&gt; fixture — and erroring out before the test even ran.&lt;/p&gt;

&lt;p&gt;This is a subtle distinction worth naming. The Gherkin spec was correct. It said exactly what it should say. The error was in the &lt;em&gt;test wiring&lt;/em&gt; that connected the spec to the assertions. The fix was a default fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;payment_scenario&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Default — overridden by specific Given steps.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spec stays clean. The wiring handles the case where a scenario doesn't care about a particular setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this exercise actually proved to me
&lt;/h2&gt;

&lt;p&gt;A few things that are now visceral rather than abstract:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specs that the AI cannot see during the build are uniquely powerful.&lt;/strong&gt; My scenarios live in &lt;code&gt;tests/features/order_creation.feature&lt;/code&gt;. The implementation lives in &lt;code&gt;app/main.py&lt;/code&gt;. When I asked an agent to modify the API, I could give it the implementation only. The spec stayed external. The agent had to make the test pass against behaviour it couldn't reverse-engineer from the assertions themselves. This is the part that genuinely changes things at Level 4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WireMock's 404-on-no-match is a feature, not a bug.&lt;/strong&gt; It exposes integration mistakes that would otherwise hide forever. The first time I saw a test silently succeed because of the 404 passthrough I was annoyed. Now I think it should be louder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing the scenarios first changed what I built.&lt;/strong&gt; Scenario 4 — partial availability — would not have existed if I'd written the code first. I would have implemented "all available or fail" and shipped it. Writing the spec first made me confront the question. The answer became part of the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try this yourself
&lt;/h2&gt;

&lt;p&gt;Everything above is in a &lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/" rel="noopener noreferrer"&gt;project&lt;/a&gt; you can clone, run, and break. Five scenarios, two mock services, one API. Total setup time: under fifteen minutes if you have Python and pip installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;repo-url&amp;gt; order-api
&lt;span class="nb"&gt;cd &lt;/span&gt;order-api
pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn httpx pytest pytest-bdd requests
pytest tests/steps/test_order_creation.py &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to use real WireMock instead of the Python-based mock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download WireMock standalone&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; wiremock.jar &lt;span class="se"&gt;\&lt;/span&gt;
  https://repo1.maven.org/maven2/org/wiremock/wiremock-standalone/3.3.1/wiremock-standalone-3.3.1.jar

&lt;span class="c"&gt;# Run two instances — the JSON mappings work as-is&lt;/span&gt;
java &lt;span class="nt"&gt;-jar&lt;/span&gt; wiremock.jar &lt;span class="nt"&gt;--port&lt;/span&gt; 8081 &lt;span class="nt"&gt;--root-dir&lt;/span&gt; wiremock/payment-mappings &amp;amp;
java &lt;span class="nt"&gt;-jar&lt;/span&gt; wiremock.jar &lt;span class="nt"&gt;--port&lt;/span&gt; 8082 &lt;span class="nt"&gt;--root-dir&lt;/span&gt; wiremock/inventory-mappings &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WireMock mapping JSON files I wrote work in real WireMock with zero changes. That was deliberate. The Python mock is for getting started fast. The real WireMock is for when you want to scale this pattern across an actual service mesh.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next issue: I take this same setup and hand it to an AI agent. Spec only — no implementation hints. We see what it builds, what it gets wrong, and how the spec acts as a guardrail.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources &amp;amp; Further Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://www.natebjones.com" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/" rel="noopener noreferrer"&gt;Cucumber + Gherkin documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wiremock.org/docs/" rel="noopener noreferrer"&gt;WireMock documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytest-bdd.readthedocs.io" rel="noopener noreferrer"&gt;pytest-bdd documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kafka0nkoffee/lvl5engineer-order-api/blob/main/findings/issue-02-wiremock-gherkin.md" rel="noopener noreferrer"&gt;Session findings - Issue #2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Level 5 Engineer - The Map I Didn't Know I Needed</title>
      <dc:creator>Diya Burman</dc:creator>
      <pubDate>Tue, 09 Jun 2026 19:12:34 +0000</pubDate>
      <link>https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5</link>
      <guid>https://dev.to/diyaburman/the-level-5-engineer-the-map-i-didnt-know-i-needed-5b5</guid>
      <description>&lt;p&gt;&lt;em&gt;The Level 5 Engineer Newsletter — Issue #1&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Preface
&lt;/h3&gt;

&lt;p&gt;I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.&lt;/p&gt;

&lt;p&gt;Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/" rel="noopener noreferrer"&gt;danshapiro.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;Watch the video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Before you panic and close this tab — yes, I see you hovering!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’m not naively chasing the thing that puts me out of a job — I’m aware of the irony. The tech world is in full meltdown mode about AI taking developer jobs right now, and, I think, most of that noise misses the point entirely. &lt;/p&gt;

&lt;p&gt;A Dark Factory still needs someone who understands the system deeply enough to define what it should build, catch what it shouldn’t touch, and course-correct when it goes sideways. That’s not a developer writing code anymore — that’s a steward. The role doesn’t disappear; it transforms. From worker bee to architect. From implementer to the person who holds the mental model of the entire system and encodes that judgment into infrastructure that the agents can operate safely within. That shift is actually what this newsletter is about — and it deserves its own deep dive, which is coming. For now, just know: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Level 5 is the destination, not the cliff edge.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  So. The Five Levels.
&lt;/h3&gt;

&lt;p&gt;Dan Shapiro borrowed the structure from the NHTSA’s autonomous driving classification — five levels from “human does everything” to “machine does everything, humans not required.” Applied to software development, it maps out like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2h5vw3zfmbpph9qsirw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2h5vw3zfmbpph9qsirw.png" alt="The Five Levels" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Level 0 — Spicy Autocomplete. You’re still writing every character. Maybe you use AI as a search engine or accept the occasional tab suggestion. The code is yours. You’re also losing ground every day to the people who aren’t at Level 0.&lt;/p&gt;

&lt;p&gt;Level 1 — The Coding Intern. You’re delegating discrete tasks. “Write a unit test for this.” “Add a docstring here.” You’re seeing some speedup. Your job is essentially unchanged. YOU are still the bottleneck.&lt;/p&gt;

&lt;p&gt;Level 2 — Junior Developer. This is where it starts to feel like something. You’re pairing with AI like a colleague. Flow state. Productivity you haven’t felt in years. You hand off the boring stuff and focus on the interesting parts. Here’s the trap though — Level 2 feels like you’re done. You’re not done. Most people who think they’re “using AI” are living here permanently and calling it Level 5.&lt;/p&gt;

&lt;p&gt;Level 3 — Developer as Manager. You’re not writing much code anymore. Your AI agent has multiple tabs running at all times. Your life is diffs. You review everything at the PR level. For a lot of people, this actually feels worse than Level 2 — more overhead, less flow. And this is where almost everyone tops out. Not because they can’t go further. Because Level 3 feels like the ceiling.&lt;/p&gt;

&lt;p&gt;Level 4 — Developer as Product Manager. The code is a black box. You write specs. You argue about the specs. You set up the right tools, define the right constraints, and then you leave for 12 hours. You come back and check if the tests passed. Dan Shapiro says he’s here. I believe him because of how he describes it — it doesn’t sound glamorous, it sounds like a different kind of hard work.&lt;/p&gt;

&lt;p&gt;Level 5 — The Dark Factory. Named after Fanuc’s robot factory — staffed entirely by robots, lights off, no humans needed or welcome. At Level 5, you’re not really running a software process anymore. You have a system that turns specs into software, autonomously. A handful of teams are doing this today. Small teams, less than five people, shipping production software with no human-written or human-reviewed code. A prominent example would be Claude’s own engineering team, who claim that Claude Code wrote most of Claude Code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Like what you read thus far? This post is public, so feel free to share it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  The Part That Actually Stings
&lt;/h3&gt;

&lt;p&gt;Here’s what Nate B. Jones added to this framework, which made me sit with it for a while.&lt;/p&gt;

&lt;p&gt;There’s a rigorous study showing that experienced developers using AI tools took 19% longer on tasks — while believing they were 24% faster. The gap between perceived productivity and actual productivity even has a name — the AI confidence gap. You feel faster. The clock disagrees.&lt;/p&gt;

&lt;p&gt;The teams that are pulling away aren’t using more AI tools. They’re using AI differently. The bottleneck has moved. It’s no longer about how fast you can implement. It’s about how precisely you can specify.&lt;/p&gt;

&lt;p&gt;The Dark Factory teams aren’t superhuman coders. They’ve built infrastructure around judgment — external behavioural scenarios that the AI cannot see during the build process, digital twin environments that simulate production dependencies safely, testing architectures designed specifically so the AI can’t reverse-engineer the passing criteria.&lt;/p&gt;

&lt;p&gt;The rest of the industry is plateaued at Level 3, reviewing diffs, and measuring velocity in story points on a Jira board that hasn’t been groomed since Q2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why I’m Writing This
&lt;/h3&gt;

&lt;p&gt;I’ve been a software engineer for nearly a decade. I’ve done the serverless architectures, the MLOps pipelines, the distributed systems. I’ve been to Re:Invent (3 times. Ooooh, fancy schmancy). I know the stack.&lt;/p&gt;

&lt;p&gt;And I watched these videos and realized I was at Level 2. Maybe Level 3 on a good week. And I had been calling it “using AI effectively.”&lt;/p&gt;

&lt;p&gt;This newsletter is the documentation of that climb — starting with understanding what the levels actually mean and why most of us have been misreading where we stand. Or at least that’s where I’m starting. I wouldn’t be surprised if my own understanding shifts incrementally as I make progress. That’s kind of the point.&lt;/p&gt;

&lt;p&gt;Not the theory — the practice. The tools, the habits, the org-level thinking, the moments where something clicks. I’ll be honest when I’m stuck and honest when something works. No performance, no hype.&lt;/p&gt;

&lt;p&gt;The climb starts with understanding the map. Now you have it.&lt;/p&gt;




&lt;p&gt;Next issue: The specification quality problem — why the bottleneck has shifted and what “writing a good spec” actually means in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sources &amp;amp; Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Dan Shapiro — &lt;a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory" rel="noopener noreferrer"&gt;The Five Levels: from Spicy Autocomplete to the Dark Factory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Nate B. Jones — &lt;a href="https://youtu.be/bDcgHzCBgmQ" rel="noopener noreferrer"&gt;The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)&lt;/a&gt; · &lt;a href="https://www.natebjones.com/" rel="noopener noreferrer"&gt;natebjones.com&lt;/a&gt; · &lt;a href="https://natesnewsletter.substack.com/" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Thanks for reading The Level 5 Engineer! Subscribe for free to receive new posts and support my work.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This article was written with the assistance of AI tools.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>testing</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
