Diya Burman

Posted on Jun 10 • Edited on Jun 13 • Originally published at level5engineer.substack.com

Green CI. Broken Contract. Nobody Noticed.

#ai #softwareengineering #agents #testing

A Level 5 Engineer — Issue #4

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

If you've been following along, you know where we are. Issue #2 introduced WireMock and Gherkin — write the behavioral contract before the code, stub your dependencies, run a real test suite. Issue #3 handed that spec to an AI agent and walked away. Five scenarios passed. The agent even found a bug in my code.

Everything worked. And that's exactly the problem this issue is about.

Because the WireMock stubs working perfectly is not the same thing as the real services working. The gap between those two statements is where production incidents are born.

The confidence trap

Here's the scenario nobody talks about until it happens to them.

Your order service calls a payment gateway. You've stubbed it with WireMock. Your Gherkin scenarios pass. Your agent builds against those stubs. Five for five, green across the board.

Meanwhile, the payment gateway team — a different squad, a different repo, maybe a different company entirely — ships a cleanup. They've been inconsistent about field naming across their API. status in one endpoint, result in another. They standardize. They rename status to result in the charge response. Their tests pass. They deploy.

Your tests still pass too. The stub hasn't changed. The stub will never change unless you change it.

The first time you learn about the rename is a production incident.

This is the confidence trap: a mock that can drift from the real service makes you feel safe right up until production proves you weren't. The tests are green. The contract is broken. You just don't know it yet.

What Pact does differently

WireMock is a behavioral double — it simulates a service so your tests can run in isolation. You define what it returns. You maintain it. You can make it say anything you want, which means it can silently lie about what the real service actually does.

Pact inverts the trust relationship.

Instead of you maintaining a stub that you hope reflects reality, your consumer tests declare what they need from the provider. Those declarations get written into a .pact file — a machine-readable contract. The provider then runs verification against that contract before it ships. If the provider no longer satisfies what the consumer declared, verification fails and the deploy is blocked.

The consumer defines the need. The provider proves delivery. No human has to remember to update a stub.

Building it — and what the docs didn't tell me

I added Pact to the order-api project this issue, covering both downstream dependencies — the payment gateway and the inventory service — with consumer tests matching the same five scenarios from the Gherkin feature file.

It was less smooth than I expected.

The pact-python v3 FFI surprise

Every tutorial for pact-python shows the same pattern: create a module-scoped Pact fixture, run multiple tests against it, write the pact file at the end. I wrote exactly that. The first test in each class passed. Every subsequent test failed with this:

RuntimeError: The provider state could not be specified.

No hint of what was actually wrong. After digging into the source, the root cause: pact-python 3.x is a complete rewrite backed by a Rust FFI binary. The Rust handle is consumed by the first serve() call — you cannot add new interactions to a handle after that point. The v2-style module-scoped pattern violates this constraint in a way the error message doesn't explain at all.

The fix was restructuring the consumer tests so all interactions are defined upfront before any serve() call:

# ❌ v2-style — breaks in pact-python v3
class TestPaymentConsumer:
    @pytest.fixture(scope="module")
    def pact(self):
        return Consumer("OrderService").has_pact_with(Provider("PaymentGateway"))

    def test_success(self, pact):
        pact.given("payment succeeds").upon_receiving("a charge")...
        with pact:
            # test

    def test_declined(self, pact):
        pact.given("payment declined").upon_receiving("a decline")...
        # RuntimeError — handle already consumed

# ✅ v3 correct pattern — all interactions before serve()
def test_payment_gateway_consumer():
    pact = Consumer("OrderService").has_pact_with(Provider("PaymentGateway"), ...)
    (pact
        .given("the payment gateway will accept the charge")
        .upon_receiving("a successful payment charge")
        .with_request("POST", "/payments/charge/success")
        .will_respond_with(200, body={"status": "ACCEPTED",
                                      "transaction_id": "txn-abc-123",
                                      "amount": 134.97}))
    (pact
        .given("the payment gateway will decline the charge")
        .upon_receiving("a declined payment charge")
        .with_request("POST", "/payments/charge/declined")
        .will_respond_with(402, body={"status": "DECLINED",
                                      "reason": "INSUFFICIENT_FUNDS"}))
    # ... all interactions defined ...
    with pact.serve() as srv:
        # exercise all interactions against srv.url
    pact.write_file("pacts/")

If you're upgrading from pact-python 1.x or 2.x: expect to rewrite your test fixtures. This isn't a syntax change — it's a different mental model of how the mock server lifecycle works.

The Verifier transport configuration gap

Provider verification had its own friction. The Verifier constructor in pact-python v3 takes a hostname, not a full URL. Passing a full URL causes a silent host mismatch when you later configure the transport:

# ❌ Causes "Host mismatch: localhost != http://localhost:8291"
Verifier("PaymentGateway", "http://localhost:8291")
    .add_transport(url="http://localhost:8291")

# ✅ Correct
Verifier("PaymentGateway", "localhost")
    .add_transport(protocol="http", port=8291, scheme="http")
    .add_source(pact_file)
    .set_request_timeout(10000)  # needed for the 6s timeout stub

The set_request_timeout(10000) line is also non-obvious: the payment timeout stub uses fixedDelayMilliseconds: 6000 to simulate a slow response. The verifier's default timeout is 5 seconds. Without the explicit timeout extension, the timeout interaction fails verification with a connection error rather than a clean pass.

Neither of these are in the main documentation. Both took real time to find. They're in the findings file for this session — linked at the bottom.

The breaking change experiment

All the Pact setup is preamble. This is the proof.

Step 1: Baseline — all contracts verified

pytest tests/pact/test_provider_verification.py -v

Verifying a pact between OrderService and PaymentGateway
  a declined payment charge         (OK)
  a successful payment charge       (OK)
  a timed-out payment charge        (OK)
PASSED

Verifying a pact between OrderService and InventoryService
  [3 interactions — all OK]
PASSED

2 passed in 8.19s

Step 2: Introduce the breaking change

In wiremock/payment-mappings/payment-success.json, one field rename:

// Before
{"status": "ACCEPTED", "transaction_id": "txn-abc-123", "amount": 134.97}

// After — "status" renamed to "result"
{"result": "ACCEPTED", "transaction_id": "txn-abc-123", "amount": 134.97}

Step 3: Provider verification with the breaking change

pytest tests/pact/test_provider_verification.py -v

  a successful payment charge (FAILED)

Failures:
  1.1) has a matching body
         $ -> Actual map is missing the following keys: status
  {
    "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
    "transaction_id": "txn-abc-123"
  }

1 failed in 7.22s

Pact caught it. Exact field. Exact diff. No ambiguity about what broke or why.

Step 4: The same breaking change against the WireMock test suite

pytest tests/steps/test_order_creation.py -v

test_order_is_successfully_created... PASSED
test_order_is_rejected_when_payment_is_declined PASSED
test_order_is_rejected_when_an_item_is_out_of_stock PASSED
test_order_surfaces_partial_unavailability... PASSED
test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

5 passed in 13.01s

Five for five. All green. The breaking change is completely invisible.

Step 5: Revert and confirm

2 passed in 8.19s

Why the WireMock tests stayed green

This isn't a flaw in the Gherkin approach — it's a precise boundary on what any behavioral test can and can't see.

The Gherkin scenarios test the order service's behavior: does the order get confirmed? Does the right status come back to the caller? In app/main.py, when the payment gateway responds, the code checks the HTTP status code and returns {"status": "CONFIRMED"} — it never reads the status field from the payment gateway body. So from the test harness's perspective, nothing changed. The right HTTP code came back, the order was confirmed, all assertions passed.

Pact caught it because the consumer test had explicitly declared that the order service expects a status field in the payment response. That expectation is encoded in the .pact file. When provider verification ran against the modified stub, the Rust verifier compared the actual response against the contract and found the key missing.

The Gherkin test and the Pact consumer test are testing different things. Gherkin tests the system's behavior end-to-end. Pact tests the shape of the conversation between services. You need both. They're not competing — they're covering different failure modes.

The can-i-deploy gate

The final piece was a local can-i-deploy simulation — a script that reads the generated .pact files, checks each interaction's expected response shape against the WireMock stub mappings, and exits 0 (safe) or 1 (blocked).

With contracts intact:

python scripts/can_i_deploy.py

Pact: OrderService → PaymentGateway
  PASS  a declined payment charge
  PASS  a successful payment charge
  PASS  a timed-out payment charge

Pact: OrderService → InventoryService
  PASS  [3 interactions]

RESULT: ALL CONTRACTS VERIFIED — safe to deploy
Exit: 0

With the breaking change in place:

  FAIL  a successful payment charge
        stub is missing fields expected by consumer: ['status']

RESULT: CONTRACT VIOLATIONS DETECTED — do not deploy
Exit: 1

In a real Pact Broker setup, this check queries a central record of which consumer versions have verified which provider versions. The local simulation does something simpler but teaches the same pattern: before you deploy, prove the contract is still satisfied. The exit code is what a CI pipeline reads. A non-zero exit stops the merge.

The full GitHub Actions wiring — where this becomes an automated gate on every PR — is Issue #6. The local simulation is enough to feel how it works.

Where we are

Four issues in, the specification layer is taking shape. Gherkin and WireMock proved the agent builds reliably against a well-written spec. The agent session proved that clean specs produce clean implementations and expose your assumptions. Pact closes the loop — the contract now survives beyond the stub and catches provider drift before it reaches production.

The stack is starting to look like something real. But there's a question I've been putting off since Issue #2 that can't wait any longer: what actually makes a Gherkin scenario good? Because not all specs are equal, and an agent that builds from a loose spec produces something very different from one that builds from a tight one. Next issue I'm going to prove that by deliberately writing bad Gherkin, handing it to the agent, and showing you what comes out.

Next issue: The Spec That Doesn't Lie — deliberately writing bad Gherkin, seeing what the agent builds from it, then rewriting it and comparing the output.

Sources & Further Reading