Diya Burman

Posted on Jun 10

Wiring the Guardrails

#testing #devops #ai #softwareengineering

A Level 5 Engineer — Issue #6

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

Five issues in, everything we've built lives on one machine. The Gherkin scenarios, the WireMock stubs, the Pact contracts, the can-i-deploy script — all of it runs locally, passes locally, and means nothing the moment someone else touches the codebase.

Issue #6 fixes that. A GitHub Actions pipeline now runs on every push, executes the full specification stack in dependency order, and blocks merges to main if anything breaks. The pipeline is the guardrail. From this point on, a broken contract or a failing scenario cannot reach main undetected.

Getting there took ninety minutes and two interventions I didn't plan for. Both are worth documenting.

Before the YAML: deciding what "green" means

The first thing Claude Code did before touching any pipeline config was run the full test suite to establish a baseline. The instruction was explicit: everything must pass before a single line of YAML gets written.

It found a failure immediately — and it wasn't from the breaking change experiment. It was from Issue #5.

The bad-spec test (test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order) was still asserting db_status in the response body. That was intentional in Issue #5 — the failure was the finding. The session ended with it red because the point was to show what bad specs produce. But on main, with CI incoming, that means the pipeline would have been red on day one before a single feature change.

The fix was adding backward-compat aliases to the response:

return {
    "order_id": order_id,
    "status": order["db_status"],            # good spec field
    "db_status": order["db_status"],         # bad spec alias — keeps Issue #5 test passing
    "placed_at": order["order_created_at"],  # good spec field
    "order_created_at": order["order_created_at"],  # bad spec alias
}

Neither test file was modified. No feature files were touched. The aliases kept both the good-spec and bad-spec tests passing against the same endpoint.

The reason this matters before the pipeline exists: a team that starts CI with a known failure trains itself to ignore red. The cost of normalising a red CI is much higher than the cost of fixing the baseline first. Claude Code made the right call and documented it before moving on.

The pipeline structure

Four jobs, in dependency order:

test → pact-consumer → pact-verify → can-i-deploy

Each job only runs if its predecessor passes. If Gherkin breaks, Pact never runs. If the consumer tests fail, verification never runs. If verification fails, can-i-deploy is skipped. The pipeline fails fast and tells you exactly which layer broke.

The artifact chain is what makes it a pipeline rather than four parallel scripts. The pact-consumer job generates the .pact files and uploads them as a GitHub Actions artifact. The pact-verify job downloads that artifact and verifies it — the same files, not freshly regenerated ones. Without this, each job would build its own consumer contract from scratch, and verification would be proving that the contract matches the code rather than proving it matches what pact-consumer actually produced.

One non-obvious piece: mock_server.py is a library module with no command-line entry point. The pipeline needed servers running as background processes. The fix was an inline Python invocation:

- name: Start mock servers
  run: |
    . .venv/bin/activate
    python -c "
    import time
    from mock_server import start_mock_server
    start_mock_server(8091, 'wiremock/payment-mappings')
    start_mock_server(8092, 'wiremock/inventory-mappings')
    time.sleep(86400)
    " &
    sleep 2

The time.sleep(86400) keeps the process alive for the duration of the job. Inelegant but functional. A proper if __name__ == "__main__" entry point with argparse is the obvious cleanup for a future session.

The first CI run — and why I had to intervene manually

The YAML was committed, pushed to main, and the pipeline ran. All three runs failed on the test job:

OSError: [Errno 98] Address already in use

Ports 8091 and 8092. Every test in test_order_creation.py errored at setup. The order status tests — which don't use the mock servers — passed fine.

Claude Code didn't catch this on its own. Here's why that's worth explaining.

When Claude Code wrote the pipeline, it was working from the codebase and its own knowledge of GitHub Actions patterns. It knew the mock servers needed to be running before pytest started, so it added an explicit start-servers step to the YAML — a reasonable decision based on the information it had. What it couldn't see was the runtime interaction between that YAML step and pytest's session-scoped fixtures, because that interaction only manifests in the CI environment, not locally.

Locally, running pytest tests/steps/ -v has always worked correctly because the session fixture starts the servers and nothing else competes. Claude Code had only ever seen local runs succeed. It had no signal that the YAML step was creating a conflict — because the conflict doesn't exist locally.

This is a fundamental limit of the "paste and walk away" approach at the boundary between local and remote environments: the agent can reason about the codebase and about CI patterns, but it can't observe the CI run itself. The failure was on GitHub. Claude Code was in a terminal. Those two things weren't connected.

I diagnosed the error from the GitHub Actions log, explained the root cause, and pasted new instructions. Claude Code fixed it in one step — removing the redundant YAML steps entirely:

# Removed from both test and pact-verify jobs:
- name: Start mock servers
  run: |
    . .venv/bin/activate
    python -c "..." &
    sleep 2

The pytest session fixtures already own server lifecycle correctly. scope="session" means pytest starts the servers once per test run and keeps them alive. The YAML step was duplicating a responsibility that was already handled. The fix wasn't a workaround — it was removing the wrong layer.

The root cause in plain terms: the YAML step and the pytest fixture both thought they were responsible for starting the servers. The port was already bound when the fixture tried to bind it again. Works on my machine. Breaks in CI. Classic.

The breaking change experiment — in the pipeline

With the pipeline green, the breaking change test ran as designed.

Branch test/breaking-change-pipeline, commit 76c0d89: renamed "status" to "result" in wiremock/payment-mappings/payment-success.json. Same change as Issue #4, now running through CI instead of local verification.

The expected failure:

a successful payment charge (FAILED)

Failures:
1) Verifying a pact between OrderService and PaymentGateway
   1.1) has a matching body
          $ -> Actual map is missing the following keys: status
   {
     "amount": 134.97,
  -  "status": "ACCEPTED",
  +  "result": "ACCEPTED",
     "transaction_id": "txn-abc-123"
   }

pact-verify fails. can-i-deploy is skipped. The merge is blocked.

And the key point from Issue #4 holds at the pipeline level: the test job — the Gherkin suite — would pass with the breaking change in place. The order creation scenarios check HTTP status codes and business outcomes. They never read pay_resp.json()["status"]. A stub returning result instead of status still returns HTTP 200. Gherkin passes. Pact catches it.

This is the division of labour. Gherkin proves the system does the right thing. Pact proves the contracts don't drift. You need both, and now both run automatically on every push.

The one step that requires the GitHub UI

Claude Code cannot configure branch protection rules — that requires the GitHub web UI or admin API. This step is non-negotiable and must be done manually:

Repo → Settings → Branches → Add branch protection rule
Branch name pattern: main
Enable Require status checks to pass before merging
Add all four status checks: test, pact-consumer, pact-verify, can-i-deploy
Enable Require branches to be up to date before merging
Save

Without this, the pipeline is advisory. A push to main can still happen even if all four jobs are red. The pipeline becomes a dashboard — it shows you the problem but doesn't stop anything. Branch protection is what turns "CI failed" from a notification into enforcement. The pipeline is only a guardrail if something stops you going around it.

The honest part

The YAML took about twenty minutes to write. The session took ninety minutes total — because the baseline fix and the port conflict ate the rest.

The instinct during the baseline audit was to skip past the known failure. It's a demo test, we know why it's there, configure CI to skip that file and move on. That would have been thirty seconds. It also would have been wrong — a pipeline with documented exceptions is a pipeline people route around.

The instinct during the port conflict was to blame the CI environment. Ubuntu runs things differently, ports work differently, it's a platform quirk. That framing would have sent the debugging in the wrong direction. The actual cause was simpler: two layers both thought they owned the same responsibility, and nobody had written down which one was actually in charge.

Both of those moments are the J-curve. Not the YAML — the discipline of not skipping and not blaming the environment. The overhead of CI is not the config file. It's every decision about what "green" actually means and who's responsible for what.

The pipeline is now real infrastructure. The breaking change can't reach main. That's worth ninety minutes.

Next issue: The Scope Problem — scaling Gherkin across a multi-service system. What happens when one spec file isn't enough, and how spec debt forms.

Sources & Further Reading