Diya Burman

Posted on Jun 10 • Edited on Jun 14 • Originally published at level5engineer.substack.com

Senior Developers Using AI Are Getting Slower. The Data Says So.

#ai #softwareengineering #agents #testing

A Level 5 Engineer — Issue #2

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

If you read Issue 1, you walked away with the map. Six levels, a plateau most engineers never escape, and a Dark Factory that a handful of teams are quietly running in production. If you missed it, go read it first — this one builds directly on it.

This issue is about the single most important shift that happens when you try to move from Level 3 to Level 4. Not the tools. Not the mindset. The bottleneck.

Because it moved. And most of us didn't notice.

When speed stops being the problem

For most of our careers, the bottleneck in software development was implementation speed. You had the idea, you had the design, you had the ticket — the constraint was how fast fingers could turn it into working code. That's the world we optimized for. That's why we measured velocity. That's why standups exist. That's why "10x engineer" was ever a phrase people said out loud without embarrassment.

AI blew that bottleneck wide open.

At Level 2, implementation stops being the constraint almost overnight. You're pairing with an agent and the code just... appears. Features that used to take days take hours. Hours take minutes. It feels like the problem is solved.

Except you haven't solved it. You've just exposed the one that was hiding behind it.

The new bottleneck is specification quality.

The agent can build anything you can describe precisely enough. The operative word is precisely. The moment you try to hand off a vague, half-formed idea — the kind a human developer would fill in with reasonable assumptions and a quick Slack message — the agent either hallucinates something plausible-looking that isn't what you wanted, or it freezes, or worse, it confidently builds the wrong thing all the way to completion.

The constraint is no longer your ability to implement. It's your ability to specify.

What a bad spec actually looks like

Here's the uncomfortable truth — most "requirements" we write as engineers are not specifications. They are vibes dressed up in Jira tickets.

"Add pagination to the users endpoint." That's not a spec. How many results per page? Is the default configurable? What happens when the page number exceeds the total — empty array or 404? What's the sort order? Cursor-based or offset-based? What happens to existing API consumers who aren't sending page parameters yet?

A human developer asks those questions in standup or figures them out from context. An agent working autonomously at Level 4 cannot do that. It will make a choice — silently, confidently, and consistently wrong in a way you won't catch until production.

This is why Dan Shapiro's insight about specification quality isn't just a productivity tip. It's a prerequisite for moving up the ladder at all. You cannot reach Level 4 with Level 2 specs. The system won't let you.

So I built one. Here's what happened.

I wanted to do something concrete this issue rather than just theorize. So I picked a real-world-shaped scenario — an e-commerce order management API with two external dependencies — and built it end to end with WireMock simulating the dependencies and Gherkin scenarios written before the code.

The full project is on GitHub so you can clone it and run the exact same setup on your machine. Everything below is reproducible.

The scenario

A POST /orders endpoint that talks to two external services:

A payment gateway (think Stripe) that can succeed, decline, or time out
An inventory service that can confirm stock, report out-of-stock, or report partial availability

Realistic enough to be relatable. Scoped enough to finish in an afternoon. The kind of integration complexity every backend engineer deals with.

Step 1 — Write the spec first. Actually first.

Here are the five Gherkin scenarios I wrote before a single line of implementation code:

Feature: Order Creation

  Scenario: Order is successfully created when payment succeeds and all items are in stock
    Given a registered user with id "user-123"
    And the inventory service confirms all items are in stock
    And the payment gateway will accept the charge
    When the user submits an order for SHOE-RED-42 and BELT-BRN-M
    Then the order status is "CONFIRMED"
    And the response includes an order id
    And the payment gateway received exactly one charge request
    And the inventory service received a reservation request

  Scenario: Order is rejected when payment is declined
    Given a registered user with id "user-456"
    And the inventory service confirms all items are in stock
    And the payment gateway will decline the charge
    When the user submits an order for SHOE-RED-42 and BELT-BRN-M
    Then the order status is "PAYMENT_FAILED"
    And the response status code is 402
    And the response includes the decline reason "INSUFFICIENT_FUNDS"
    And the inventory reservation is released
    And no order id is issued

  Scenario: Order is rejected when an item is out of stock
    Given a registered user with id "user-789"
    And the inventory service reports SHOE-RED-42 is out of stock
    When the user submits an order for SHOE-RED-42
    Then the order status is "UNAVAILABLE"
    And the response status code is 409
    And the payment gateway is never called

  Scenario: Order surfaces partial unavailability without auto-confirming
    Given a registered user with id "user-321"
    And the inventory service reports SHOE-RED-42 as available but BELT-BRN-M as unavailable
    When the user submits an order for SHOE-RED-42 and BELT-BRN-M
    Then the order status is "PARTIAL_UNAVAILABLE"
    And the response status code is 207
    And the payment gateway is never called
    And no order is confirmed without explicit user action

  Scenario: Order handling is graceful when the payment gateway times out
    Given a registered user with id "user-654"
    And the inventory service confirms all items are in stock
    And the payment gateway will not respond within the timeout window
    When the user submits an order for SHOE-RED-42 and BELT-BRN-M
    Then the response is returned within 12 seconds
    And the order status is "PAYMENT_PENDING"
    And the inventory is held for 15 minutes
    And the payment gateway is not retried more than 2 times

Notice what these are. Not implementation documents. Not pseudocode. They're a behavioural contract — plain-language descriptions of exactly what the system should do in specific situations, written in a format any teammate, PM, or yes — agent — can read.

The discipline of writing them first forced me to make decisions I would normally have postponed:

Do we check inventory before charging, or charge first? (Inventory first. The fourth scenario locks this in.)
What happens during partial availability — auto-fulfill what's available, or ask the user? (Ask. Encoded in scenario 4.)
What's the timeout SLA on the payment gateway? (5 seconds, with a max of 2 retries. Scenario 5 makes this testable.)
What's the response code for partial availability? (207 Multi-Status.)

Without these scenarios, every one of those decisions would have been made silently by whoever wrote the code first.

Step 2 — Stub the dependencies with WireMock

Before writing any tests you can actually run, the external services need to be simulated. This is what Dan Shapiro calls a digital twin universe — a fully simulated version of your dependencies that behaves like the real thing without the real thing's unpredictability, cost, or rate limits.

WireMock is the industry standard for this. A WireMock stub is just a JSON file describing how a service should respond:

{
  "request": {
    "method": "POST",
    "url": "/payments/charge/success"
  },
  "response": {
    "status": 200,
    "body": "{\"status\": \"ACCEPTED\", \"transaction_id\": \"txn-abc-123\", \"amount\": 134.97}"
  }
}

For the payment timeout scenario, WireMock has a built-in fixedDelayMilliseconds parameter. One line and the mock takes 6 seconds to respond:

{
  "request": {
    "method": "POST",
    "url": "/payments/charge/timeout"
  },
  "response": {
    "status": 504,
    "body": "{\"status\": \"TIMEOUT\"}",
    "fixedDelayMilliseconds": 6000
  }
}

That tiny config line is what makes scenario 5 testable. Without it, you cannot exercise timeout behaviour in a local environment without disabling network connectivity at the OS level — which I have done in the past, and it is exactly as miserable as it sounds.

Step 3 — Wire the scenarios to real assertions

Gherkin by itself is just text. To turn it into an executable test suite I used pytest-bdd, which lets each Given/When/Then line map to a Python function:

@given("the payment gateway will decline the charge", target_fixture="payment_scenario")
def pay_declined():
    return "declined"

@when("the user submits an order for SHOE-RED-42 and BELT-BRN-M", target_fixture="response")
def submit_two(user_id, payment_scenario, inventory_scenario):
    items = [
        {"sku": "SHOE-RED-42", "quantity": 1, "unit_price": 89.99},
        {"sku": "BELT-BRN-M",  "quantity": 1, "unit_price": 44.98},
    ]
    r = requests.post(f"http://localhost:{API_PORT}/orders",
                      json={"user_id": user_id, "items": items,
                            "payment_scenario": payment_scenario,
                            "inventory_scenario": inventory_scenario})
    return {"response": r}

@then("the payment gateway is never called")
def payment_not_called():
    assert not payment_log.all(), f"Expected no payment calls, got: {payment_log.all()}"

That last assertion — the payment gateway is never called — is the kind of thing that's almost impossible to verify with traditional unit tests but trivial with WireMock. WireMock records every call it receives. You assert against that log directly.

Step 4 — Run the suite

$ pytest tests/steps/test_order_creation.py -v
============================= test session starts ==============================
collected 5 items

tests/steps/test_order_creation.py::test_order_is_successfully_created_when_payment_succeeds_and_all_items_are_in_stock PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_payment_is_declined PASSED
tests/steps/test_order_creation.py::test_order_is_rejected_when_an_item_is_out_of_stock PASSED
tests/steps/test_order_creation.py::test_order_surfaces_partial_unavailability_without_autoconfirming PASSED
tests/steps/test_order_creation.py::test_order_handling_is_graceful_when_the_payment_gateway_times_out PASSED

============================== 5 passed in 13.53s ==============================

Five for five. But getting there was educational.

What broke along the way

I want to be honest about the failures because that's where the actual learning happened. The five tests didn't pass on the first run. They didn't pass on the second run either.

Failure 1 — The "no stub matched" silent success

When a request comes in that no WireMock stub knows how to handle, the default behaviour is to return a 404. My API code did this:

try:
    pay = httpx.post(f"{PAYMENT_URL}/payments/charge/{scenario}", ...)
    payment_result = pay.json()
except Exception:
    raise HTTPException(503, "Payment service error")

A 404 is not an exception in httpx. It's just a response. So the API would happily call pay.json(), get {"error": "No stub matched"}, and treat the entire interaction as a success — issuing an order id and confirming the order even though no real payment had been processed.

This is genuinely dangerous. A misconfigured mock would have made all my tests pass while hiding that the real service path was broken. Lesson: always explicitly check the response status from a mock. The fix:

try:
    pay = httpx.post(f"{PAYMENT_URL}/payments/charge/{scenario}", ...)
    if pay.status_code == 404:
        raise HTTPException(503, f"Payment scenario not found: {scenario}")
    payment_result = pay.json()

Failure 2 — The shared call log bug

I started with one MockServer class that held a single class-level call log. Both the payment and inventory mocks recorded into the same list. When the test asserted "the payment gateway received exactly one charge request," the inventory call was in the log but no payment call was — because of failure 1 — and the assertion was looking at the combined log.

The fix was conceptually small but architecturally important — each mock server instance gets its own call log:

def start_mock_server(port: int, mappings_dir: str) -> tuple[HTTPServer, MockCallLog]:
    stubs = [json.loads(f.read_text()) for f in Path(mappings_dir).glob("*.json")]
    log = MockCallLog()                  # ← per-instance log
    handler = make_handler(stubs, log)
    server = HTTPServer(("localhost", port), handler)
    threading.Thread(target=server.serve_forever, daemon=True).start()
    return server, log

This mirrors how real WireMock works in production — you run separate WireMock instances per service, each with its own request log. The bug was a direct consequence of cutting that corner.

Failure 3 — The fixture wiring gap

Scenarios 3 and 4 don't define a payment scenario in their Given clauses, because the payment gateway should never be called in those cases. But pytest-bdd was still expecting the payment_scenario fixture — and erroring out before the test even ran.

This is a subtle distinction worth naming. The Gherkin spec was correct. It said exactly what it should say. The error was in the test wiring that connected the spec to the assertions. The fix was a default fixture:

@pytest.fixture
def payment_scenario():
    """Default — overridden by specific Given steps."""
    return "success"

The spec stays clean. The wiring handles the case where a scenario doesn't care about a particular setup.

What this exercise actually proved to me

A few things that are now visceral rather than abstract:

Specs that the AI cannot see during the build are uniquely powerful. My scenarios live in tests/features/order_creation.feature. The implementation lives in app/main.py. When I asked an agent to modify the API, I could give it the implementation only. The spec stayed external. The agent had to make the test pass against behaviour it couldn't reverse-engineer from the assertions themselves. This is the part that genuinely changes things at Level 4.

WireMock's 404-on-no-match is a feature, not a bug. It exposes integration mistakes that would otherwise hide forever. The first time I saw a test silently succeed because of the 404 passthrough I was annoyed. Now I think it should be louder.

Writing the scenarios first changed what I built. Scenario 4 — partial availability — would not have existed if I'd written the code first. I would have implemented "all available or fail" and shipped it. Writing the spec first made me confront the question. The answer became part of the system.

Try this yourself

Everything above is in a project you can clone, run, and break. Five scenarios, two mock services, one API. Total setup time: under fifteen minutes if you have Python and pip installed.

git clone <repo-url> order-api
cd order-api
pip install fastapi uvicorn httpx pytest pytest-bdd requests
pytest tests/steps/test_order_creation.py -v

If you want to use real WireMock instead of the Python-based mock:

# Download WireMock standalone
curl -L -o wiremock.jar \
  https://repo1.maven.org/maven2/org/wiremock/wiremock-standalone/3.3.1/wiremock-standalone-3.3.1.jar

# Run two instances — the JSON mappings work as-is
java -jar wiremock.jar --port 8081 --root-dir wiremock/payment-mappings &
java -jar wiremock.jar --port 8082 --root-dir wiremock/inventory-mappings &

The WireMock mapping JSON files I wrote work in real WireMock with zero changes. That was deliberate. The Python mock is for getting started fast. The real WireMock is for when you want to scale this pattern across an actual service mesh.

Next issue: I take this same setup and hand it to an AI agent. Spec only — no implementation hints. We see what it builds, what it gets wrong, and how the spec acts as a guardrail.

Sources & Further Reading

Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Cucumber + Gherkin documentation
WireMock documentation
pytest-bdd documentation
Session findings - Issue #2

This article was written with the assistance of AI tools.

DEV Community