Diya Burman

Posted on Jun 10 • Edited on Jun 14 • Originally published at level5engineer.substack.com

The AI Built the Wrong Thing. Every Test Passed.

#ai #softwareengineering #agents #testing

A Level 5 Engineer — Issue #5

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

Every issue so far has assumed something I haven't said out loud: that the specs are good. Issue #2 wrote them carefully. Issue #3 handed them to an agent and watched it build correctly. Issue #4 proved the contracts survive provider drift.

But what happens when the spec isn't good? Not broken — Gherkin syntax is fine, tests pass, the agent builds something. Just imprecise. Vague in ways that feel precise when you're writing them.

This issue answers that question by doing the thing deliberately. I wrote bad Gherkin on purpose, handed it to the agent, watched what it built — and then rewrote the spec and did it again. The difference between the two implementations is the article.

The hardest thing about bad specs

Bad specs are hard to spot when you're writing them because they feel complete.

A scenario that references implementation details sounds like reasonable description — you wrote the implementation, so the details feel like specifics. A Given clause that feels obvious to you will be interpreted differently by every reader who hasn't seen the code. The Gherkin is syntactically correct. The tests pass. Nothing in the output signals that anything is wrong.

This is the trap. It's not that bad specs break things. It's that they don't.

The endpoint

I added a new endpoint to the order-api project: GET /orders/{order_id}/status. It returns the current status of an order and relevant metadata. Simple enough that the spec should be easy to write well. Which makes it a good target for writing it badly on purpose.

The bad specs

Two scenarios. Both syntactically valid. Both produce passing tests. Both wrong in different ways.

# BAD SPEC 1 — The leaky spec
# Problem: references internal implementation concepts (db_status, order_created_at)
# rather than describing what a caller observes. The agent uses these names literally
# in the response body, leaking storage terminology into the public API contract.

Scenario: Retrieving status for a confirmed order
  Given an order exists in the system with db_status "CONFIRMED"
  When I request GET /orders/{order_id}/status
  Then the response should contain the db_status field set to "CONFIRMED"
  And the order_created_at field should be populated from the order record

# BAD SPEC 2 — The vague Given
# Problem: "an order that has not been placed" is underspecified. The agent must
# guess what this means — a malformed ID? A well-formed UUID with no record?
# A deleted order? Each interpretation is plausible and produces different behavior.

Scenario: Retrieving status for an order that does not exist
  Given an order that has not been placed
  When I request GET /orders/{order_id}/status
  Then the response should indicate the order was not found

Both passed immediately:

tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order PASSED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

2 passed in 0.34s

Green. No warnings. No hint that anything is wrong.

What the agent built from the bad specs

Here's the implementation the agent produced:

@app.get("/orders/{order_id}/status")
def get_order_status(order_id: str):
    order = _orders.get(order_id)
    if order is None:
        raise HTTPException(status_code=404, detail="Order not found")
    return {
        "order_id": order_id,
        "db_status": order["db_status"],
        "order_created_at": order["order_created_at"],
    }

It satisfies the spec completely. It also made four decisions the spec never made:

Decision 1: The field is named db_status in the response.
The spec said db_status so the agent used db_status. It never questioned whether this was an internal name leaking into a public API. It satisfied the spec literally.

Decision 2: A missing order returns 404.
The spec says "indicate the order was not found." 404 is a defensible interpretation. So is 422, 403, or a 200 with a NOT_FOUND status field. The agent picked the most conventional option — but the spec never mandated it, and FastAPI's default 404 body is {"detail": "Order not found"}, not {"error": "Order not found"}. A client checking response.json()["error"] gets a KeyError.

Decision 3: The timestamp field is named order_created_at with no format requirement.
The spec says "populated from the order record." The agent chose order_created_at and returned an ISO string because that's what datetime.utcnow().isoformat() produces. The step definition checked only that the field is non-empty and a string — so any format would have passed. A Unix timestamp integer would have passed. A human-readable string like "June 2nd" would have passed.

Decision 4: The order store is in-memory.
The spec says nothing about persistence. An in-memory dict is the simplest thing that makes the tests pass. In production, orders are persisted. The in-memory store vanishes on restart and isn't shared across worker processes.

Every one of these decisions is plausible. The agent made the reasonable call every time. That's not the problem. The problem is that a different agent, given the same spec, might have made different reasonable calls — and both implementations would pass the same test suite.

The rewrite

Writing the good spec forced every decision the bad spec had silently delegated:

# GOOD SPEC 1 — Caller's perspective, not implementation's
# Fixed: field names describe what the caller observes (status, placed_at)
# not what the storage layer calls them (db_status, order_created_at).
# The format of placed_at is now a contract obligation, not an assumption.

Scenario: Confirmed order status is returned with placement timestamp
  Given a confirmed order with id "order-abc-123" exists in the system
  When I request GET /orders/order-abc-123/status
  Then the response status code is 200
  And the response body contains "order_id" equal to "order-abc-123"
  And the response body contains "status" equal to "CONFIRMED"
  And the response body contains "placed_at" as a valid ISO 8601 timestamp

# GOOD SPEC 2 — Precise Given, explicit 404 body shape
# Fixed: "a well-formed UUID with no corresponding record" is now unambiguous.
# The 404 response body shape is now a contract obligation, not a guess.

Scenario: Unknown order id returns 404 with error message
  Given no order with id "order-xyz-999" exists in the system
  When I request GET /orders/order-xyz-999/status
  Then the response status code is 404
  And the response body contains an "error" field

Notice what changed. The scenarios describe the same two situations. The intent is identical. But now every decision is in the spec rather than in the agent's interpretation of the spec.

What the agent built from the good spec

@app.get("/orders/{order_id}/status")
def get_order_status(order_id: str):
    order = _orders.get(order_id)
    if order is None:
        return JSONResponse(status_code=404, content={"error": "Order not found"})
    return {
        "order_id": order_id,
        "status": order["db_status"],
        "placed_at": order["order_created_at"],
    }

Same endpoint. Same logic. Different API.

db_status became status. order_created_at became placed_at. The 404 body now contains error not detail. The timestamp is now asserted to be ISO 8601 — not just non-empty.

These are not cosmetic differences. They are different contracts that clients build against.

The cross-run

After building from the good spec, I ran the bad-spec tests against the new implementation:

tests/steps/test_order_status_bad.py::test_retrieving_status_for_a_confirmed_order FAILED
tests/steps/test_order_status_bad.py::test_retrieving_status_for_an_order_that_does_not_exist PASSED

E   KeyError: 'db_status'

The leaky test failed. The field db_status doesn't exist in the good implementation — it's been renamed to status, which is what a caller should see. The test that was checking for an internal name is now broken, correctly.

The vague test passed. Both implementations return a 404 for a missing order — the good implementation just happened to reach the same conclusion, but for an explicit reason this time.

That asymmetry is instructive. The vague Given produced the right answer by coincidence. The leaky Then produced the wrong field name by construction. One was luck. One was baked in.

Why this matters

Both implementations pass their own test suites. That is the trap.

If you run the bad-spec tests against the bad-spec implementation: green. If you run the good-spec tests against the good-spec implementation: green. The difference only surfaces when you cross-run — and in production, you never cross-run. You ship the bad implementation, it passes CI, and the problem lands in a client exception report six months later.

Here's the concrete difference: the bad-spec implementation returns db_status and order_created_at with no format guarantee. The good-spec implementation returns status and placed_at with a mandatory ISO 8601 format. An agent given the bad spec had no way to know that db_status was wrong — the spec said db_status. An agent given the good spec had no choice but to produce status — the spec said status.

Spec quality is not about whether tests pass. It is about how much of the implementation the spec author wrote versus how much was silently delegated to the agent. Every silent delegation is a place where two agents given the same spec produce different code — code that both passes, but disagrees on the contract.

At scale — dozens of endpoints, hundreds of scenarios — that disagreement is the system.

The practical test for a good spec

Before handing any scenario to an agent, ask one question: what decisions does this scenario leave open?

If the answer is "none — every field name, format, response code, and body shape is specified," the spec is ready. If the answer is "a few reasonable ones," those are the places where your implementation and the next agent's implementation will silently diverge.

The agent will always make reasonable decisions. That's not the problem. The problem is that reasonable is not the same as specified — and at Level 4, specified is the only thing that counts.

Next issue: Wiring the Guardrails — GitHub Actions, the Pact Broker, and the pipeline that turns contract violations into blocked merges automatically.

Sources & Further Reading