Diya Burman

Posted on Jun 10 • Edited on Jun 13 • Originally published at level5engineer.substack.com

The Agent Found What Code Review Missed.

#ai #softwareengineering #agents #testing

A Level 5 Engineer — Issue #3

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

If you've been following along, you know what we've built so far. Issue #1 introduced the five levels framework and the Dark Factory concept. Issue #2 got concrete — we wrote five Gherkin scenarios for an order management API before touching any implementation code, stubbed out two external dependencies with WireMock, and ran a real test suite against the whole thing.

At the end of Issue #2 I made a promise: hand the spec to an AI agent, spec only, no implementation hints, and see what it builds.

This is that issue.

The setup

The instruction I gave Claude Code at the start of the session was exactly this:

"The Gherkin scenarios in tests/features/order_creation.feature define the full behavioural contract for this API. Do not read the existing implementation in app/main.py. Build a fresh implementation that makes all 5 scenarios pass. Document your findings in FINDINGS.md as you go."

That's it. No architecture hints. No "use FastAPI." No "here's how the mock servers work." Just the spec and a documentation instruction.

The CLAUDE.md in the repo handled the rest — the guardrails, the project context, the constraint that the .feature files cannot be touched, and the format the FINDINGS.md should follow. If you missed the deep dive on CLAUDE.md in Issue #2, that file is essentially the agent's standing orders. It reads it at the start of every session.

Then I sat back and watched.

What the agent derived from the spec alone

Here's what I found interesting. Before writing a single line of code, the agent read the Gherkin scenarios and derived the entire API contract from them. Unprompted. It produced this:

POST /inventory/check/{inventory_scenario}
  → all available      → POST /payments/charge/{payment_scenario}
  → partial available  → return 207 PARTIAL_UNAVAILABLE (no charge)
  → all out of stock   → return 409 UNAVAILABLE (no charge)

And the full response shape for all five scenarios:

Scenario	status	status_code	Key fields
Success	CONFIRMED	—	`order_id`
Payment declined	PAYMENT_FAILED	402	`decline_reason`, `inventory_released: true`
Out of stock	UNAVAILABLE	409	`unavailable_items`
Partial stock	PARTIAL_UNAVAILABLE	207	`available_items`, `unavailable_items`
Payment timeout	PAYMENT_PENDING	202	`inventory_hold_minutes: 15`, `retry_count`

This is exactly right. The agent read five plain-language scenarios and extracted a precise technical contract — the order of operations, the response codes, the body fields, the retry behaviour — without being told any of it explicitly.

That's not nothing. That's the spec doing its job.

Where it got interesting — the timeout scenario

Scenario 5 is the one I was most curious about. Timeout behaviour is notoriously hard to test and easy to get wrong. The agent worked through it carefully and documented its reasoning:

PAYMENT_TIMEOUT_SECONDS=5 — per-attempt HTTP client timeout
MAX_PAYMENT_RETRIES=2 — total attempt cap, not a retry count on top of the first attempt
Worst-case wall time with 2 attempts at 5 seconds each: 10 seconds — comfortably inside the 12-second contract from the scenario
The WireMock timeout stub uses fixedDelayMilliseconds: 6000 — deliberately longer than the client timeout so the client always times out before the mock responds

That last detail is subtle and correct. If the mock delay were shorter than the client timeout, the test would be testing the wrong thing — the mock responding slowly rather than the client giving up. The agent caught this without being prompted. It's in the FINDINGS.md.

The bug it found that I had written

This is my favourite part of this issue.

The original test setup — the code I pointed Claude Code at — had a hard-coded path:

sys.path.insert(0, "/home/claude/order-api")

On my machine this would silently start mock servers with no stubs loaded. Every payment call would return a 404. Every inventory call would return a 404. The tests would fail in ways that looked like logic errors rather than a configuration problem.

The agent caught it, diagnosed the root cause, and fixed it:

# Before — hard-coded, breaks on any machine but the original
sys.path.insert(0, "/home/claude/order-api")

# After — computed dynamically, works everywhere
PROJECT_ROOT = Path(__file__).parent.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

To be clear: this bug was in my code. Code I had written and shipped to the repo. The agent found it during implementation because it was trying to run the tests on a different environment and they failed in a way that forced the diagnosis.

This is a thing that happens at Level 4 that doesn't happen at Level 2. When you're implementing yourself, you don't notice the hard-coded paths because everything works on your machine. When an agent implements on a clean environment, your assumptions get exposed immediately.

My honest reaction

I'll be transparent about something. This API isn't complex. It's an order endpoint with two downstream dependencies and five scenarios. I didn't expect the agent to struggle with it, and it didn't. It hit errors, diagnosed them promptly, and moved on. Five scenarios, all passing.

What struck me wasn't the capability — it was the texture of the experience.

Watching Claude Code work, I found myself doing something I don't usually do when I'm implementing: I was evaluating. Not writing, not debugging, not context-switching. Just reading the agent's reasoning and deciding whether I agreed with it. That's a different cognitive posture entirely. It felt closer to a code review than a coding session.

I also noticed I spent the entire session approving individual commands — every file edit, every pytest run, every pip install. Claude Code asks for permission before each action by default. For this first session I let it. From the next task onward I'm going to configure it to run basic commands without checking in every thirty seconds. There's a trust-building curve here, and I'm on the early part of it.

What this proves — and what it doesn't

Five passing scenarios on a moderately simple API is not proof that Level 5 is solved. It's proof that the approach works at this scale and this complexity.

The honest question — the one this newsletter is actually tracking — is whether it holds as the system grows. Pact tests across services. CI/CD pipelines. Evals as guardrails. Contextual stewardship documents for systems with years of history and undocumented decisions baked into the architecture.

That's where the real test is. And that's where we're going next.

What I'd do differently

One thing the exercise exposed: the spec was good enough for the agent to build correctly, but I had one implicit assumption that didn't make it into the scenarios. The response shape for the success case doesn't specify that status_code should be absent — it just checks for order_id. The agent inferred this correctly, but if it hadn't, the test would have passed anyway.

That's a gap in the spec, not a gap in the agent. The lesson is the same one from Issue #2: every implicit assumption is a decision waiting to cause a bug in production. Write it down. Make it a scenario. Make the machine prove it.

Next issue: Phase 3 — adding Pact contract testing between the order service and its dependencies. What happens when the service contract and the mock stub disagree?

Sources & Further Reading