I run an AI-engineering research lab that studies what it actually takes to work with Claude Code on hard technical surfaces, not from Claude Code. Two surfaces run in parallel: a learning protocol where Claude Opus is the coaching partner, and a QA-automation pipeline where Claude Code + MCP ship sprint reporting, Jira pulls, and Slack digests on a real work loop. Both surfaces stress-test the same operator pattern: spec-first, sub-agent orchestration, eval on agent output, foundational-fluency check.
The operator pattern is codified in a public .claude/ framework — 15 rule files (four carry WHY + Retire-when audit metadata so rules decay cleanly as models improve), 21 skills, concentric-loop pedagogy pinned per-node to named practitioners. Repo: github.com/aman-bhandari/claude-code-agent-skills-framework.
The lab protocol is the more unusual surface because I deliberately run the failure modes a shallow user would produce — random questions, tool-reflex over understanding, accepting an exercise without loading the mechanism into memory — and verify the protocol catches them. If it catches the failure, it earns its keep. If it doesn't, the protocol changes.
A recent session caught a failure mode worth writing up.
The diagnostic
The lab had worked through a block of Python execution-model material — namespaces as dicts, closures as cells, the CPython compile pipeline, mutation semantics. The coaching protocol says: before building anything on top of that material, run a cold retention grill. No scrollback, no retry, 8 minutes per concept, two questions each — one mechanism, one "what breaks."
I ran the grill on 19 concepts. I got 6 of them back clean. Three were outright BREAKs (quality 0-1). Three were PARTIAL. The full distribution is in the session log; the point for this post is the shape of the failures, not the numbers.
The diagnosis I got wrong
My first read on the data was: these concepts are threads I haven't connected to each other. The learning has no brain because it has no interconnection.
That framing feels intuitive. The partner rejected it on the spot.
The actual shape: the recent sessions went deep on the execution model (how Python runs). The breaks were on stdlib mechanism (dict internals, list memory layout, string interning, JSON deserialization). Two different layers of the stack, not two disconnected threads. The edges between them already exist in the graph. I just hadn't walked them with code in my hands.
"Threads not connected" is the wrong diagnosis because it implies the fix is more abstract thinking. "Layers not walked" is the right diagnosis because it implies the fix is more code.
The failure reflex this was hiding
The cold-grill data was cheap to produce. The expensive finding was what I said after seeing the results:
If I just start doing exercises now, I will look up the solution from here and there, complete the exercise, move to next. This is the exact moment when I fail every time.
I named my own failure reflex before the partner named it for me. In agentic-engineering terms, this is the Claude-Operator failure mode: accepting the tool's output without reading it carefully, shipping the exercise without loading the mechanism into memory, and calling it progress. It is exactly the pattern Karpathy flagged when he reframed "vibe coding" earlier this year — fine for throwaway work, a skill-atrophy risk for anything you are supposed to own.
The protocol's job at this moment is to stop me from pattern-matching my way through the next exercise. Not by withholding help. By changing the shape of the work.
The pivot
The fix has a name: build from scratch to understand. It is not mine — it is Karpathy's nanoGPT / micrograd / nanochat thesis applied to stdlib. Before trusting the 40,000-line version, build the 100-line version with your own hands.
I committed to it as a mandate at the end of the session. Seven build exercises, dependency-ordered: MyDict → MyList → MyIterator → MyDecorator → MyContextManager → MyLogger → MyJSONParser. Each has a one-sentence scope, a RED test file committed first, a WORKSPACE.md with five pre-build prediction questions, and a TOOL-IN-HAND.md with ten specific observations (using dis, sys.getsizeof, time.perf_counter, mypy --strict) that the build has to produce evidence for.
The operational constraint I added:
While building, we will go deeper into OS, networking, computer architecture, mathematics, hardware. But every time there should be a tool in this engineer's hands.
That constraint is what separates build-from-scratch from "build and read the code." A tool in hand — strace, dis, perf_counter, the REPL — is what forces the mechanism into memory. Without it, the build becomes another shallow exercise and the reflex wins.
The rule I am building the protocol to catch
Inside the session, the pattern "let's plan now and build next time" landed twice. Both times it was the same failure reflex wearing executive-function clothing: deferring the uncomfortable part (code that fails in the REPL) behind the comfortable part (more plan documents). The partner named it explicitly. I locked a three-condition mandate — scoped plan file, RED scaffold committed in the same session, the next session opens with code, not plan revision.
The plan file itself carries a retire-when clause with a Bransford transfer test (cold cross-exercise question, no scrollback). The plan prescribes how to measure its own success and when to stop running it. That is the difference between a plan and a falsifiable instrument.
If planning-as-avoidance lands a third time, I codify it as planning-build-ratio.md with this session's date as the triggering incident. That is the rule-obsolescence audit pattern I wrote about in the previous post — every rule carries a WHY (what default behavior it corrects) and a Retire-when (the observable condition under which it is no longer needed). The protocol decays cleanly or it does not earn its presence.
What this changed in the system
Five things moved after the diagnosis:
- The spaced-review deck grew from 17 cards to 28 — 11 load-bearing nodes from recent sessions were added. An audit I only caught because I asked, mid-grill, whether the deck had been updated. Silent rot in a retention system is what happens when nobody does the boring update.
- The pedagogy mode flipped from theory-heavy / exercise-light to Karpathy build-from-scratch. Not codified as a rule yet — observation window is 2-3 sessions.
- The tool-in-hand constraint got surfaced in my own words, independently of the systems-thinking rule that already prescribes it. Walking into a rule from the other side is Bransford transfer evidence. The concentric loop closed.
- The Claude-Operator failure reflex surfaced in my own words. That gets it from the rule file into my cache.
- The next session opens with code on
mydict.py. Not plan revision. Not "one more question first." If that contract breaks, the pattern has won and it gets named out loud.
The same pattern on the work surface
The QA-automation surface is where the same operator pattern ships against a production-shaped workload. claude-code-mcp-qa-automation is 16 Claude Code skills plus a Python implementation: 8 modules, a 7-table SQLite trending store, flag-gated config-driven execution, and a sub-agent fan-out coordinator using ThreadPoolExecutor as a structurally-identical stand-in for Claude Code's own Agent tool. The pipeline runs a full-loop demo end-to-end — two fixture boards, inline-CSS self-contained HTML reports, byte-identical output under the same flags. Repo: github.com/aman-bhandari/claude-code-mcp-qa-automation.
Two details worth calling out because they translate directly to any Claude-Code-plus-MCP operator surface:
-
Skills as invocation contracts, not code. The 16 skill files under
.claude/skills/are pure markdown. Each one names its inputs, the work it delegates, and the failure modes it distinguishes. The Python implementation can be swapped without touching the skill surface. This is how you keep review authority — the contracts are what a human reads, the implementation is what gets re-typed by an agent under the contract. -
Flag-gated, config-driven execution. Every behavior that could be on or off lives in
config/flags.yamlwith global + board-scoped overrides. No inlineif FEATURE_FOO:toggles in the Python. Regression debugging starts with flipping a flag and re-running, not a code spelunk.
Different surfaces, same operator move: specs human-owned, execution agent-owned, every behavior falsifiable.
Why I am writing this up
The sessions are private. The protocol is public. The value of a coaching protocol is whether it produces this kind of mid-session diagnosis — not whether it feels good in the moment. The cold-grill results would have been a morale hit if read as "six concepts failed." Read as "the retention test revealed a layer the build was about to paper over," they are a systems signal.
If you are running Claude Code as an operator on anything hard — coaching surface, QA surface, reporting surface, any surface — the questions worth asking are: what failure modes am I simulating on purpose, and what protocol is catching them? If the answer to either is nothing, the partnership is still in vibe-coding mode. The fix is not less AI. It is a build-from-scratch constraint with a tool in your hand and a retire-when clause on every rule.
Aman Bhandari. Operator of an AI-engineering research lab running Claude Opus as the coaching partner, plus a QA-automation surface shipping against a real sprint workload. Public artifacts: claude-code-agent-skills-framework and claude-code-mcp-qa-automation. github.com/aman-bhandari.
Top comments (0)