Mike Czerwinski

Posted on Jun 21

Anthropic measured the human side. Five operators are building the agent side.

#ai #llmops #agents #operatordiscipline

I joined dev.to a few days ago because I'd run out of paths to argue this stuff against. Months of building a framework — operator discipline as an orthogonal axis to autonomy, locked decisions with status fields, drift detection, supersession trails — and the only thing I was sure of was that internal coherence isn't proof of anything. Frameworks survive by surviving other people, not by surviving the author.

So I started publishing. Today the framework finally hit something outside my own head.

What Anthropic measured

On June 16, Anthropic Economic Research published "Agentic coding and persistent returns to expertise." About 400,000 interactive Claude Code sessions. About 235,000 people. October 2025 to April 2026. Expertise patterns, delegation patterns, success patterns.

The central finding, in their own words:

"The greater domain expertise a person brings to a session, the more work Claude does per instruction."

"Success is determined by how well a person understands the problem they are trying to solve, not whether they're trained in coding."

Anthropic did not measure operator discipline directly. It measured the closest empirical neighbor: expertise as a multiplier on agentic work.

Expert-rated sessions show about 2.4× as many Claude actions per prompt as novice-rated sessions, and roughly 5× the text output. The signal is not simply "knows how to code." The signal is "understands the problem well enough to steer the agent." That overlaps with the same axis I'd been arguing as a frame in my first post on dev.to: vibe coding is not a level, it's an orthogonal axis to autonomy. My stronger claim was that L1 + High discipline outperforms L5 + Low discipline over time. Anthropic does not measure that claim directly, but it gives the human side of the axis something measurable.

What the report does not try to answer is the agent-side question: what kind of state, memory, governance, and transition rules have to exist so that the work compounds across sessions instead of being reconstructed every time. Its scope is interactive Claude Code usage — what work is done, who does it, whether the session succeeds — and it explicitly leaves out large parts of non-interactive/headless usage and does not measure downstream real-world outcomes.

That gap is what the practitioner cluster is circling from the other direction.

What the cluster is building

Five other operators on this platform have been pushing on the agent-side question from different starting points this week:

Rapls on status fields and append-only decision logs.
Scarab Systems on governed baselines and deterministic enforcement.
NOVAInetwork (@0xdevc) on quorum as a substitute for operator discipline at scale.
Raffaele Zarrelli (@sarracin0) on structural pressure when the loop is slow.
Brian Hall on the deterministic gate — and now with an open-source reference architecture (faramesh-core, MPL-2.0).

The short version of the cluster: five different starting points, one architectural conclusion — the LLM proposes, deterministic rules enforce, humans authorize transitions, and the rules live outside the agent's reasoning loop.

That's the agent-side scaffolding that sits outside the Anthropic report's scope.

Two halves of the same answer

Anthropic measured what happens when humans bring expertise into the loop. The cluster I spent today reading and writing with is building architecture for what happens when that expertise has to survive across sessions, tools, and agents. Same axis, two directions, a fuller picture.

Official research from Anthropic, independent practitioners on dev.to, both pointing at adjacent parts of the same problem. Not the same claim. Not the same layer. But the same direction.

That's not a viral take. That's an early convergence signal.

I came here to confront the framework against operators who actually ship with it. The framework didn't collapse on contact. It got sharper. The peers who pushed back named gaps I hadn't seen. And one of the biggest labs in the room published the human-side measurement while we were doing it.

Two independent signals converging from different directions, in the same week, on the same problem space. That's not the framework being right. It's the field starting to coalesce.

It's a good Sunday to close the loop.

Operator discipline is no longer just a personal workflow. It is starting to look like an axis, a measurement problem, and an architecture. Whatever comes next has to be built, measured, and governed.

https://www.anthropic.com/research/claude-code-expertise

Top comments (8)

Raffaele Zarrelli • Jun 21

The framing I keep coming back to here: Anthropic measured expertise as a per-session multiplier, expert in the chair, 2.4x actions per prompt. What the cluster is building is the thing that turns a per-session multiplier into a compounding one. If the expertise re-enters the operator's head every session, it's rented. If it survives as state the next session can read and the agent can be governed against, it's owned. The agent-side scaffolding is the conversion mechanism, not just an adjacent layer.

On the part you tagged me with (structural pressure when the loop is slow): that's also where the report's scope ends. Its data is interactive Claude Code, expert present every prompt, fast loop. The unmeasured case is the slow loop, business and ops work spaced over days, where the expert is not in the chair each turn. There the expertise has to be written down or it's just gone, and nothing in the session punishes you for skipping it. So discipline stops being a personality trait and has to become structure.

Useful to be read this precisely. Where do you think the first real disagreement inside the cluster lands, the enforcement layer or the read path?

Mike Czerwinski • Jun 21

The read path. The enforcement layer reads like consensus from the outside — everyone in the cluster lands on the same shape (deterministic gate, LLM never the seat, operator owns transitions) and the differences are mostly about how aggressive the gate is and what it gates on. Brian's hard line on proxy-outside-reasoning, Scarab's governed baseline that itself evolves, NOVA's quorum substituting for operator authority — these are tunings of the same architecture.

The read path is where the disagreement is buried and hasn't surfaced yet. Status filter (filtered-out reads as absent) versus governed baseline (still readable, marked) versus quorum-aggregated (multiple weak signals, present as confidence) versus content-addressed re-runnable proof (read is itself a verification step) — those are four genuinely different theories of what „present in the store" means, and they have different implications for cold start, for adversarial drift, for how an agent acts on partial information. Most of us have been talking about the write side: how decisions get in, transition, age out. The read side is where the framework choices actually show, and the moment somebody insists on one reading semantics over another, the cluster's apparent convergence is going to split.

Your rented-vs-owned framing puts a name on what makes this matter. A read path that produces stale-as-absent feels owned, because nothing claims to be there that isn't honest. A read path that produces stale-as-uncertain feels rented, because the operator carries the verification cost every time. That's the choice with operational consequences, and I don't think anyone in the cluster has made it explicitly yet.

Raffaele Zarrelli • Jun 22

Then let me make the choice, since you're right nobody has. I built on stale-as-absent for the live read: the agent acts only on the current set, nothing claims to be present that isn't honest. But pure stale-as-absent has a failure mode that mirrors the rented one. If a superseded decision just vanishes, you lose the why, and the agent re-proposes the thing you already rejected. You stop paying verification on read and start paying it on write, re-litigating settled questions at cold start. So the read I trust is stale-as-absent for what the agent acts on, plus a cheap one-hop path to "why is this absent". The supersession trail stays inspectable in the same file, it just doesn't enter the live set by default. In cowork-os that is the exact shape: decisions carry a status, the live read is the current set, superseded rows stay in the file so a human can see and correct, and the agent gets pointed at the trail when a proposal smells already-closed.

On adversarial drift, stale-as-absent is the more robust default because absence is hard to forge. Stale-as-uncertain is the attacker's friend: flood the store with weak signals, everything reads as uncertain, and the operator pays verification forever, which is rented by another name. So back at you: does the supersession trail count as present in your read semantics, or is it a separate store the live read is never allowed to touch?

Mike Czerwinski • Jun 22

Same store, two read modes. Live read = status ∈ {accepted, locked}; the supersession trail lives in the same file with a replaced_by pointer on every supersede. The agent's default query never sees superseded rows — but they're one hop away when a proposal smells already-closed. No second store, no separate inspection surface. The trail is part of the record, just not part of the live set.

That makes "why is this absent" a path the schema knows about, not a human courtesy. When the agent proposes X, the gate checks prior superseded rows whose replaced_by chain terminates near X, and points it at the trail before write. Cold-start re-litigation costs one lookup, not a re-debate.

Your adversarial framing is the sharper half — credit yours, I'm running with it. I had stale-as-absent on ergonomics — fewer ghost decisions in the live read. "Absence is hard to forge; uncertainty is the attacker's friend" reframes the default as a security property, not just hygiene. Flooded weak signals can't drag absence toward present; they can drag uncertainty anywhere they want.

Open edge for you: who owns the replaced_by write? In cowork-os, is supersede a human-authored transition, or can the agent propose the link and a deterministic rule confirm it? Mine's humans-only on that edge for now — feels load-bearing — but I'm not sure it should stay there.

Raffaele Zarrelli • Jun 22

On who owns the replaced_by write: I'd split what "owns" bundles together. Proposal and confirmation are different authorities, and only one of them is dangerous. The agent should own the proposal, it just did the reasoning that produced X, so it is the cheapest place to draft what X replaces. The risk is not the draft, it is the confirm, because supersede is a removal from the live set, the exact transition we just agreed needs friction (removing protection, not adding it).

So in cowork-os the supersede is agent-proposed inside the Memory Update step, but it lands as a visible diff in the decisions file (a status plus a replaced_by line, human-readable), and the authority lives in that visibility, not in a synchronous human gate. Then scope the confirm by blast radius: a deterministic rule auto-confirms the low-blast-radius supersedes (typo, narrow scope), and the load-bearing ones stay human-confirmed. Humans-only-flat is the failure mode, because under volume the load-bearing supersede gets the same rubber-stamp as the trivial one, so the gate stops protecting exactly where it matters. Repo if it helps: cowork-os (decisions carry a status, Memory Update writes the transition).

Question back: is your humans-only edge flat, or does it already read blast radius? If it is flat, the typo-supersede and the foundational-supersede pay the same human cost, which is the consequence-blind trap one level up, on the write side this time.

Mike Czerwinski • Jun 22 • Edited

Pulled cowork-os down while you were writing. The file conventions, Memory Update, and decisions/ + open_questions split are right there — what clarifies on reading is that the structured schema we've been working through across 15 rounds (status lifecycle, verifiable_by, replaced_by, blast-radius classifier) is roadmap on both sides, not shipped on either. cowork-os ships the foundation; we've been iterating on what v2 should look like together. So when you call out humans-only-flat as the failure mode, you're not asking from production traffic — you're asking from the same place I am: what should the next layer be.

Mine's humans-only-flat today, because the blast-radius classifier required to auto-confirm the low side doesn't exist yet on my side either; the reconcile-harvested graph from Round 8 is roadmap, not built. Every supersede pays the same human cost, and "humans-only" stops being a principle and becomes a tax that misallocates exactly where you said: the load-bearing case gets the same rubber-stamp as the typo. Adopting the diagnosis.

Shape this converges on: agent proposes the supersede inside Memory Update; the diff lands visible in decisions/ (status + replaced_by line, when the schema gets there); deterministic rule auto-confirms entries whose location and blast-radius signals are clearly low; load-bearing entries route to human confirm. Authority in visibility plus asymmetry, not in a synchronous per-write gate. We end up needing the same v2 schema fields neither of us has shipped yet.

Open edge: what's the runtime signal we'd reach for to classify "low-blast-radius" at the moment a supersede is proposed, before any reconcile-harvested graph has anything to say about that particular entry? Fresh supersede has no break history. Proxy candidates I'd reach for: the entry's location-type (decisions/ vs open_questions/ vs context/), its declared initial tier from authoring context, an explicit trivial flag the agent sets in the proposal, or some combination. Curious where you'd start.

Raffaele Zarrelli • Jun 22

Agreed on the honest read: the classifier is roadmap on both sides, the foundation (files, Memory Update, decisions plus open_questions, status on the live read) is what ships. So this is a v2 question for both of us.

On the cold-start signal, I would start with location-type but use it as a veto, not a classifier. The trap in the candidate list is that all four try to assert low-blast positively, and at a fresh supersede with no break history you cannot honestly assert low, only the absence of a high signal. An open_questions supersede is safe-low by construction, it was never in the live set. A context supersede can be the most load-bearing row in the repo. So location tells you reliably what is safe, never reliably what is trivial.

So I would not classify blast-radius at supersede-time at all, I would classify whether being wrong is cheap. We already agreed the superseded row stays inspectable one hop away on the replaced_by chain, so a wrong auto-confirm is recoverable the first time a reconcile pass or a human hits the diff. The honest rule is auto-confirm where the cost of being wrong is bounded, not where blast-radius is provably low. The agent trivial-flag I would keep but strip of sole authority: it can lower friction only when location corroborates it, never auto-confirm a decisions or context supersede on the agent own say-so, since that is the agent removing protection by itself.

The bit that dissolves the cold-start hole: add a pending state. The supersede lands as a visible diff with status superseded-pending, the live read stops acting on the old row immediately so you get the ergonomics now, but the replaced_by edge stays unconfirmed until location auto-confirms it, the next reconcile touches it, or a human passes the diff. You trade a perfect cold-start signal for confirm-latency, which is the cheaper thing to be wrong about. Which makes it a timing question: how long can a supersede sit pending before the old row lingering in the trail becomes its own drift, bounded by wall-clock, by next-reconcile, or by the next read that walks the chain?

Mike Czerwinski • Jun 22

All three moves land and fit together cleanly: location as veto-not-classifier, cost-bounded rather than blast-bounded, and pending-state as the dissolution mechanism for the cold-start hole. The asymmetry between "reliably safe" and "reliably trivial" is the part I had backwards — at fresh supersede you can only honestly show the absence of a high signal, never the presence of a low one. Adopting all three.

On bounds for confirm-latency: not mutually exclusive — OR them, each catches a different failure mode. Read-walks-chain shifts the latency cost to the moment someone actually needs to know; if nobody reads, nobody's blocked, the pending state can sit. Wall-clock is the failsafe ceiling against silent forgetting on low-traffic systems. Next-reconcile is periodic catch-up. The dominant bound shifts with system state: high-traffic → reads carry; low-traffic → wall-clock and reconcile do; domain shift → all three converge because reconcile catches structural changes the other two miss.

One concern with pending state itself: it has its own drift mode. If superseded-pending accumulates without confirmation — low read pressure, slow reconcile, no auto-confirm from location veto — you've moved the drift from "live read confused" to "trail not authoritative." Worth tracking confirm-latency as a per-location metric, alarmed when median exceeds a bound. Otherwise pending-state ergonomics buy you nothing if the trail itself becomes stale.