<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mike Czerwinski</title>
    <description>The latest articles on DEV Community by Mike Czerwinski (@jugeni).</description>
    <link>https://dev.to/jugeni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3993038%2Fc272b6b5-4050-4cb9-9527-a044b0d7265f.png</url>
      <title>DEV Community: Mike Czerwinski</title>
      <link>https://dev.to/jugeni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jugeni"/>
    <language>en</language>
    <item>
      <title>Anthropic measured the human side. Five operators are building the agent side.</title>
      <dc:creator>Mike Czerwinski</dc:creator>
      <pubDate>Sun, 21 Jun 2026 20:45:56 +0000</pubDate>
      <link>https://dev.to/jugeni/anthropic-measured-the-human-side-five-operators-are-building-the-agent-side-17a0</link>
      <guid>https://dev.to/jugeni/anthropic-measured-the-human-side-five-operators-are-building-the-agent-side-17a0</guid>
      <description>&lt;p&gt;I joined dev.to a few days ago because I'd run out of paths to argue this stuff against. Months of building a framework — operator discipline as an orthogonal axis to autonomy, locked decisions with status fields, drift detection, supersession trails — and the only thing I was sure of was that internal coherence isn't proof of anything. Frameworks survive by surviving other people, not by surviving the author.&lt;/p&gt;

&lt;p&gt;So I started publishing. Today the framework finally hit something outside my own head.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Anthropic measured
&lt;/h2&gt;

&lt;p&gt;On June 16, Anthropic Economic Research published "Agentic coding and persistent returns to expertise." About 400,000 interactive Claude Code sessions. About 235,000 people. October 2025 to April 2026. Expertise patterns, delegation patterns, success patterns.&lt;/p&gt;

&lt;p&gt;The central finding, in their own words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The greater domain expertise a person brings to a session, the more work Claude does per instruction."&lt;/p&gt;

&lt;p&gt;"Success is determined by how well a person understands the problem they are trying to solve, not whether they're trained in coding."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anthropic did not measure operator discipline directly. It measured the closest empirical neighbor: expertise as a multiplier on agentic work.&lt;/p&gt;

&lt;p&gt;Expert-rated sessions show about 2.4× as many Claude actions per prompt as novice-rated sessions, and roughly 5× the text output. The signal is not simply "knows how to code." The signal is "understands the problem well enough to steer the agent." That overlaps with the same axis I'd been arguing as a frame in my &lt;a href="https://dev.to/jugeni/vibe-coding-is-not-a-level-its-an-axis-12gb"&gt;first post on dev.to&lt;/a&gt;: vibe coding is not a level, it's an orthogonal axis to autonomy. My stronger claim was that L1 + High discipline outperforms L5 + Low discipline over time. Anthropic does not measure that claim directly, but it gives the human side of the axis something measurable.&lt;/p&gt;

&lt;p&gt;What the report does not try to answer is the agent-side question: what kind of state, memory, governance, and transition rules have to exist so that the work compounds across sessions instead of being reconstructed every time. Its scope is interactive Claude Code usage — what work is done, who does it, whether the session succeeds — and it explicitly leaves out large parts of non-interactive/headless usage and does not measure downstream real-world outcomes.&lt;/p&gt;

&lt;p&gt;That gap is what the practitioner cluster is circling from the other direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cluster is building
&lt;/h2&gt;

&lt;p&gt;Five other operators on this platform have been pushing on the agent-side question from different starting points this week:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/rapls"&gt;Rapls&lt;/a&gt; on status fields and append-only decision logs.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/scarab-systems"&gt;Scarab Systems&lt;/a&gt; on governed baselines and deterministic enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/0xdevc"&gt;NOVAInetwork (@0xdevc)&lt;/a&gt; on quorum as a substitute for operator discipline at scale.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/sarracin0"&gt;Raffaele Zarrelli (@sarracin0)&lt;/a&gt; on structural pressure when the loop is slow.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/brianrhall"&gt;Brian Hall&lt;/a&gt; on the deterministic gate — and now with an open-source reference architecture (&lt;code&gt;faramesh-core&lt;/code&gt;, MPL-2.0).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The short version of the cluster: five different starting points, one architectural conclusion — the LLM proposes, deterministic rules enforce, humans authorize transitions, and the rules live outside the agent's reasoning loop.&lt;/p&gt;

&lt;p&gt;That's the agent-side scaffolding that sits outside the Anthropic report's scope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two halves of the same answer
&lt;/h2&gt;

&lt;p&gt;Anthropic measured what happens when humans bring expertise into the loop. The cluster I spent today reading and writing with is building architecture for what happens when that expertise has to survive across sessions, tools, and agents. Same axis, two directions, a fuller picture.&lt;/p&gt;

&lt;p&gt;Official research from Anthropic, independent practitioners on dev.to, both pointing at adjacent parts of the same problem. Not the same claim. Not the same layer. But the same direction.&lt;/p&gt;

&lt;p&gt;That's not a viral take. That's an early convergence signal.&lt;/p&gt;




&lt;p&gt;I came here to confront the framework against operators who actually ship with it. The framework didn't collapse on contact. It got sharper. The peers who pushed back named gaps I hadn't seen. And one of the biggest labs in the room published the human-side measurement while we were doing it.&lt;/p&gt;

&lt;p&gt;Two independent signals converging from different directions, in the same week, on the same problem space. That's not the framework being right. It's the field starting to coalesce.&lt;/p&gt;

&lt;p&gt;It's a good Sunday to close the loop.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Operator discipline is no longer just a personal workflow. It is starting to look like an axis, a measurement problem, and an architecture. Whatever comes next has to be built, measured, and governed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/research/claude-code-expertise" rel="noopener noreferrer"&gt;https://www.anthropic.com/research/claude-code-expertise&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llmops</category>
      <category>agents</category>
      <category>operatordiscipline</category>
    </item>
    <item>
      <title>Vibe coding is not a level. It's an axis.</title>
      <dc:creator>Mike Czerwinski</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:48:45 +0000</pubDate>
      <link>https://dev.to/jugeni/vibe-coding-is-not-a-level-its-an-axis-12gb</link>
      <guid>https://dev.to/jugeni/vibe-coding-is-not-a-level-its-an-axis-12gb</guid>
      <description>&lt;p&gt;Karpathy gave us vibe coding: "see stuff, say stuff, run stuff, copy and paste stuff, and it mostly works." Since then, the industry has kept trying to turn it into a tidy autonomy ladder — Level 0, Level 1, all the way up to fully autonomous development.&lt;/p&gt;

&lt;p&gt;That ladder is useful. It is also incomplete.&lt;/p&gt;

&lt;p&gt;It measures one thing: how much of the &lt;em&gt;building&lt;/em&gt; you delegate to AI.&lt;/p&gt;

&lt;p&gt;But two people can delegate the same amount and get radically different outcomes. One compounds. The other accumulates entropy. Same autonomy level. Different operating system.&lt;/p&gt;

&lt;p&gt;That's the missing axis: &lt;strong&gt;operator discipline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By operator discipline I mean one thing: how much of your work survives the session boundary as inspectable state.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the vertical axis measures
&lt;/h2&gt;

&lt;p&gt;The autonomy ladder — inspired by Karpathy, reinforced by recent writing on AI-assisted development, and repeated in a dozen industry variants — measures one vertical: how much of the work you direct the model to own, and how fluent you are at directing that delegation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;L0: no AI&lt;/li&gt;
&lt;li&gt;L1: AI as autocomplete&lt;/li&gt;
&lt;li&gt;L2: intent-driven (you specify the what, AI fills the how)&lt;/li&gt;
&lt;li&gt;L3: collaborative pair-programming&lt;/li&gt;
&lt;li&gt;L4: semi-autonomous (AI executes multi-step tasks, you review)&lt;/li&gt;
&lt;li&gt;L5: fully autonomous (AI owns the loop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each step is a skill ladder &lt;em&gt;inside one domain&lt;/em&gt; — building software. You climb by getting better at prompts, decomposition, code review-at-speed, and tolerance for non-determinism.&lt;/p&gt;

&lt;p&gt;This is real and worth measuring. It's just not the only axis.&lt;/p&gt;




&lt;h2&gt;
  
  
  The horizontal axis most maps underweight
&lt;/h2&gt;

&lt;p&gt;Here's the question the vertical can't answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Two developers are both at Level 4. One ships features that compound — the codebase gets cleaner, their operating context gets sharper, their next prompt does more with less. The other ships features that decay — the codebase grows entropy, their trust in the model degrades, every new prompt is a fresh negotiation.&lt;/p&gt;

&lt;p&gt;Same vibe coding level. Different outcomes. What's the difference?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's not skill at building. It's &lt;strong&gt;how the person relates to the tool over time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Some maps name fragments of this — trust, verification, code review burden, the "perception–action gap" between knowing AI code can be wrong and being able to actually catch it. Those are real and worth reading. But they tend to live as caveats inside the autonomy story, not as a second axis with its own structure.&lt;/p&gt;

&lt;p&gt;So let me try to draw the axis directly.&lt;/p&gt;

&lt;p&gt;A small concrete example, since the abstraction needs one. For about three months I kept re-explaining the same architecture decision to the model every few sessions. Each time it would respectfully suggest the alternative I'd already rejected. Each time I'd argue it down again. The work felt fine in any single session. Over a month it was exhausting.&lt;/p&gt;

&lt;p&gt;Then I started writing those decisions down in a separate store, with a status field. &lt;code&gt;proposed → accepted → locked&lt;/code&gt;. Once a decision is &lt;code&gt;locked&lt;/code&gt;, the model is told not to relitigate it without an explicit unlock.&lt;/p&gt;

&lt;p&gt;The relitigation stopped. The work got calmer. The codebase started moving in one direction instead of wobbling.&lt;/p&gt;

&lt;p&gt;Nothing about my vibe coding level changed. What changed was that a decision became a piece of state instead of a thing I had to defend live.&lt;/p&gt;

&lt;p&gt;That's the axis. Not "are you good at prompting" — &lt;em&gt;how much of your context is a state machine, vs. how much is reconstructed from scratch each session.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The 2×6 matrix
&lt;/h2&gt;

&lt;p&gt;If autonomy is L0–L5 and operator discipline is Low/High, you get twelve cells. The diagonal that matters isn't "low everything → high everything." It's the cross-axis claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;L1 + High operator discipline &amp;gt; L5 + Low operator discipline&lt;/strong&gt; over any time horizon longer than a sprint.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three sample cells:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L3 + Low&lt;/strong&gt;: fast and brittle. Codebase entropy rising. Trust in the model is high in any given session and degrades across sessions because nothing about wrongness ever feeds back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3 + High&lt;/strong&gt;: fast and stable. Trust calibrated by sampling. Wrongness feeds back into the persistent context as a constraint, so the &lt;em&gt;next&lt;/em&gt; session is starting from a better prior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L5 + Low&lt;/strong&gt;: maximum velocity into maximum mess. This is the failure mode every honest writeup of autonomous agents eventually admits to — locally-sensible decisions that miss global constraints, with no substrate to catch the drift.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The claim is that the second axis dominates the first over time. I think it's right. It's testable. If you've watched two equally fluent AI users diverge over six months, you've already seen the pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  What operator discipline actually is
&lt;/h2&gt;

&lt;p&gt;I'll describe what I personally run — not as the right answer, but so you have something concrete to disagree with.&lt;/p&gt;

&lt;p&gt;A persona file the model loads each session: identity, communication preferences, hard rules, things that previously caused friction. Updated when a session reveals a new edge case.&lt;/p&gt;

&lt;p&gt;Three append-only stores. Decisions have a lifecycle (&lt;code&gt;proposed → accepted → locked&lt;/code&gt;). Threads are active workstreams, each with current step, blocker, and next action. Notes are atomic facts with source-anchoring — every fact carries provenance: which email, which call, which file, which line.&lt;/p&gt;

&lt;p&gt;A capture habit. Decisions go into the store the same turn they happen, not as a post-session recap. Recaps drift. Live captures don't.&lt;/p&gt;

&lt;p&gt;Locked decisions stop the death-by-second-guessing loop. Source-anchoring removes one easy path to hallucination — the model is less likely to confidently restate a "fact" when the workflow forces provenance into view.&lt;/p&gt;

&lt;p&gt;None of this is novel architecture. The novelty is that it's &lt;em&gt;written down and enforced&lt;/em&gt;, not implied. It's a state machine, not a prompt trick.&lt;/p&gt;

&lt;p&gt;Whatever your autonomy level, you can be high or low on this. That's the axis.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm not claiming
&lt;/h2&gt;

&lt;p&gt;Discipline doesn't beat fluency. They multiply. An L1 user with high discipline still moves slower than an L4 user with high discipline.&lt;/p&gt;

&lt;p&gt;The autonomy ladder isn't wrong. It's real and worth climbing.&lt;/p&gt;

&lt;p&gt;What I am claiming: the map has two axes, and most of the public conversation has been about one of them. If "more AI" hasn't translated into "more leverage" for you, the answer might not be a smarter model. It might be the axis you weren't measuring.&lt;/p&gt;




&lt;p&gt;What does &lt;em&gt;your&lt;/em&gt; operator discipline look like? What's captured as state, what's reconstructed every session? Curious to hear concrete setups in the comments — especially ones that disagree with mine.&lt;/p&gt;

&lt;p&gt;— Mike&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
