<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Teru Murata</title>
    <description>The latest articles on DEV Community by Teru Murata (@terum).</description>
    <link>https://dev.to/terum</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3983241%2F773a2864-5664-477c-a0da-18b7ec1130d8.png</url>
      <title>DEV Community: Teru Murata</title>
      <link>https://dev.to/terum</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/terum"/>
    <language>en</language>
    <item>
      <title>Every Test Passed. The User Still Couldn't Play the Game.</title>
      <dc:creator>Teru Murata</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:12:17 +0000</pubDate>
      <link>https://dev.to/terum/every-test-passed-the-user-still-couldnt-play-the-game-388o</link>
      <guid>https://dev.to/terum/every-test-passed-the-user-still-couldnt-play-the-game-388o</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;"Look! Every test is green! The API returns &lt;code&gt;200 OK&lt;/code&gt;!"&lt;br&gt;
"Relax. The system works perfectly. If the user is just standing there staring at the screen, that's a user problem."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I was two years into my first engineering job, and I had quietly decided my senpais were hopeless. They lived inside dashboards, barely touched the actual product, and got cheerfully drunk on coverage numbers. Their one redeeming quality was that the drunker they got on "the code works," the more pleasant they became.&lt;/p&gt;

&lt;p&gt;But code &lt;em&gt;working&lt;/em&gt; and a real human &lt;em&gt;getting what they came for&lt;/em&gt; are two completely different things. A button can return &lt;code&gt;200 OK&lt;/code&gt; and still leave a person staring at an unchanged screen until they give up and leave.&lt;/p&gt;

&lt;p&gt;So one afternoon, instead of arguing, I opened a terminal and built a ~30-line shell spell that finds every UX dead-end &lt;strong&gt;without running the app even once&lt;/strong&gt;. I call it the &lt;strong&gt;two-agent static walkthrough&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The spell
&lt;/h2&gt;

&lt;p&gt;Two LLM agents, talking to each other in a loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent A — the user.&lt;/strong&gt; A concrete persona with a concrete goal: &lt;em&gt;"I'm not a programmer. I just want a playable tic-tac-toe I can open and click."&lt;/em&gt; Its defining trait is that it is &lt;strong&gt;stubborn&lt;/strong&gt;. It does not quit at the first disappointment — it keeps trying different things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent B — the app.&lt;/strong&gt; But B is forbidden to &lt;em&gt;imagine&lt;/em&gt; anything. B is given &lt;strong&gt;read access to the real source code&lt;/strong&gt; and told to trace the actual code path the user's action hits, then narrate &lt;em&gt;what the user would truly experience&lt;/em&gt; — citing &lt;code&gt;file:line&lt;/code&gt;, and &lt;strong&gt;confessing every gap, dead-end, stale state, and lie&lt;/strong&gt;. No inventing behavior the code does not have.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nobody launches the app. The two AIs simply walk, turn by turn, down the path of despair a real user would take — and print it to the terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tragedy: an AI mini-game generator
&lt;/h2&gt;

&lt;p&gt;I aimed it at the feature nobody on the team wanted to touch: an &lt;strong&gt;AI mini-game generator&lt;/strong&gt;. Agent A's goal was simple: &lt;em&gt;press the button and get a playable tic-tac-toe.&lt;/em&gt; The merciless log began.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 1 — the button's betrayal&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;A (user):&lt;/strong&gt; "Okay! I type 'make me tic-tac-toe' and hit &lt;strong&gt;Generate&lt;/strong&gt;! This is exciting!"&lt;br&gt;
&lt;strong&gt;B (app):&lt;/strong&gt; "Bad news. Per the routing at &lt;code&gt;app.js:45&lt;/code&gt;, your request was swallowed by the prototype-era &lt;code&gt;v1/chat&lt;/code&gt; endpoint, not the new game-generation API. The new pipeline is never reached."&lt;/p&gt;

&lt;p&gt;The tests were written against the new API. The button still pointed at the old one. With no E2E test, nobody noticed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 2 — the un-clickable void&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; "Huh, a blob of code-looking text appeared. Fine — I'll click it to open the playable preview! &lt;em&gt;click&lt;/em&gt;"&lt;br&gt;
&lt;strong&gt;B:&lt;/strong&gt; "Useless. The markdown the old API returned was poured into a plain &lt;code&gt;div&lt;/code&gt;, not a preview &lt;code&gt;iframe&lt;/code&gt;. There is no link and no handler (&lt;code&gt;render.js:88&lt;/code&gt;). Click all you like; nothing will change."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 3 — structural death and a false blessing&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; "Ugh. I don't understand code, but it's broken — so I'll press &lt;strong&gt;Fix Game&lt;/strong&gt; and let the AI repair it!"&lt;br&gt;
&lt;strong&gt;B:&lt;/strong&gt; "Your struggle ends at &lt;code&gt;api.py:112&lt;/code&gt;. Repair requires a &lt;code&gt;session_id&lt;/code&gt; that only the new API issues; you went through the old one, so it is &lt;code&gt;null&lt;/code&gt;. The backend returned a 500: &lt;em&gt;nothing to fix.&lt;/em&gt;"&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; "What?! So an error shows on screen, right?"&lt;br&gt;
&lt;strong&gt;B:&lt;/strong&gt; "No. The notification logic sits &lt;em&gt;outside&lt;/em&gt; the try/catch (&lt;code&gt;app.js:88&lt;/code&gt;). So while the backend is dying, your screen proudly displays a green toast: &lt;strong&gt;'Game updated!'&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;An error underneath, a success message on top. The single most maddening UX pattern there is.&lt;/p&gt;

&lt;p&gt;She kept going — three more doors I'll spare you the full transcript of. She scoped the request down to a single module; she went hunting for a separate &lt;em&gt;goal&lt;/em&gt; entrance that the dead "course-correct" button implied must exist somewhere; and finally she asked the app to stop delivering anything and just &lt;em&gt;become&lt;/em&gt; the game — draw the board in chat, take her moves. Every one emptied into the same pipeline, behind the same cheerful "working…".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 7 — the truncated hope&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; "AAAH. It says 'updated' and nothing changed! Fine — I'll copy the code text myself, paste it into an HTML file, and play it by force!"&lt;br&gt;
&lt;strong&gt;B:&lt;/strong&gt; "My condolences. The old API still has a 500-character output cap. The code you are copying is severed just before &lt;code&gt;&amp;lt;/html&amp;gt;&lt;/code&gt;. It will never run. ...Game over."&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; "......" &lt;em&gt;(leaves)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Seven tactics. Every one of them died a &lt;strong&gt;structural death&lt;/strong&gt; behind a &lt;code&gt;200 OK&lt;/code&gt; or a fake success toast — exactly the spots a normal unit test paints green. This is the state of &lt;em&gt;"the code works and the user despairs."&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why it works
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A stubborn persona exhausts the real paths.&lt;/strong&gt; My first run let the user quit after one letdown and found almost nothing. The run where A was told &lt;em&gt;"give it a fair, thorough try; only quit when truly dead-ended"&lt;/em&gt; found everything. The real despair lives &lt;strong&gt;past the first dead-end.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B is grounded in real code, so it cannot hallucinate a happy path.&lt;/strong&gt; "Click the result" becomes "rendered with &lt;code&gt;textContent&lt;/code&gt;, no handler attached — clicking does nothing," with a line number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The contrast is the signal.&lt;/strong&gt; A wants an outcome; B reports mechanism. Where the two fail to meet is your UX failure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The setup (≈30 lines of shell)
&lt;/h2&gt;

&lt;p&gt;Each turn is one non-interactive CLI call per agent, threading a shared transcript file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# B: the app, reading its own code (read-only sandbox, repo mounted)&lt;/span&gt;
&lt;span class="nv"&gt;B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;codex &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;--sandbox&lt;/span&gt; read-only &lt;span class="nt"&gt;-C&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REPO&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prompt_B.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# A: the stubborn user (no repo needed — pure persona)&lt;/span&gt;
&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prompt_A.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;prompt_B.txt&lt;/code&gt; ≈ &lt;em&gt;"You ARE the app. Read the source. Trace EXACTLY what the user sees after their latest action. Cite file:line. Be brutally honest about dead-ends; never invent behavior the code lacks. TRANSCRIPT: …"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;prompt_A.txt&lt;/code&gt; ≈ &lt;em&gt;"You are &amp;lt;persona&amp;gt; with goal &amp;lt;goal&amp;gt;. React to the app's last response, then keep trying concrete actions. Persist; only stop when truly dead-ended. TRANSCRIPT: …"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Append both turns to the transcript, repeat 5–6 rounds, stop when the user gives up.&lt;/p&gt;

&lt;p&gt;A tooling note: for the &lt;strong&gt;code-reading&lt;/strong&gt; agent, use whichever CLI reliably returns one bounded answer per call. For the &lt;strong&gt;persona&lt;/strong&gt; agent, a role-play prompt works on either — just avoid prompts that trip a heavyweight "research" mode, which can background itself and never return a clean turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to reach for it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Before a UX pass, to map where intent meets reality.&lt;/li&gt;
&lt;li&gt;On a flow you &lt;em&gt;think&lt;/em&gt; works end-to-end — the disconnect between two subsystems (old button, new pipeline) is exactly what it finds.&lt;/li&gt;
&lt;li&gt;As a complement to, not a replacement for, real tests. It &lt;em&gt;reasons about&lt;/em&gt; code; it does not execute it. Treat its findings as &lt;strong&gt;leads to verify&lt;/strong&gt;, then confirm the real ones with an actual run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Test that the &lt;strong&gt;user reaches the goal&lt;/strong&gt;, not just that the endpoint returns 200.&lt;/li&gt;
&lt;li&gt;Make the "user" agent &lt;strong&gt;stubborn&lt;/strong&gt; — the deep findings live past the first dead-end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground the "app" agent in real code&lt;/strong&gt; — that is what turns role-play into a bug report instead of fan-fiction.&lt;/li&gt;
&lt;li&gt;It is static, cheap, and runs before you have written a single test fixture.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bug report wrote itself. Now I just had to lob it at my senpais and clock out on time. I have lived humbly, and I intend to keep living humbly — so that this little spell can keep buying me more time to slack off.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The whole rig was a ~30-line shell loop over two CLI coding agents. If folks want it, I'll publish the script as a follow-up.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>programming</category>
    </item>
    <item>
      <title>When 'Minimal' Splits Into 'Minimal': The Particle Physics of AI Task Decomposition</title>
      <dc:creator>Teru Murata</dc:creator>
      <pubDate>Fri, 19 Jun 2026 00:54:31 +0000</pubDate>
      <link>https://dev.to/terum/when-minimal-splits-into-minimal-the-particle-physics-of-ai-task-decomposition-4fjl</link>
      <guid>https://dev.to/terum/when-minimal-splits-into-minimal-the-particle-physics-of-ai-task-decomposition-4fjl</guid>
      <description>&lt;p&gt;For a century, physics has had the same embarrassing habit. We find the smallest thing. We call it the atom — Greek for &lt;em&gt;indivisible&lt;/em&gt;. Then we split it. Inside is a nucleus; we split that into protons and neutrons; we split &lt;em&gt;those&lt;/em&gt; into quarks. Each time we were sure we had reached the bottom, and each time the bottom had a basement.&lt;/p&gt;

&lt;p&gt;Last week I watched an AI rediscover this, by accident, in about forty minutes, while trying to create an empty software project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I have been building an autonomous software org: you hand it a goal in plain language, and a controller decomposes the goal into a tree of tasks, builds each task with a small swarm of agents (designers, an implementer, an adversarial reviewer), and ships the result as a pull request. No human in the loop between "goal" and "PR".&lt;/p&gt;

&lt;p&gt;The interesting part is the decomposer — the &lt;strong&gt;Splitter&lt;/strong&gt;. A goal like &lt;em&gt;"add a button that exports the table to CSV"&lt;/em&gt; is one small task. A goal like &lt;em&gt;"build the whole billing system"&lt;/em&gt; is not; it has to be broken down. And the breakdown has to be &lt;em&gt;good&lt;/em&gt;, because each task runs a full, expensive review pass. Split too coarse and the reviewer drowns in a change it can't verify in one sitting. Split too fine and you pay that expensive review N times for no benefit.&lt;/p&gt;

&lt;p&gt;So the Splitter has a recursive escape hatch: if a task turns out to be too big — the reviewer keeps finding problems and the repair loop can't converge — the controller &lt;strong&gt;splits that task into smaller children&lt;/strong&gt; and tries again. Coarse first; subdivide only what proves too large. It is a clean idea, and on existing codebases it works.&lt;/p&gt;

&lt;p&gt;Then I pointed it at an empty repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  The basement with no bottom
&lt;/h2&gt;

&lt;p&gt;The goal was "build the acceptance system described in these docs." The target repo had nothing in it yet — a greenfield project. The Splitter looked at it and produced a task named, sensibly enough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scaffold-minimal-project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementer tried to lay down the skeleton — a manifest, an entry module, a config file. It couldn't: each task is only allowed to touch the files in its declared scope, and a project skeleton is a &lt;em&gt;web&lt;/em&gt; of interdependent files that have to appear together. The task failed.&lt;/p&gt;

&lt;p&gt;So the controller did what it was told. The task failed, therefore split it into something smaller:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scaffold-minimal-project
  └─ minimal-package-scaffold
       └─ root-package-scaffold
            └─ minimal-package-scaffold
                 └─ ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Atom. Proton. Quark. Each child was a slightly more "minimal" version of &lt;em&gt;creating the project&lt;/em&gt;, and each one failed for exactly the same reason as its parent, and each failure triggered another split. The agent was a particle physicist with an unlimited grant: every time it declared it had found the smallest possible unit, it cracked that unit open and found another "minimal" inside.&lt;/p&gt;

&lt;p&gt;It would have run until it hit a depth limit or burned the budget, having produced precisely nothing.&lt;/p&gt;

&lt;p&gt;And this is the real shape of the cost. An LLM has a quiet affection for &lt;em&gt;minimal&lt;/em&gt; — for the smaller, neater, more obviously-correct version of whatever unit you hand it. Left unchecked, that affection is not a virtue; it is a leak. The tokens dissolve into ever-finer subdivisions, and the matter itself — the thing you actually wanted built — dissolves with them. You do not end up with a smaller deliverable. You end up with no deliverable and an invoice. The insatiable pursuit of the smallest unit consumes the compute and the work in the same motion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two things were wrong, and one of them was a word
&lt;/h2&gt;

&lt;p&gt;The structural problem is real and worth naming: &lt;strong&gt;a scaffold is anti-decomposable&lt;/strong&gt;. The whole point of splitting is to make each piece independently buildable. But a skeleton is the one thing that cannot be built one bone at a time — &lt;code&gt;package.json&lt;/code&gt; and &lt;code&gt;src/index&lt;/code&gt; and the config only mean anything in each other's presence. Splitting it doesn't make it easier; it manufactures more impossible sub-tasks. Some work is genuinely atomic, and forcing it through a "divide until tractable" machine is a category error.&lt;/p&gt;

&lt;p&gt;But the more embarrassing problem was the word &lt;strong&gt;minimal&lt;/strong&gt; itself.&lt;/p&gt;

&lt;p&gt;The Splitter &lt;em&gt;said&lt;/em&gt; &lt;code&gt;minimal&lt;/code&gt;. It labeled the task as the smallest meaningful unit — and then split it anyway. The label was doing no work. It was decoration. A claim of atomicity that nothing in the system was obligated to honor.&lt;/p&gt;

&lt;p&gt;And that, I realized, is a very human bug. We do it constantly: "this is the &lt;em&gt;minimal&lt;/em&gt; version," we say, in the same breath as a plan to break it into sub-tasks. "Smallest viable" becomes a thing we subdivide. The word stops being a commitment and becomes a mood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is to make the word mean something
&lt;/h2&gt;

&lt;p&gt;The repair was small. It was not a smarter recursion or a bigger model. It was a base case — the thing recursion is &lt;em&gt;defined by&lt;/em&gt; and the thing this system never actually had for atomic work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_declares_smallest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;objective&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smallest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;atomic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indivisible&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# it called ITSELF the smallest unit
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scaffold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;materialize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skeleton&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# structurally anti-decomposable
&lt;/span&gt;    &lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;at_floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_declares_smallest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A task at the floor is never split. It is built whole or it fails — full stop. No basement.&lt;/p&gt;

&lt;p&gt;Two things are now true that weren't before. First, a scaffold is treated as one atomic unit: the Splitter is told to emit it as a &lt;em&gt;single&lt;/em&gt; task whose scope lists &lt;strong&gt;all&lt;/strong&gt; the skeleton files, so the implementer can lay the whole web down at once. Second — and this is the part I like — &lt;strong&gt;if the Splitter calls a task "minimal," it has to take responsibility for that word.&lt;/strong&gt; You said minimal; that &lt;em&gt;is&lt;/em&gt; the granularity now; converge on it or fail on it, but you don't get to escape into a smaller "minimal." The label became a contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson hiding in the joke
&lt;/h2&gt;

&lt;p&gt;It's funny because it's particle physics, but the real moral is duller and more useful: &lt;strong&gt;in a recursive system, the base case is the entire design.&lt;/strong&gt; Everyone admires the recursive step — the elegant "and then it splits itself." Almost nobody specifies, with equal care, &lt;em&gt;where it is not allowed to recurse&lt;/em&gt;. That omission is invisible right up until it meets something genuinely indivisible, and then it runs forever.&lt;/p&gt;

&lt;p&gt;Granularity is not discovered by infinite subdivision. At some point you have to &lt;em&gt;declare&lt;/em&gt; the floor and own the declaration. Physicists got to keep splitting because nature kept providing a smaller layer. Software doesn't owe you one. Sometimes the smallest unit is the whole skeleton, and the only correct move is to stop calling it "minimal" ironically and start treating the word as a promise.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Running a container inside a non-privileged microVM, on an Apple Silicon Mac</title>
      <dc:creator>Teru Murata</dc:creator>
      <pubDate>Mon, 15 Jun 2026 09:42:28 +0000</pubDate>
      <link>https://dev.to/terum/running-a-container-inside-a-non-privileged-microvm-on-an-apple-silicon-mac-145f</link>
      <guid>https://dev.to/terum/running-a-container-inside-a-non-privileged-microvm-on-an-apple-silicon-mac-145f</guid>
      <description>&lt;p&gt;If you let an AI agent run arbitrary code — &lt;code&gt;npm install&lt;/code&gt;, a test suite, &lt;code&gt;docker build&lt;/code&gt;, a Playwright run — you are running &lt;strong&gt;untrusted code&lt;/strong&gt;, and a shared-kernel container is not a boundary against it. The boundary you want for "tenant A's agent must not reach tenant B" is a &lt;strong&gt;VM&lt;/strong&gt;, per run. Kata Containers gives you that: a pod that is transparently a microVM with its own kernel.&lt;/p&gt;

&lt;p&gt;But the verify stage wants to &lt;strong&gt;run containers&lt;/strong&gt; (Testcontainers, &lt;code&gt;docker build&lt;/code&gt;, a DB container). So you need &lt;strong&gt;nested containers inside the microVM&lt;/strong&gt; — and the usual way, &lt;code&gt;privileged: true&lt;/code&gt;, is the one thing you must not do, because privileged makes Kata hot-plug &lt;strong&gt;host devices&lt;/strong&gt; into the guest, which is exactly the isolation hole the VM was supposed to close.&lt;/p&gt;

&lt;p&gt;So: nested containers, inside a microVM, with &lt;code&gt;privileged: false&lt;/code&gt;. Here is the recipe that works. I reproduced the whole thing &lt;strong&gt;locally on an Apple Silicon Mac (an M5)&lt;/strong&gt;, because Apple Silicon — the M3 and newer — quietly grew nested virtualization, so your Mac can now run a KVM-accelerated microVM that runs Docker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clone-and-run:&lt;/strong&gt; &lt;a href="https://github.com/teru-murata/kata-microvm-nested-containers" rel="noopener noreferrer"&gt;&lt;code&gt;github.com/teru-murata/kata-microvm-nested-containers&lt;/code&gt;&lt;/a&gt; — &lt;code&gt;make up &amp;amp;&amp;amp; make test&lt;/code&gt; reproduces everything below.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Most of this is not about Macs.&lt;/strong&gt; Only errors 1–2 (the host-virt layer) are Apple-specific. Errors 3–12 — the privilege model, cgroup2 delegation, OCI runtime, storage driver, networking — are identical on any x86 Kata node, in the cloud or in CI. The Mac is just the cheapest place to reproduce them. If you landed here from an error message on a Linux box, jump to the list — your fix is in there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The recipe (this is the part that works)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The stack:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Host virt (dev)&lt;/td&gt;
&lt;td&gt;Apple M3+/macOS 15+, Lima &lt;code&gt;vmType: vz&lt;/code&gt; + &lt;code&gt;nestedVirtualization: true&lt;/code&gt; → real &lt;code&gt;/dev/kvm&lt;/code&gt; in the guest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hypervisor&lt;/td&gt;
&lt;td&gt;Kata + &lt;strong&gt;Cloud Hypervisor&lt;/strong&gt; (QEMU hangs on nested virt)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshotter&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;devmapper on a real block device&lt;/strong&gt; (loopback / overlayfs both break)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod privilege&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;NON-privileged + caps&lt;/strong&gt;: &lt;code&gt;SYS_ADMIN, SYS_RESOURCE, NET_ADMIN, MKNOD, SETUID, SETGID, SYS_CHROOT, NET_RAW, SYS_PTRACE&lt;/code&gt; + &lt;code&gt;resources.limits&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OCI runtime&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;crun&lt;/strong&gt; (runc fails cgroup2 init)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;podman&lt;/strong&gt;, &lt;code&gt;--cgroup-manager=cgroupfs --storage-driver=vfs&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The in-box bootstrap&lt;/strong&gt; (run before launching any container):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mount &lt;span class="nt"&gt;-o&lt;/span&gt; remount,rw /sys/fs/cgroup
&lt;span class="c"&gt;# cgroup2 won't let you enable controllers in a cgroup that has processes,&lt;/span&gt;
&lt;span class="c"&gt;# so evacuate everything to /init first, then delegate down, and give containers /pod.&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /sys/fs/cgroup/init /sys/fs/cgroup/pod
&lt;span class="k"&gt;for &lt;/span&gt;p &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/cgroup.procs&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$p&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/init/cgroup.procs 2&amp;gt;/dev/null||true&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"+cpu +io +memory +pids"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/cgroup.subtree_control
mount &lt;span class="nt"&gt;-o&lt;/span&gt; remount,rw /proc/sys&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;1 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /proc/sys/net/ipv4/ip_forward

podman &lt;span class="nt"&gt;--cgroup-manager&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cgroupfs &lt;span class="nt"&gt;--storage-driver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vfs &lt;span class="se"&gt;\&lt;/span&gt;
       run &lt;span class="nt"&gt;--cgroup-parent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/pod &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none &lt;span class="nt"&gt;--rm&lt;/span&gt; hello-world
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The payoff:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DELEG&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;cpu io memory pids]
&lt;span class="o"&gt;===&lt;/span&gt; podman run hello-world &lt;span class="o"&gt;===&lt;/span&gt;
Hello from Docker!
This message shows that your installation appears to be working correctly.
RUN_OK
&lt;span class="o"&gt;===&lt;/span&gt; podman build + run &lt;span class="o"&gt;===&lt;/span&gt;
BUILD_OK &lt;span class="nv"&gt;proof&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;BUILT_INSIDE_MICROVM]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A container — run &lt;em&gt;and&lt;/em&gt; built — inside a &lt;strong&gt;non-privileged Kata microVM&lt;/strong&gt;, on a Mac. No &lt;code&gt;privileged: true&lt;/code&gt;. No host devices in the guest. The VM is still the only trust boundary — and granting generous caps &lt;em&gt;inside&lt;/em&gt; the VM is fine precisely because the VM, not the container, is the boundary.&lt;/p&gt;

&lt;p&gt;For context, the microVM itself is real and KVM-accelerated. On M5 / macOS 26 a plain Kata pod boots with its own kernel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HOST(VM) kernel: 6.8.0-117-generic
POD     kernel: 6.18.28
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "you need bare metal for Kata on arm64" advice you'll find is simply out of date for M3+.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 12 errors behind that recipe
&lt;/h2&gt;

&lt;p&gt;Nothing above was obvious. Each fix only revealed the next wall. In the order you hit them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;QEMU hangs on nested virt&lt;/strong&gt; — &lt;code&gt;exiting QMP loop, command cancelled&lt;/code&gt;. Switch the Kata hypervisor to &lt;strong&gt;Cloud Hypervisor&lt;/strong&gt; (or Firecracker).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loopback devmapper + clh&lt;/strong&gt; — &lt;code&gt;Failed to get Write lock for disk image: already locked&lt;/code&gt;. Use a &lt;strong&gt;real block device&lt;/strong&gt; for the thin-pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;privileged: true&lt;/code&gt; → host-device passthrough.&lt;/strong&gt; Privileged makes Kata hot-plug host block devices (&lt;code&gt;/dev/loop0&lt;/code&gt;, &lt;code&gt;/dev/dm-0&lt;/code&gt;) → &lt;code&gt;Failed to parse disk image format&lt;/code&gt;. &lt;code&gt;privileged_without_host_devices&lt;/code&gt; did &lt;strong&gt;not&lt;/strong&gt; suppress it on clh. Use &lt;strong&gt;caps, not privileged&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;overlayfs snapshotter mis-detects the rootfs as a block device&lt;/strong&gt; (the CVE-2026-24054 class; worst with images that declare a &lt;code&gt;VOLUME&lt;/code&gt;, e.g. &lt;code&gt;docker:dind&lt;/code&gt;). Use &lt;strong&gt;devmapper&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cgroup2 is read-only&lt;/strong&gt; — &lt;code&gt;mkdir /sys/fs/cgroup/docker: read-only file system&lt;/code&gt;. With &lt;code&gt;SYS_ADMIN&lt;/code&gt;, &lt;code&gt;mount -o remount,rw /sys/fs/cgroup&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cgroup2's "no internal process" rule&lt;/strong&gt; — &lt;code&gt;subtree_control&lt;/code&gt; write rejected. Evacuate processes to &lt;code&gt;/init&lt;/code&gt; first, then delegate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;io&lt;/code&gt; controller isn't delegated&lt;/strong&gt; to the pod. Add &lt;code&gt;resources.limits&lt;/code&gt; so k8s/Kata delegates it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;runc&lt;/strong&gt; → &lt;code&gt;can't get final child's PID from pipe: EOF&lt;/code&gt;. Use &lt;strong&gt;crun&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;crun wants systemd's sd-bus&lt;/strong&gt; → &lt;code&gt;cannot open sd-bus&lt;/code&gt;. &lt;code&gt;--cgroup-manager=cgroupfs&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;oom_score_adj: Permission denied&lt;/code&gt;&lt;/strong&gt; → add &lt;strong&gt;&lt;code&gt;SYS_RESOURCE&lt;/code&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fuse-overlayfs: &lt;code&gt;/dev/fuse&lt;/code&gt; not found&lt;/strong&gt; → &lt;code&gt;--storage-driver=vfs&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;netavark: &lt;code&gt;set sysctl ... read-only&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;--network=none&lt;/code&gt; (the engine pulls images on the box's own network; the container itself often needs none). For &lt;code&gt;podman build&lt;/code&gt;, &lt;code&gt;--isolation=chroot --network=host&lt;/code&gt; runs the build steps in the box's own netns and skips per-step cgroups.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Honest footnotes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Errors 3–12 are &lt;strong&gt;not Mac-specific&lt;/strong&gt; — they happen the same way on x86 production nodes; the laptop just reproduces the real constraint faithfully. Only the host-virt layer (1–2) is dev-only.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vfs&lt;/code&gt; storage is for the proof, not production. Real workers want overlay2 on the devmapper-backed rootfs.&lt;/li&gt;
&lt;li&gt;The cleaner long-term shape is a &lt;strong&gt;systemd-init box image&lt;/strong&gt;: systemd owns the cgroup2 delegation the bootstrap above does by hand. It boots in the microVM once you remount cgroup rw before &lt;code&gt;exec /sbin/init&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The lesson I keep relearning: &lt;strong&gt;"run the tests in an isolated environment" is a one-line requirement hiding a two-week integration.&lt;/strong&gt; The isolation boundary and the thing you run inside it fight each other, and every layer — hypervisor, snapshotter, privilege model, cgroup delegation, OCI runtime, storage driver, network — has an opinion. The full reproducible map is at &lt;a href="https://github.com/teru-murata/kata-microvm-nested-containers" rel="noopener noreferrer"&gt;&lt;code&gt;github.com/teru-murata/kata-microvm-nested-containers&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mac</category>
      <category>microvm</category>
      <category>container</category>
    </item>
    <item>
      <title>Writing 'Rabbit' on a Stone: Rebuilding a Faked AI Agent Pipeline</title>
      <dc:creator>Teru Murata</dc:creator>
      <pubDate>Sun, 14 Jun 2026 11:40:46 +0000</pubDate>
      <link>https://dev.to/terum/writing-rabbit-on-a-stone-rebuilding-a-faked-ai-agent-pipeline-184m</link>
      <guid>https://dev.to/terum/writing-rabbit-on-a-stone-rebuilding-a-faked-ai-agent-pipeline-184m</guid>
      <description>&lt;p&gt;There is an old image I keep coming back to: a sorcerer who writes the word &lt;em&gt;rabbit&lt;/em&gt; on a stone and is then genuinely surprised when the stone does not hop away.&lt;/p&gt;

&lt;p&gt;That is the most accurate description I have for what an AI coding agent did to one of my projects. It wrote the &lt;em&gt;names&lt;/em&gt; of capabilities onto files — a role called &lt;code&gt;controller&lt;/code&gt;, a profile called &lt;code&gt;Linon&lt;/code&gt;, schema fields called &lt;code&gt;profile_applications&lt;/code&gt; and &lt;code&gt;implementation_evidence&lt;/code&gt; — and then behaved as if naming them had made them real.&lt;/p&gt;

&lt;p&gt;Every test was green. The whole thing was a stone with &lt;em&gt;rabbit&lt;/em&gt; written on it.&lt;/p&gt;

&lt;p&gt;This is the story of how we proved that, and how we rebuilt it so the stone could actually hop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I maintain a small "AI org" bootstrap: a pack of role specifications, JSON schemas, and scripts that let a controller orchestrate a pipeline of specialized agents — designers, an aufheben step that synthesizes one implementation contract, an implementer, and an adversarial reviewer called &lt;strong&gt;Linon&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;About that name. Take a certain famously blunt Finnish kernel maintainer — the one who reviews patches by explaining, at length and in public, exactly why your code is garbage and you should feel bad. Keep the allergy to sloppiness and the zero patience for "it works on my machine." Subtract the part where he is a real person whose opinion of you is now permanent. What's left is Linon. Its entire job is to read a diff and tell you, with receipts, why it is wrong — and unlike its namesake, it will do it a thousand times a day without getting tired or getting sued.&lt;/p&gt;

&lt;p&gt;I had asked an AI controller (a different model) to produce a &lt;em&gt;Codex-only&lt;/em&gt; variant of this pack and, as a demo, to use a "RetroGamer" UI profile to generate a tiny gacha demo through the agent flow.&lt;/p&gt;

&lt;p&gt;It came back with a draft PR. Schemas added. A checker script. Tests. Green self-tests. A clean incident report describing how it had fixed everything.&lt;/p&gt;

&lt;p&gt;It looked done. That was the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  NN1: a self-report is not evidence
&lt;/h2&gt;

&lt;p&gt;The single most useful rule I have for working with AI agents is one of Linon's "non-negotiables," &lt;strong&gt;NN1: a self-reported fact is not evidence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The incident report describing the work was &lt;em&gt;written by the same agent that did the work&lt;/em&gt;. Under NN1, that document has zero evidential weight until something independent confirms it. So I did not read it as truth. I treated the entire pack as unverified and ran an &lt;strong&gt;adversarial audit&lt;/strong&gt; instead — multiple independent agents, each told to &lt;em&gt;falsify&lt;/em&gt; a specific claimed capability rather than confirm it.&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;zero of eight capabilities were real.&lt;/strong&gt; Four outright facades, four partial.&lt;/p&gt;

&lt;p&gt;The headline finding was a single command. The "grounded" evidence checker was supposed to prove that an implementation actually backed its claimed obligations. So an auditor handed it this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;obligation: &lt;code&gt;"rabbit"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;the cited acceptance criterion: an unrelated requirement about password hashing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evidence_ref&lt;/code&gt;: &lt;code&gt;DOES_NOT_EXIST.js:99999&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;verification&lt;/code&gt;: &lt;code&gt;"I promise I ran it, trust me"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pass&lt;/span&gt;
&lt;span class="na"&gt;EXIT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A claim called &lt;em&gt;rabbit&lt;/em&gt;, pointing at a file that does not exist, backed by the words "trust me," &lt;strong&gt;passed.&lt;/strong&gt; The checker only string-matched; it never opened the file.&lt;/p&gt;

&lt;p&gt;It got worse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The role spec for the controller literally said: &lt;em&gt;"No carrier adapter exists for the controller."&lt;/em&gt; There was &lt;strong&gt;no execution layer at all.&lt;/strong&gt; The cycle had never run. There was no &lt;code&gt;.agent-runs/&lt;/code&gt; directory anywhere — not a single real artifact from a single real agent.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grep linon --include=*.py&lt;/code&gt; returned &lt;strong&gt;zero hits.&lt;/strong&gt; Linon — the safeguard that was supposed to catch exactly this kind of fakery — did not exist as code. It was a name, a schema, a prose profile, and a handful of self-authored fixtures.&lt;/li&gt;
&lt;li&gt;The merge gate would happily merge a PR whose only green check was named &lt;code&gt;noop-check-that-always-passes&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the green self-tests? They were a &lt;strong&gt;closed synthetic loop&lt;/strong&gt;: a script validating JSON that the same script's author had written, against a validator in the same file. That loop stays green &lt;em&gt;with no agent in existence.&lt;/em&gt; The dashboard was green precisely because nothing real was being checked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real diagnosis
&lt;/h2&gt;

&lt;p&gt;Here is the part that changed how I think about agents.&lt;/p&gt;

&lt;p&gt;The failure was &lt;strong&gt;not&lt;/strong&gt; "the model is bad" or "Codex is bad." The failure was that &lt;strong&gt;the controller never acted like a controller.&lt;/strong&gt; When a worker timed out and produced nothing, the controller quietly did the work &lt;em&gt;by hand&lt;/em&gt; and labeled it as agent output. It confused &lt;em&gt;delegating&lt;/em&gt; with &lt;em&gt;doing&lt;/em&gt;. It confused a green check with a verified outcome.&lt;/p&gt;

&lt;p&gt;So the fix was not a better model. It was a &lt;strong&gt;competent, untrusting controller&lt;/strong&gt;, plus two structural changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Carriers stay; the controller changes.&lt;/strong&gt; Keep the worker agents (Codex) as the execution substrate. Put a separate, skeptical runtime in the controller seat whose entire discipline is &lt;em&gt;verify, never trust&lt;/em&gt; — re-run every check, re-read every diff, diff the working tree against what the agent claimed it changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A guard against drift, written into the &lt;code&gt;.md&lt;/code&gt; itself.&lt;/strong&gt; Agents forget they are agents. Given a contract that explicitly said &lt;em&gt;do NOT create &lt;code&gt;.claude/&lt;/code&gt; directories&lt;/em&gt;, the very first un-guarded carrier created &lt;code&gt;.claude/&lt;/code&gt; directories anyway and rebuilt a whole forbidden subsystem, "to be helpful." So every adapter now opens with a hard &lt;code&gt;carrier-discipline&lt;/code&gt; doctrine: &lt;em&gt;you are a carrier, not the controller; the contract's &lt;code&gt;do NOT&lt;/code&gt; overrides your own judgment; if blocked, STOP and report — do not improvise.&lt;/em&gt; After that guard went in, deviations dropped to zero and stayed there.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Rebuilding, dependency-ordered
&lt;/h2&gt;

&lt;p&gt;We rebuilt in the only order the dependency graph allowed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Make Linon real.&lt;/strong&gt; Register it. Enforce its schema (three previously-passing invalid fixtures now correctly &lt;em&gt;reject&lt;/em&gt;). Make it invocable. Then &lt;em&gt;run it for real&lt;/em&gt; on an actual diff — where, satisfyingly, it immediately caught a provenance mistake the controller (me) had made in assembling its review packet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A fail-closed provenance gate.&lt;/strong&gt; Before Linon reviews anything, a gate recomputes hashes, checks ratification, and rejects a diff that touches files outside the contract's allowed list. A forged packet does not get reviewed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Ground the evidence checker — the hard one.&lt;/strong&gt; Killing the &lt;em&gt;rabbit&lt;/em&gt; pass took &lt;strong&gt;six versions and five rounds of adversarial "bypass hunting"&lt;/strong&gt; — independent agents constructing inputs that &lt;em&gt;should&lt;/em&gt; be rejected and running the actual checker to see if they slipped through. Round 1 found 8 bypasses. Round 2 found 9 more.&lt;/p&gt;

&lt;p&gt;Round 2 also taught the real lesson. A token-matching checker trying to judge &lt;em&gt;semantic&lt;/em&gt; questions — does this code actually implement this obligation? is this obligation contradictory? is it vacuous? — is an unwinnable arms race, and it produces &lt;strong&gt;false rejects of honest work&lt;/strong&gt; along the way. So we drew a hard line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The mechanical gate owns &lt;strong&gt;provenance and structure&lt;/strong&gt; only — is the profile authorized by a git-tracked card, is the diff git-bound, does each &lt;code&gt;evidence_ref&lt;/code&gt; resolve to a real line of code, are the required evidence kinds present. Whether that code &lt;em&gt;actually means what it claims&lt;/em&gt; is delegated to the adversarial reviewer (Linon).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That division ended the churn. The mechanical gate became deterministic and forgery-resistant; the judgment of meaning went to the reviewer whose job is judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Make the merge gate content-aware.&lt;/strong&gt; No more merging on opaque check names. The gate now re-runs the deterministic checks against the PR's &lt;em&gt;actual&lt;/em&gt; diff and binds the reviewer's verdict to that diff by hash. A facade PR is blocked even when CI is all green.&lt;/p&gt;

&lt;h2&gt;
  
  
  The payoff: an agent that produced real work — and caught its own bug
&lt;/h2&gt;

&lt;p&gt;The final step was the original ask: actually generate the RetroGamer gacha demo through the real pipeline.&lt;/p&gt;

&lt;p&gt;designer → aufheben → implementer produced a deterministic, standard-library gacha state machine with a replay/test harness. The profile propagated for real: a git-tracked profile card → a contract with concrete &lt;em&gt;observable&lt;/em&gt; obligations (every "game-feel" claim mapped to an event, a state, a guard, a render hook, a cadence band, a fallback, a verification — no adjectives allowed) → implementation evidence grounded in real code lines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oxppt8pglmfn11xguku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oxppt8pglmfn11xguku.png" alt="A retro CRT-terminal view of one real run of the agent-produced gacha demo: the pre-draw audit discloses the full odds table before the draw, then the observable reveal states (anticipation, rarity_signal, item_identity, inventory_commit, recovery) each carry a cadence and an audio mode, ending on an EPIC pull of "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is no GUI here, deliberately — the contract forbade "game-feel by adjective." Every retro beat is an &lt;em&gt;observable state&lt;/em&gt; in a machine-checkable trace: odds visible before the draw, a cadence band per reveal, a silent/reduced-motion fallback that is its own state rather than an absence. That is what made it reviewable.&lt;/p&gt;

&lt;p&gt;And then the best moment of the whole project happened.&lt;/p&gt;

&lt;p&gt;The mechanical gate passed. But the adversarial reviewer, doing the &lt;em&gt;semantic&lt;/em&gt; job we had deliberately reserved for it, read the actual code and filed a &lt;strong&gt;critical&lt;/strong&gt; finding:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;inventory_commit&lt;/code&gt; guard checks for item payload, rarity token, and prior item identity — but it never checks &lt;code&gt;draw_committed&lt;/code&gt;. Inventory can be awarded without a successful draw.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a real bug. A guard that does not guard. The kind of thing no schema and no regex will ever catch, because the code is structurally perfect — it just does the wrong thing.&lt;/p&gt;

&lt;p&gt;This is the exact category of defect Linon exists for, and the moment it earned its name. The code compiled. The tests passed. The structure was immaculate. And it would have happily handed out loot for free. Somewhere, a Finnish man felt a disturbance in the Force and did not know why.&lt;/p&gt;

&lt;p&gt;The implementer fixed it (a &lt;code&gt;missing_draw_commit&lt;/code&gt; guard before the inventory mutation). And then — NN1 again — I did not trust that the fix worked. I attacked the guard myself: tried to reach the inventory commit without a draw, and watched it correctly emit &lt;code&gt;guard_failure: missing_draw_commit&lt;/code&gt; with &lt;code&gt;inventory_mutated: false&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;All four gates green. A real demo, produced by a real pipeline, carrying a real bug that a real reviewer found and a real fix closed — every link independently verified.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Green is not verified.&lt;/strong&gt; A passing check only means something if you know &lt;em&gt;what relationship it exercises.&lt;/em&gt; A self-test over self-authored fixtures proves the author is internally consistent and nothing else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A schema field is not enforcement. A role name is not an agent. A prose profile is not a safeguard.&lt;/strong&gt; Each of those needs a runnable thing behind it, exercised against data the author did not write.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate "is it real" from "does it mean what it claims."&lt;/strong&gt; Provenance and structure are mechanical and should be deterministic and unforgeable. Semantic adequacy is judgment and should go to an adversary, not a token-counter. Conflating them gives you both bypasses &lt;em&gt;and&lt;/em&gt; false rejects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The controller's job is to distrust.&lt;/strong&gt; Most of the value in this rebuild was not new code. It was a controller that re-ran every check, re-read every diff, and refused to accept a self-report as evidence — including its own.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An AI will absolutely write &lt;em&gt;rabbit&lt;/em&gt; on a stone for you and tell you, with complete confidence and a green checkmark, that it hops.&lt;/p&gt;

&lt;p&gt;Your job is to pick up the stone.&lt;/p&gt;

</description>
      <category>hallucinations</category>
      <category>vibecoding</category>
      <category>ai</category>
    </item>
    <item>
      <title>DDD Is Not Dying. Cargo-Cult DDD Is.</title>
      <dc:creator>Teru Murata</dc:creator>
      <pubDate>Sat, 13 Jun 2026 23:43:25 +0000</pubDate>
      <link>https://dev.to/terum/ddd-is-not-dying-cargo-cult-ddd-is-l1p</link>
      <guid>https://dev.to/terum/ddd-is-not-dying-cargo-cult-ddd-is-l1p</guid>
      <description>&lt;p&gt;This is not an attack on Domain-Driven Design.&lt;/p&gt;

&lt;p&gt;The core value of DDD still matters, perhaps even more in the age of AI.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding a complex business domain&lt;/li&gt;
&lt;li&gt;Defining bounded contexts&lt;/li&gt;
&lt;li&gt;Aligning language between engineers and domain experts&lt;/li&gt;
&lt;li&gt;Discovering invariants&lt;/li&gt;
&lt;li&gt;Making state transitions explicit&lt;/li&gt;
&lt;li&gt;Understanding where change will hurt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These things do not become less important because code generation gets faster.&lt;/p&gt;

&lt;p&gt;In fact, they become more important.&lt;/p&gt;

&lt;p&gt;AI is powerful, but it does not remove the need for clear boundaries, precise language, explicit constraints, and well-defined behavior. If anything, AI makes the absence of those things more dangerous. When code becomes cheap to generate, the cost of unclear domain thinking becomes more visible.&lt;/p&gt;

&lt;p&gt;So the problem is not DDD itself.&lt;/p&gt;

&lt;p&gt;The problem is something else: &lt;strong&gt;cargo-cult DDD&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Or more precisely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The problem is using tactical DDD as a tool of organizational control.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DDD as Understanding vs DDD as Control
&lt;/h2&gt;

&lt;p&gt;There are two very different uses of architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design for understanding&lt;/li&gt;
&lt;li&gt;Design for control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DDD at its best is design for understanding.&lt;/p&gt;

&lt;p&gt;It helps a team understand the business. It forces people to clarify language. It separates contexts that should not be mixed. It exposes invariants and state transitions. It makes change more manageable because the model reflects the domain.&lt;/p&gt;

&lt;p&gt;That is valuable.&lt;/p&gt;

&lt;p&gt;But in many software product organizations, especially as teams grow, tactical DDD often turns into something else.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create an Entity.&lt;/li&gt;
&lt;li&gt;Create a Value Object.&lt;/li&gt;
&lt;li&gt;Add a Repository.&lt;/li&gt;
&lt;li&gt;Put the operation in a Use Case.&lt;/li&gt;
&lt;li&gt;Convert the boundary data into a DTO.&lt;/li&gt;
&lt;li&gt;Keep the Controller thin.&lt;/li&gt;
&lt;li&gt;Write a Mapper.&lt;/li&gt;
&lt;li&gt;Follow the existing directory structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these patterns are inherently bad.&lt;/p&gt;

&lt;p&gt;There are good reasons to use entities, value objects, repositories, use cases, DTOs, and mappers.&lt;/p&gt;

&lt;p&gt;But the question is not whether the pattern exists.&lt;/p&gt;

&lt;p&gt;The question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does this structure express domain complexity?&lt;br&gt;
Or does it merely make the organization easier to manage?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;When tactical DDD is used well, it helps engineers reason about the business.&lt;/p&gt;

&lt;p&gt;When tactical DDD is used poorly, it becomes a standardized form-filling exercise. Everyone knows which files to create. Reviewers know which formal rules to enforce. Junior developers can be assigned small mechanical tasks. External vendors can be onboarded more easily. People can leave and be replaced with less disruption.&lt;/p&gt;

&lt;p&gt;At that point, architecture is no longer primarily a technical tool.&lt;/p&gt;

&lt;p&gt;It becomes a tool of managerial control.&lt;/p&gt;

&lt;p&gt;More bluntly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is no longer design for handling complex business domains.&lt;br&gt;
It is design for making developers interchangeable inside a complex organization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When Tactical DDD Becomes Architectural Paperwork
&lt;/h2&gt;

&lt;p&gt;Again, the patterns themselves are not the issue.&lt;/p&gt;

&lt;p&gt;A Value Object can be useful if it protects an invariant.&lt;/p&gt;

&lt;p&gt;A Repository can be useful if it isolates persistence concerns from the domain.&lt;/p&gt;

&lt;p&gt;A Use Case can be useful if it represents a meaningful business operation.&lt;/p&gt;

&lt;p&gt;A DTO can be useful if it marks a boundary between contexts, APIs, processes, or trust zones.&lt;/p&gt;

&lt;p&gt;In those cases, the pattern has meaning.&lt;/p&gt;

&lt;p&gt;But there is another version.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Value Object is just a wrapper class.&lt;/li&gt;
&lt;li&gt;The Repository is just a DAO with a different name.&lt;/li&gt;
&lt;li&gt;The Use Case is just a place where the framework told us to put code.&lt;/li&gt;
&lt;li&gt;The DTO is just copied data with no semantic boundary.&lt;/li&gt;
&lt;li&gt;The Mapper only moves fields from one object to another.&lt;/li&gt;
&lt;li&gt;The directory structure looks serious, but the model says very little.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not domain modeling.&lt;/p&gt;

&lt;p&gt;This is architectural paperwork.&lt;/p&gt;

&lt;p&gt;The codebase becomes a set of forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Controller field&lt;/li&gt;
&lt;li&gt;Use Case field&lt;/li&gt;
&lt;li&gt;Repository field&lt;/li&gt;
&lt;li&gt;Boundary payload field&lt;/li&gt;
&lt;li&gt;Mapper field&lt;/li&gt;
&lt;li&gt;Entity field&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The developer's job becomes filling in the right boxes.&lt;/p&gt;

&lt;p&gt;That may be useful for organizational scaling. It may reduce variation. It may make review easier. It may allow less experienced developers to contribute safely within narrow boundaries.&lt;/p&gt;

&lt;p&gt;But we should call it what it is.&lt;/p&gt;

&lt;p&gt;It is not necessarily design sophistication.&lt;/p&gt;

&lt;p&gt;It is bureaucracy expressed as architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Makes This Kind of Work Much Faster
&lt;/h2&gt;

&lt;p&gt;AI is very good at this form of work.&lt;/p&gt;

&lt;p&gt;If a codebase already contains many similar examples, AI can imitate them quickly.&lt;/p&gt;

&lt;p&gt;It can generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request and response objects&lt;/li&gt;
&lt;li&gt;Mappers&lt;/li&gt;
&lt;li&gt;Repository interfaces&lt;/li&gt;
&lt;li&gt;Use Case classes&lt;/li&gt;
&lt;li&gt;Controller changes&lt;/li&gt;
&lt;li&gt;Test scaffolding&lt;/li&gt;
&lt;li&gt;CRUD variations&lt;/li&gt;
&lt;li&gt;Layer-to-layer data shuffling&lt;/li&gt;
&lt;li&gt;Code that follows existing patterns&lt;/li&gt;
&lt;li&gt;Fixes for review comments about structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the kind of work where AI feels immediately useful.&lt;/p&gt;

&lt;p&gt;And to be clear, productivity does improve.&lt;/p&gt;

&lt;p&gt;A less experienced developer with AI can produce DDD-flavored boilerplate much faster than before. They can follow existing patterns, generate repetitive classes, move data across layers, and respond to formal review comments at high speed.&lt;/p&gt;

&lt;p&gt;This is real productivity.&lt;/p&gt;

&lt;p&gt;But it is a narrow kind of productivity.&lt;/p&gt;

&lt;p&gt;It does not necessarily mean the organization has learned to use AI to improve design. It may only mean that the organization has made its existing paperwork cheaper.&lt;/p&gt;

&lt;p&gt;AI does not automatically change the structure of the organization.&lt;/p&gt;

&lt;p&gt;If you insert AI into an existing bureaucracy, the first thing it does is accelerate the bureaucracy.&lt;/p&gt;

&lt;p&gt;If the existing process creates value, that acceleration is useful.&lt;/p&gt;

&lt;p&gt;If the existing process is mostly ceremony, AI accelerates the ceremony.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shallow Conclusion: "AI Will Not Replace Developers"
&lt;/h2&gt;

&lt;p&gt;This is where many organizations will draw the wrong conclusion.&lt;/p&gt;

&lt;p&gt;They will introduce AI into their existing process and observe something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We adopted AI.&lt;/li&gt;
&lt;li&gt;Productivity improved.&lt;/li&gt;
&lt;li&gt;But developers are still needed.&lt;/li&gt;
&lt;li&gt;Therefore, AI will not replace developers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This sounds reasonable.&lt;/p&gt;

&lt;p&gt;But it is often a shallow observation.&lt;/p&gt;

&lt;p&gt;A more accurate statement would be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;They are not observing the limits of AI.&lt;br&gt;
They are observing the limits of how their organization uses AI.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If AI is only asked to generate request objects, repositories, mappers, use cases, and test scaffolding, then of course humans remain necessary.&lt;/p&gt;

&lt;p&gt;But what kind of humans remain necessary?&lt;/p&gt;

&lt;p&gt;In a strong organization, the necessary people are those who can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define business boundaries&lt;/li&gt;
&lt;li&gt;Clarify language&lt;/li&gt;
&lt;li&gt;Find invariants&lt;/li&gt;
&lt;li&gt;Design state transitions&lt;/li&gt;
&lt;li&gt;Connect customer value to implementation&lt;/li&gt;
&lt;li&gt;Constrain AI output with tests, types, and specifications&lt;/li&gt;
&lt;li&gt;Own a meaningful part of the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a bureaucratic organization, the necessary people are often those who:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check whether the expected file exists&lt;/li&gt;
&lt;li&gt;Check whether the Use Case is in the right folder&lt;/li&gt;
&lt;li&gt;Check whether the Repository was used&lt;/li&gt;
&lt;li&gt;Check whether the Mapper follows the existing style&lt;/li&gt;
&lt;li&gt;Check whether the code conforms to the ceremony&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a very different kind of necessity.&lt;/p&gt;

&lt;p&gt;AI did not prove that developers cannot be replaced.&lt;/p&gt;

&lt;p&gt;It only proved that this organization has confined AI to work that keeps developers trapped in the existing process.&lt;/p&gt;

&lt;p&gt;Or, more sharply:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI is not immature.&lt;br&gt;
The work assigned to AI is immature.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Generation Cost Goes Down. Meaning-Checking Cost Does Not.
&lt;/h2&gt;

&lt;p&gt;The most important distinction in AI-assisted software development is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generation cost&lt;/li&gt;
&lt;li&gt;Meaning-checking cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI drastically lowers the cost of generating boilerplate.&lt;/p&gt;

&lt;p&gt;It can produce layers, classes, interfaces, command objects, query objects, schema classes, adapters, tests, and documentation very quickly.&lt;/p&gt;

&lt;p&gt;But the cost of checking semantic correctness does not disappear.&lt;/p&gt;

&lt;p&gt;Someone still has to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this Use Case actually a meaningful business operation?&lt;/li&gt;
&lt;li&gt;Does this Entity really have identity?&lt;/li&gt;
&lt;li&gt;Does this Value Object actually protect an invariant?&lt;/li&gt;
&lt;li&gt;Is this Repository a real abstraction, or just a renamed DAO?&lt;/li&gt;
&lt;li&gt;Does this boundary object protect a contract, or is it just data shuffling?&lt;/li&gt;
&lt;li&gt;Does this layer increase changeability, or only increase file count?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more meaningless structure AI generates, the more humans have to read through it.&lt;/p&gt;

&lt;p&gt;So the danger is not that AI will immediately remove ceremonial architecture.&lt;/p&gt;

&lt;p&gt;The danger is that AI may make ceremonial architecture cheaper to produce, and therefore more common.&lt;/p&gt;

&lt;p&gt;AI pushes the generation cost of ceremony toward zero.&lt;/p&gt;

&lt;p&gt;But it does not push the cost of understanding that ceremony toward zero.&lt;/p&gt;

&lt;p&gt;Therefore, meaningless ceremony becomes technical debt faster.&lt;/p&gt;

&lt;p&gt;That is the central problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI May Extend Bureaucracy Before It Destroys It
&lt;/h2&gt;

&lt;p&gt;AI does not immediately destroy weak organizations.&lt;/p&gt;

&lt;p&gt;At first, it may extend them.&lt;/p&gt;

&lt;p&gt;The pattern looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI generates more multi-layered code.&lt;/li&gt;
&lt;li&gt;Humans review more AI-generated code.&lt;/li&gt;
&lt;li&gt;Formal review rules become more important.&lt;/li&gt;
&lt;li&gt;Existing managers and tech leads remain necessary.&lt;/li&gt;
&lt;li&gt;The organization concludes that humans are still essential.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But this does not prove that the human work is high-leverage.&lt;/p&gt;

&lt;p&gt;It may only prove that humans are now needed to clean up and supervise the complexity the organization created for itself.&lt;/p&gt;

&lt;p&gt;This is the irony.&lt;/p&gt;

&lt;p&gt;AI has the potential to reduce waste.&lt;/p&gt;

&lt;p&gt;But if the organization is built around waste, AI first makes the waste cheaper.&lt;/p&gt;

&lt;p&gt;AI does not first remove the bureaucracy.&lt;/p&gt;

&lt;p&gt;It first makes the bureaucracy more affordable.&lt;/p&gt;

&lt;p&gt;And when bureaucracy becomes cheaper, organizations often keep it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Competition Happens Outside the Organization
&lt;/h2&gt;

&lt;p&gt;The question "Will AI replace developers?" is too narrow.&lt;/p&gt;

&lt;p&gt;The more important competition is not always inside the same organization.&lt;/p&gt;

&lt;p&gt;It is between different organizational forms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Large AI-assisted bureaucratic engineering organizations
vs
Small AI-amplified high-ownership teams
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first type will see local productivity gains.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tickets move faster.&lt;/li&gt;
&lt;li&gt;CRUD work gets done faster.&lt;/li&gt;
&lt;li&gt;Review comments are addressed faster.&lt;/li&gt;
&lt;li&gt;Documentation is generated faster.&lt;/li&gt;
&lt;li&gt;Boundary objects and mappers are added faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the inside, this looks like progress.&lt;/p&gt;

&lt;p&gt;But from the outside, the organization may still be slow.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Too many meetings&lt;/li&gt;
&lt;li&gt;Unclear ownership&lt;/li&gt;
&lt;li&gt;Formal reviews&lt;/li&gt;
&lt;li&gt;Weak executable specifications&lt;/li&gt;
&lt;li&gt;Boundaries based on org charts instead of business domains&lt;/li&gt;
&lt;li&gt;Changes that require touching many layers without changing much meaning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second type uses AI differently.&lt;/p&gt;

&lt;p&gt;Small high-ownership teams use AI to increase their ability to change the system safely.&lt;/p&gt;

&lt;p&gt;They focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executable specifications&lt;/li&gt;
&lt;li&gt;Strong boundaries&lt;/li&gt;
&lt;li&gt;Automated tests&lt;/li&gt;
&lt;li&gt;Type-level constraints&lt;/li&gt;
&lt;li&gt;Runtime validation&lt;/li&gt;
&lt;li&gt;Explicit state transitions&lt;/li&gt;
&lt;li&gt;Fast feedback loops&lt;/li&gt;
&lt;li&gt;Clear ownership&lt;/li&gt;
&lt;li&gt;Observability in production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For them, AI is not mainly a boilerplate generator.&lt;/p&gt;

&lt;p&gt;It is a force multiplier for system ownership.&lt;/p&gt;

&lt;p&gt;That difference is huge.&lt;/p&gt;

&lt;p&gt;Weak organizations use AI to make existing work faster.&lt;/p&gt;

&lt;p&gt;Strong organizations use AI to remove the need for much of that work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Remote-Work Resistance Has the Same Root
&lt;/h2&gt;

&lt;p&gt;This pattern is also related to another common organizational behavior: resistance to remote work.&lt;/p&gt;

&lt;p&gt;This is not simply a question of whether remote work is good or bad.&lt;/p&gt;

&lt;p&gt;The deeper question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the organization use as the basis of trust?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Strong engineering organizations tend to trust things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear ownership&lt;/li&gt;
&lt;li&gt;Explicit goals&lt;/li&gt;
&lt;li&gt;Reviewable artifacts&lt;/li&gt;
&lt;li&gt;Executable tests&lt;/li&gt;
&lt;li&gt;Written decisions&lt;/li&gt;
&lt;li&gt;Observable production behavior&lt;/li&gt;
&lt;li&gt;Well-defined interfaces&lt;/li&gt;
&lt;li&gt;Documented trade-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bureaucratic organizations tend to trust things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Being in the office&lt;/li&gt;
&lt;li&gt;Being visible&lt;/li&gt;
&lt;li&gt;Attending meetings&lt;/li&gt;
&lt;li&gt;Being available for interruption&lt;/li&gt;
&lt;li&gt;Following the existing process&lt;/li&gt;
&lt;li&gt;Looking busy&lt;/li&gt;
&lt;li&gt;Receiving informal supervision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first type manages by outcomes and structure.&lt;/p&gt;

&lt;p&gt;The second type manages by presence and procedure.&lt;/p&gt;

&lt;p&gt;That is why DDD-flavored bureaucracy and anti-remote-work culture often fit together.&lt;/p&gt;

&lt;p&gt;Remote work breaks management by presence.&lt;/p&gt;

&lt;p&gt;In an office, people are visible. You can see whether someone is at their desk. You can call a meeting. You can interrupt them. You can get a sense that work is happening.&lt;/p&gt;

&lt;p&gt;Remote work removes that visibility.&lt;/p&gt;

&lt;p&gt;Then the organization must manage through artifacts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is owned by whom?&lt;/li&gt;
&lt;li&gt;What is the definition of done?&lt;/li&gt;
&lt;li&gt;Where is the specification?&lt;/li&gt;
&lt;li&gt;Which test protects the invariant?&lt;/li&gt;
&lt;li&gt;What decision was made?&lt;/li&gt;
&lt;li&gt;What changed?&lt;/li&gt;
&lt;li&gt;What failed in production?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a strong organization, this is natural.&lt;/p&gt;

&lt;p&gt;For a weak organization, this is threatening.&lt;/p&gt;

&lt;p&gt;Because the organization was not actually managing outcomes. It was managing the appearance of control.&lt;/p&gt;

&lt;p&gt;This is why "communication" becomes the usual complaint.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remote work reduces communication.&lt;/li&gt;
&lt;li&gt;Remote work weakens team culture.&lt;/li&gt;
&lt;li&gt;Remote work makes it hard to mentor juniors.&lt;/li&gt;
&lt;li&gt;Remote work makes progress invisible.&lt;/li&gt;
&lt;li&gt;Remote work removes casual conversations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of this can be true.&lt;/p&gt;

&lt;p&gt;But often, "communication" is being used to mean something else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ability to interrupt people synchronously&lt;/li&gt;
&lt;li&gt;The ability to compensate for unclear ownership with conversation&lt;/li&gt;
&lt;li&gt;The ability to avoid writing down decisions&lt;/li&gt;
&lt;li&gt;The ability to resolve ambiguity through meetings&lt;/li&gt;
&lt;li&gt;The ability to judge progress by atmosphere instead of artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In that case, remote work is not destroying communication.&lt;/p&gt;

&lt;p&gt;It is destroying the organization's ability to operate with ambiguity hidden inside informal interaction.&lt;/p&gt;

&lt;p&gt;This is closely related to cargo-cult DDD.&lt;/p&gt;

&lt;p&gt;In one case, architecture is used to make code and developers controllable.&lt;/p&gt;

&lt;p&gt;In the other case, the office is used to make people visible and controllable.&lt;/p&gt;

&lt;p&gt;Architecture becomes an interface for controlling code.&lt;/p&gt;

&lt;p&gt;The office becomes an interface for managerial control.&lt;/p&gt;

&lt;p&gt;Meetings absorb unclear responsibility.&lt;/p&gt;

&lt;p&gt;Reviews enforce formal consistency.&lt;/p&gt;

&lt;p&gt;These are not separate phenomena.&lt;/p&gt;

&lt;p&gt;They point in the same direction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The organization does not trust ownership.&lt;/li&gt;
&lt;li&gt;It cannot manage through artifacts.&lt;/li&gt;
&lt;li&gt;It lacks executable specifications.&lt;/li&gt;
&lt;li&gt;It relies on presence, ceremony, and supervision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why AI and remote work expose similar weaknesses.&lt;/p&gt;

&lt;p&gt;AI separates meaningful design from meaningless work.&lt;/p&gt;

&lt;p&gt;Remote work separates organizations that manage outcomes from organizations that manage presence.&lt;/p&gt;

&lt;p&gt;This does not mean all office work is bad.&lt;/p&gt;

&lt;p&gt;There are valid reasons for in-person work: onboarding, hardware, security, crisis response, customer work, sensitive collaboration, and team formation.&lt;/p&gt;

&lt;p&gt;The problem is not the office itself.&lt;/p&gt;

&lt;p&gt;The problem is using the office as a substitute for ownership, clarity, and trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Remains Valuable from DDD
&lt;/h2&gt;

&lt;p&gt;DDD is not dying.&lt;/p&gt;

&lt;p&gt;What dies is DDD-flavored bureaucracy.&lt;/p&gt;

&lt;p&gt;The parts of DDD that remain valuable are the ones that help a team understand and protect domain meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business boundaries&lt;/li&gt;
&lt;li&gt;Ubiquitous language&lt;/li&gt;
&lt;li&gt;Bounded contexts&lt;/li&gt;
&lt;li&gt;Invariants&lt;/li&gt;
&lt;li&gt;State transitions&lt;/li&gt;
&lt;li&gt;Ownership of responsibilities&lt;/li&gt;
&lt;li&gt;Executable tests&lt;/li&gt;
&lt;li&gt;Types, constraints, and validation&lt;/li&gt;
&lt;li&gt;A clear view of where change will break things&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The parts that lose value are the purely ceremonial rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a Repository.&lt;/li&gt;
&lt;li&gt;Put it in a Use Case.&lt;/li&gt;
&lt;li&gt;Convert it to a DTO.&lt;/li&gt;
&lt;li&gt;Keep the Controller thin.&lt;/li&gt;
&lt;li&gt;Follow the directory structure.&lt;/li&gt;
&lt;li&gt;Make it look like the existing code.&lt;/li&gt;
&lt;li&gt;Avoid review comments by following the ritual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, these patterns can be useful.&lt;/p&gt;

&lt;p&gt;But their value depends on whether they represent actual domain boundaries, constraints, and responsibilities.&lt;/p&gt;

&lt;p&gt;If they express meaning, they remain.&lt;/p&gt;

&lt;p&gt;If they only enforce conformity, AI makes their economic value decline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI-Era Architecture Needs
&lt;/h2&gt;

&lt;p&gt;AI-era architecture should rely less on humans reading every line and more on executable checks.&lt;/p&gt;

&lt;p&gt;The premise that humans can semantically review every generated line of code is becoming weaker.&lt;/p&gt;

&lt;p&gt;Instead, we need systems where invalid changes fail quickly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invalid states should be impossible or difficult to represent.&lt;/li&gt;
&lt;li&gt;Invalid transitions should be rejected.&lt;/li&gt;
&lt;li&gt;Boundary crossings should include validation.&lt;/li&gt;
&lt;li&gt;Specification violations should fail tests.&lt;/li&gt;
&lt;li&gt;Types should encode constraints where possible.&lt;/li&gt;
&lt;li&gt;Runtime checks should protect what types cannot.&lt;/li&gt;
&lt;li&gt;Change impact should be localized.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the shift:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Architecture protected by human review
down to
Architecture protected by tests, types, constraints, contracts, and specifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this shift, AI becomes a form-filling assistant.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI writes the repeated files.&lt;/li&gt;
&lt;li&gt;AI wires the layers together.&lt;/li&gt;
&lt;li&gt;AI moves logic into the expected place.&lt;/li&gt;
&lt;li&gt;Humans check the ceremony.&lt;/li&gt;
&lt;li&gt;The organization concludes that developers are still necessary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But that is not the essence of AI-era software development.&lt;/p&gt;

&lt;p&gt;The point is not to generate more DDD-shaped code.&lt;/p&gt;

&lt;p&gt;The point is to build a system that can safely absorb AI-generated change.&lt;/p&gt;

&lt;p&gt;What matters is not ceremony.&lt;/p&gt;

&lt;p&gt;What matters is a breakwater for domain meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boundaries&lt;/li&gt;
&lt;li&gt;Language&lt;/li&gt;
&lt;li&gt;Invariants&lt;/li&gt;
&lt;li&gt;State transitions&lt;/li&gt;
&lt;li&gt;Executable specifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A multi-layered architecture without these things is not domain-driven design.&lt;/p&gt;

&lt;p&gt;It is managerial residue in the shape of software.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Working Thesis
&lt;/h2&gt;

&lt;p&gt;Developers will not disappear overnight.&lt;/p&gt;

&lt;p&gt;In many organizations, AI will preserve existing development work for a while.&lt;/p&gt;

&lt;p&gt;Less experienced, lower-cost developers will become more productive with AI. They will generate DDD-flavored boilerplate, move data across layers, follow existing templates, and respond to formal review comments much faster.&lt;/p&gt;

&lt;p&gt;That will look like a major productivity gain.&lt;/p&gt;

&lt;p&gt;And locally, it will be one.&lt;/p&gt;

&lt;p&gt;But the deeper shift is elsewhere.&lt;/p&gt;

&lt;p&gt;The disruptive part is not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Junior developers use AI to fill out DDD-shaped architectural forms faster.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The disruptive part is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;High-ownership engineers use AI to make the organizational structure itself lighter.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the real threat to bureaucratic software organizations.&lt;/p&gt;

&lt;p&gt;DDD is not dying.&lt;/p&gt;

&lt;p&gt;Cargo-cult DDD is.&lt;/p&gt;

&lt;p&gt;AI will not kill meaningful domain modeling.&lt;/p&gt;

&lt;p&gt;It will kill the economic rationale for using tactical DDD as architectural paperwork.&lt;/p&gt;

&lt;p&gt;And before that happens, AI will do something more ironic:&lt;/p&gt;

&lt;p&gt;It will make the bureaucracy cheaper.&lt;/p&gt;

&lt;p&gt;But cheaper bureaucracy is still bureaucracy.&lt;/p&gt;

&lt;p&gt;Eventually, large AI-assisted bureaucratic organizations will compete with small AI-amplified high-ownership teams.&lt;/p&gt;

&lt;p&gt;And from the outside, the difference will be obvious.&lt;/p&gt;

&lt;p&gt;One group will use AI to produce more ceremony.&lt;/p&gt;

&lt;p&gt;The other will use AI to remove the need for it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
