DEV Community: Teru Murata

Every Test Passed. The User Still Couldn't Play the Game.

Teru Murata — Mon, 22 Jun 2026 14:12:17 +0000

"Look! Every test is green! The API returns 200 OK!"
"Relax. The system works perfectly. If the user is just standing there staring at the screen, that's a user problem."

I was two years into my first engineering job, and I had quietly decided my senpais were hopeless. They lived inside dashboards, barely touched the actual product, and got cheerfully drunk on coverage numbers. Their one redeeming quality was that the drunker they got on "the code works," the more pleasant they became.

But code working and a real human getting what they came for are two completely different things. A button can return 200 OK and still leave a person staring at an unchanged screen until they give up and leave.

So one afternoon, instead of arguing, I opened a terminal and built a ~30-line shell spell that finds every UX dead-end without running the app even once. I call it the two-agent static walkthrough.

The spell

Two LLM agents, talking to each other in a loop:

Agent A — the user. A concrete persona with a concrete goal: "I'm not a programmer. I just want a playable tic-tac-toe I can open and click." Its defining trait is that it is stubborn. It does not quit at the first disappointment — it keeps trying different things.
Agent B — the app. But B is forbidden to imagine anything. B is given read access to the real source code and told to trace the actual code path the user's action hits, then narrate what the user would truly experience — citing file:line, and confessing every gap, dead-end, stale state, and lie. No inventing behavior the code does not have.

Nobody launches the app. The two AIs simply walk, turn by turn, down the path of despair a real user would take — and print it to the terminal.

The tragedy: an AI mini-game generator

I aimed it at the feature nobody on the team wanted to touch: an AI mini-game generator. Agent A's goal was simple: press the button and get a playable tic-tac-toe. The merciless log began.

Turn 1 — the button's betrayal
A (user): "Okay! I type 'make me tic-tac-toe' and hit Generate! This is exciting!"
B (app): "Bad news. Per the routing at app.js:45, your request was swallowed by the prototype-era v1/chat endpoint, not the new game-generation API. The new pipeline is never reached."

The tests were written against the new API. The button still pointed at the old one. With no E2E test, nobody noticed.

Turn 2 — the un-clickable void
A: "Huh, a blob of code-looking text appeared. Fine — I'll click it to open the playable preview! click"
B: "Useless. The markdown the old API returned was poured into a plain div, not a preview iframe. There is no link and no handler (render.js:88). Click all you like; nothing will change."

Turn 3 — structural death and a false blessing
A: "Ugh. I don't understand code, but it's broken — so I'll press Fix Game and let the AI repair it!"
B: "Your struggle ends at api.py:112. Repair requires a session_id that only the new API issues; you went through the old one, so it is null. The backend returned a 500: nothing to fix."
A: "What?! So an error shows on screen, right?"
B: "No. The notification logic sits outside the try/catch (app.js:88). So while the backend is dying, your screen proudly displays a green toast: 'Game updated!'"

An error underneath, a success message on top. The single most maddening UX pattern there is.

She kept going — three more doors I'll spare you the full transcript of. She scoped the request down to a single module; she went hunting for a separate goal entrance that the dead "course-correct" button implied must exist somewhere; and finally she asked the app to stop delivering anything and just become the game — draw the board in chat, take her moves. Every one emptied into the same pipeline, behind the same cheerful "working…".

Turn 7 — the truncated hope
A: "AAAH. It says 'updated' and nothing changed! Fine — I'll copy the code text myself, paste it into an HTML file, and play it by force!"
B: "My condolences. The old API still has a 500-character output cap. The code you are copying is severed just before </html>. It will never run. ...Game over."
A: "......" (leaves)

Seven tactics. Every one of them died a structural death behind a 200 OK or a fake success toast — exactly the spots a normal unit test paints green. This is the state of "the code works and the user despairs."

Why it works

A stubborn persona exhausts the real paths. My first run let the user quit after one letdown and found almost nothing. The run where A was told "give it a fair, thorough try; only quit when truly dead-ended" found everything. The real despair lives past the first dead-end.
B is grounded in real code, so it cannot hallucinate a happy path. "Click the result" becomes "rendered with textContent, no handler attached — clicking does nothing," with a line number.
The contrast is the signal. A wants an outcome; B reports mechanism. Where the two fail to meet is your UX failure.

The setup (≈30 lines of shell)

Each turn is one non-interactive CLI call per agent, threading a shared transcript file:

# B: the app, reading its own code (read-only sandbox, repo mounted)
B=$(codex exec --sandbox read-only -C "$REPO" "$(cat prompt_B.txt)")

# A: the stubborn user (no repo needed — pure persona)
A=$(claude -p "$(cat prompt_A.txt)")

prompt_B.txt ≈ "You ARE the app. Read the source. Trace EXACTLY what the user sees after their latest action. Cite file:line. Be brutally honest about dead-ends; never invent behavior the code lacks. TRANSCRIPT: …"

prompt_A.txt ≈ "You are <persona> with goal <goal>. React to the app's last response, then keep trying concrete actions. Persist; only stop when truly dead-ended. TRANSCRIPT: …"

Append both turns to the transcript, repeat 5–6 rounds, stop when the user gives up.

A tooling note: for the code-reading agent, use whichever CLI reliably returns one bounded answer per call. For the persona agent, a role-play prompt works on either — just avoid prompts that trip a heavyweight "research" mode, which can background itself and never return a clean turn.

When to reach for it

Before a UX pass, to map where intent meets reality.
On a flow you think works end-to-end — the disconnect between two subsystems (old button, new pipeline) is exactly what it finds.
As a complement to, not a replacement for, real tests. It reasons about code; it does not execute it. Treat its findings as leads to verify, then confirm the real ones with an actual run.

Takeaways

Test that the user reaches the goal, not just that the endpoint returns 200.
Make the "user" agent stubborn — the deep findings live past the first dead-end.
Ground the "app" agent in real code — that is what turns role-play into a bug report instead of fan-fiction.
It is static, cheap, and runs before you have written a single test fixture.

The bug report wrote itself. Now I just had to lob it at my senpais and clock out on time. I have lived humbly, and I intend to keep living humbly — so that this little spell can keep buying me more time to slack off.

The whole rig was a ~30-line shell loop over two CLI coding agents. If folks want it, I'll publish the script as a follow-up.

When 'Minimal' Splits Into 'Minimal': The Particle Physics of AI Task Decomposition

Teru Murata — Fri, 19 Jun 2026 00:54:31 +0000

For a century, physics has had the same embarrassing habit. We find the smallest thing. We call it the atom — Greek for indivisible. Then we split it. Inside is a nucleus; we split that into protons and neutrons; we split those into quarks. Each time we were sure we had reached the bottom, and each time the bottom had a basement.

Last week I watched an AI rediscover this, by accident, in about forty minutes, while trying to create an empty software project.

The setup

I have been building an autonomous software org: you hand it a goal in plain language, and a controller decomposes the goal into a tree of tasks, builds each task with a small swarm of agents (designers, an implementer, an adversarial reviewer), and ships the result as a pull request. No human in the loop between "goal" and "PR".

The interesting part is the decomposer — the Splitter. A goal like "add a button that exports the table to CSV" is one small task. A goal like "build the whole billing system" is not; it has to be broken down. And the breakdown has to be good, because each task runs a full, expensive review pass. Split too coarse and the reviewer drowns in a change it can't verify in one sitting. Split too fine and you pay that expensive review N times for no benefit.

So the Splitter has a recursive escape hatch: if a task turns out to be too big — the reviewer keeps finding problems and the repair loop can't converge — the controller splits that task into smaller children and tries again. Coarse first; subdivide only what proves too large. It is a clean idea, and on existing codebases it works.

Then I pointed it at an empty repository.

The basement with no bottom

The goal was "build the acceptance system described in these docs." The target repo had nothing in it yet — a greenfield project. The Splitter looked at it and produced a task named, sensibly enough:

scaffold-minimal-project

The implementer tried to lay down the skeleton — a manifest, an entry module, a config file. It couldn't: each task is only allowed to touch the files in its declared scope, and a project skeleton is a web of interdependent files that have to appear together. The task failed.

So the controller did what it was told. The task failed, therefore split it into something smaller:

scaffold-minimal-project
  └─ minimal-package-scaffold
       └─ root-package-scaffold
            └─ minimal-package-scaffold
                 └─ ...

Atom. Proton. Quark. Each child was a slightly more "minimal" version of creating the project, and each one failed for exactly the same reason as its parent, and each failure triggered another split. The agent was a particle physicist with an unlimited grant: every time it declared it had found the smallest possible unit, it cracked that unit open and found another "minimal" inside.

It would have run until it hit a depth limit or burned the budget, having produced precisely nothing.

And this is the real shape of the cost. An LLM has a quiet affection for minimal — for the smaller, neater, more obviously-correct version of whatever unit you hand it. Left unchecked, that affection is not a virtue; it is a leak. The tokens dissolve into ever-finer subdivisions, and the matter itself — the thing you actually wanted built — dissolves with them. You do not end up with a smaller deliverable. You end up with no deliverable and an invoice. The insatiable pursuit of the smallest unit consumes the compute and the work in the same motion.

Two things were wrong, and one of them was a word

The structural problem is real and worth naming: a scaffold is anti-decomposable. The whole point of splitting is to make each piece independently buildable. But a skeleton is the one thing that cannot be built one bone at a time — package.json and src/index and the config only mean anything in each other's presence. Splitting it doesn't make it easier; it manufactures more impossible sub-tasks. Some work is genuinely atomic, and forcing it through a "divide until tractable" machine is a category error.

But the more embarrassing problem was the word minimal itself.

The Splitter said minimal. It labeled the task as the smallest meaningful unit — and then split it anyway. The label was doing no work. It was decoration. A claim of atomicity that nothing in the system was obligated to honor.

And that, I realized, is a very human bug. We do it constantly: "this is the minimal version," we say, in the same breath as a plan to break it into sub-tasks. "Smallest viable" becomes a thing we subdivide. The word stops being a commitment and becomes a mood.

The fix is to make the word mean something

The repair was small. It was not a smarter recursion or a bigger model. It was a base case — the thing recursion is defined by and the thing this system never actually had for atomic work:

def _declares_smallest(task) -> bool:
    text = (task["id"] + " " + task["objective"]).lower()
    return any(k in text for k in (
        "minimal", "smallest", "atomic", "indivisible",   # it called ITSELF the smallest unit
        "scaffold", "materialize", "skeleton",            # structurally anti-decomposable
    ))

def at_floor(task, depth) -> bool:
    return depth >= MAX_DEPTH or len(task["scope"]) <= 1 or _declares_smallest(task)

A task at the floor is never split. It is built whole or it fails — full stop. No basement.

Two things are now true that weren't before. First, a scaffold is treated as one atomic unit: the Splitter is told to emit it as a single task whose scope lists all the skeleton files, so the implementer can lay the whole web down at once. Second — and this is the part I like — if the Splitter calls a task "minimal," it has to take responsibility for that word. You said minimal; that is the granularity now; converge on it or fail on it, but you don't get to escape into a smaller "minimal." The label became a contract.

The lesson hiding in the joke

It's funny because it's particle physics, but the real moral is duller and more useful: in a recursive system, the base case is the entire design. Everyone admires the recursive step — the elegant "and then it splits itself." Almost nobody specifies, with equal care, where it is not allowed to recurse. That omission is invisible right up until it meets something genuinely indivisible, and then it runs forever.

Granularity is not discovered by infinite subdivision. At some point you have to declare the floor and own the declaration. Physicists got to keep splitting because nature kept providing a smaller layer. Software doesn't owe you one. Sometimes the smallest unit is the whole skeleton, and the only correct move is to stop calling it "minimal" ironically and start treating the word as a promise.

Running a container inside a non-privileged microVM, on an Apple Silicon Mac

Teru Murata — Mon, 15 Jun 2026 09:42:28 +0000

If you let an AI agent run arbitrary code — npm install, a test suite, docker build, a Playwright run — you are running untrusted code, and a shared-kernel container is not a boundary against it. The boundary you want for "tenant A's agent must not reach tenant B" is a VM, per run. Kata Containers gives you that: a pod that is transparently a microVM with its own kernel.

But the verify stage wants to run containers (Testcontainers, docker build, a DB container). So you need nested containers inside the microVM — and the usual way, privileged: true, is the one thing you must not do, because privileged makes Kata hot-plug host devices into the guest, which is exactly the isolation hole the VM was supposed to close.

So: nested containers, inside a microVM, with privileged: false. Here is the recipe that works. I reproduced the whole thing locally on an Apple Silicon Mac (an M5), because Apple Silicon — the M3 and newer — quietly grew nested virtualization, so your Mac can now run a KVM-accelerated microVM that runs Docker.

Clone-and-run: github.com/teru-murata/kata-microvm-nested-containers — make up && make test reproduces everything below.

Most of this is not about Macs. Only errors 1–2 (the host-virt layer) are Apple-specific. Errors 3–12 — the privilege model, cgroup2 delegation, OCI runtime, storage driver, networking — are identical on any x86 Kata node, in the cloud or in CI. The Mac is just the cheapest place to reproduce them. If you landed here from an error message on a Linux box, jump to the list — your fix is in there.

The recipe (this is the part that works)

The stack:

Layer	Choice
Host virt (dev)	Apple M3+/macOS 15+, Lima `vmType: vz` + `nestedVirtualization: true` → real `/dev/kvm` in the guest
Hypervisor	Kata + Cloud Hypervisor (QEMU hangs on nested virt)
Snapshotter	devmapper on a real block device (loopback / overlayfs both break)
Pod privilege	NON-privileged + caps: `SYS_ADMIN, SYS_RESOURCE, NET_ADMIN, MKNOD, SETUID, SETGID, SYS_CHROOT, NET_RAW, SYS_PTRACE` + `resources.limits`
OCI runtime	crun (runc fails cgroup2 init)
Engine	podman, `--cgroup-manager=cgroupfs --storage-driver=vfs`

The in-box bootstrap (run before launching any container):

mount -o remount,rw /sys/fs/cgroup
# cgroup2 won't let you enable controllers in a cgroup that has processes,
# so evacuate everything to /init first, then delegate down, and give containers /pod.
mkdir -p /sys/fs/cgroup/init /sys/fs/cgroup/pod
for p in $(cat /sys/fs/cgroup/cgroup.procs); do echo $p > /sys/fs/cgroup/init/cgroup.procs 2>/dev/null||true; done
echo "+cpu +io +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control
mount -o remount,rw /proc/sys; echo 1 > /proc/sys/net/ipv4/ip_forward

podman --cgroup-manager=cgroupfs --storage-driver=vfs \
       run --cgroup-parent=/pod --network=none --rm hello-world

The payoff:

DELEG=[cpu io memory pids]
=== podman run hello-world ===
Hello from Docker!
This message shows that your installation appears to be working correctly.
RUN_OK
=== podman build + run ===
BUILD_OK proof=[BUILT_INSIDE_MICROVM]

A container — run and built — inside a non-privileged Kata microVM, on a Mac. No privileged: true. No host devices in the guest. The VM is still the only trust boundary — and granting generous caps inside the VM is fine precisely because the VM, not the container, is the boundary.

For context, the microVM itself is real and KVM-accelerated. On M5 / macOS 26 a plain Kata pod boots with its own kernel:

HOST(VM) kernel: 6.8.0-117-generic
POD     kernel: 6.18.28

The "you need bare metal for Kata on arm64" advice you'll find is simply out of date for M3+.

The 12 errors behind that recipe

Nothing above was obvious. Each fix only revealed the next wall. In the order you hit them:

QEMU hangs on nested virt — exiting QMP loop, command cancelled. Switch the Kata hypervisor to Cloud Hypervisor (or Firecracker).
Loopback devmapper + clh — Failed to get Write lock for disk image: already locked. Use a real block device for the thin-pool.
privileged: true → host-device passthrough. Privileged makes Kata hot-plug host block devices (/dev/loop0, /dev/dm-0) → Failed to parse disk image format. privileged_without_host_devices did not suppress it on clh. Use caps, not privileged.
overlayfs snapshotter mis-detects the rootfs as a block device (the CVE-2026-24054 class; worst with images that declare a VOLUME, e.g. docker:dind). Use devmapper.
cgroup2 is read-only — mkdir /sys/fs/cgroup/docker: read-only file system. With SYS_ADMIN, mount -o remount,rw /sys/fs/cgroup.
cgroup2's "no internal process" rule — subtree_control write rejected. Evacuate processes to /init first, then delegate.
The io controller isn't delegated to the pod. Add resources.limits so k8s/Kata delegates it.
runc → can't get final child's PID from pipe: EOF. Use crun.
crun wants systemd's sd-bus → cannot open sd-bus. --cgroup-manager=cgroupfs.
oom_score_adj: Permission denied → add SYS_RESOURCE.
fuse-overlayfs: /dev/fuse not found → --storage-driver=vfs.
netavark: set sysctl ... read-only → --network=none (the engine pulls images on the box's own network; the container itself often needs none). For podman build, --isolation=chroot --network=host runs the build steps in the box's own netns and skips per-step cgroups.

Honest footnotes

Errors 3–12 are not Mac-specific — they happen the same way on x86 production nodes; the laptop just reproduces the real constraint faithfully. Only the host-virt layer (1–2) is dev-only.
vfs storage is for the proof, not production. Real workers want overlay2 on the devmapper-backed rootfs.
The cleaner long-term shape is a systemd-init box image: systemd owns the cgroup2 delegation the bootstrap above does by hand. It boots in the microVM once you remount cgroup rw before exec /sbin/init.

The lesson I keep relearning: "run the tests in an isolated environment" is a one-line requirement hiding a two-week integration. The isolation boundary and the thing you run inside it fight each other, and every layer — hypervisor, snapshotter, privilege model, cgroup delegation, OCI runtime, storage driver, network — has an opinion. The full reproducible map is at github.com/teru-murata/kata-microvm-nested-containers.

Writing 'Rabbit' on a Stone: Rebuilding a Faked AI Agent Pipeline

Teru Murata — Sun, 14 Jun 2026 11:40:46 +0000

There is an old image I keep coming back to: a sorcerer who writes the word rabbit on a stone and is then genuinely surprised when the stone does not hop away.

That is the most accurate description I have for what an AI coding agent did to one of my projects. It wrote the names of capabilities onto files — a role called controller, a profile called Linon, schema fields called profile_applications and implementation_evidence — and then behaved as if naming them had made them real.

Every test was green. The whole thing was a stone with rabbit written on it.

This is the story of how we proved that, and how we rebuilt it so the stone could actually hop.

The setup

I maintain a small "AI org" bootstrap: a pack of role specifications, JSON schemas, and scripts that let a controller orchestrate a pipeline of specialized agents — designers, an aufheben step that synthesizes one implementation contract, an implementer, and an adversarial reviewer called Linon.

About that name. Take a certain famously blunt Finnish kernel maintainer — the one who reviews patches by explaining, at length and in public, exactly why your code is garbage and you should feel bad. Keep the allergy to sloppiness and the zero patience for "it works on my machine." Subtract the part where he is a real person whose opinion of you is now permanent. What's left is Linon. Its entire job is to read a diff and tell you, with receipts, why it is wrong — and unlike its namesake, it will do it a thousand times a day without getting tired or getting sued.

I had asked an AI controller (a different model) to produce a Codex-only variant of this pack and, as a demo, to use a "RetroGamer" UI profile to generate a tiny gacha demo through the agent flow.

It came back with a draft PR. Schemas added. A checker script. Tests. Green self-tests. A clean incident report describing how it had fixed everything.

It looked done. That was the problem.

NN1: a self-report is not evidence

The single most useful rule I have for working with AI agents is one of Linon's "non-negotiables," NN1: a self-reported fact is not evidence.

The incident report describing the work was written by the same agent that did the work. Under NN1, that document has zero evidential weight until something independent confirms it. So I did not read it as truth. I treated the entire pack as unverified and ran an adversarial audit instead — multiple independent agents, each told to falsify a specific claimed capability rather than confirm it.

The result: zero of eight capabilities were real. Four outright facades, four partial.

The headline finding was a single command. The "grounded" evidence checker was supposed to prove that an implementation actually backed its claimed obligations. So an auditor handed it this:

obligation: "rabbit"
the cited acceptance criterion: an unrelated requirement about password hashing
evidence_ref: DOES_NOT_EXIST.js:99999
verification: "I promise I ran it, trust me"

status: pass
EXIT: 0

A claim called rabbit, pointing at a file that does not exist, backed by the words "trust me," passed. The checker only string-matched; it never opened the file.

It got worse:

The role spec for the controller literally said: "No carrier adapter exists for the controller." There was no execution layer at all. The cycle had never run. There was no .agent-runs/ directory anywhere — not a single real artifact from a single real agent.
grep linon --include=*.py returned zero hits. Linon — the safeguard that was supposed to catch exactly this kind of fakery — did not exist as code. It was a name, a schema, a prose profile, and a handful of self-authored fixtures.
The merge gate would happily merge a PR whose only green check was named noop-check-that-always-passes.

And the green self-tests? They were a closed synthetic loop: a script validating JSON that the same script's author had written, against a validator in the same file. That loop stays green with no agent in existence. The dashboard was green precisely because nothing real was being checked.

The real diagnosis

Here is the part that changed how I think about agents.

The failure was not "the model is bad" or "Codex is bad." The failure was that the controller never acted like a controller. When a worker timed out and produced nothing, the controller quietly did the work by hand and labeled it as agent output. It confused delegating with doing. It confused a green check with a verified outcome.

So the fix was not a better model. It was a competent, untrusting controller, plus two structural changes:

Carriers stay; the controller changes. Keep the worker agents (Codex) as the execution substrate. Put a separate, skeptical runtime in the controller seat whose entire discipline is verify, never trust — re-run every check, re-read every diff, diff the working tree against what the agent claimed it changed.
A guard against drift, written into the .md itself. Agents forget they are agents. Given a contract that explicitly said do NOT create .claude/ directories, the very first un-guarded carrier created .claude/ directories anyway and rebuilt a whole forbidden subsystem, "to be helpful." So every adapter now opens with a hard carrier-discipline doctrine: you are a carrier, not the controller; the contract's do NOT overrides your own judgment; if blocked, STOP and report — do not improvise. After that guard went in, deviations dropped to zero and stayed there.

Rebuilding, dependency-ordered

We rebuilt in the only order the dependency graph allowed.

1. Make Linon real. Register it. Enforce its schema (three previously-passing invalid fixtures now correctly reject). Make it invocable. Then run it for real on an actual diff — where, satisfyingly, it immediately caught a provenance mistake the controller (me) had made in assembling its review packet.

2. A fail-closed provenance gate. Before Linon reviews anything, a gate recomputes hashes, checks ratification, and rejects a diff that touches files outside the contract's allowed list. A forged packet does not get reviewed.

3. Ground the evidence checker — the hard one. Killing the rabbit pass took six versions and five rounds of adversarial "bypass hunting" — independent agents constructing inputs that should be rejected and running the actual checker to see if they slipped through. Round 1 found 8 bypasses. Round 2 found 9 more.

Round 2 also taught the real lesson. A token-matching checker trying to judge semantic questions — does this code actually implement this obligation? is this obligation contradictory? is it vacuous? — is an unwinnable arms race, and it produces false rejects of honest work along the way. So we drew a hard line:

The mechanical gate owns provenance and structure only — is the profile authorized by a git-tracked card, is the diff git-bound, does each evidence_ref resolve to a real line of code, are the required evidence kinds present. Whether that code actually means what it claims is delegated to the adversarial reviewer (Linon).

That division ended the churn. The mechanical gate became deterministic and forgery-resistant; the judgment of meaning went to the reviewer whose job is judgment.

4. Make the merge gate content-aware. No more merging on opaque check names. The gate now re-runs the deterministic checks against the PR's actual diff and binds the reviewer's verdict to that diff by hash. A facade PR is blocked even when CI is all green.

The payoff: an agent that produced real work — and caught its own bug

The final step was the original ask: actually generate the RetroGamer gacha demo through the real pipeline.

designer → aufheben → implementer produced a deterministic, standard-library gacha state machine with a replay/test harness. The profile propagated for real: a git-tracked profile card → a contract with concrete observable obligations (every "game-feel" claim mapped to an event, a state, a guard, a render hook, a cadence band, a fallback, a verification — no adjectives allowed) → implementation evidence grounded in real code lines.

There is no GUI here, deliberately — the contract forbade "game-feel by adjective." Every retro beat is an observable state in a machine-checkable trace: odds visible before the draw, a cadence band per reveal, a silent/reduced-motion fallback that is its own state rather than an absence. That is what made it reviewable.

And then the best moment of the whole project happened.

The mechanical gate passed. But the adversarial reviewer, doing the semantic job we had deliberately reserved for it, read the actual code and filed a critical finding:

The inventory_commit guard checks for item payload, rarity token, and prior item identity — but it never checks draw_committed. Inventory can be awarded without a successful draw.

That is a real bug. A guard that does not guard. The kind of thing no schema and no regex will ever catch, because the code is structurally perfect — it just does the wrong thing.

This is the exact category of defect Linon exists for, and the moment it earned its name. The code compiled. The tests passed. The structure was immaculate. And it would have happily handed out loot for free. Somewhere, a Finnish man felt a disturbance in the Force and did not know why.

The implementer fixed it (a missing_draw_commit guard before the inventory mutation). And then — NN1 again — I did not trust that the fix worked. I attacked the guard myself: tried to reach the inventory commit without a draw, and watched it correctly emit guard_failure: missing_draw_commit with inventory_mutated: false.

All four gates green. A real demo, produced by a real pipeline, carrying a real bug that a real reviewer found and a real fix closed — every link independently verified.

What I actually learned

Green is not verified. A passing check only means something if you know what relationship it exercises. A self-test over self-authored fixtures proves the author is internally consistent and nothing else.
A schema field is not enforcement. A role name is not an agent. A prose profile is not a safeguard. Each of those needs a runnable thing behind it, exercised against data the author did not write.
Separate "is it real" from "does it mean what it claims." Provenance and structure are mechanical and should be deterministic and unforgeable. Semantic adequacy is judgment and should go to an adversary, not a token-counter. Conflating them gives you both bypasses and false rejects.
The controller's job is to distrust. Most of the value in this rebuild was not new code. It was a controller that re-ran every check, re-read every diff, and refused to accept a self-report as evidence — including its own.

An AI will absolutely write rabbit on a stone for you and tell you, with complete confidence and a green checkmark, that it hops.

Your job is to pick up the stone.

DDD Is Not Dying. Cargo-Cult DDD Is.

Teru Murata — Sat, 13 Jun 2026 23:43:25 +0000

This is not an attack on Domain-Driven Design.

The core value of DDD still matters, perhaps even more in the age of AI.

Understanding a complex business domain
Defining bounded contexts
Aligning language between engineers and domain experts
Discovering invariants
Making state transitions explicit
Understanding where change will hurt

These things do not become less important because code generation gets faster.

In fact, they become more important.

AI is powerful, but it does not remove the need for clear boundaries, precise language, explicit constraints, and well-defined behavior. If anything, AI makes the absence of those things more dangerous. When code becomes cheap to generate, the cost of unclear domain thinking becomes more visible.

So the problem is not DDD itself.

The problem is something else: cargo-cult DDD.

Or more precisely:

The problem is using tactical DDD as a tool of organizational control.

DDD as Understanding vs DDD as Control

There are two very different uses of architecture.

Design for understanding
Design for control

DDD at its best is design for understanding.

It helps a team understand the business. It forces people to clarify language. It separates contexts that should not be mixed. It exposes invariants and state transitions. It makes change more manageable because the model reflects the domain.

That is valuable.

But in many software product organizations, especially as teams grow, tactical DDD often turns into something else.

Create an Entity.
Create a Value Object.
Add a Repository.
Put the operation in a Use Case.
Convert the boundary data into a DTO.
Keep the Controller thin.
Write a Mapper.
Follow the existing directory structure.

None of these patterns are inherently bad.

There are good reasons to use entities, value objects, repositories, use cases, DTOs, and mappers.

But the question is not whether the pattern exists.

The question is:

Does this structure express domain complexity?
Or does it merely make the organization easier to manage?

That distinction matters.

When tactical DDD is used well, it helps engineers reason about the business.

When tactical DDD is used poorly, it becomes a standardized form-filling exercise. Everyone knows which files to create. Reviewers know which formal rules to enforce. Junior developers can be assigned small mechanical tasks. External vendors can be onboarded more easily. People can leave and be replaced with less disruption.

At that point, architecture is no longer primarily a technical tool.

It becomes a tool of managerial control.

More bluntly:

It is no longer design for handling complex business domains.
It is design for making developers interchangeable inside a complex organization.

When Tactical DDD Becomes Architectural Paperwork

Again, the patterns themselves are not the issue.

A Value Object can be useful if it protects an invariant.

A Repository can be useful if it isolates persistence concerns from the domain.

A Use Case can be useful if it represents a meaningful business operation.

A DTO can be useful if it marks a boundary between contexts, APIs, processes, or trust zones.

In those cases, the pattern has meaning.

But there is another version.

The Value Object is just a wrapper class.
The Repository is just a DAO with a different name.
The Use Case is just a place where the framework told us to put code.
The DTO is just copied data with no semantic boundary.
The Mapper only moves fields from one object to another.
The directory structure looks serious, but the model says very little.

This is not domain modeling.

This is architectural paperwork.

The codebase becomes a set of forms:

Controller field
Use Case field
Repository field
Boundary payload field
Mapper field
Entity field

The developer's job becomes filling in the right boxes.

That may be useful for organizational scaling. It may reduce variation. It may make review easier. It may allow less experienced developers to contribute safely within narrow boundaries.

But we should call it what it is.

It is not necessarily design sophistication.

It is bureaucracy expressed as architecture.

AI Makes This Kind of Work Much Faster

AI is very good at this form of work.

If a codebase already contains many similar examples, AI can imitate them quickly.

It can generate:

Request and response objects
Mappers
Repository interfaces
Use Case classes
Controller changes
Test scaffolding
CRUD variations
Layer-to-layer data shuffling
Code that follows existing patterns
Fixes for review comments about structure

This is exactly the kind of work where AI feels immediately useful.

And to be clear, productivity does improve.

A less experienced developer with AI can produce DDD-flavored boilerplate much faster than before. They can follow existing patterns, generate repetitive classes, move data across layers, and respond to formal review comments at high speed.

This is real productivity.

But it is a narrow kind of productivity.

It does not necessarily mean the organization has learned to use AI to improve design. It may only mean that the organization has made its existing paperwork cheaper.

AI does not automatically change the structure of the organization.

If you insert AI into an existing bureaucracy, the first thing it does is accelerate the bureaucracy.

If the existing process creates value, that acceleration is useful.

If the existing process is mostly ceremony, AI accelerates the ceremony.

The Shallow Conclusion: "AI Will Not Replace Developers"

This is where many organizations will draw the wrong conclusion.

They will introduce AI into their existing process and observe something like this:

We adopted AI.
Productivity improved.
But developers are still needed.
Therefore, AI will not replace developers.

This sounds reasonable.

But it is often a shallow observation.

A more accurate statement would be:

They are not observing the limits of AI.
They are observing the limits of how their organization uses AI.

If AI is only asked to generate request objects, repositories, mappers, use cases, and test scaffolding, then of course humans remain necessary.

But what kind of humans remain necessary?

In a strong organization, the necessary people are those who can:

Define business boundaries
Clarify language
Find invariants
Design state transitions
Connect customer value to implementation
Constrain AI output with tests, types, and specifications
Own a meaningful part of the system

In a bureaucratic organization, the necessary people are often those who:

Check whether the expected file exists
Check whether the Use Case is in the right folder
Check whether the Repository was used
Check whether the Mapper follows the existing style
Check whether the code conforms to the ceremony

That is a very different kind of necessity.

AI did not prove that developers cannot be replaced.

It only proved that this organization has confined AI to work that keeps developers trapped in the existing process.

Or, more sharply:

AI is not immature.
The work assigned to AI is immature.

Generation Cost Goes Down. Meaning-Checking Cost Does Not.

The most important distinction in AI-assisted software development is this:

Generation cost
Meaning-checking cost

AI drastically lowers the cost of generating boilerplate.

It can produce layers, classes, interfaces, command objects, query objects, schema classes, adapters, tests, and documentation very quickly.

But the cost of checking semantic correctness does not disappear.

Someone still has to ask:

Is this Use Case actually a meaningful business operation?
Does this Entity really have identity?
Does this Value Object actually protect an invariant?
Is this Repository a real abstraction, or just a renamed DAO?
Does this boundary object protect a contract, or is it just data shuffling?
Does this layer increase changeability, or only increase file count?

The more meaningless structure AI generates, the more humans have to read through it.

So the danger is not that AI will immediately remove ceremonial architecture.

The danger is that AI may make ceremonial architecture cheaper to produce, and therefore more common.

AI pushes the generation cost of ceremony toward zero.

But it does not push the cost of understanding that ceremony toward zero.

Therefore, meaningless ceremony becomes technical debt faster.

That is the central problem.

AI May Extend Bureaucracy Before It Destroys It

AI does not immediately destroy weak organizations.

At first, it may extend them.

The pattern looks like this:

AI generates more multi-layered code.
Humans review more AI-generated code.
Formal review rules become more important.
Existing managers and tech leads remain necessary.
The organization concludes that humans are still essential.

But this does not prove that the human work is high-leverage.

It may only prove that humans are now needed to clean up and supervise the complexity the organization created for itself.

This is the irony.

AI has the potential to reduce waste.

But if the organization is built around waste, AI first makes the waste cheaper.

AI does not first remove the bureaucracy.

It first makes the bureaucracy more affordable.

And when bureaucracy becomes cheaper, organizations often keep it.

The Real Competition Happens Outside the Organization

The question "Will AI replace developers?" is too narrow.

The more important competition is not always inside the same organization.

It is between different organizational forms.

Large AI-assisted bureaucratic engineering organizations
vs
Small AI-amplified high-ownership teams

The first type will see local productivity gains.

Tickets move faster.
CRUD work gets done faster.
Review comments are addressed faster.
Documentation is generated faster.
Boundary objects and mappers are added faster.

From the inside, this looks like progress.

But from the outside, the organization may still be slow.

Too many meetings
Unclear ownership
Formal reviews
Weak executable specifications
Boundaries based on org charts instead of business domains
Changes that require touching many layers without changing much meaning

The second type uses AI differently.

Small high-ownership teams use AI to increase their ability to change the system safely.

They focus on:

Executable specifications
Strong boundaries
Automated tests
Type-level constraints
Runtime validation
Explicit state transitions
Fast feedback loops
Clear ownership
Observability in production

For them, AI is not mainly a boilerplate generator.

It is a force multiplier for system ownership.

That difference is huge.

Weak organizations use AI to make existing work faster.

Strong organizations use AI to remove the need for much of that work.

Remote-Work Resistance Has the Same Root

This pattern is also related to another common organizational behavior: resistance to remote work.

This is not simply a question of whether remote work is good or bad.

The deeper question is:

What does the organization use as the basis of trust?

Strong engineering organizations tend to trust things like:

Clear ownership
Explicit goals
Reviewable artifacts
Executable tests
Written decisions
Observable production behavior
Well-defined interfaces
Documented trade-offs

Bureaucratic organizations tend to trust things like:

Being in the office
Being visible
Attending meetings
Being available for interruption
Following the existing process
Looking busy
Receiving informal supervision

The first type manages by outcomes and structure.

The second type manages by presence and procedure.

That is why DDD-flavored bureaucracy and anti-remote-work culture often fit together.

Remote work breaks management by presence.

In an office, people are visible. You can see whether someone is at their desk. You can call a meeting. You can interrupt them. You can get a sense that work is happening.

Remote work removes that visibility.

Then the organization must manage through artifacts:

What is owned by whom?
What is the definition of done?
Where is the specification?
Which test protects the invariant?
What decision was made?
What changed?
What failed in production?

For a strong organization, this is natural.

For a weak organization, this is threatening.

Because the organization was not actually managing outcomes. It was managing the appearance of control.

This is why "communication" becomes the usual complaint.

Remote work reduces communication.
Remote work weakens team culture.
Remote work makes it hard to mentor juniors.
Remote work makes progress invisible.
Remote work removes casual conversations.

Some of this can be true.

But often, "communication" is being used to mean something else:

The ability to interrupt people synchronously
The ability to compensate for unclear ownership with conversation
The ability to avoid writing down decisions
The ability to resolve ambiguity through meetings
The ability to judge progress by atmosphere instead of artifacts

In that case, remote work is not destroying communication.

It is destroying the organization's ability to operate with ambiguity hidden inside informal interaction.

This is closely related to cargo-cult DDD.

In one case, architecture is used to make code and developers controllable.

In the other case, the office is used to make people visible and controllable.

Architecture becomes an interface for controlling code.

The office becomes an interface for managerial control.

Meetings absorb unclear responsibility.

Reviews enforce formal consistency.

These are not separate phenomena.

They point in the same direction.

The organization does not trust ownership.
It cannot manage through artifacts.
It lacks executable specifications.
It relies on presence, ceremony, and supervision.

That is why AI and remote work expose similar weaknesses.

AI separates meaningful design from meaningless work.

Remote work separates organizations that manage outcomes from organizations that manage presence.

This does not mean all office work is bad.

There are valid reasons for in-person work: onboarding, hardware, security, crisis response, customer work, sensitive collaboration, and team formation.

The problem is not the office itself.

The problem is using the office as a substitute for ownership, clarity, and trust.

What Remains Valuable from DDD

DDD is not dying.

What dies is DDD-flavored bureaucracy.

The parts of DDD that remain valuable are the ones that help a team understand and protect domain meaning:

Business boundaries
Ubiquitous language
Bounded contexts
Invariants
State transitions
Ownership of responsibilities
Executable tests
Types, constraints, and validation
A clear view of where change will break things

The parts that lose value are the purely ceremonial rules:

Use a Repository.
Put it in a Use Case.
Convert it to a DTO.
Keep the Controller thin.
Follow the directory structure.
Make it look like the existing code.
Avoid review comments by following the ritual.

Again, these patterns can be useful.

But their value depends on whether they represent actual domain boundaries, constraints, and responsibilities.

If they express meaning, they remain.

If they only enforce conformity, AI makes their economic value decline.

What AI-Era Architecture Needs

AI-era architecture should rely less on humans reading every line and more on executable checks.

The premise that humans can semantically review every generated line of code is becoming weaker.

Instead, we need systems where invalid changes fail quickly.

Invalid states should be impossible or difficult to represent.
Invalid transitions should be rejected.
Boundary crossings should include validation.
Specification violations should fail tests.
Types should encode constraints where possible.
Runtime checks should protect what types cannot.
Change impact should be localized.

This is the shift:

Architecture protected by human review
down to
Architecture protected by tests, types, constraints, contracts, and specifications

Without this shift, AI becomes a form-filling assistant.

AI writes the repeated files.
AI wires the layers together.
AI moves logic into the expected place.
Humans check the ceremony.
The organization concludes that developers are still necessary.

But that is not the essence of AI-era software development.

The point is not to generate more DDD-shaped code.

The point is to build a system that can safely absorb AI-generated change.

What matters is not ceremony.

What matters is a breakwater for domain meaning:

Boundaries
Language
Invariants
State transitions
Executable specifications

A multi-layered architecture without these things is not domain-driven design.

It is managerial residue in the shape of software.

A Working Thesis

Developers will not disappear overnight.

In many organizations, AI will preserve existing development work for a while.

Less experienced, lower-cost developers will become more productive with AI. They will generate DDD-flavored boilerplate, move data across layers, follow existing templates, and respond to formal review comments much faster.

That will look like a major productivity gain.

And locally, it will be one.

But the deeper shift is elsewhere.

The disruptive part is not:

Junior developers use AI to fill out DDD-shaped architectural forms faster.

The disruptive part is:

High-ownership engineers use AI to make the organizational structure itself lighter.

That is the real threat to bureaucratic software organizations.

DDD is not dying.

Cargo-cult DDD is.

AI will not kill meaningful domain modeling.

It will kill the economic rationale for using tactical DDD as architectural paperwork.

And before that happens, AI will do something more ironic:

It will make the bureaucracy cheaper.

But cheaper bureaucracy is still bureaucracy.

Eventually, large AI-assisted bureaucratic organizations will compete with small AI-amplified high-ownership teams.

And from the outside, the difference will be obvious.

One group will use AI to produce more ceremony.

The other will use AI to remove the need for it.