TL;DR
Most multi-agent Claude Code stacks are missing the layer that decides how to think about the problem before any agent touches it. This is what that layer looks like in practice. Cynefin classification as the first routing decision. Specs as conjunctions of sub-hypotheses with explicit Bayesian priors. Gates as a scheduled rotation of cognitive tools rather than checkpoints. Theory of Constraints as runtime state. Falsifiability as the default for every claim.
The missing layer
Most multi-agent Claude Code is a planner LLM, N subagents, and a prayer. The piece that's almost always missing is the decision-making layer that picks how to think about the problem before any agent touches it.
If you're running multi-agent workflows and you can't answer "what regime is this work in" before the planner generates a task tree, the rest is hope.
Cynefin classifies before anything else runs
The first move on every project should be Cynefin classification. Clear, Complicated, Complex, or Chaotic. The classification picks the decomposition strategy.
- Clear gets Functional decomposition. Known correct answer, structure by feature boundary.
- Complicated gets Layer-based decomposition. Experts can analyze, structure by abstraction.
- Complex gets Risk-based decomposition. Highest uncertainty first, time-boxed probes, no end-date estimation at all.
- Chaotic gets Act-Sense-Respond. Stop the bleeding first, analyze second.
The planner doesn't get to freestyle a task tree. The epistemic regime constrains it.
Specs are conjunctions of sub-hypotheses with Bayesian priors
A spec written as a list of features is a wish. A spec written as a conjunction of sub-hypotheses with priors is a falsifiable plan.
P(project succeeds) = P(H1) × P(H2) × ... × P(Hn)
If the joint prior falls below 20%, no infrastructure work happens until a Phase 0 Validate-or-Kill sprint resolves the existential risks first. Kill conditions are written into the spec during pre-mortem, before motivation arrives.
A belief tracker (Prior, After Gate 0, After Gate 1, ...) shows whether new evidence is actually moving probabilities or just generating activity. Sunk cost gets engineered out at the spec level, not at gate-3 review when the team is too invested to abandon ship.
Gates are a rotation of cognitive tools, not checkpoints
Most stacks call sprint review a "gate." That's a checkpoint. A real gate is a scheduled rotation of cognitive frameworks, timed to where each one has leverage.
At phase start:
- Cynefin classify. Which regime are we in.
- Polya. What is the problem actually asking.
- Inversion / pre-mortem. How does this fail.
- Decompose. Per the regime-picked strategy.
- Theory of Constraints. What is the bottleneck right now.
- Fermi. Size with order-of-magnitude estimates.
During the phase:
7. Bayesian execution. Update priors as evidence arrives.
At gate review:
8. 8-bias audit + Five Whys
Three biases tripped triggers a mandatory pause and reframe from scratch.
The Gate Decision Matrix is four-valued: Go, Iterate, Pivot, Kill. "Fix it" isn't a legal exit. Kill is first-class, and the conditions that justify it are pre-registered as direct output of the pre-mortem.
Theory of Constraints is runtime state, not a slogan
The current bottleneck gets written to disk and updated at every gate. Non-bottleneck agents wait rather than pile up WIP.
Idle is diagnostic, not failure.
Most multi-agent stacks treat unassigned agents as wasted capacity and pile on. That's the opposite of the right move. Subordinating non-bottleneck work to the actual constraint is what makes the system converge.
Everything is falsifiable
Every mutating tool call writes a record to an append-only audit log:
{
"task_id": "T42",
"agent_id": "backend-architect",
"diff_hash": "sha256:e4f5...",
"caller_kind": "subagent",
"outcome": "success"
}
The log is bit-identical across the bash and Python implementations so attribution can't drift between languages.
A calibration miss counter makes self-review falsifiable. When the executor self-reports NONE or LOW severity and an independent adversarial reviewer says HIGH or CRITICAL, that's a miss. Over 50% miss rate over the last 4 paired reviews triggers a retraining advisory.
Self-review without a paired second voice isn't review. It's hope.
The system shape is the point
Building agentic systems is mostly the work of making decisions falsifiable.
- Classify before you plan.
- Hypothesize with priors.
- Gate with cognitive tools rotated by where they have leverage.
- Attribute every mutation to a hash.
- Let Kill be a first-class outcome of any gate.
- Pick decomposition strategy by epistemic regime, not by what the planner LLM feels like generating.
Each piece replaces a specific failure mode. The planner LLM generated a wrong-shaped tree. The team shipped infrastructure for a project whose joint prior was 8%. Three gates passed because the bias the team had been tripping the whole time never made the checklist. The agent self-reported success on a feature that didn't actually work.
What you can apply today
You don't need to rebuild your stack to start. Pick the cheapest move with the highest leverage.
- Add a Cynefin classification field to the top of your spec template. Use it before any agent touches the work.
- Write your next spec as a conjunction of sub-hypotheses with priors. Compute the joint. If it's under 20%, design a cheap Phase 0 before any infrastructure investment.
- Pair every reviewable task with an independent adversarial reviewer. Count the disagreements. That number is the only honest measure of self-review quality you have.
The question
If you're running multi-agent Claude Code and you can't tell me what your routing decision was before planning began, what's your stop condition when the joint probability of all your assumptions falls through the floor?
Top comments (0)