I Dropped Multi-Agent Coordination for a 5-Layer Falsification Battery

#ai #opensource #devtools #testing

Swarm Orchestrator just lost its swarm. Dropped the multi-agent parallel coordination layer. Running one agent now and putting all the weight on a five-layer post-merge falsification battery instead.

This is an experiment, not an endpoint. v8 will bring proper multi-agent swarming back. The reason for cutting it temporarily: I want to know whether the value I was getting from coordinated parallel agents was the coordination itself, or the verification pressure that coordination produced. Easier to measure with one variable. Intended side effect: cost reduction, since the previous architecture spun up multiple CLI agent instances per run. Real benchmarks pending.

TL;DR: every patch survives a five-layer post-merge battery before the orchestrator declares success. Layers 1 and 2 are hard gates. Layers 3, 4, 5 are advisory and feed a composite score. Hard-gate failure throws before attestation, before final gates, before any external success signal.

Pipeline Order

The battery runs once per orchestrator execution against the merged working tree, not per-step branches. The per-step verifier is a separate component. Layers fire in fixed order:

Differential gate (hard)
Mutation gate (hard)
Cheat detector (advisory)
Property gate (advisory)
Attestation (advisory on first run, signed after)

If the hard gate fails, the composite is forced to 0 and the orchestrator throws falsification battery blocked the patch before any external success signal can fire.

Layer 1: Differential Gate (Hard)

Before any agent touches the repo, a synthesizer generates a regression test against the goal. Layer 1 then runs that test in two detached worktrees: one at the base commit, one at the patch commit.

The contract: the test must fail at base and pass at patch.

If the test passes at base, the layer returns INVALID_TEST. This catches the specific failure mode where an agent writes a tautological test that passes against any code. Without this gate, that pattern slips through every other check downstream.

If no command can be synthesized and the caller doesn't pass --differentialTestCommand, the layer fails closed. Deliberate policy.

Layer 2: Mutation Gate (Hard)

Runs Stryker for JS/TS, mutmut for Python, PITest (Gradle/Maven) for Java, against changed files only. First runs the regression command; if that fails, the layer fails immediately. On pass, dispatches the mutation tool and parses the reporter output, including Stryker's clear-text table layout.

Score	Status
< 0.6	FAIL
< 0.8	WARNING
≥ 0.8	PASS

Patches with no meaningful test coverage of the changed lines fail here even when npm test is green.

Layer 3: Cheat Detector (Advisory)

Diff-static scan for known agent-shortcut patterns. Each rule maps to a real category of behaviour I've watched agents fall into under pressure.

Cheat detector rule IDs

test-modification: agent edited the test instead of the implementation
complexity-mismatch: trivial implementation paired with a non-trivial test
mock-mutation: real call replaced with a mock that returns the expected value
exception-swallowing: caught and ignored exceptions to make tests pass
hardcoded-answer: literal return of the expected output

Optional Semgrep integration loads any registered cheat-rule packs into a transient config. The scan reads added and removed lines and patterns; it does not execute the patch. An agent that hides a hardcoded answer behind a small computation can still pass this layer. Honest about that.

Layer 4: Property Gate (Advisory)

Discovers modified TS/JS/Python functions, parses their parameter types, maps each to a fast-check arbitrary or Hypothesis strategy, generates a harness, runs it. Counterexamples surface as findings. Untyped or unsupported types degrade to a low-severity advisory finding rather than blocking.

Layer 5: Attestation (Advisory on First Run)

Reads the refs/notes/swarm-attestation git note for the patch commit, validates the in-toto SLSA v1.0 envelope's subject SHA against the patch commit, then verifies the cosign signature. On the first run for a commit there's no note yet, so this layer reports advisory-warn and the post-battery attestation step writes the note.

The note is verifiable later via swarm attest verify <commit>. A downstream consumer can verify the patch survived the battery without trusting the running orchestrator process.

Composite Scoring

When the hard gate passes, a weighted composite is computed across the three advisory layers and any optional advisory quality-gate results. Failed advisory gates each subtract a fixed penalty.

Default scoring (overridable via .swarm/gates.yaml)

composite threshold: 0.7
weights: cheat detector 0.4, property gate 0.4, attestation 0.2
advisory gate penalty: 0.02 per failure

humanReviewRequired is true when the composite score is below threshold or any advisory layer is in advisory-warn status.

Where It Actually Runs

Three real call sites, not just unit tests:

Production orchestrator on every swarm run
Synthetic calibration corpus (36 paired test specs across 6 broken-category families) executing in CI on every push
SWE-bench harness using Layer 1 and Layer 4 as standalone spot-check eval drivers

Honest Caveats

These are in docs/known-gaps.md and I won't hide them:

Differential gate is host-Python-sensitive on legacy codebases. The synth-eval can reflect import-chain errors rather than assertion outcomes. The authoritative resolution gate in the per-instance Docker image is unaffected.
Mutation gate skips quietly when no changed files match supported languages. YAML, Markdown, Rust, Go diffs don't get mutation-tested.
Cheat detector is diff-static, not behavioural. The hidden-computation-around-hardcoded-answer pattern can pass it.
Attestation signing is best-effort. Cosign-not-installed errors get logged and the run proceeds without a note. The note's absence is reflected in Layer 5's advisory-warn on subsequent runs but does not block.

Why Run This Experiment

If the falsification battery alone produces patches that survive scrutiny at acceptable quality, then a lot of the apparent value of multi-agent coordination was actually the verification pressure it created, not the agent diversity itself. If the battery alone isn't enough, then v8 multi-agent gets a clearer mandate: the swarm is the value, not the side effect.

Either result is useful. The point of the rewrite is to make the answer measurable.

View on GitHub