DEV Community

ORCHESTRATE
ORCHESTRATE

Posted on

Bird Meadow v2: an external review found a silent bug, refuted my Nx port, and endorsed our audit-anchor pattern. Here's the loop closing.

External-review credit: Jeremy Jones ran the v1 + v2 adversarial review panels (eight-critic LLM-assisted) that surfaced the findings closed in this post. The single most consequential finding (the Dirichlet bug) and the single most consequential refutation (the Nx port) both came from his loop. Thank you.

TL;DR

Hours ago we published Bird Meadow — a multi-agent Active Inference workbench in pure Elixir — with a public ask: poke holes in it. An external review panel responded within 24 hours. Two follow-up reviews (v1 + v2 delta) gave us a punch list.

This post documents what closed:

  • v1.1-remediation — fixed a silent Dirichlet learning bug, sharpened framing, hardened multi-agent collision logic
  • v1.2-hardening — Mnesia consistency model, signal-race property tests, telemetry-context discipline, a 100/100 statistical regime test, CI workflow
  • v1.3-falsifiability — the GW1 three-arm experiment (EFE vs greedy vs random) and the G4 belief-evolution prediction
  • v2-equivalence-proof — proved primitive-level Nx equivalence to 1e-9, measured the drop-in dispatch as a 5x perf regression, reverted it, documented the honest finding

Every wave landed with passing tests, signed tags, and source-code audit anchors that fail when the claim drifts. Repo: TheORCHESTRATEActiveInferenceWorkbench.

The bigger story is the methodology. The reviewer called the audit-anchor-as-source-code-test pattern "the single most valuable thing this codebase has taught us." That endorsement is what this post is really about.


v1.1 — the silent Dirichlet bug

The 🔴 finding from the v1 review:

DirichletUpdateA reads marginal_state_belief from the bundle map; the field lives on agent state. The Map.get fallback fires every call. Online learning of A reduces to averaging observation counts uniformly across hidden states regardless of agent posterior.

Confirmed and extended. DirichletUpdateB had the same bug and a complete no-op branch — q_now also fell through to nil, so the entire B-update was dead code. The agent appeared to be learning. It was not.

Fix:

# Before — always hit the fallback
q_s = Map.get(bundle, :marginal_state_belief, uniform(length(hd(a))))

# After — read from agent state with explicit empty handling
q_s =
  case state.marginal_state_belief do
    [] -> uniform(length(hd(a)))
    [_ | _] = vec -> vec
  end
Enter fullscreen mode Exit fullscreen mode

Three positive regression tests now guard against this returning. They assert state-dependent alpha deltas — not just "alpha changed" (which the buggy version would also pass). If the bug returns, parallel scenarios with different state.marginal_state_belief would produce identical alpha matrices, and the test fails loud.

This was the only 🔴 in the panel. It shipped alone (commit 96f4c35), in isolation, before the rename and before the audit-anchor doc additions, so its blast radius would be unambiguous.


v1.2 — distributed-systems audit anchors

The Kingsbury-named findings (K1–K7) targeted distributed systems concerns the v1 work hadn't formally addressed:

  • K1 — Mnesia consistency. New event_log_consistency_test.exs runs 8 parallel writers × 25 events each and asserts per-agent_id monotonicity of the timestamp field. Documented model: per-agent causal ordering; cross-agent ordering is timestamp-best-effort and may interleave under microsecond-equal commits.

  • K2 — Signal-route races. Adversarial integration test fires perceive and plan signals from 6 task-spawned senders across 4 ticks, asserts the agent's belief evolution remains causal regardless of interleaving.

  • K3 — Telemetry context. Process.put/get doesn't propagate across Task.async. Added moduledoc warning + 5-test property suite using Task.async_stream over policies; either provenance survives, or it fails-loud (no silent loss).

  • K4 — MVP statistical regime. 100 episodes on tiny_open_goal with production defaults. 100/100 success rate — gives us a hard floor that future regressions would visibly fail.

  • A2 — Policy enumeration cost. enumerate_policies/depth is |A|^d exponential. Now warned in docstring with practical ceiling.

  • C1 — CI workflow. .github/workflows/ci.yml runs mix compile --warnings-as-errors + mix test --exclude slow_experiment. README badge added.

K5 deserves its own paragraph because it was an over-correction I caught only via Plan-agent stress-test of my draft. The reviewer's note said "sort intentions deterministically" — sounds like a 15-minute change. But sorting iteration order doesn't prevent two birds from landing on the same previously-empty cell. The actual fix is a three-phase sweep: collect intentions → detect target conflicts → tie-break (lowest agent_id wins, losers get {:blocked, :collision}) → commit. ~30 lines, with a property test that asserts the rule across random multi-bird action maps.

The honest version of "I read the finding carefully" is: the first read produced the wrong fix. Ship the right one.


v1.3 — falsifiability

This is where we stopped patching and started measuring claims that could falsify the system.

GW1 — the three-arm experiment

The reviewer's joint Gershman-Wolpert finding: the bundle's hand-crafted geometric prior toward the loud-token gradient might be doing all the work. EFE machinery vs. baseline greedy might show no difference if the prior is already strong.

Tested. Three arms, identical ConvergentBird bundle, identical 8×8 corner-spawned matching priors:

Arm Action selection Median final distance
AI EFE-weighted policy posterior 7.0
GreedyLoudest Pragmatic-greedy on observation amplitude 14.0
Random Uniform random walk 7.0

The honest result: GreedyLoudest performed worse than random walk. Why? Because the greedy baseline ties on equal-amplitude tokens and defaults to :stay — so it sat there. That's a publishable finding about the baseline's failure mode, not about EFE's superiority.

What it actually says: the bundle's geometric prior is doing real work (random walk and EFE both hit the loud token) and EFE's value-add is matching random-walk performance with directional consistency that the test doesn't yet measure. The next experiment should isolate that — but we shipped what we measured, including the inconvenient bit.

G4 — belief-evolution prediction

A specific quantitative prediction: in a custom 4-state stochastic environment, withholding observations from t=5 to t=9 should cause the marginal posterior entropy to broaden toward ln(4) ≈ 1.386 during the window and snap back to the observed-belief entropy when observations resume.

Measured trajectory:

Arm A (full obs):     0.042 → 0.042 → 0.042 → 0.042 → 0.042 → 0.042 → ... (constant)
Arm B (withheld 5-9): 0.042 → 0.042 → 0.042 → 0.042 → 0.042 → 1.245 → 1.369 → 1.384 → 1.386 → 1.386 → 0.042 → 0.042
Enter fullscreen mode Exit fullscreen mode

Asymptotically converges to ln 4 under withholding. Snaps back. Textbook trajectory. The test asserts both monotonic broadening during the window (within ε) and recovery within 2 ticks of resumption — so future regressions to the predictive rollout machinery would fail visibly.


v2-equivalence-proof — the substrate finding

The Wolpert W1 finding, escalated in the v2 review to "load-bearing capability constraint": pure-Elixir list math hits the Jido per-action 60s timeout for ComplexBird at policy depth ≥ 2 on the 1000-dim observation space. Original plan: Nx port to lift the ceiling.

What we proved

ActiveInferenceCore.Math.Nx.matvec/2 and softmax/1 produce numerically equivalent output to the pure-Elixir reference within 1.0e-9 on random inputs at meadow scale (1000×1152), edge cases (1×1, zero matrix, empty vector), sharply-peaked softmax inputs, and 1000-dim policy logits. 9 tests, 0 failures. This is the artifact future redesign builds on.

What we refuted

Drop-in dispatch — wiring Math.matvec/softmax to call through Math.Nx via a config flag — was prototyped and benchmarked on ComplexBird depth 2 on a 4×4 meadow:

Path Wall-clock
Pure-Elixir ~26 s
Nx (BinaryBackend, drop-in dispatch) ~121 s

Speedup: 0.22x. Five times slower. Plus accumulated summation-order divergence above 1e-6 on the long log-domain matvecs after composition through log_eps + matvec + softmax — despite primitive equivalence holding at 1e-9.

Root cause: per-call Nx.tensor(...) / Nx.to_list(...) boundary conversions dominate when the kernel itself is small (single matvec on a few thousand elements) and is invoked thousands of times per Plan call. The default BinaryBackend has no SIMD acceleration to amortise the conversion cost.

The honest scoping

Drop-in primitive replacement is the wrong design. To deliver a speedup the inner sweep must be tensorised as a whole: batched matvec across policies, defn-compiled kernels, EXLA or Torchx backend so conversion cost amortises. That is multi-week work tracked as v2.1 and not part of the v1.x remediation series.

The benchmark file now ships as a baseline measurement of the pure-Elixir path only (25.34s on this machine, well under the 60s Jido timeout). The benchmark passes with that finding written into its assertions. Equivalence is proven, performance is refuted, redesign is documented. Future work has a fixed-point reference to build against.

This is the audit-grade move: don't ship the regression. Document why it didn't work. Make the artifact useful even when the optimization fails.


The audit-anchor-as-source-code-test pattern

This is what the reviewer called "the single most valuable thing this codebase has taught us."

Every claim that lives in a docstring or design document has a corresponding test that enforces the claim at the source-code or mathematical-property level. Examples currently in the workbench:

  • vfe_bound_test.exsF[q] ≥ -ln p(y) against brute-force forward algorithm
  • elbo_bound_test.exsELBO[q] ≤ ln p(y)
  • q_vs_p_naming_test.exs — production and audit code paths can't accidentally merge
  • blanket_ci_test.exs — inter-agent Markov blanket is a real conditional-independence partition (replay-determinism test)
  • no_thermo_overclaim_test.exs — source-code lint against thermodynamic overclaims
  • dirichlet_update_a_test.exs / dirichlet_update_b_test.exs — state-dependent alpha deltas (the v1.1 fix)
  • event_log_consistency_test.exs — per-agent_id monotonicity under N parallel writers
  • nx_benchmark_test.exs — substrate ceiling baseline (the v2 finding)
  • experiment_one_v2_test.exs — the GW1 three-arm result
  • belief_evolution_prediction_test.exs — the G4 predictive trajectory

Each one is a claim that fails loud when it drifts. Each one was named in a review or surfaced from a refused over-claim. Each one is a piece of the methodology, not the math.

The reviewer recommended adoption by their own Ecphory project. That is the genuine endorsement — not "the math is right" (which any standard derivation should be), but "the way you defend the math against drift is something we want too."


What's deferred, by name

  • v2.1 — full inner-sweep Nx redesign (batched matvec across policies, defn kernels, EXLA backend). Multi-week. Tracked in OPS.md §4.
  • GreedyLoudest tie-break refinement — current baseline defaults to :stay on amplitude ties. A directional tie-break would make the EFE comparison sharper.
  • :world_models:spec_registry app rename — Mix umbrella requires app atom = directory name. Documented in ADR-001 as a v2-milestone change with the migration shim.

These are named, not hidden. If we shipped a regression while pretending it was a feature, the audit-anchor pattern would be performance art. The whole point is that the substrate finding is the deliverable.


How to verify

git clone https://github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench.git
cd TheORCHESTRATEActiveInferenceWorkbench/active_inference
mix deps.get
mix compile --warnings-as-errors
mix test --exclude slow_experiment   # 322 tests, 0 failures
mix test --include slow_experiment apps/agent_plane/test/meadow/nx_benchmark_test.exs
mix phx.server                       # → http://localhost:4000/labs/meadow
Enter fullscreen mode Exit fullscreen mode

Tags to pull: v1.1-remediation, v1.2-hardening, v1.3-falsifiability, v2-equivalence-proof. Each one ships with passing tests and the OPS.md / README updates that document its scope.


Credit

The Dirichlet bug, the substrate refutation, and the audit-anchor endorsement all came from one external loop. Jeremy Jones ran the eight-critic LLM-assisted review panel that produced the v1 + v2 reports. The methodology of "ask the public to poke holes; respond honestly with code, not press releases" works only if the holes-pokers exist and the honest response shows up. Jeremy's panel is both halves of that.

The next finding is welcome. Open an issue. The loop is open.


The workbench is a pedagogical Active Inference reference — discrete-time POMDP with mean-field VMP and EFE-weighted policy posterior, one specific instantiation under the FEP framework. Mathematical source: Parr, Pezzulo & Friston (2022) Active Inference, MIT Press. Code license: CC BY-NC-ND.

Top comments (0)