We pre-registered, ran, and verified the macro ablation: information per joule, measured

#llm #localllm #opensource #ai

Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the data, not a pitch — links and the raw runs at the end.

The multi-model benchmark answered: can off-the-shelf local models do real GitHub-maintainer work? (Yes — four of them, 74–82.5% on a pre-registered 100-task corpus.) It didn't answer the more interesting question: why is moving the reasoning to design-time the efficient move, not just a trick? So we ran a direct test — the macro ablation.

Pre-registered and frozen. We committed the whole protocol — the two conditions, the trajectory→intent decode rule, the metric, and the prediction — before running a single MACRO-OFF trial. The git timestamp on bench/MACRO_ABLATION_PREREGISTRATION.md is the audit trail. No post-hoc edits; the pre-registration is frozen. Same committed 100-task corpus, same router and tool-calling machinery, temperature 0; the only thing that changes is the tool set:

MACRO-OFF (reason it live) — the model is given low-level primitives only and must compose the multi-step workflow itself at runtime.
MACRO-ON (the macro) — the workflow is encoded once at design time; at runtime the model only perceives intent and dispatches it in a single routing call.

For each we measure I(X;Y) in nats — the mutual information between the correct intent and the intent the model actually produced — and the compute it spent (per-task wall-clock latency). That gives value-density = I(X;Y) per second of compute (the theory's value per joule, under roughly constant power). The headline is the MACRO-ON ÷ MACRO-OFF density ratio.

The result. Encoding the workflow as a design-time macro delivered 2.0–5.1× the information-per-second of compute for the 1.5B/3B/8B models — the per-joule win, with compute measured independently of the routing decisions. It also raised I(X;Y) 1.24–1.62× for every model that routes (4 of 5 on the ladder); that number is secondary support because it shares the routing confusion matrix with the information measure (see honest scope). The durable claim is the compute efficiency: the macro spends far less runtime compute while preserving task-relevant information — more useful work per second.

We report the negative too. The 7B inverted on wall-clock I/sec (0.72× in MACRO-ON's favor — i.e., MACRO-OFF looked faster per second on the 7B). Cause: run-level latency noise — the information gap between conditions is real, but the 7B's MACRO-OFF traces were unusually fast in this run, compressing the denominator. The 7B's information still favors MACRO-ON; the inversion is in the compute normalizer, and it's disclosed, not buried. (Weak planners also chained ~1 call/item rather than full multi-step sequences, which moderates the per-call compute gap — a stronger planner would likely widen it.) The Mistral result is excluded from the 1.24–1.62× range because Mistral produced near-zero I(X;Y) in both conditions — a tool-call plumbing issue that makes the ratio meaningless, not a rigged exclusion.

Why this is a different claim than the benchmark. The benchmark shows weak models score well on a narrow task. The ablation shows the mechanism: design-time encoding raises value per joule, directly measured. That's exactly the prediction in WHY_IT_WORKS.md — that a macro raises I(X;Y) per joule — drawn from A Mathematical Theory of Value (Qian, 2026). Macrokit's result validates a prediction of that theory. The theory is a standalone preprint; it doesn't depend on Macrokit, and Macrokit doesn't depend on it being the final word.

Honest scope. This is a demonstration, not a law — one task family (github-maintainer), five local models, one institution. One honest limitation on the information numbers: I(X;Y) is computed from the same routing confusion matrix as intent accuracy, so a raw I(X;Y) lift is partly definitional. The per-joule / per-compute result (2.0–5.1× for the 1.5B/3B/8B models) is the robust half — compute is measured independently. A follow-up experiment using independently-scored task value (rather than routing accuracy) is in flight and will close this loop. The harness and raw runs are committed and open; re-run it on your own models and push back where it breaks.