Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the data, not a pitch — links and the raw runs at the end.
The multi-model benchmark answered: can off-the-shelf local models do real GitHub-maintainer work? (Yes — four of them, 74–82.5% on a pre-registered 100-task corpus.) It didn't answer the more interesting question: why is moving the reasoning to design-time the efficient move, not just a trick? So we ran a direct test — the macro ablation.
Pre-registered and frozen. We committed the whole protocol — the two conditions, the trajectory→intent decode rule, the metric, and the prediction — before running a single MACRO-OFF trial. The git timestamp on bench/MACRO_ABLATION_PREREGISTRATION.md is the audit trail. No post-hoc edits; the pre-registration is frozen. Same committed 100-task corpus, same router and tool-calling machinery, temperature 0; the only thing that changes is the tool set:
- MACRO-OFF (reason it live) — the model is given low-level primitives only and must compose the multi-step workflow itself at runtime.
- MACRO-ON (the macro) — the workflow is encoded once at design time; at runtime the model only perceives intent and dispatches it in a single routing call.
For each we measure I(X;Y) in nats — the mutual information between the correct intent and the intent the model actually produced — and the compute it spent (per-task wall-clock latency). That gives value-density = I(X;Y) per second of compute (the theory's value per joule, under roughly constant power). The headline is the MACRO-ON ÷ MACRO-OFF density ratio.
The result. Encoding the workflow as a design-time macro raised I(X;Y) 1.24–1.62× for every model that routes (4 of 5 on the ladder), and delivered 2.0–5.1× the information-per-second of compute for the 1.5B/3B/8B models. The macro preserves the task-relevant information at least as well and spends far less runtime compute — that ratio is the win, measured directly.
We report the negative too. The 7B inverted on wall-clock I/sec (0.72× in MACRO-ON's favor — i.e., MACRO-OFF looked faster per second on the 7B). Cause: run-level latency noise — the information gap between conditions is real, but the 7B's MACRO-OFF traces were unusually fast in this run, compressing the denominator. The 7B's information still favors MACRO-ON; the inversion is in the compute normalizer, and it's disclosed, not buried. (Weak planners also chained ~1 call/item rather than full multi-step sequences, which moderates the per-call compute gap — a stronger planner would likely widen it.) The Mistral result is excluded from the 1.24–1.62× range because Mistral produced near-zero I(X;Y) in both conditions — a tool-call plumbing issue that makes the ratio meaningless, not a rigged exclusion.
Why this is a different claim than the benchmark. The benchmark shows weak models score well on a narrow task. The ablation shows the mechanism: design-time encoding raises value per joule, directly measured. That's exactly the prediction in WHY_IT_WORKS.md — that a macro raises I(X;Y) per joule — drawn from A Mathematical Theory of Value (Qian, 2026). Macrokit's result validates a prediction of that theory. The theory is a standalone preprint; it doesn't depend on Macrokit, and Macrokit doesn't depend on it being the final word.
Honest scope. This is a demonstration, not a law — one task family (github-maintainer), five local models, one institution. It's direct, falsifiable evidence for the mechanism. The harness and raw runs are committed and open; re-run it on your own models and push back where it breaks.
- The ablation + the why: https://macrokit.dev
- Code, benchmark & raw runs (Apache-2.0): https://github.com/macrokit/core
- Keyless in-browser demo: https://studio.macrokit.dev
- Theory (standalone preprint): A Mathematical Theory of Value, Qian 2026 — https://doi.org/10.5281/zenodo.20487042
— Cheng Qian
Top comments (0)