<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cheng Qian</title>
    <description>The latest articles on DEV Community by Cheng Qian (@spriterock).</description>
    <link>https://dev.to/spriterock</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3961073%2F4b486726-e053-4467-909b-c75960e7dbdd.png</url>
      <title>DEV Community: Cheng Qian</title>
      <link>https://dev.to/spriterock</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/spriterock"/>
    <language>en</language>
    <item>
      <title>Wire your AI agent into Macrokit's MCP server — and find out which workflows you should have encoded</title>
      <dc:creator>Cheng Qian</dc:creator>
      <pubDate>Fri, 19 Jun 2026 14:25:42 +0000</pubDate>
      <link>https://dev.to/spriterock/wire-your-ai-agent-into-macrokits-mcp-server-and-find-out-which-workflows-you-should-have-encoded-4l2f</link>
      <guid>https://dev.to/spriterock/wire-your-ai-agent-into-macrokits-mcp-server-and-find-out-which-workflows-you-should-have-encoded-4l2f</guid>
      <description>&lt;p&gt;Most LLM agents accumulate tools the same way codebases accumulate dependencies: organically, without consolidation, until you have 40 helpers and no macros. Macrokit's public MCP server gives your existing agent (Claude Code, Cursor, anything that speaks MCP) a way to wire in, do its normal work, and then get told which repeated workflows it should have encoded.&lt;/p&gt;

&lt;p&gt;That last part is the thing no other tool does.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you're wiring in
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;macrokit mcp ./demo&lt;/code&gt;, your agent sees six tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_macros&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows the macro registry — what's already encoded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_macro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Executes a named macro with typed arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gh_list_issues&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lists open GitHub issues (primitive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gh_list_pulls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lists open pull requests (primitive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gh_list_pull_files&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Files changed in a PR (primitive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gh_suggest_labels_dryrun&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Suggests labels without writing (dry-run primitive)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first two are runtime tools. The last four are primitives — the building blocks macros call. Your agent can call them directly; everything it calls is recorded to &lt;code&gt;.macrokit/sessions&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Set it up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Install the CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @macrokit/cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Scaffold a project:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;macrokit init demo &lt;span class="nt"&gt;--vertical&lt;/span&gt; github
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creates &lt;code&gt;demo/&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;macrokit.json&lt;/code&gt; — project config&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;macros/summarize_open_issues.ts&lt;/code&gt; and &lt;code&gt;triage_newest_pull.ts&lt;/code&gt; — two working macros&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;primitives/&lt;/code&gt; — the four GitHub primitives&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fixtures/&lt;/code&gt; — recorded test inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Wire it into Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add macrokit &lt;span class="nt"&gt;--&lt;/span&gt; macrokit mcp ./demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Start a session (&lt;code&gt;claude&lt;/code&gt;) and your agent now sees all six tools. It will use them naturally as it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 — After a session, run the gate:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;macrokit gate .macrokit/sessions &lt;span class="nt"&gt;--macros&lt;/span&gt; macros
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gate reads the session log. Any user turn where the agent made three or more distinct tool calls that weren't routed through a macro gets flagged as "a workflow without a macro" — and the gate suggests what to encode: a name, an argument schema, and a stub handler. You review, fill in the handler, and the macro library grows at the rate the agent uses the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the gate output looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;macrokit gate: 1 turn(s) ran a multi-step workflow without a macro — encode each as one macro before merging.

Session: .macrokit/sessions/mcp-2026-06-19T12-06-47-529Z.jsonl
Turn 1 — user: label the newest PR based on its changed files
3 tool call(s) — 3 un-encoded:
    - gh_list_pulls
    - gh_list_pull_files
    - gh_suggest_labels_dryrun
Suggested macro: label_newest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineMacro&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@macrokit/authoring&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;label_newest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineMacro&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;label_newest&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;label the newest PR based on its changed files&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="c1"&gt;// TODO: extract the arguments this workflow needs from the user request&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// This workflow currently happens as several router-driven calls:&lt;/span&gt;
    &lt;span class="c1"&gt;//   - gh_list_pulls&lt;/span&gt;
    &lt;span class="c1"&gt;//   - gh_list_pull_files&lt;/span&gt;
    &lt;span class="c1"&gt;//   - gh_suggest_labels_dryrun&lt;/span&gt;
    &lt;span class="c1"&gt;// Encode the sequence here so the router dispatches it as ONE macro.&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gate &lt;strong&gt;prints&lt;/strong&gt; this stub (it doesn't write a file — that's your call). You copy it into &lt;code&gt;macros/&lt;/code&gt;, fill in the handler logic (or have a strong model do it), and the next time the agent handles a labeling request it routes through the macro in one call instead of three.&lt;/p&gt;

&lt;p&gt;This is the distillation loop: the agent works, the gate surfaces recurrence, you encode it. Strong model encodes once; weak or local model runs it forever at near-zero cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest scope
&lt;/h2&gt;

&lt;p&gt;The MCP server handles &lt;strong&gt;record + run + gate&lt;/strong&gt;. It does not auto-distill macros when it detects recurrence — you get a suggestion, not an automatic encoding. The review step is intentional: a macro is code that runs deterministically, and you want a developer to own it before it runs unsupervised.&lt;/p&gt;

&lt;p&gt;Auto-distillation (where the system proposes &lt;em&gt;and&lt;/em&gt; encodes without a review step) is a separate capability, not in the public server today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The broader argument
&lt;/h2&gt;

&lt;p&gt;The on-ramp is intentionally small: one &lt;code&gt;init&lt;/code&gt;, one &lt;code&gt;mcp add&lt;/code&gt;, and a post-session &lt;code&gt;gate&lt;/code&gt;. But the compounding effect is what matters. Wire this in, work normally for a week, and run the gate. You'll see exactly which multi-step sequences your agent repeats — and the gate will have already proposed macros for them.&lt;/p&gt;

&lt;p&gt;That's the claim the pre-registered benchmark validates: once a workflow is encoded, a 7B local model routes to it at 94.5% accuracy. Not as impressive as a frontier model reasoning it live — but it costs fractions of a cent, runs offline, and never hallucinates the tool name.&lt;/p&gt;

&lt;p&gt;Everything is open. Apache 2.0. I'm the maker (Cheng Qian).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code + benchmark:&lt;/strong&gt; &lt;a href="https://github.com/macrokit/core" rel="noopener noreferrer"&gt;https://github.com/macrokit/core&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://macrokit.dev" rel="noopener noreferrer"&gt;https://macrokit.dev&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm genuinely after feedback on where the routing breaks. If you wire this in and hit a case the gate misses or flags wrong, open an issue — that's the data the methodology needs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
    </item>
    <item>
      <title>We pre-registered, ran, and verified the macro ablation: information per joule, measured</title>
      <dc:creator>Cheng Qian</dc:creator>
      <pubDate>Tue, 02 Jun 2026 09:20:46 +0000</pubDate>
      <link>https://dev.to/spriterock/we-pre-registered-ran-and-verified-the-macro-ablation-information-per-joule-measured-4p2k</link>
      <guid>https://dev.to/spriterock/we-pre-registered-ran-and-verified-the-macro-ablation-information-per-joule-measured-4p2k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Maker disclosure:&lt;/strong&gt; I build Macrokit (Apache-2.0, fully open). This is the data, not a pitch — links and the raw runs at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;a href="https://macrokit.dev" rel="noopener noreferrer"&gt;multi-model benchmark&lt;/a&gt; answered: &lt;em&gt;can&lt;/em&gt; off-the-shelf local models do real GitHub-maintainer work? (Yes — four of them, 74–82.5% on a pre-registered 100-task corpus.) It didn't answer the more interesting question: &lt;strong&gt;why is moving the reasoning to design-time the efficient move, not just a trick?&lt;/strong&gt; So we ran a direct test — the &lt;strong&gt;macro ablation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-registered and frozen.&lt;/strong&gt; We committed the whole protocol — the two conditions, the trajectory→intent decode rule, the metric, and the prediction — &lt;em&gt;before&lt;/em&gt; running a single MACRO-OFF trial. The git timestamp on &lt;code&gt;bench/MACRO_ABLATION_PREREGISTRATION.md&lt;/code&gt; is the audit trail. No post-hoc edits; the pre-registration is frozen. Same committed 100-task corpus, same router and tool-calling machinery, temperature 0; the &lt;em&gt;only&lt;/em&gt; thing that changes is the tool set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MACRO-OFF (reason it live)&lt;/strong&gt; — the model is given low-level primitives only and must compose the multi-step workflow itself at runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MACRO-ON (the macro)&lt;/strong&gt; — the workflow is encoded &lt;em&gt;once&lt;/em&gt; at design time; at runtime the model only perceives intent and dispatches it in a single routing call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each we measure &lt;strong&gt;&lt;code&gt;I(X;Y)&lt;/code&gt; in nats&lt;/strong&gt; — the mutual information between the correct intent and the intent the model actually produced — and the compute it spent (per-task wall-clock latency). That gives &lt;strong&gt;value-density = &lt;code&gt;I(X;Y)&lt;/code&gt; per second of compute&lt;/strong&gt; (the theory's &lt;em&gt;value per joule&lt;/em&gt;, under roughly constant power). The headline is the &lt;strong&gt;MACRO-ON ÷ MACRO-OFF density ratio&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result.&lt;/strong&gt; Encoding the workflow as a design-time macro delivered &lt;strong&gt;2.0–5.1× the information-per-second of compute for the 1.5B/3B/8B models&lt;/strong&gt; — the per-joule win, with compute measured independently of the routing decisions. It also raised &lt;code&gt;I(X;Y)&lt;/code&gt; &lt;strong&gt;1.24–1.62× for every model that routes&lt;/strong&gt; (4 of 5 on the ladder); that number is secondary support because it shares the routing confusion matrix with the information measure (see honest scope). The durable claim is the compute efficiency: the macro spends far less runtime compute while preserving task-relevant information — more useful work per second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We report the negative too.&lt;/strong&gt; The 7B inverted on wall-clock I/sec (0.72× in MACRO-ON's favor — i.e., MACRO-OFF looked faster per second on the 7B). Cause: run-level latency noise — the information gap between conditions is real, but the 7B's MACRO-OFF traces were unusually fast in this run, compressing the denominator. The 7B's &lt;em&gt;information&lt;/em&gt; still favors MACRO-ON; the inversion is in the compute normalizer, and it's disclosed, not buried. (Weak planners also chained ~1 call/item rather than full multi-step sequences, which moderates the per-call compute gap — a stronger planner would likely widen it.) The Mistral result is excluded from the 1.24–1.62× range because Mistral produced near-zero &lt;code&gt;I(X;Y)&lt;/code&gt; in &lt;em&gt;both&lt;/em&gt; conditions — a tool-call plumbing issue that makes the ratio meaningless, not a rigged exclusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this is a different claim than the benchmark.&lt;/strong&gt; The benchmark shows weak models &lt;em&gt;score well&lt;/em&gt; on a narrow task. The ablation shows &lt;strong&gt;the mechanism&lt;/strong&gt;: design-time encoding raises value per joule, directly measured. That's exactly the prediction in &lt;a href="https://github.com/macrokit/core/blob/main/docs/WHY_IT_WORKS.md" rel="noopener noreferrer"&gt;&lt;code&gt;WHY_IT_WORKS.md&lt;/code&gt;&lt;/a&gt; — that a macro raises &lt;code&gt;I(X;Y)&lt;/code&gt; per joule — drawn from &lt;em&gt;A Mathematical Theory of Value&lt;/em&gt; (Qian, 2026). &lt;strong&gt;Macrokit's result validates a prediction of that theory.&lt;/strong&gt; The theory is a standalone preprint; it doesn't depend on Macrokit, and Macrokit doesn't depend on it being the final word.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest scope.&lt;/strong&gt; This is a &lt;strong&gt;demonstration, not a law&lt;/strong&gt; — one task family (&lt;code&gt;github-maintainer&lt;/code&gt;), five local models, one institution. One honest limitation on the information numbers: &lt;code&gt;I(X;Y)&lt;/code&gt; is computed from the same routing confusion matrix as intent accuracy, so a raw &lt;code&gt;I(X;Y)&lt;/code&gt; lift is partly definitional. The &lt;strong&gt;per-joule / per-compute result&lt;/strong&gt; (2.0–5.1× for the 1.5B/3B/8B models) is the robust half — compute is measured independently. A follow-up experiment using independently-scored task &lt;em&gt;value&lt;/em&gt; (rather than routing accuracy) is in flight and will close this loop. The harness and raw runs are committed and open; re-run it on your own models and push back where it breaks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ablation + the why: &lt;a href="https://macrokit.dev" rel="noopener noreferrer"&gt;https://macrokit.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Code, benchmark &amp;amp; raw runs (Apache-2.0): &lt;a href="https://github.com/macrokit/core" rel="noopener noreferrer"&gt;https://github.com/macrokit/core&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Keyless in-browser demo: &lt;a href="https://studio.macrokit.dev" rel="noopener noreferrer"&gt;https://studio.macrokit.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Theory (standalone preprint): &lt;em&gt;A Mathematical Theory of Value&lt;/em&gt;, Qian 2026 — &lt;a href="https://doi.org/10.5281/zenodo.20487041" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.20487041&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;— Cheng Qian&lt;/p&gt;

</description>
      <category>llm</category>
      <category>localllm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>We ported how brains manage the cost of thinking to LLM systems</title>
      <dc:creator>Cheng Qian</dc:creator>
      <pubDate>Sun, 31 May 2026 11:15:19 +0000</pubDate>
      <link>https://dev.to/spriterock/a-tiny-local-model-doing-real-github-maintainer-work-in-your-browser-and-the-pattern-behind-it-4lme</link>
      <guid>https://dev.to/spriterock/a-tiny-local-model-doing-real-github-maintainer-work-in-your-browser-and-the-pattern-behind-it-4lme</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Maker disclosure:&lt;/strong&gt; I build Macrokit (Apache-2.0, fully open). This is the idea, not a pitch — there's nothing to buy. Links at the end; the demo is keyless and runs entirely in your browser, so you can verify every claim in your own network tab.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Open one link and a ~0.5–7B model running &lt;strong&gt;in your browser&lt;/strong&gt; — no signup, no API key, no server, nothing installed — does GitHub-maintainer work you'd assume needs a frontier model: triaging the newest PR on a public repo, proposing labels, summarizing open issues. Open your network tab while it runs and the only outbound traffic is the model weights downloading once and public GitHub reads. No inference server. No key, mine or yours.&lt;/p&gt;

&lt;p&gt;That demo isn't a trick, and it isn't "weak models are secretly as smart as GPT-4." It's a structural choice about &lt;em&gt;where the thinking happens&lt;/em&gt; — and the cleanest way to explain it is that &lt;strong&gt;we ported how brains manage the cost of thinking to LLM systems.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Intelligence is expensive, so brains don't think twice
&lt;/h2&gt;

&lt;p&gt;Thinking burns energy. The brain is ~2% of body mass and ~20% of resting metabolic cost, and deliberate reasoning is the most expensive thing it does. Evolution's answer wasn't "think faster." It was to &lt;strong&gt;think a thing through once, and then stop thinking about it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the dual-process picture Kahneman popularized as System 1 and System 2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System 2 — deliberation:&lt;/strong&gt; slow, effortful, expensive, flexible. It's what you use for genuinely novel problems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System 1 — automaticity:&lt;/strong&gt; fast, effortless, cheap, reflexive. It's what carries the overwhelming majority of your day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole trick of an efficient mind is &lt;strong&gt;having both&lt;/strong&gt;, and routing almost everything to the cheap one. You deliberated hard the first ten times you drove a car; now you do it while holding a conversation. The cheap reflex carries ~95% of the load; the expensive mind is held in reserve for when the world surprises you.&lt;/p&gt;

&lt;p&gt;Macrokit is that architecture, ported to LLM systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The strong model is System 2&lt;/strong&gt; — slow, expensive, for the novel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A &lt;em&gt;macro&lt;/em&gt; is System 1&lt;/strong&gt; — fast, cheap, deterministic, for the routine.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;A fast cheap reflex and a slow expensive mind, with the reflex carrying the load.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The sharpened version: macros are &lt;em&gt;compiled deliberation&lt;/em&gt;, not instinct
&lt;/h2&gt;

&lt;p&gt;It's tempting to call a macro an "instinct," but that's wrong, and the correct version is more interesting. Pure instinct is &lt;strong&gt;innate&lt;/strong&gt; — genetic, like a spider's web or a suckling reflex. No macro is born; every macro is &lt;em&gt;learned&lt;/em&gt;. The right analog is &lt;strong&gt;habit and acquired expertise&lt;/strong&gt;, and the key is &lt;em&gt;how it forms&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A behavior starts as effortful System-2 deliberation. Repeated enough, the brain &lt;strong&gt;chunks&lt;/strong&gt; it and physically migrates it from the prefrontal cortex (slow, costly) to the &lt;strong&gt;basal ganglia&lt;/strong&gt; (fast, cheap). The skill stops being something you reason through and becomes something you &lt;em&gt;run&lt;/em&gt;. &lt;strong&gt;Deliberation compiles itself into reflex.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the deepest framing of what Macrokit does: &lt;strong&gt;intelligence compiling itself into instinct through repetition.&lt;/strong&gt; A strong model reasons a workflow out step by step exactly once, and that reasoning is compiled down into a deterministic artifact a weak model can just &lt;em&gt;run&lt;/em&gt; — no reasoning required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern maps end-to-end
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Macrokit&lt;/th&gt;
&lt;th&gt;Cognition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Macro / weak-model routing&lt;/td&gt;
&lt;td&gt;System 1 — fast, automatic, cheap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strong model&lt;/td&gt;
&lt;td&gt;System 2 — slow, deliberate, expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distillation gate (encode on recurrence)&lt;/td&gt;
&lt;td&gt;Neural chunking (cortex → basal ganglia)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graduation %&lt;/td&gt;
&lt;td&gt;The novice → expert curve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bail-out detector&lt;/td&gt;
&lt;td&gt;"Wait — this isn't working, think harder"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encode once, run cheap forever&lt;/td&gt;
&lt;td&gt;The brain's energy budget — the &lt;em&gt;same&lt;/em&gt; argument&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The distillation gate is the piece I think is genuinely novel, and it's the artificial version of &lt;strong&gt;neural chunking&lt;/strong&gt; — same trigger (repetition), same reason (the cost of thinking). More on it below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure mode is predicted, not hidden
&lt;/h2&gt;

&lt;p&gt;Here's where the analogy earns its keep instead of just sounding nice: it &lt;strong&gt;predicts the failure mode&lt;/strong&gt; rather than papering over it.&lt;/p&gt;

&lt;p&gt;Habits are fast &lt;em&gt;but brittle&lt;/em&gt;. They misfire when the environment shifts — the moth that navigated by moonlight for a million years flies into the flame; the experienced driver's reflexes betray them on the opposite side of the road. The reflex is only safe in the world it was compiled for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Macros rot exactly the same way.&lt;/strong&gt; A macro over a third-party surface breaks when that surface changes underneath it. The cognitive frame doesn't excuse this — it &lt;em&gt;anticipates&lt;/em&gt; it, and prescribes the same fix biology uses: when the automatic path fails, &lt;strong&gt;re-deliberate&lt;/strong&gt;. That's the &lt;strong&gt;bail-out detector&lt;/strong&gt; — "this isn't working, think harder" — kicking the system from autopilot back into System 2. Loud, typed failures, caught in CI, not a "self-healing" hand-wave.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the analogy honestly breaks
&lt;/h2&gt;

&lt;p&gt;Never over-romanticize this. Three places it doesn't hold, and you should know them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Innate vs. learned.&lt;/strong&gt; No macro is genetic. (If you squint: the SDK's &lt;em&gt;primitives&lt;/em&gt; are the innate reflexes the system is born with; macros are the habits learned from them.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metacognition.&lt;/strong&gt; A human can introspect and &lt;em&gt;choose&lt;/em&gt; to override a habit. Macrokit's "which system fires" is a confidence-gated router — far cruder than real metacognitive control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generality.&lt;/strong&gt; Some instincts transfer broadly; a macro is narrow and parameterized — a &lt;em&gt;specific&lt;/em&gt; skill, not a broad drive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a real structural correspondence under a shared cost constraint, not a claim of biological fidelity. That's the honest version.&lt;/p&gt;




&lt;p&gt;The rest of this is the engineering and the evidence. The cognition frame is the spine; everything below is the proof that the spine is load-bearing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-phase split, concretely
&lt;/h2&gt;

&lt;p&gt;Most production LLM workflows aren't novel reasoning problems. They're the &lt;strong&gt;same shape of work, with different arguments, run thousands of times&lt;/strong&gt; — fetch this, extract that, score it, label it. The hard part isn't deciding what to do once you understand the request; it's &lt;em&gt;reasoning your way there step by step&lt;/em&gt;, which is exactly what weak models can't do reliably.&lt;/p&gt;

&lt;p&gt;So don't make weak models reason better. &lt;strong&gt;Remove the runtime reasoning requirement.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Any workflow a strong model can solve by reasoning step-by-step on a known surface can be encoded once as a deterministic, parameterized sequence of tool calls. After that, &lt;em&gt;executing&lt;/em&gt; it only requires intent classification — a one-shot routing problem small models handle fine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That encoded sequence is a &lt;strong&gt;macro&lt;/strong&gt;, and it splits the work in two — design-time deliberation, runtime reflex:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Design time (offline, rare — System 2):&lt;/strong&gt; a strong model, supervised by a developer using the coding agent you already use, solves the workflow once and writes the macro. Versioned, reviewable, deterministic. Costs ~$0.50 of inference and happens once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime (online, constant — System 1):&lt;/strong&gt; a weak/local model gets a request, classifies which macro it maps to, and calls it with extracted arguments. The macro runs as ordinary tested code. The model never plans the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost asymmetry is the whole point — the same argument the brain's energy budget makes. Encode once with a frontier model; execute thousands of times with a model that costs ~1/100th–1/1000th as much and runs on a laptop. The capability gap between models stops mattering &lt;em&gt;for that workflow&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a macro actually is
&lt;/h2&gt;

&lt;p&gt;Not a prompt template, not a cache. A parameterized program with five parts: an intent spec the router matches against, a typed argument schema, a deterministic handler (the real tool-call sequence, in code), a structured failure contract, and test fixtures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;defineMacro&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;triage_arxiv_paper&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;summarize and classify an arXiv paper by its ID or URL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;paperId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;relevance&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;novelty&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;method&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;relevance&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;paperId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;classifier&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arxiv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;paperId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pdf&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arxiv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchPdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;paperId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1-3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;classifier&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;paperId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;oneLine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;summary&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="cm"&gt;/* recorded request → expected-output fixtures */&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At runtime the model sees &lt;em&gt;"triage 2401.12345, I care whether the method is new"&lt;/em&gt; and emits exactly one call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"triage_arxiv_paper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"paperId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2401.12345"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"classifier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"method"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It doesn't decide to fetch, then extract, then score. That sequence was compiled offline — the reflex already exists. The model only routes. (Composition is also a macro — if a workflow is "run A, then B, then C," that's &lt;em&gt;one&lt;/em&gt; macro &lt;code&gt;run_full_pipeline&lt;/code&gt;, not three router turns. Three router turns is reasoning-at-runtime by the back door.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The distillation gate = neural chunking, as a CLI
&lt;/h2&gt;

&lt;p&gt;A macro library is only useful if it's complete enough for the workflows people actually run. Most tool collections grow organically and rot — every session adds a one-off helper, no session consolidates. The brain has the opposite reflex: repeat something enough and it &lt;em&gt;gets chunked&lt;/em&gt; into automaticity, on purpose, because re-deliberating it every time is too expensive.&lt;/p&gt;

&lt;p&gt;Macrokit makes that reflex a discipline, enforced by tooling:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Every session that touches a workflow with no existing macro must encode one before it ends.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A CLI reads the session log and fails the build when it sees raw tool calls for an un-encoded workflow — repetition that should have been chunked but wasn't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ macrokit gate
Session 2026-05-24T14:02Z used 4 raw tool calls for an unmacro'd workflow:
  → fetch_user_profile(id=…)
  → list_user_open_issues(id=…)
  → label_issues(ids=…, label="needs-triage")
  → notify_assignees(issue_ids=…)
Encode this before ending the session.
Suggested: triage_open_issues_for_user(user_id, label="needs-triage")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runtime is just engineering. The gate is the cultural piece — it's chunking, triggered by repetition, for the same reason brains chunk: the cost of thinking. Wire it into CI and your library compounds at the rate you use the system, instead of becoming a graveyard of helpers. That's why this pattern compounds where agent frameworks and RPA libraries haven't — and &lt;strong&gt;graduation %&lt;/strong&gt;, the share of traffic the cheap reflex now carries, is just the novice→expert curve made into a metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty beat: four local models clear the bar — and the one that doesn't
&lt;/h2&gt;

&lt;p&gt;I pre-registered a 100-task intent-routing benchmark, temperature 0, no cloud and no key. Four off-the-shelf local models straight from &lt;code&gt;ollama pull&lt;/code&gt; — Qwen 2.5 1.5B / 3B / 7B and Llama 3.1 8B — score &lt;strong&gt;74–82.5%&lt;/strong&gt;. The same 7B tuned as the production reference (llama.cpp, Q4_K_M) reaches &lt;strong&gt;94.5%&lt;/strong&gt; with zero structural failures; most of that gap is serving, quantization, and sampling config, not raw model capability. No frontier rows — deliberately.&lt;/p&gt;

&lt;p&gt;The benchmark also ships the model that flunks: &lt;strong&gt;Mistral 7B v0.3 scored 14%&lt;/strong&gt;. It narrated tool calls in prose ("the &lt;code&gt;triage_pull_request&lt;/code&gt; macro will be called with…") instead of emitting structured ones, and the bail-out detector caught that on 24 tasks rather than scoring a hallucinated success. Publishing the row that fails is the point — a bar you can't fail isn't a bar.&lt;/p&gt;

&lt;p&gt;And the methodology earned its keep before any of that. The production 7B's &lt;em&gt;first&lt;/em&gt; pre-registered run scored &lt;strong&gt;53.5%&lt;/strong&gt; — not a model problem, a bug in my own SDK: zod schemas don't carry JSON Schema by default, so the router fell back to a permissive &lt;code&gt;{type: object}&lt;/code&gt; and the model never saw the real argument names. A ~12-line fix (zod → JSON Schema at &lt;code&gt;defineMacro()&lt;/code&gt; time) took the same model on the same corpus to 94.5%. The failed run is shipped right next to the fixed one. Pre-registration is &lt;em&gt;for&lt;/em&gt; catching exactly that kind of silent config miss — re-run the harness yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it does &lt;em&gt;not&lt;/em&gt; help (so you don't misapply it)
&lt;/h2&gt;

&lt;p&gt;This falls right out of the System-1/System-2 split: automaticity is for the routine, deliberation is for the novel. So —&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Genuinely novel reasoning&lt;/strong&gt; — "write a reply to this angry customer," "price a brand-new category." That's System-2 work; route it to a frontier model. Hybrid routing (local handles the routine 80–90%, frontier handles the novel remainder) is opt-in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflows that change every time&lt;/strong&gt; — exploratory research, open-ended debugging. Nothing to chunk; the pattern is pure overhead. The value comes from the ratio of executions to encodings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surfaces that change underneath you&lt;/strong&gt; — the brittle-habit failure mode above. Mitigation is loud, typed failures caught in CI plus a DOM/action-menu abstraction, not a "self-healing" claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models that can't reliably emit tool calls&lt;/strong&gt; — it's less about size than structured-output discipline. Qwen 2.5 scales cleanly from 1.5B (74%) up, but a &lt;em&gt;7B&lt;/em&gt; that narrates calls in prose instead of emitting them (Mistral 7B v0.3, 14% here) won't clear the bar until its tool-calling is fixed. The bail-out detector catches those rather than scoring hallucinated successes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What it is not
&lt;/h2&gt;

&lt;p&gt;Not an agent framework (those compete on &lt;em&gt;better&lt;/em&gt; runtime reasoning — more System 2; this &lt;em&gt;eliminates&lt;/em&gt; it — "an agent that routes," not "an agent that thinks"). Not a model (BYO — OpenAI-compatible + Ollama out of the box). Not RPA (macros are semantic tool calls, not recorded pixels). Not a fine-tuning pipeline. Not no-code (authoring needs a developer + strong model).&lt;/p&gt;

&lt;p&gt;This isn't a fad or a hack. It's the &lt;strong&gt;universal architecture of efficient intelligence under a cost budget&lt;/strong&gt; — the same answer brains, animals, and now LLM systems all converge on, because the cost asymmetry between deliberating and executing isn't going away. I've run the pattern in production for about a year inside an unrelated operations tool serving users with no practical frontier-API access, and pulled the vertical-agnostic core out into Macrokit. The interesting question to me isn't whether weak models can match frontier — it's how much of your real workload is repetitive enough that you never need to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it / read it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demo&lt;/strong&gt; (keyless, in-browser, public repos — open the network tab): &lt;a href="https://studio.macrokit.dev" rel="noopener noreferrer"&gt;https://studio.macrokit.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The pattern + the benchmark, including the failed first run:&lt;/strong&gt; &lt;a href="https://macrokit.dev" rel="noopener noreferrer"&gt;https://macrokit.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDK (Apache-2.0, TypeScript):&lt;/strong&gt; &lt;a href="https://github.com/macrokit/core" rel="noopener noreferrer"&gt;https://github.com/macrokit/core&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd genuinely like to hear where this breaks on real work — that's the useful feedback. Pushback welcome.&lt;/p&gt;

&lt;p&gt;— Cheng Qian&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>llm</category>
      <category>localllm</category>
    </item>
  </channel>
</rss>
