DEV Community: Cheng Qian

Wire your AI agent into Macrokit's MCP server — and find out which workflows you should have encoded

Cheng Qian — Fri, 19 Jun 2026 14:25:42 +0000

Most LLM agents accumulate tools the same way codebases accumulate dependencies: organically, without consolidation, until you have 40 helpers and no macros. Macrokit's public MCP server gives your existing agent (Claude Code, Cursor, anything that speaks MCP) a way to wire in, do its normal work, and then get told which repeated workflows it should have encoded.

That last part is the thing no other tool does.

What you're wiring in

When you run macrokit mcp ./demo, your agent sees six tools:

Tool	What it does
`list_macros`	Shows the macro registry — what's already encoded
`run_macro`	Executes a named macro with typed arguments
`gh_list_issues`	Lists open GitHub issues (primitive)
`gh_list_pulls`	Lists open pull requests (primitive)
`gh_list_pull_files`	Files changed in a PR (primitive)
`gh_suggest_labels_dryrun`	Suggests labels without writing (dry-run primitive)

The first two are runtime tools. The last four are primitives — the building blocks macros call. Your agent can call them directly; everything it calls is recorded to .macrokit/sessions.

Set it up

Step 1 — Install the CLI:

npm install -g @macrokit/cli

Step 2 — Scaffold a project:

macrokit init demo --vertical github

Creates demo/:

macrokit.json — project config
macros/summarize_open_issues.ts and triage_newest_pull.ts — two working macros
primitives/ — the four GitHub primitives
fixtures/ — recorded test inputs

Step 3 — Wire it into Claude Code:

claude mcp add macrokit -- macrokit mcp ./demo

That's it. Start a session (claude) and your agent now sees all six tools. It will use them naturally as it works.

Step 4 — After a session, run the gate:

macrokit gate .macrokit/sessions --macros macros

The gate reads the session log. Any user turn where the agent made three or more distinct tool calls that weren't routed through a macro gets flagged as "a workflow without a macro" — and the gate suggests what to encode: a name, an argument schema, and a stub handler. You review, fill in the handler, and the macro library grows at the rate the agent uses the system.

What the gate output looks like

macrokit gate: 1 turn(s) ran a multi-step workflow without a macro — encode each as one macro before merging.

Session: .macrokit/sessions/mcp-2026-06-19T12-06-47-529Z.jsonl
Turn 1 — user: label the newest PR based on its changed files
3 tool call(s) — 3 un-encoded:
    - gh_list_pulls
    - gh_list_pull_files
    - gh_suggest_labels_dryrun
Suggested macro: label_newest

import { defineMacro } from "@macrokit/authoring";
import { z } from "zod";

export const label_newest = defineMacro({
  name: "label_newest",
  intent: "label the newest PR based on its changed files",
  schema: z.object({
    // TODO: extract the arguments this workflow needs from the user request
  }),
  handler: async (args, ctx) => {
    // This workflow currently happens as several router-driven calls:
    //   - gh_list_pulls
    //   - gh_list_pull_files
    //   - gh_suggest_labels_dryrun
    // Encode the sequence here so the router dispatches it as ONE macro.
    return {};
  },
});

The gate prints this stub (it doesn't write a file — that's your call). You copy it into macros/, fill in the handler logic (or have a strong model do it), and the next time the agent handles a labeling request it routes through the macro in one call instead of three.

This is the distillation loop: the agent works, the gate surfaces recurrence, you encode it. Strong model encodes once; weak or local model runs it forever at near-zero cost.

Honest scope

The MCP server handles record + run + gate. It does not auto-distill macros when it detects recurrence — you get a suggestion, not an automatic encoding. The review step is intentional: a macro is code that runs deterministically, and you want a developer to own it before it runs unsupervised.

Auto-distillation (where the system proposes and encodes without a review step) is a separate capability, not in the public server today.

The broader argument

The on-ramp is intentionally small: one init, one mcp add, and a post-session gate. But the compounding effect is what matters. Wire this in, work normally for a week, and run the gate. You'll see exactly which multi-step sequences your agent repeats — and the gate will have already proposed macros for them.

That's the claim the pre-registered benchmark validates: once a workflow is encoded, a 7B local model routes to it at 94.5% accuracy. Not as impressive as a frontier model reasoning it live — but it costs fractions of a cent, runs offline, and never hallucinates the tool name.

Everything is open. Apache 2.0. I'm the maker (Cheng Qian).

Code + benchmark: https://github.com/macrokit/core
Docs: https://macrokit.dev

I'm genuinely after feedback on where the routing breaks. If you wire this in and hit a case the gate misses or flags wrong, open an issue — that's the data the methodology needs.

We pre-registered, ran, and verified the macro ablation: information per joule, measured

Cheng Qian — Tue, 02 Jun 2026 09:20:46 +0000

Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the data, not a pitch — links and the raw runs at the end.

The multi-model benchmark answered: can off-the-shelf local models do real GitHub-maintainer work? (Yes — four of them, 74–82.5% on a pre-registered 100-task corpus.) It didn't answer the more interesting question: why is moving the reasoning to design-time the efficient move, not just a trick? So we ran a direct test — the macro ablation.

Pre-registered and frozen. We committed the whole protocol — the two conditions, the trajectory→intent decode rule, the metric, and the prediction — before running a single MACRO-OFF trial. The git timestamp on bench/MACRO_ABLATION_PREREGISTRATION.md is the audit trail. No post-hoc edits; the pre-registration is frozen. Same committed 100-task corpus, same router and tool-calling machinery, temperature 0; the only thing that changes is the tool set:

MACRO-OFF (reason it live) — the model is given low-level primitives only and must compose the multi-step workflow itself at runtime.
MACRO-ON (the macro) — the workflow is encoded once at design time; at runtime the model only perceives intent and dispatches it in a single routing call.

For each we measure I(X;Y) in nats — the mutual information between the correct intent and the intent the model actually produced — and the compute it spent (per-task wall-clock latency). That gives value-density = I(X;Y) per second of compute (the theory's value per joule, under roughly constant power). The headline is the MACRO-ON ÷ MACRO-OFF density ratio.

The result. Encoding the workflow as a design-time macro delivered 2.0–5.1× the information-per-second of compute for the 1.5B/3B/8B models — the per-joule win, with compute measured independently of the routing decisions. It also raised I(X;Y) 1.24–1.62× for every model that routes (4 of 5 on the ladder); that number is secondary support because it shares the routing confusion matrix with the information measure (see honest scope). The durable claim is the compute efficiency: the macro spends far less runtime compute while preserving task-relevant information — more useful work per second.

We report the negative too. The 7B inverted on wall-clock I/sec (0.72× in MACRO-ON's favor — i.e., MACRO-OFF looked faster per second on the 7B). Cause: run-level latency noise — the information gap between conditions is real, but the 7B's MACRO-OFF traces were unusually fast in this run, compressing the denominator. The 7B's information still favors MACRO-ON; the inversion is in the compute normalizer, and it's disclosed, not buried. (Weak planners also chained ~1 call/item rather than full multi-step sequences, which moderates the per-call compute gap — a stronger planner would likely widen it.) The Mistral result is excluded from the 1.24–1.62× range because Mistral produced near-zero I(X;Y) in both conditions — a tool-call plumbing issue that makes the ratio meaningless, not a rigged exclusion.

Why this is a different claim than the benchmark. The benchmark shows weak models score well on a narrow task. The ablation shows the mechanism: design-time encoding raises value per joule, directly measured. That's exactly the prediction in WHY_IT_WORKS.md — that a macro raises I(X;Y) per joule — drawn from A Mathematical Theory of Value (Qian, 2026). Macrokit's result validates a prediction of that theory. The theory is a standalone preprint; it doesn't depend on Macrokit, and Macrokit doesn't depend on it being the final word.

Honest scope. This is a demonstration, not a law — one task family (github-maintainer), five local models, one institution. One honest limitation on the information numbers: I(X;Y) is computed from the same routing confusion matrix as intent accuracy, so a raw I(X;Y) lift is partly definitional. The per-joule / per-compute result (2.0–5.1× for the 1.5B/3B/8B models) is the robust half — compute is measured independently. A follow-up experiment using independently-scored task value (rather than routing accuracy) is in flight and will close this loop. The harness and raw runs are committed and open; re-run it on your own models and push back where it breaks.

The ablation + the why: https://macrokit.dev
Code, benchmark & raw runs (Apache-2.0): https://github.com/macrokit/core
Keyless in-browser demo: https://studio.macrokit.dev
Theory (standalone preprint): A Mathematical Theory of Value, Qian 2026 — https://doi.org/10.5281/zenodo.20487041

— Cheng Qian

We ported how brains manage the cost of thinking to LLM systems

Cheng Qian — Sun, 31 May 2026 11:15:19 +0000

Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the idea, not a pitch — there's nothing to buy. Links at the end; the demo is keyless and runs entirely in your browser, so you can verify every claim in your own network tab.

Open one link and a ~0.5–7B model running in your browser — no signup, no API key, no server, nothing installed — does GitHub-maintainer work you'd assume needs a frontier model: triaging the newest PR on a public repo, proposing labels, summarizing open issues. Open your network tab while it runs and the only outbound traffic is the model weights downloading once and public GitHub reads. No inference server. No key, mine or yours.

That demo isn't a trick, and it isn't "weak models are secretly as smart as GPT-4." It's a structural choice about where the thinking happens — and the cleanest way to explain it is that we ported how brains manage the cost of thinking to LLM systems.

Intelligence is expensive, so brains don't think twice

Thinking burns energy. The brain is ~2% of body mass and ~20% of resting metabolic cost, and deliberate reasoning is the most expensive thing it does. Evolution's answer wasn't "think faster." It was to think a thing through once, and then stop thinking about it.

This is the dual-process picture Kahneman popularized as System 1 and System 2:

System 2 — deliberation: slow, effortful, expensive, flexible. It's what you use for genuinely novel problems.
System 1 — automaticity: fast, effortless, cheap, reflexive. It's what carries the overwhelming majority of your day.

The whole trick of an efficient mind is having both, and routing almost everything to the cheap one. You deliberated hard the first ten times you drove a car; now you do it while holding a conversation. The cheap reflex carries ~95% of the load; the expensive mind is held in reserve for when the world surprises you.

Macrokit is that architecture, ported to LLM systems:

The strong model is System 2 — slow, expensive, for the novel.
A macro is System 1 — fast, cheap, deterministic, for the routine.

A fast cheap reflex and a slow expensive mind, with the reflex carrying the load.

The sharpened version: macros are compiled deliberation, not instinct

It's tempting to call a macro an "instinct," but that's wrong, and the correct version is more interesting. Pure instinct is innate — genetic, like a spider's web or a suckling reflex. No macro is born; every macro is learned. The right analog is habit and acquired expertise, and the key is how it forms.

A behavior starts as effortful System-2 deliberation. Repeated enough, the brain chunks it and physically migrates it from the prefrontal cortex (slow, costly) to the basal ganglia (fast, cheap). The skill stops being something you reason through and becomes something you run. Deliberation compiles itself into reflex.

That's the deepest framing of what Macrokit does: intelligence compiling itself into instinct through repetition. A strong model reasons a workflow out step by step exactly once, and that reasoning is compiled down into a deterministic artifact a weak model can just run — no reasoning required.

The pattern maps end-to-end

Macrokit	Cognition
Macro / weak-model routing	System 1 — fast, automatic, cheap
Strong model	System 2 — slow, deliberate, expensive
Distillation gate (encode on recurrence)	Neural chunking (cortex → basal ganglia)
Graduation %	The novice → expert curve
Bail-out detector	"Wait — this isn't working, think harder"
Encode once, run cheap forever	The brain's energy budget — the same argument

The distillation gate is the piece I think is genuinely novel, and it's the artificial version of neural chunking — same trigger (repetition), same reason (the cost of thinking). More on it below.

The failure mode is predicted, not hidden

Here's where the analogy earns its keep instead of just sounding nice: it predicts the failure mode rather than papering over it.

Habits are fast but brittle. They misfire when the environment shifts — the moth that navigated by moonlight for a million years flies into the flame; the experienced driver's reflexes betray them on the opposite side of the road. The reflex is only safe in the world it was compiled for.

Macros rot exactly the same way. A macro over a third-party surface breaks when that surface changes underneath it. The cognitive frame doesn't excuse this — it anticipates it, and prescribes the same fix biology uses: when the automatic path fails, re-deliberate. That's the bail-out detector — "this isn't working, think harder" — kicking the system from autopilot back into System 2. Loud, typed failures, caught in CI, not a "self-healing" hand-wave.

Where the analogy honestly breaks

Never over-romanticize this. Three places it doesn't hold, and you should know them:

Innate vs. learned. No macro is genetic. (If you squint: the SDK's primitives are the innate reflexes the system is born with; macros are the habits learned from them.)
Metacognition. A human can introspect and choose to override a habit. Macrokit's "which system fires" is a confidence-gated router — far cruder than real metacognitive control.
Generality. Some instincts transfer broadly; a macro is narrow and parameterized — a specific skill, not a broad drive.

It's a real structural correspondence under a shared cost constraint, not a claim of biological fidelity. That's the honest version.

The rest of this is the engineering and the evidence. The cognition frame is the spine; everything below is the proof that the spine is load-bearing.

The two-phase split, concretely

Most production LLM workflows aren't novel reasoning problems. They're the same shape of work, with different arguments, run thousands of times — fetch this, extract that, score it, label it. The hard part isn't deciding what to do once you understand the request; it's reasoning your way there step by step, which is exactly what weak models can't do reliably.

So don't make weak models reason better. Remove the runtime reasoning requirement.

Any workflow a strong model can solve by reasoning step-by-step on a known surface can be encoded once as a deterministic, parameterized sequence of tool calls. After that, executing it only requires intent classification — a one-shot routing problem small models handle fine.

That encoded sequence is a macro, and it splits the work in two — design-time deliberation, runtime reflex:

Design time (offline, rare — System 2): a strong model, supervised by a developer using the coding agent you already use, solves the workflow once and writes the macro. Versioned, reviewable, deterministic. Costs ~$0.50 of inference and happens once.
Runtime (online, constant — System 1): a weak/local model gets a request, classifies which macro it maps to, and calls it with extracted arguments. The macro runs as ordinary tested code. The model never plans the workflow.

The cost asymmetry is the whole point — the same argument the brain's energy budget makes. Encode once with a frontier model; execute thousands of times with a model that costs ~1/100th–1/1000th as much and runs on a laptop. The capability gap between models stops mattering for that workflow.

What a macro actually is

Not a prompt template, not a cache. A parameterized program with five parts: an intent spec the router matches against, a typed argument schema, a deterministic handler (the real tool-call sequence, in code), a structured failure contract, and test fixtures.

defineMacro({
  name: "triage_arxiv_paper",
  intent: "summarize and classify an arXiv paper by its ID or URL",
  schema: z.object({
    paperId: z.string(),
    classifier: z.enum(["relevance", "novelty", "method"]).default("relevance"),
  }),
  handler: async ({ paperId, classifier }, ctx) => {
    const meta = await ctx.tools.arxiv.fetchMetadata(paperId);
    const pdf  = await ctx.tools.arxiv.fetchPdf(paperId);
    const text = await ctx.tools.pdf.extract(pdf, { pages: "1-3" });
    const score = await ctx.tools.classify(text, { dimension: classifier });
    return { paperId, title: meta.title, score, oneLine: meta.summary };
  },
  tests: [/* recorded request → expected-output fixtures */],
});

At runtime the model sees "triage 2401.12345, I care whether the method is new" and emits exactly one call:

{ "tool": "triage_arxiv_paper", "args": { "paperId": "2401.12345", "classifier": "method" } }

It doesn't decide to fetch, then extract, then score. That sequence was compiled offline — the reflex already exists. The model only routes. (Composition is also a macro — if a workflow is "run A, then B, then C," that's one macro run_full_pipeline, not three router turns. Three router turns is reasoning-at-runtime by the back door.)

The distillation gate = neural chunking, as a CLI

A macro library is only useful if it's complete enough for the workflows people actually run. Most tool collections grow organically and rot — every session adds a one-off helper, no session consolidates. The brain has the opposite reflex: repeat something enough and it gets chunked into automaticity, on purpose, because re-deliberating it every time is too expensive.

Macrokit makes that reflex a discipline, enforced by tooling:

Every session that touches a workflow with no existing macro must encode one before it ends.

A CLI reads the session log and fails the build when it sees raw tool calls for an un-encoded workflow — repetition that should have been chunked but wasn't:

$ macrokit gate
Session 2026-05-24T14:02Z used 4 raw tool calls for an unmacro'd workflow:
  → fetch_user_profile(id=…)
  → list_user_open_issues(id=…)
  → label_issues(ids=…, label="needs-triage")
  → notify_assignees(issue_ids=…)
Encode this before ending the session.
Suggested: triage_open_issues_for_user(user_id, label="needs-triage")

The runtime is just engineering. The gate is the cultural piece — it's chunking, triggered by repetition, for the same reason brains chunk: the cost of thinking. Wire it into CI and your library compounds at the rate you use the system, instead of becoming a graveyard of helpers. That's why this pattern compounds where agent frameworks and RPA libraries haven't — and graduation %, the share of traffic the cheap reflex now carries, is just the novice→expert curve made into a metric.

The honesty beat: four local models clear the bar — and the one that doesn't

I pre-registered a 100-task intent-routing benchmark, temperature 0, no cloud and no key. Four off-the-shelf local models straight from ollama pull — Qwen 2.5 1.5B / 3B / 7B and Llama 3.1 8B — score 74–82.5%. The same 7B tuned as the production reference (llama.cpp, Q4_K_M) reaches 94.5% with zero structural failures; most of that gap is serving, quantization, and sampling config, not raw model capability. No frontier rows — deliberately.

The benchmark also ships the model that flunks: Mistral 7B v0.3 scored 14%. It narrated tool calls in prose ("the triage_pull_request macro will be called with…") instead of emitting structured ones, and the bail-out detector caught that on 24 tasks rather than scoring a hallucinated success. Publishing the row that fails is the point — a bar you can't fail isn't a bar.

And the methodology earned its keep before any of that. The production 7B's first pre-registered run scored 53.5% — not a model problem, a bug in my own SDK: zod schemas don't carry JSON Schema by default, so the router fell back to a permissive {type: object} and the model never saw the real argument names. A ~12-line fix (zod → JSON Schema at defineMacro() time) took the same model on the same corpus to 94.5%. The failed run is shipped right next to the fixed one. Pre-registration is for catching exactly that kind of silent config miss — re-run the harness yourself.

Where it does not help (so you don't misapply it)

This falls right out of the System-1/System-2 split: automaticity is for the routine, deliberation is for the novel. So —

Genuinely novel reasoning — "write a reply to this angry customer," "price a brand-new category." That's System-2 work; route it to a frontier model. Hybrid routing (local handles the routine 80–90%, frontier handles the novel remainder) is opt-in.
Workflows that change every time — exploratory research, open-ended debugging. Nothing to chunk; the pattern is pure overhead. The value comes from the ratio of executions to encodings.
Surfaces that change underneath you — the brittle-habit failure mode above. Mitigation is loud, typed failures caught in CI plus a DOM/action-menu abstraction, not a "self-healing" claim.
Models that can't reliably emit tool calls — it's less about size than structured-output discipline. Qwen 2.5 scales cleanly from 1.5B (74%) up, but a 7B that narrates calls in prose instead of emitting them (Mistral 7B v0.3, 14% here) won't clear the bar until its tool-calling is fixed. The bail-out detector catches those rather than scoring hallucinated successes.

What it is not

Not an agent framework (those compete on better runtime reasoning — more System 2; this eliminates it — "an agent that routes," not "an agent that thinks"). Not a model (BYO — OpenAI-compatible + Ollama out of the box). Not RPA (macros are semantic tool calls, not recorded pixels). Not a fine-tuning pipeline. Not no-code (authoring needs a developer + strong model).

This isn't a fad or a hack. It's the universal architecture of efficient intelligence under a cost budget — the same answer brains, animals, and now LLM systems all converge on, because the cost asymmetry between deliberating and executing isn't going away. I've run the pattern in production for about a year inside an unrelated operations tool serving users with no practical frontier-API access, and pulled the vertical-agnostic core out into Macrokit. The interesting question to me isn't whether weak models can match frontier — it's how much of your real workload is repetitive enough that you never need to find out.

Try it / read it

Demo (keyless, in-browser, public repos — open the network tab): https://studio.macrokit.dev
The pattern + the benchmark, including the failed first run: https://macrokit.dev
SDK (Apache-2.0, TypeScript): https://github.com/macrokit/core

I'd genuinely like to hear where this breaks on real work — that's the useful feedback. Pushback welcome.

— Cheng Qian