Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the idea, not a pitch — there's nothing to buy. Links at the end; the demo is keyless and runs entirely in your browser, so you can verify every claim in your own network tab.
Open one link and a ~0.5–7B model running in your browser — no signup, no API key, no server, nothing installed — does GitHub-maintainer work you'd assume needs a frontier model: triaging the newest PR on a public repo, proposing labels, summarizing open issues. Open your network tab while it runs and the only outbound traffic is the model weights downloading once and public GitHub reads. No inference server. No key, mine or yours.
That demo isn't a trick, and it isn't "weak models are secretly as smart as GPT-4." It's a structural choice about where the thinking happens — and the cleanest way to explain it is that we ported how brains manage the cost of thinking to LLM systems.
Intelligence is expensive, so brains don't think twice
Thinking burns energy. The brain is ~2% of body mass and ~20% of resting metabolic cost, and deliberate reasoning is the most expensive thing it does. Evolution's answer wasn't "think faster." It was to think a thing through once, and then stop thinking about it.
This is the dual-process picture Kahneman popularized as System 1 and System 2:
- System 2 — deliberation: slow, effortful, expensive, flexible. It's what you use for genuinely novel problems.
- System 1 — automaticity: fast, effortless, cheap, reflexive. It's what carries the overwhelming majority of your day.
The whole trick of an efficient mind is having both, and routing almost everything to the cheap one. You deliberated hard the first ten times you drove a car; now you do it while holding a conversation. The cheap reflex carries ~95% of the load; the expensive mind is held in reserve for when the world surprises you.
Macrokit is that architecture, ported to LLM systems:
- The strong model is System 2 — slow, expensive, for the novel.
- A macro is System 1 — fast, cheap, deterministic, for the routine.
A fast cheap reflex and a slow expensive mind, with the reflex carrying the load.
The sharpened version: macros are compiled deliberation, not instinct
It's tempting to call a macro an "instinct," but that's wrong, and the correct version is more interesting. Pure instinct is innate — genetic, like a spider's web or a suckling reflex. No macro is born; every macro is learned. The right analog is habit and acquired expertise, and the key is how it forms.
A behavior starts as effortful System-2 deliberation. Repeated enough, the brain chunks it and physically migrates it from the prefrontal cortex (slow, costly) to the basal ganglia (fast, cheap). The skill stops being something you reason through and becomes something you run. Deliberation compiles itself into reflex.
That's the deepest framing of what Macrokit does: intelligence compiling itself into instinct through repetition. A strong model reasons a workflow out step by step exactly once, and that reasoning is compiled down into a deterministic artifact a weak model can just run — no reasoning required.
The pattern maps end-to-end
| Macrokit | Cognition |
|---|---|
| Macro / weak-model routing | System 1 — fast, automatic, cheap |
| Strong model | System 2 — slow, deliberate, expensive |
| Distillation gate (encode on recurrence) | Neural chunking (cortex → basal ganglia) |
| Graduation % | The novice → expert curve |
| Bail-out detector | "Wait — this isn't working, think harder" |
| Encode once, run cheap forever | The brain's energy budget — the same argument |
The distillation gate is the piece I think is genuinely novel, and it's the artificial version of neural chunking — same trigger (repetition), same reason (the cost of thinking). More on it below.
The failure mode is predicted, not hidden
Here's where the analogy earns its keep instead of just sounding nice: it predicts the failure mode rather than papering over it.
Habits are fast but brittle. They misfire when the environment shifts — the moth that navigated by moonlight for a million years flies into the flame; the experienced driver's reflexes betray them on the opposite side of the road. The reflex is only safe in the world it was compiled for.
Macros rot exactly the same way. A macro over a third-party surface breaks when that surface changes underneath it. The cognitive frame doesn't excuse this — it anticipates it, and prescribes the same fix biology uses: when the automatic path fails, re-deliberate. That's the bail-out detector — "this isn't working, think harder" — kicking the system from autopilot back into System 2. Loud, typed failures, caught in CI, not a "self-healing" hand-wave.
Where the analogy honestly breaks
Never over-romanticize this. Three places it doesn't hold, and you should know them:
- Innate vs. learned. No macro is genetic. (If you squint: the SDK's primitives are the innate reflexes the system is born with; macros are the habits learned from them.)
- Metacognition. A human can introspect and choose to override a habit. Macrokit's "which system fires" is a confidence-gated router — far cruder than real metacognitive control.
- Generality. Some instincts transfer broadly; a macro is narrow and parameterized — a specific skill, not a broad drive.
It's a real structural correspondence under a shared cost constraint, not a claim of biological fidelity. That's the honest version.
The rest of this is the engineering and the evidence. The cognition frame is the spine; everything below is the proof that the spine is load-bearing.
The two-phase split, concretely
Most production LLM workflows aren't novel reasoning problems. They're the same shape of work, with different arguments, run thousands of times — fetch this, extract that, score it, label it. The hard part isn't deciding what to do once you understand the request; it's reasoning your way there step by step, which is exactly what weak models can't do reliably.
So don't make weak models reason better. Remove the runtime reasoning requirement.
Any workflow a strong model can solve by reasoning step-by-step on a known surface can be encoded once as a deterministic, parameterized sequence of tool calls. After that, executing it only requires intent classification — a one-shot routing problem small models handle fine.
That encoded sequence is a macro, and it splits the work in two — design-time deliberation, runtime reflex:
- Design time (offline, rare — System 2): a strong model, supervised by a developer using the coding agent you already use, solves the workflow once and writes the macro. Versioned, reviewable, deterministic. Costs ~$0.50 of inference and happens once.
- Runtime (online, constant — System 1): a weak/local model gets a request, classifies which macro it maps to, and calls it with extracted arguments. The macro runs as ordinary tested code. The model never plans the workflow.
The cost asymmetry is the whole point — the same argument the brain's energy budget makes. Encode once with a frontier model; execute thousands of times with a model that costs ~1/100th–1/1000th as much and runs on a laptop. The capability gap between models stops mattering for that workflow.
What a macro actually is
Not a prompt template, not a cache. A parameterized program with five parts: an intent spec the router matches against, a typed argument schema, a deterministic handler (the real tool-call sequence, in code), a structured failure contract, and test fixtures.
defineMacro({
name: "triage_arxiv_paper",
intent: "summarize and classify an arXiv paper by its ID or URL",
schema: z.object({
paperId: z.string(),
classifier: z.enum(["relevance", "novelty", "method"]).default("relevance"),
}),
handler: async ({ paperId, classifier }, ctx) => {
const meta = await ctx.tools.arxiv.fetchMetadata(paperId);
const pdf = await ctx.tools.arxiv.fetchPdf(paperId);
const text = await ctx.tools.pdf.extract(pdf, { pages: "1-3" });
const score = await ctx.tools.classify(text, { dimension: classifier });
return { paperId, title: meta.title, score, oneLine: meta.summary };
},
tests: [/* recorded request → expected-output fixtures */],
});
At runtime the model sees "triage 2401.12345, I care whether the method is new" and emits exactly one call:
{ "tool": "triage_arxiv_paper", "args": { "paperId": "2401.12345", "classifier": "method" } }
It doesn't decide to fetch, then extract, then score. That sequence was compiled offline — the reflex already exists. The model only routes. (Composition is also a macro — if a workflow is "run A, then B, then C," that's one macro run_full_pipeline, not three router turns. Three router turns is reasoning-at-runtime by the back door.)
The distillation gate = neural chunking, as a CLI
A macro library is only useful if it's complete enough for the workflows people actually run. Most tool collections grow organically and rot — every session adds a one-off helper, no session consolidates. The brain has the opposite reflex: repeat something enough and it gets chunked into automaticity, on purpose, because re-deliberating it every time is too expensive.
Macrokit makes that reflex a discipline, enforced by tooling:
Every session that touches a workflow with no existing macro must encode one before it ends.
A CLI reads the session log and fails the build when it sees raw tool calls for an un-encoded workflow — repetition that should have been chunked but wasn't:
$ macrokit gate
Session 2026-05-24T14:02Z used 4 raw tool calls for an unmacro'd workflow:
→ fetch_user_profile(id=…)
→ list_user_open_issues(id=…)
→ label_issues(ids=…, label="needs-triage")
→ notify_assignees(issue_ids=…)
Encode this before ending the session.
Suggested: triage_open_issues_for_user(user_id, label="needs-triage")
The runtime is just engineering. The gate is the cultural piece — it's chunking, triggered by repetition, for the same reason brains chunk: the cost of thinking. Wire it into CI and your library compounds at the rate you use the system, instead of becoming a graveyard of helpers. That's why this pattern compounds where agent frameworks and RPA libraries haven't — and graduation %, the share of traffic the cheap reflex now carries, is just the novice→expert curve made into a metric.
The honesty beat: four local models clear the bar — and the one that doesn't
I pre-registered a 100-task intent-routing benchmark, temperature 0, no cloud and no key. Four off-the-shelf local models straight from ollama pull — Qwen 2.5 1.5B / 3B / 7B and Llama 3.1 8B — score 74–82.5%. The same 7B tuned as the production reference (llama.cpp, Q4_K_M) reaches 94.5% with zero structural failures; most of that gap is serving, quantization, and sampling config, not raw model capability. No frontier rows — deliberately.
The benchmark also ships the model that flunks: Mistral 7B v0.3 scored 14%. It narrated tool calls in prose ("the triage_pull_request macro will be called with…") instead of emitting structured ones, and the bail-out detector caught that on 24 tasks rather than scoring a hallucinated success. Publishing the row that fails is the point — a bar you can't fail isn't a bar.
And the methodology earned its keep before any of that. The production 7B's first pre-registered run scored 53.5% — not a model problem, a bug in my own SDK: zod schemas don't carry JSON Schema by default, so the router fell back to a permissive {type: object} and the model never saw the real argument names. A ~12-line fix (zod → JSON Schema at defineMacro() time) took the same model on the same corpus to 94.5%. The failed run is shipped right next to the fixed one. Pre-registration is for catching exactly that kind of silent config miss — re-run the harness yourself.
Where it does not help (so you don't misapply it)
This falls right out of the System-1/System-2 split: automaticity is for the routine, deliberation is for the novel. So —
- Genuinely novel reasoning — "write a reply to this angry customer," "price a brand-new category." That's System-2 work; route it to a frontier model. Hybrid routing (local handles the routine 80–90%, frontier handles the novel remainder) is opt-in.
- Workflows that change every time — exploratory research, open-ended debugging. Nothing to chunk; the pattern is pure overhead. The value comes from the ratio of executions to encodings.
- Surfaces that change underneath you — the brittle-habit failure mode above. Mitigation is loud, typed failures caught in CI plus a DOM/action-menu abstraction, not a "self-healing" claim.
- Models that can't reliably emit tool calls — it's less about size than structured-output discipline. Qwen 2.5 scales cleanly from 1.5B (74%) up, but a 7B that narrates calls in prose instead of emitting them (Mistral 7B v0.3, 14% here) won't clear the bar until its tool-calling is fixed. The bail-out detector catches those rather than scoring hallucinated successes.
What it is not
Not an agent framework (those compete on better runtime reasoning — more System 2; this eliminates it — "an agent that routes," not "an agent that thinks"). Not a model (BYO — OpenAI-compatible + Ollama out of the box). Not RPA (macros are semantic tool calls, not recorded pixels). Not a fine-tuning pipeline. Not no-code (authoring needs a developer + strong model).
This isn't a fad or a hack. It's the universal architecture of efficient intelligence under a cost budget — the same answer brains, animals, and now LLM systems all converge on, because the cost asymmetry between deliberating and executing isn't going away. I've run the pattern in production for about a year inside an unrelated operations tool serving users with no practical frontier-API access, and pulled the vertical-agnostic core out into Macrokit. The interesting question to me isn't whether weak models can match frontier — it's how much of your real workload is repetitive enough that you never need to find out.
Try it / read it
- Demo (keyless, in-browser, public repos — open the network tab): https://studio.macrokit.dev
- The pattern + the benchmark, including the failed first run: https://macrokit.dev
- SDK (Apache-2.0, TypeScript): https://github.com/macrokit/core
I'd genuinely like to hear where this breaks on real work — that's the useful feedback. Pushback welcome.
— Cheng Qian
Top comments (2)
A tiny local model doing real maintainer work is a great proof point for the pattern that matters most: capability is mostly about scoping the task, not scaling the model. A small model fails at be a general assistant but succeeds at triage this issue against these labels or check this PR for these specific things, because the narrow task fits within what a small model reliably does, and the structure around it (clear inputs, tight output contract, the maintainer rules encoded) carries the weight the parameters don't. Same lesson as routing: most real work is a pile of small well-defined tasks, and the cheapest model that clears each bar wins, you don't need frontier intelligence to apply a label policy. The in-your-browser part adds two underrated wins: privacy (the code/issues never leave the machine) and zero marginal cost (no per-call bill, so you can run it constantly without watching a meter). The pattern behind it, as you put it, is the real takeaway, scope tightly, structure the task, run small and local where you can. Shrink the task to fit a small model rather than reaching for a big one beats throwing GPT at everything. That right-size-the-model-to-the-scoped-task instinct is core to how I think about cost in Moonshift. Where did the small model hit its ceiling, the tasks that needed broader judgment, and did you escalate those to a bigger model or just leave them to a human?
Great framing — "scope the task to fit the model" is exactly it, and the cognition lens says the same thing: a small model is fine at System-1 work (recognize the request, route it) and falls over the moment a task needs System-2 deliberation. So the ceiling isn't a capability cliff, it's wherever genuine novelty or judgment starts.
Concretely, in the 100-task run the misses clustered in exactly the judgment-heavy buckets —
suggest_reviewers(83%) andclose_stale_issues(88%), where "which reviewer / is this really stale" is a call, not a lookup. The routine triage buckets were 100%. The most important number to me was theno_macrobucket (91%): the model correctly recognizing "nothing here fits, don't force a route." That recognition is the whole ballgame — it's the bail-out detector, the artificial version of "wait, this isn't working, think harder."On escalation: it's opt-in hybrid routing. When confidence is low or
no_macrofires, you hand off — to a frontier model if the deployment allows it, or to a human if it doesn't (a lot of mine can't call out at all, so it's human-in-the-loop by default). The small model's job is to know its own edge, not to fake past it.Curious how you draw that line in practice — escalate on a confidence threshold, or decide by task type up front?