Maker disclosure: I build Macrokit (Apache-2.0, fully open). This post is the pattern, not a pitch — there's nothing to buy. Links at the end; the demo is keyless and runs entirely in your browser, so you can verify every claim here in your own network tab.
Open one link, and a ~0.5–7B model running in your browser — no signup, no API key, no server, nothing installed — does GitHub-maintainer work you'd assume needs a frontier model: triaging the newest PR on a public repo, proposing labels, summarizing open issues. Open your network tab while it runs and you'll see the only outbound traffic is the model weights downloading once and public GitHub reads. No inference server. No key, mine or yours.
That demo isn't a trick, and it isn't "weak models are secretly as smart as GPT." It's a structural choice about where the reasoning happens. Here's the whole idea.
The forced choice everyone starts with
If you deploy an LLM app today you pick a side:
- Frontier API models plan multi-step workflows, recover from errors, parse messy surfaces. They're also expensive per call, require a network hop to a US-controlled API, and are a non-starter under data-residency rules, air-gaps, or capped budgets.
- Weak / local models (3B–14B open weights, on-device via Ollama/llama.cpp/MLX) are cheap, private, locally controllable — and fall over on multi-step reasoning. They drift into prose when they should call a tool. They hallucinate tool names. They loop. They quit at the first error.
The default reaction is to push on the weak side: train small models to reason better, wrap them in an agent framework that scaffolds the loop, eat the latency and brittleness. I think that's the wrong lever for most production work.
The observation that changes the lever
Most production LLM workflows aren't novel reasoning problems. They're the same shape of work, with different arguments, run thousands of times — fetch this, extract that, score it, label it. The hard part isn't deciding what to do once you understand the request. The hard part is reasoning your way there step by step — which is exactly the part weak models can't do reliably.
So don't make weak models reason better. Remove the runtime reasoning requirement.
Any workflow a strong model can solve by reasoning step-by-step on a known surface can be encoded once as a deterministic, parameterized sequence of tool calls. After that, executing it only requires intent classification — a one-shot routing problem small models handle fine.
That encoded sequence is a macro. This splits the work in two:
- Design time (offline, rare): a strong model — supervised by a developer, using the coding agent you already use — solves the workflow once and writes the macro. Versioned, reviewable, deterministic. Costs ~$0.50 of inference and happens once.
- Runtime (online, constant): a weak/local model gets a request, classifies which macro it maps to, and calls it with extracted arguments. The macro runs as ordinary tested code. The model never plans the workflow.
The cost asymmetry is the point: encode once with a frontier model, execute thousands of times with a model that costs ~1/100th–1/1000th as much and runs on a laptop. The capability gap between models stops mattering for that workflow.
What a macro actually is
Not a prompt template, not a cache. A parameterized program with five parts: an intent spec the router matches against, a typed argument schema, a deterministic handler (the real tool-call sequence, in code), a structured failure contract, and test fixtures.
defineMacro({
name: "triage_arxiv_paper",
intent: "summarize and classify an arXiv paper by its ID or URL",
schema: z.object({
paperId: z.string(),
classifier: z.enum(["relevance", "novelty", "method"]).default("relevance"),
}),
handler: async ({ paperId, classifier }, ctx) => {
const meta = await ctx.tools.arxiv.fetchMetadata(paperId);
const pdf = await ctx.tools.arxiv.fetchPdf(paperId);
const text = await ctx.tools.pdf.extract(pdf, { pages: "1-3" });
const score = await ctx.tools.classify(text, { dimension: classifier });
return { paperId, title: meta.title, score, oneLine: meta.summary };
},
tests: [/* recorded request → expected-output fixtures */],
});
At runtime the model sees "triage 2401.12345, I care whether the method is new" and emits exactly one call:
{ "tool": "triage_arxiv_paper", "args": { "paperId": "2401.12345", "classifier": "method" } }
It doesn't decide to fetch, then extract, then score. That sequence was encoded offline. The model only routes. (Composition is also a macro — if a workflow is "run A, then B, then C," that's one macro run_full_pipeline, not three router turns. Three router turns is reasoning-at-runtime by the back door.)
The part I think is actually novel: the distillation gate
A macro library is only useful if it's complete enough for the workflows people actually run. Most tool collections grow organically and rot — every session adds a one-off helper, no session consolidates.
The fix is a discipline, enforced by tooling:
Every session that touches a workflow with no existing macro must encode one before it ends.
A CLI reads the session log and fails the build when it sees raw tool calls for an un-encoded workflow:
$ macrokit gate
Session 2026-05-24T14:02Z used 4 raw tool calls for an unmacro'd workflow:
→ fetch_user_profile(id=…)
→ list_user_open_issues(id=…)
→ label_issues(ids=…, label="needs-triage")
→ notify_assignees(issue_ids=…)
Encode this before ending the session.
Suggested: triage_open_issues_for_user(user_id, label="needs-triage")
The runtime is just engineering. The gate is the cultural piece — wire it into CI and your library compounds at the rate you use the system, instead of becoming a graveyard of helpers. That's the bet for why this pattern compounds where agent frameworks and RPA libraries haven't.
The honesty beat: 53.5% → 94.5%
I pre-registered a 100-task intent-routing benchmark before running it (Qwen 2.5 7B, 4-bit, on a 16GB MacBook). First run: 53.5%. I published that number. The model picked the right macro almost every time but got argument names wrong — emitting {repo_owner, repo_name, pr_number} where the schema said {owner, repo, number}. Root cause was a bug in my own SDK, not the model: zod schemas don't carry JSON Schema by default, so the router fell back to a permissive {type: object} and the model never saw the real arg names. It was guessing.
The fix was ~12 lines (convert zod → JSON Schema at defineMacro() time). Same model, same corpus: 94.5%, zero structural failures. The methodology is more convincing than the headline — pre-registration is for catching exactly that kind of silent config miss. The failed run is shipped right next to the fixed one; re-run the harness yourself.
Where it does not help (so you don't misapply it)
- Genuinely novel reasoning — "write a reply to this angry customer," "price a brand-new category." Route those to a frontier model. Hybrid routing (local handles the routine 80–90%, frontier handles the novel remainder) is opt-in.
- Workflows that change every time — exploratory research, open-ended debugging. Nothing to encode; the pattern is pure overhead. The value comes from the ratio of executions to encodings.
- Surfaces that change underneath you — a macro over a third-party UI breaks when the UI changes. The mitigation is loud, typed failures caught in CI plus a DOM/action-menu abstraction, not a "self-healing" claim.
- Very small models — ~7B instruct-tuned is roughly the serious-deployment floor today; below that the failure-detector fires more than routing succeeds. An empirical bound that keeps dropping.
What it is not
Not an agent framework (those compete on better runtime reasoning; this eliminates it — "an agent that routes," not "an agent that thinks"). Not a model (BYO — OpenAI-compatible + Ollama out of the box). Not RPA (macros are semantic tool calls, not recorded pixels). Not a fine-tuning pipeline. Not no-code (authoring needs a developer + strong model).
I've run this pattern in production for about a year inside an unrelated operations tool serving users with no practical frontier-API access, and pulled the vertical-agnostic core out into Macrokit. The interesting question to me isn't whether weak models can match frontier — it's how much of your real workload is repetitive enough that you never need to find out.
Try it / read it
- Demo (keyless, in-browser, public repos — open the network tab): https://studio.macrokit.dev
- The pattern + the benchmark, including the failed first run: https://macrokit.dev
- SDK (Apache-2.0, TypeScript): https://github.com/macrokit/core
I'd genuinely like to hear where this breaks on real work — that's the useful feedback. Pushback welcome.
— Cheng Qian
Top comments (0)