Most LLM agents accumulate tools the same way codebases accumulate dependencies: organically, without consolidation, until you have 40 helpers and no macros. Macrokit's public MCP server gives your existing agent (Claude Code, Cursor, anything that speaks MCP) a way to wire in, do its normal work, and then get told which repeated workflows it should have encoded.
That last part is the thing no other tool does.
What you're wiring in
When you run macrokit mcp ./demo, your agent sees six tools:
| Tool | What it does |
|---|---|
list_macros |
Shows the macro registry — what's already encoded |
run_macro |
Executes a named macro with typed arguments |
gh_list_issues |
Lists open GitHub issues (primitive) |
gh_list_pulls |
Lists open pull requests (primitive) |
gh_list_pull_files |
Files changed in a PR (primitive) |
gh_suggest_labels_dryrun |
Suggests labels without writing (dry-run primitive) |
The first two are runtime tools. The last four are primitives — the building blocks macros call. Your agent can call them directly; everything it calls is recorded to .macrokit/sessions.
Set it up
Step 1 — Install the CLI:
npm install -g @macrokit/cli
Step 2 — Scaffold a project:
macrokit init demo --vertical github
Creates demo/:
-
macrokit.json— project config -
macros/summarize_open_issues.tsandtriage_newest_pull.ts— two working macros -
primitives/— the four GitHub primitives -
fixtures/— recorded test inputs
Step 3 — Wire it into Claude Code:
claude mcp add macrokit -- macrokit mcp ./demo
That's it. Start a session (claude) and your agent now sees all six tools. It will use them naturally as it works.
Step 4 — After a session, run the gate:
macrokit gate .macrokit/sessions --macros macros
The gate reads the session log. Any user turn where the agent made three or more distinct tool calls that weren't routed through a macro gets flagged as "a workflow without a macro" — and the gate suggests what to encode: a name, an argument schema, and a stub handler. You review, fill in the handler, and the macro library grows at the rate the agent uses the system.
What the gate output looks like
macrokit gate: 1 turn(s) ran a multi-step workflow without a macro — encode each as one macro before merging.
Session: .macrokit/sessions/mcp-2026-06-19T12-06-47-529Z.jsonl
Turn 1 — user: label the newest PR based on its changed files
3 tool call(s) — 3 un-encoded:
- gh_list_pulls
- gh_list_pull_files
- gh_suggest_labels_dryrun
Suggested macro: label_newest
import { defineMacro } from "@macrokit/authoring";
import { z } from "zod";
export const label_newest = defineMacro({
name: "label_newest",
intent: "label the newest PR based on its changed files",
schema: z.object({
// TODO: extract the arguments this workflow needs from the user request
}),
handler: async (args, ctx) => {
// This workflow currently happens as several router-driven calls:
// - gh_list_pulls
// - gh_list_pull_files
// - gh_suggest_labels_dryrun
// Encode the sequence here so the router dispatches it as ONE macro.
return {};
},
});
The gate prints this stub (it doesn't write a file — that's your call). You copy it into macros/, fill in the handler logic (or have a strong model do it), and the next time the agent handles a labeling request it routes through the macro in one call instead of three.
This is the distillation loop: the agent works, the gate surfaces recurrence, you encode it. Strong model encodes once; weak or local model runs it forever at near-zero cost.
Honest scope
The MCP server handles record + run + gate. It does not auto-distill macros when it detects recurrence — you get a suggestion, not an automatic encoding. The review step is intentional: a macro is code that runs deterministically, and you want a developer to own it before it runs unsupervised.
Auto-distillation (where the system proposes and encodes without a review step) is a separate capability, not in the public server today.
The broader argument
The on-ramp is intentionally small: one init, one mcp add, and a post-session gate. But the compounding effect is what matters. Wire this in, work normally for a week, and run the gate. You'll see exactly which multi-step sequences your agent repeats — and the gate will have already proposed macros for them.
That's the claim the pre-registered benchmark validates: once a workflow is encoded, a 7B local model routes to it at 94.5% accuracy. Not as impressive as a frontier model reasoning it live — but it costs fractions of a cent, runs offline, and never hallucinates the tool name.
Everything is open. Apache 2.0. I'm the maker (Cheng Qian).
Code + benchmark: https://github.com/macrokit/core
Docs: https://macrokit.dev
I'm genuinely after feedback on where the routing breaks. If you wire this in and hit a case the gate misses or flags wrong, open an issue — that's the data the methodology needs.
Top comments (0)