Wire your AI agent into Macrokit's MCP server — and find out which workflows you should have encoded

#ai #llm #opensource #mcp

Most LLM agents accumulate tools the same way codebases accumulate dependencies: organically, without consolidation, until you have 40 helpers and no macros. Macrokit's public MCP server gives your existing agent (Claude Code, Cursor, anything that speaks MCP) a way to wire in, do its normal work, and then get told which repeated workflows it should have encoded.

That last part is the thing no other tool does.

What you're wiring in

When you run macrokit mcp ./demo, your agent sees six tools:

Tool	What it does
`list_macros`	Shows the macro registry — what's already encoded
`run_macro`	Executes a named macro with typed arguments
`gh_list_issues`	Lists open GitHub issues (primitive)
`gh_list_pulls`	Lists open pull requests (primitive)
`gh_list_pull_files`	Files changed in a PR (primitive)
`gh_suggest_labels_dryrun`	Suggests labels without writing (dry-run primitive)

The first two are runtime tools. The last four are primitives — the building blocks macros call. Your agent can call them directly; everything it calls is recorded to .macrokit/sessions.

Set it up

Step 1 — Install the CLI:

npm install -g @macrokit/cli

Step 2 — Scaffold a project:

macrokit init demo --vertical github

Creates demo/:

macrokit.json — project config
macros/summarize_open_issues.ts and triage_newest_pull.ts — two working macros
primitives/ — the four GitHub primitives
fixtures/ — recorded test inputs

Step 3 — Wire it into Claude Code:

claude mcp add macrokit -- macrokit mcp ./demo

That's it. Start a session (claude) and your agent now sees all six tools. It will use them naturally as it works.

Step 4 — After a session, run the gate:

macrokit gate .macrokit/sessions --macros macros

The gate reads the session log. Any user turn where the agent made three or more distinct tool calls that weren't routed through a macro gets flagged as "a workflow without a macro" — and the gate suggests what to encode: a name, an argument schema, and a stub handler. You review, fill in the handler, and the macro library grows at the rate the agent uses the system.

What the gate output looks like

macrokit gate: 1 turn(s) ran a multi-step workflow without a macro — encode each as one macro before merging.

Session: .macrokit/sessions/mcp-2026-06-19T12-06-47-529Z.jsonl
Turn 1 — user: label the newest PR based on its changed files
3 tool call(s) — 3 un-encoded:
    - gh_list_pulls
    - gh_list_pull_files
    - gh_suggest_labels_dryrun
Suggested macro: label_newest

import { defineMacro } from "@macrokit/authoring";
import { z } from "zod";

export const label_newest = defineMacro({
  name: "label_newest",
  intent: "label the newest PR based on its changed files",
  schema: z.object({
    // TODO: extract the arguments this workflow needs from the user request
  }),
  handler: async (args, ctx) => {
    // This workflow currently happens as several router-driven calls:
    //   - gh_list_pulls
    //   - gh_list_pull_files
    //   - gh_suggest_labels_dryrun
    // Encode the sequence here so the router dispatches it as ONE macro.
    return {};
  },
});

The gate prints this stub (it doesn't write a file — that's your call). You copy it into macros/, fill in the handler logic (or have a strong model do it), and the next time the agent handles a labeling request it routes through the macro in one call instead of three.

This is the distillation loop: the agent works, the gate surfaces recurrence, you encode it. Strong model encodes once; weak or local model runs it forever at near-zero cost.

Honest scope

The MCP server handles record + run + gate. It does not auto-distill macros when it detects recurrence — you get a suggestion, not an automatic encoding. The review step is intentional: a macro is code that runs deterministically, and you want a developer to own it before it runs unsupervised.

Auto-distillation (where the system proposes and encodes without a review step) is a separate capability, not in the public server today.

The broader argument

The on-ramp is intentionally small: one init, one mcp add, and a post-session gate. But the compounding effect is what matters. Wire this in, work normally for a week, and run the gate. You'll see exactly which multi-step sequences your agent repeats — and the gate will have already proposed macros for them.

That's the claim the pre-registered benchmark validates: once a workflow is encoded, a 7B local model routes to it at 94.5% accuracy. Not as impressive as a frontier model reasoning it live — but it costs fractions of a cent, runs offline, and never hallucinates the tool name.

Everything is open. Apache 2.0. I'm the maker (Cheng Qian).

Code + benchmark: https://github.com/macrokit/core
Docs: https://macrokit.dev

I'm genuinely after feedback on where the routing breaks. If you wire this in and hit a case the gate misses or flags wrong, open an issue — that's the data the methodology needs.