If you're building a wrapper around Claude Code — spawning claude CLI as a subprocess for automation, bots, or multi-agent orchestration — you might be burning through your token quota much faster than expected. Here's why, and a concrete fix.
The Problem
When your wrapper spawns a claude CLI subprocess, each process starts fresh. That process inherits your entire global configuration:
-
~/CLAUDE.md(your project instructions) - All enabled plugins and their skills
- Every MCP server's tool descriptions
- User-level settings from
~/.claude/settings.json
Every single turn of every subprocess re-injects all of this. In our case (building MAMA, a memory plugin with hooks + MCP server), a single subprocess turn consumed ~50K tokens before doing any actual work.
Run /context in a fresh session to see for yourself — MCP tool descriptions alone can eat 10-20K tokens.
The Numbers
Before isolation:
Subprocess turn 1: ~50K tokens (system prompt + plugins + MCP tools)
Subprocess turn 5: ~250K tokens cumulative
After isolation:
Subprocess turn 1: ~5K tokens
Subprocess turn 5: ~25K tokens cumulative
That's a 10x reduction.
The Fix: 4-Layer Subprocess Isolation
We solved this by isolating each subprocess from the user's global settings:
Layer 1: Scoped Working Directory
// Set cwd to a scoped workspace, NOT os.homedir()
// This prevents ~/CLAUDE.md from being auto-loaded
cwd: path.join(os.homedir(), '.mama', 'workspace')
Layer 2: Git Boundary
// Create a .git/HEAD to block upward CLAUDE.md traversal
const gitDir = path.join(workspaceDir, '.git');
fs.mkdirSync(gitDir, { recursive: true });
fs.writeFileSync(path.join(gitDir, 'HEAD'), 'ref: refs/heads/main\n');
Layer 3: Empty Plugin Directory
// Point --plugin-dir to an empty directory
'--plugin-dir', path.join(os.homedir(), '.mama', '.empty-plugins')
Layer 4: Setting Sources
// Exclude user-level settings (which contain enabledPlugins)
'--setting-sources', 'project,local'
Why Each Layer Matters
| Layer | What it blocks | Without it |
|---|---|---|
| Scoped cwd | ~/CLAUDE.md auto-load | ~5K tokens/turn of instructions |
| .git/HEAD | Upward CLAUDE.md traversal | Claude Code walks to ~ and finds it |
| --plugin-dir | Global plugin skills | Plugins inject skills every turn |
| --setting-sources | enabledPlugins list | settings.json re-enables plugins |
Why Wrap the CLI Instead of Using the API Directly?
You might wonder: why not just call the Anthropic API and skip all this CLI overhead?
Because Claude Code CLI gives you a full agentic runtime for free:
- Built-in tools — file read/write, bash execution, glob, grep — all wired up and ready
- Agentic loop — tool calls → execution → response, handled automatically
- MCP support — connect any MCP server and the CLI manages the protocol
- Session persistence — resume conversations across process restarts
- Permission model — sandboxed tool execution with user approval flow
Building all of this on the raw API means reimplementing thousands of lines of tool execution, file I/O, and safety checks. The CLI already did that work.
The tradeoff: each subprocess inherits global config and burns tokens. That's what the 4-layer isolation fixes — you get the full CLI runtime without the bloat.
One-Shot vs Persistent Process
Pattern A: One-shot with resume
claude -p "<prompt>" \
--append-system-prompt "<identity>" \
--resume <session-id>
Each call re-sends full history + system prompt. After 10 turns the system prompt has been sent 10 times.
Pattern B: Persistent stream-json (our approach)
claude --print \
--input-format stream-json \
--output-format stream-json \
--session-id <id>
Process stays alive. System prompt sent once. Messages go through stdin.
Both patterns need the 4-layer isolation.
Try It Yourself
- Open Claude Code with your usual setup
- Run
/context— note total token count - Imagine that multiplied by every subprocess turn
Links
- PR with the full implementation
- MAMA project — Memory-Augmented MCP Assistant
- Related HN discussion
Top comments (12)
50K tokens per subagent turn is painful. The root cause — each subprocess loads the full system prompt plus conversation history — is a known issue with most agent frameworks, not just Claude Code.
The fix you describe (context windowing + summarization) is the standard approach, but there's a tradeoff: if you summarize too aggressively, the agent loses important context. I've found that keeping the last 3-5 tool call/response pairs intact and summarizing everything older than that hits a good balance.
What's your summarization strategy?
Thanks! To clarify — this isn't about context windowing or summarization. The problem is repeated injection.
When you spawn a CLI subprocess, the system prompt (CLAUDE.md, plugin skills, MCP tool descriptions) gets injected on the first turn — that's fine, the agent needs it. But without isolation, that same config gets re-injected every turn because the CLI re-reads global settings each time. Turn 5 = 5x the same system prompt loaded.
The 4-layer isolation ensures the subprocess only loads what you explicitly provide via
--system-prompton the first turn, and doesn't pick up global config repeatedly. Combined with a persistent process (stream-json mode), the agent keeps its context in one continuous session — no re-injection, no summarization needed. Claude Code handles its own compaction internally.One thing worth clarifying: this isn't an inherent limitation of LLM agents. It's a side effect of how the ecosystem is designed — partly by architecture, partly by incentive.
The technical part: The API is stateless — every HTTP request needs full context. But a CLI process can be stateful. In stream-json mode, the process stays alive and holds the conversation in memory. New messages go through stdin; the agent already knows its system prompt, tools, and history. No re-injection needed.
The incentive part: Providers design stateless APIs because it simplifies their infrastructure — no server-side session management. The side effect? Clients re-send system prompts + tool definitions every turn, which means more billable tokens. The "fix" they offer is prompt caching (90% discount on cache hits), but that still assumes you're re-sending everything — it just costs less. There's no push toward persistent sessions because the current design already works in their favor.
That's why most people accept "every turn re-injects everything" as a law of nature. It's not — it's just the default path of least resistance that happens to align with the provider's revenue model.
This hits a real problem that doesn't get talked about enough. The MCP tool description overhead is particularly nasty in multi-agent setups — you might have 20+ tools registered for the full system, but any given subagent only needs 3-4 of them for its specific task. Loading all the descriptions anyway burns context budget before the first useful token gets written.
The git boundary trick for blocking upward CLAUDE.md traversal is clever. I've hit the same issue from a different angle — a CLAUDE.md that made sense for interactive dev work was completely wrong context for a tightly scoped automation task. Separating the working directory solves both problems at once.
One thing I'd add: if you're connecting to a real MCP server rather than using Claude's built-in tools, selectively filtering which tools you expose to each subprocess can make another significant dent in overhead. Less a CLI trick, more an MCP server design choice — expose exactly what the task needs, nothing more. Combined with your 4-layer isolation this gets you to a point where each subprocess starts genuinely lean.
Great point — this is the next layer of optimization I haven't tackled yet.
Right now MAMA uses a tier-based permission system: Tier 1 agents get full tool access, Tier 2/3 are restricted to read-only tools. But the restriction is soft — enforced via prompt instructions, while the full tool descriptions from each MCP server still get injected into every agent's system prompt. So even an agent that only needs
searchandsavefrom the MAMA MCP server still pays the token cost of every tool it exposes.Your suggestion to filter at the MCP server level makes sense — if a single MCP server exposes 20 tools, each subagent should only see the 3-4 it actually needs. I'm planning to add an
allowed_toolsfield per agent config so the system prompt only includes relevant tool descriptions. This would:Clean distinction: "don't re-inject what's already loaded" vs. "don't inject what's not needed in the first place." Thanks for surfacing this.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.