DEV Community

Building a 24/7 Claude Code Wrapper? Here's Why Each Subprocess Burns 50K Tokens

Jaehoon Jung on February 22, 2026

If you're building a wrapper around Claude Code — spawning claude CLI as a subprocess for automation, bots, or multi-agent orchestration — you migh...
Collapse
 
matthewhou profile image
Matthew Hou

50K tokens per subagent turn is painful. The root cause — each subprocess loads the full system prompt plus conversation history — is a known issue with most agent frameworks, not just Claude Code.

The fix you describe (context windowing + summarization) is the standard approach, but there's a tradeoff: if you summarize too aggressively, the agent loses important context. I've found that keeping the last 3-5 tool call/response pairs intact and summarizing everything older than that hits a good balance.

What's your summarization strategy?

Collapse
 
jungjaehoon profile image
Jaehoon Jung

Thanks! To clarify — this isn't about context windowing or summarization. The problem is repeated injection.

When you spawn a CLI subprocess, the system prompt (CLAUDE.md, plugin skills, MCP tool descriptions) gets injected on the first turn — that's fine, the agent needs it. But without isolation, that same config gets re-injected every turn because the CLI re-reads global settings each time. Turn 5 = 5x the same system prompt loaded.

The 4-layer isolation ensures the subprocess only loads what you explicitly provide via --system-prompt on the first turn, and doesn't pick up global config repeatedly. Combined with a persistent process (stream-json mode), the agent keeps its context in one continuous session — no re-injection, no summarization needed. Claude Code handles its own compaction internally.

Collapse
 
jungjaehoon profile image
Jaehoon Jung

One thing worth clarifying: this isn't an inherent limitation of LLM agents. It's a side effect of how the ecosystem is designed — partly by architecture, partly by incentive.

The technical part: The API is stateless — every HTTP request needs full context. But a CLI process can be stateful. In stream-json mode, the process stays alive and holds the conversation in memory. New messages go through stdin; the agent already knows its system prompt, tools, and history. No re-injection needed.

The incentive part: Providers design stateless APIs because it simplifies their infrastructure — no server-side session management. The side effect? Clients re-send system prompts + tool definitions every turn, which means more billable tokens. The "fix" they offer is prompt caching (90% discount on cache hits), but that still assumes you're re-sending everything — it just costs less. There's no push toward persistent sessions because the current design already works in their favor.

That's why most people accept "every turn re-injects everything" as a law of nature. It's not — it's just the default path of least resistance that happens to align with the provider's revenue model.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

This matches what I have seen with agent wrappers too - the tool startup and context replay costs dwarf the actual coding once you start fanning out subagents.

One thing that helped us was treating subagents as long-lived workers with a stable, minimal system prompt and a shared scratchpad, and only sending diffs or explicit file chunks instead of the whole repo every turn.

Also curious if you measured where the tokens go most - repeated project summaries vs tool call transcripts vs code blocks? Would love to see a breakdown and whether caching at the wrapper layer changed the curve.

Collapse
 
jungjaehoon profile image
Jaehoon Jung • Edited

Great question about the breakdown. It helps to separate token consumption into three distinct categories:

1. Token Waste — what the wrapper solves

The key insight isn't just "50K tokens are wasted" — it's that subagents don't need the user's full global config. They need a purpose-built system prompt crafted by the wrapper.

Without isolation, the CLI auto-loads everything: your CLAUDE.md, all plugin skills, every MCP tool schema, user settings. A subagent doing code review doesn't need your deployment skills or Slack tool descriptions. But the CLI doesn't know that — it loads everything by default, every turn.

The wrapper's job is to block all of that with 4-layer isolation, then inject only what the subagent actually needs via --system-prompt on the first turn: its persona, its specific tools, its task context. This is what turns a 50K generic context into a ~5K focused one — not by compressing, but by not loading irrelevant context in the first place.

And because this system prompt is injected only on the first turn of a persistent process, it's never re-sent. So to answer your original question — "repeated project summaries" simply don't exist in this model. The system prompt lands once, stays in the process's memory, and every subsequent turn only carries new user messages and tool results.

2. Token Consumption — what the LLM/CLI manages

Tool call transcripts, code blocks, conversation history — these aren't waste. They're the agent's working context. An agent needs to see what it read, what it edited, and what tests returned to work coherently. This is managed by the CLI's internal compaction (automatic compression when approaching context limits). The wrapper shouldn't touch this — and max_turns is counterproductive here, it caps agent capability rather than optimizing anything.

3. The measurement gap — work in progress

Here's the honest part: I can quantify the waste I block (category 1), but observing how working context grows per turn (category 2) is still a blind spot. I track API-reported input/output tokens per request, but I don't yet have turn-by-turn context window growth curves (e.g., turn 1 = 5K, turn 5 = 25K, turn 10 = 80K, post-compaction = 40K).

I'm actively building this — per-turn context size tracking, compaction event detection, and system prompt vs working context ratio analysis. The goal is to make the "CLI-managed" side observable too, not just trust it as a black box.

TL;DR: The wrapper blocks the CLI's default "load everything" behavior, then injects a purpose-built system prompt per subagent — once, on the first turn, never repeated. The LLM's working context is necessary consumption managed by the CLI. And measuring that consumption is the next frontier I'm working on right now.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

Really helpful breakdown. The "50k tokens before doing any work" thing is exactly what surprises people when they turn a nice interactive CLI into a subprocess-driven runtime.

The part that clicked for me is that the bloat is not one thing, it's a stack: project instructions, plugin skill prompts, MCP tool catalogs, plus user settings that quietly re-enable everything. If you do N turns across M subprocesses, you are paying that cost N*M times.

Your 4-layer isolation feels like the right mental model because each layer blocks a different prompt leakage path. I especially like the .git/HEAD boundary trick. People assume cwd is enough, but upward traversal is sneaky, and it only takes one unexpected parent to pull in a huge CLAUDE.md.

One extra thing we've seen help in similar wrappers: measure the overhead as a first-class metric. Log token counts per turn, and fail fast if the system/context tax goes above a threshold. It turns token burn from a mystery into a regression you can catch.

Also +1 on persistent stream-json. One-shot + resume is deceptively expensive because you keep resending the same context.

Thanks for sharing the concrete flags and code, this is the kind of post you wish existed before building the wrapper.

Collapse
 
jungjaehoon profile image
Jaehoon Jung • Edited

Thanks Mahima — the N×M framing is spot on, wish I'd put it that way in the post.

Great timing on the metric suggestion — I'm building exactly this in my next sprint. A MetricsStore that tracks token-per-turn costs with health score thresholds (green/yellow/red) so token overhead becomes a first-class observable, not something you discover after the bill.

And yes, the .git/HEAD thing cost me a few days of debugging. cwd alone feels like it should be enough — until Claude Code walks upward looking for project roots.

I'm also folding signalstack's MCP tool filtering idea into the same architecture — an allowed_tools field per agent so subprocesses only see the tools they actually need. Between preventing repeated injection (this article) and unnecessary injection (tool filtering), the token budget gets much healthier.

Good feedback from this thread is actively shaping what I build. Appreciate it.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

This is a solid breakdown of why “wrapper per turn” architectures get expensive fast. In similar setups, the big win for us was keeping a single long lived worker process and treating tool calls as messages, then layering a small state cache (repo index, dependency graph, last N files) on top so you are not rebuilding context every time.

Curious if you tried moving the subprocess boundary to a job queue so the wrapper can reuse the same process pool (and per process warm state) across turns? Also, measuring tokens per tool call versus per full turn ended up being the metric that actually drove fixes.

Collapse
 
jungjaehoon profile image
Jaehoon Jung • Edited

Great points! I'm actually already running 3 persistent agents (conductor, developer, critic) in a process pool — so the "long-lived worker" pattern is partially in place. The rest (writer, analyst, sysadmin) are on-demand.

For the "tokens per tool call" metric — I currently track tool_duration_ms and prompt_latency_ms, but not token consumption per call. That's a blind spot. I'll add prompt_tokens_in/prompt_tokens_out to the MetricsStore so I can measure context cost growth over turns.

Thanks for the concrete suggestions, really helpful framing!

Collapse
 
signalstack profile image
signalstack

This hits a real problem that doesn't get talked about enough. The MCP tool description overhead is particularly nasty in multi-agent setups — you might have 20+ tools registered for the full system, but any given subagent only needs 3-4 of them for its specific task. Loading all the descriptions anyway burns context budget before the first useful token gets written.

The git boundary trick for blocking upward CLAUDE.md traversal is clever. I've hit the same issue from a different angle — a CLAUDE.md that made sense for interactive dev work was completely wrong context for a tightly scoped automation task. Separating the working directory solves both problems at once.

One thing I'd add: if you're connecting to a real MCP server rather than using Claude's built-in tools, selectively filtering which tools you expose to each subprocess can make another significant dent in overhead. Less a CLI trick, more an MCP server design choice — expose exactly what the task needs, nothing more. Combined with your 4-layer isolation this gets you to a point where each subprocess starts genuinely lean.

Collapse
 
jungjaehoon profile image
Jaehoon Jung

Great point — this is the next layer of optimization I haven't tackled yet.

Right now MAMA uses a tier-based permission system: Tier 1 agents get full tool access, Tier 2/3 are restricted to read-only tools. But the restriction is soft — enforced via prompt instructions, while the full tool descriptions from each MCP server still get injected into every agent's system prompt. So even an agent that only needs search and save from the MAMA MCP server still pays the token cost of every tool it exposes.

Your suggestion to filter at the MCP server level makes sense — if a single MCP server exposes 20 tools, each subagent should only see the 3-4 it actually needs. I'm planning to add an allowed_tools field per agent config so the system prompt only includes relevant tool descriptions. This would:

  1. Cut token waste further (tool descriptions alone can be thousands of tokens)
  2. Reduce hallucinated tool calls — if the model doesn't see a tool definition, it won't try to use it
  3. Complement the 4-layer isolation from this article — that prevents repeated injection, while this prevents unnecessary injection

Clean distinction: "don't re-inject what's already loaded" vs. "don't inject what's not needed in the first place." Thanks for surfacing this.