Harrison Guo

Posted on Apr 8 • Edited on Apr 9 • Originally published at harrisonsec.com

Claude Code Deep Dive Part 3: The 5-Level Compression Pipeline Behind 1M Tokens

#claudecode #contextengineering #agents #compression

This is Part 3 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features | Part 2: The 1,421-Line While Loop | Part 4: Memory Tradeoffs

Why Context Engineering Is the Real Moat

Every AI agent has the same fundamental constraint: a fixed-size context window. Claude's is now up to 1M tokens. That sounds massive — until you realize a real coding session can easily generate multiples of that. Dozens of file reads, hundreds of tool calls, thousands of lines of output.

The model's decision quality depends entirely on what it sees. Get the tradeoff wrong, and it forgets which files it just edited, re-reads content it already saw, or contradicts its own earlier decisions.

Think of the context window as an office desk. Limited surface area. You need the most important documents within arm's reach, everything else filed in drawers — retrievable, but not cluttering your workspace.

Claude Code's context engineering is that filing system. And it's far more sophisticated than most people expect. In Part 2, we covered the 4-stage compression overview as part of the loop's survival mechanism. Here, we zoom into the internal engineering — revealing a 5th level most sessions never trigger, a dual-path algorithm that adapts to cache state, and a security blind spot in the summarizer.

The compression pipeline alone lives in src/services/compact/ — over 3,960 lines of TypeScript across 5 files.

The 5-Level Compression Pipeline

The design philosophy is progressive compression: cheapest first, heaviest last. Each level is more expensive than the previous one — consuming more compute or discarding more context detail.

Most conversations never reach Level 5. That's the point.

Level 1 — Tool Result Budget (Zero Cost)

Problem: A single FileReadTool call on a 10,000-line file dumps the entire thing into context. A BashTool running find returns thousands of paths.

Solution: When a tool result exceeds 50,000 characters (DEFAULT_MAX_RESULT_SIZE_CHARS), Claude Code doesn't truncate it — it persists the full output to disk and keeps only a 2KB preview in context:

<persisted-output>
Output too large (2.3 MB). Full output saved to:
/tmp/.claude/session-xxx/tool-results/toolu_abc123.txt

Preview (first 2.0 KB):
[first 2000 bytes of content]
...
</persisted-output>

Why persist instead of truncate? Truncation means permanent loss. If the model later needs line 500 of that output — maybe that's where the bug is — it can use the Read tool to access the full file from disk. The 2KB preview gives enough context to decide whether that's necessary.

Level 2 — History Snip

Think of History Snip as garbage collection for stale conversation scaffolding. If the session contains repetitive assistant wrappers, redundant bookkeeping, or older spans that no longer affect the next decision, this layer can cut them before heavier compression starts.

Its real importance is accounting correctness. It feeds snipTokensFreed into the autocompact threshold calculation. Without that correction, the last assistant message's usage data still reflects the pre-snip context size, so autocompact can fire even after tokens were already freed.

Level 3 — Microcompact (The Dual-Path Design)

This is where it gets clever. Microcompact cleans up old tool results that are no longer useful — that file you read 30 minutes ago is probably irrelevant now, but it's still eating thousands of tokens.

The twist: Microcompact has two completely different code paths, selected based on cache state.

Path A — Cache Cold (Time-Based)

When the user was away long enough for the prompt cache to expire (default 5-minute TTL), the cache is already dead. Rebuilding is inevitable. So Microcompact goes ahead and directly modifies message content:

// microCompact.ts — cold path
return { ...block, content: '[Old tool result content cleared]' }

Simple, brutal, effective. Keep only the N most recent compactable tool results, replace everything else with a placeholder.

Path B — Cache Hot (Cache-Editing)

When the user is actively chatting and the prompt cache is warm — holding 100K+ tokens of cached prefix — directly modifying messages would invalidate the entire cache. That's a massive cost hit.

Instead, the hot path uses an API-level mechanism called cache_edits:

Tag tool result blocks with cache_reference: tool_use_id
Construct cache_edits blocks telling the server to delete those references in-place
Server-side deletion preserves cache warmth — no client re-upload needed

The messages themselves are returned unchanged. The edit happens at the API layer, invisible to the local conversation state.

	Time-Based (Cold)	Cache-Edit (Hot)
Trigger	Time gap exceeds threshold	Tool count exceeds threshold
Operation	Direct message modification	`cache_edits` API blocks
Cache Impact	Cache rebuilds anyway	Preserves 100K+ cached prefix
API Calls	Zero	Zero (edits piggyback on next request)

The two paths are mutually exclusive. Time-based takes priority — if the cache is already cold, using cache_edits is pointless.

Level 4 — Context Collapse (Non-Destructive)

Think of this as a database View — the underlying table (message array) stays unchanged, but queries (API requests) see a filtered, summarized projection.

Context Collapse triggers at ~90% utilization. Unlike autocompact, it's reversible — original messages are never deleted, and the collapse can be rolled back if needed. The summaries live in a separate collapse store, and projectView() overlays them onto the original messages at query time.

Critical interaction: when Context Collapse is active, Autocompact is suppressed. Both compete for the same token space — autocompact at ~87%, collapse at ~90% — and autocompact would destroy the fine-grained context that collapse is trying to preserve.

Level 5 — Autocompact (The Last Resort)

When everything else fails to keep tokens under control, the system forks a child agent to summarize the entire conversation. This is expensive and irreversible.

The compression prompt uses a two-phase Chain-of-Thought Scratchpad technique:

<analysis> block — the model walks through every message chronologically: user intent, approaches taken, key decisions, filenames, code snippets, errors, fixes
<summary> block — a structured summary with 9 standardized sections (Primary Request, Key Technical Concepts, Files and Code, Errors and Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step)

The critical design: formatCompactSummary() strips the <analysis> block and keeps only the <summary>. Chain-of-thought reasoning improves summary quality dramatically, but the reasoning itself would waste tokens if kept in context. Discard the work, keep the conclusion.

Post-Compression Recovery

Autocompact's biggest risk: the model "forgets" files it just edited. The system automatically runs runPostCompactCleanup():

Restore last 5 recently-read files (≤5K tokens each)
Restore all activated skills (≤25K tokens total)
Re-announce deferred tools, agent lists, MCP directives
Reset Context Collapse state
Restore Plan mode state if active

Without this recovery step, the model would start re-reading files it just edited — or worse, make contradictory changes.

The Circuit Breaker Story

On March 10, 2026, Anthropic's telemetry showed 1,279 sessions with 50+ consecutive autocompact failures. The worst session hit 3,272 consecutive failures. Globally, this wasted approximately 250,000 API calls per day.

In Part 2, we mentioned the circuit breaker as a single boolean (hasAttemptedReactiveCompact). Here's the production story behind it.

The fix was three lines:

const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3

After 3 consecutive failures, stop trying. The context is irrecoverably over-limit — burning more API calls won't help. This is a textbook circuit breaker: detect a failure loop, break it early, fail gracefully.

Three adjacent systems make this pipeline viable in production: accurate token estimation, prompt-cache boundaries, and the summarizer's security assumptions.

Token Estimation Without API Calls

Most agents estimate context size by counting tokens on the client. This typically has 30%+ error — enough to trigger compression too early or too late.

Claude Code uses a smarter approach. Think of it as a morning weigh-in: you step on the scale at 75kg, then eat lunch. You don't need the scale again — estimating 75.5kg is good enough.

The "scale" is the usage data returned by every API response — server-side precise token counts. The "lunch" is the few messages added since then.

function tokenCountWithEstimation(messages): number {
  // Find the most recent message with server-reported usage
  // Use that as the anchor point
  // Estimate only the delta (new messages since anchor)
  // Result: <5% error vs 30%+ from pure client estimation
}

This eliminates the need for tokenizer API calls while maintaining accuracy that's good enough for compression timing decisions.

The Prompt Cache Architecture

Claude Code's system prompt can be 50-100K tokens. Without caching, every API call would re-process this from scratch.

The key innovation: SYSTEM_PROMPT_DYNAMIC_BOUNDARY — a sentinel string that splits the system prompt into static and dynamic halves.

Before the boundary: core instructions, tool descriptions, security rules — identical for ALL users globally → cached with scope: 'global'
After the boundary: MCP tool instructions, output preferences, language settings — varies per user → not cached globally

This means millions of Claude Code users share the same cached system prompt prefix. One cache hit saves compute for everyone. But change one byte before the boundary, and the global cache breaks for all users.

To protect this, Claude Code implements sticky-on latching for beta headers: once a header is sent in a session, it persists for all subsequent requests — even if the feature flag is turned off mid-session. Flexibility sacrificed for cache stability.

The Security Blind Spot

Here's something the compression pipeline gets wrong: it treats all content equally.

The autocompact summarizer processes user instructions and tool results through the same pipeline. If an attacker plants malicious instructions inside a project file — and the model reads that file — those instructions survive compression. They become part of the summary, indistinguishable from legitimate context.

The <analysis> scratchpad that makes summaries so good also faithfully preserves injected instructions. There's no classification step that distinguishes "user said this" from "this was in a file the model read."

Additionally, truncateHeadForPTLRetry() reveals another edge: when the conversation is so long that the compression request itself triggers a Prompt-Too-Long error, the system recursively drops the oldest turns to make the compression fit. An attacker could craft inputs that survive this truncation — instructions placed strategically in the middle of conversations, not at the edges.

Three Designs Worth Stealing

If you're building your own agent, these patterns transfer directly:

Progressive compression (cheapest first) — Don't jump to expensive summarization. Try zero-cost approaches first. Most sessions will never need the heavy option.
Cache-aware dual paths — Let infrastructure state drive algorithm selection. When cache is cold, optimize for simplicity. When cache is hot, optimize for preservation. Same goal, different strategies.
Circuit breakers on automated recovery — Never let a fix become a new failure mode. If compression fails 3 times, it will fail a 4th time. Stop. The 250K wasted API calls/day before this fix was added is a cautionary tale for any self-healing system.

Next: Part 4: Memory — First-Principles Tradeoffs in Agent Persistence — why Anthropic chose Markdown files over vector databases, and when that's the wrong call.

Previous: Part 2: The 1,421-Line While Loop | Part 1: 5 Hidden Features