DEV Community

Charlie Li
Charlie Li

Posted on

How Claude Code Manages 200K Tokens Without Losing Its Mind

Every AI agent builder eventually hits the same wall: your agent works great for 5 minutes, then starts forgetting things, hallucinating, or just... losing the plot.

I reverse-engineered Claude Code's source code (v2.1.88, recovered via source maps) and spent weeks documenting how it actually handles this. The context management system is far more sophisticated than I expected from a CLI tool.

Here are the patterns that actually matter.

The Core Problem

Claude Sonnet has a 200K token context window. Sounds huge, right? Here's reality:

  • System prompt: 5-15K tokens
  • Project config + memory: 5-20K tokens
  • Each conversation turn: 1-10K tokens

After just 50-100 turns of an active coding session, you're hitting the wall. Without active management, your agent either crashes or loses critical context.

Claude Code solves this with a gradient compaction system — three strategies at different granularities, triggered at different thresholds.

Pattern 1: Static/Dynamic Prompt Partitioning

The Claude API has a Prompt Cache: if two requests share the same system prompt prefix, the second reuses the cache. But the cache is strict — a single character difference = cache miss.

The problem: system prompts need dynamic info (git status, tool listings, session state), but dynamic info changes constantly, breaking the cache.

Claude Code's solution — partition the system prompt:

┌──────────────────────────────────────┐
│    Static Zone (Global Cache)         │
│                                      │
│  Identity intro                       │
│  Tool usage rules                     │
│  Coding style guidelines              │
│  Operational caution principles       │
│  Tone and style                       │
├──────── DYNAMIC BOUNDARY ────────────┤
│    Dynamic Zone (Session Cache)       │
│                                      │
│  Memory content                       │
│  Environment info                     │
│  MCP tool descriptions                │
│  Session-specific guidance            │
└──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Everything above the boundary is identical across all users and sessions — it gets global caching. Everything below is session-specific.

Takeaway for your agent: Separate your system prompt into stable and volatile sections. Put stable content first. This alone can cut your API costs significantly.

Pattern 2: The DANGEROUS_ Prefix Convention

This is a naming pattern, not a technical feature, but it's brilliant:

// Cache-safe section (computed once per session)
function systemPromptSection(name, compute) {
  return { name, compute, cacheBreak: false }
}

// Cache-breaking section (MUST provide a reason)
function DANGEROUS_uncachedSystemPromptSection(
  name, compute, _reason
) {
  return { name, compute, cacheBreak: true }
}
Enter fullscreen mode Exit fullscreen mode

The _reason parameter isn't used at runtime. It exists purely for code review — forcing developers to justify why this section needs to break the cache.

Every time someone writes DANGEROUS_uncachedSystemPromptSection, it's a red flag in code review: "This will increase latency and cost. Is it worth it?"

Takeaway: In your agent codebase, make expensive operations visually obvious. Don't bury cost decisions in abstraction layers.

Pattern 3: Gradient Compaction (The Flood Control System)

When context fills up, Claude Code doesn't just truncate. It applies a five-stage compaction pipeline, each run executed in order:

Each loop iteration:

  ① Tool result trimming    (hard limit per result)
  ② Snip trimming           (discard oldest turns)
  ③ Micro-compaction        (clear old tool results)
  ④ Context folding         (collapse sections)
  ⑤ Auto-compaction         (AI-generated summary)
Enter fullscreen mode Exit fullscreen mode

Micro-compaction: The lightest touch

Old tool results are like margin notes from 50 pages ago — you need to know "I annotated something" but not the exact contents. Micro-compaction replaces old tool outputs with a one-line placeholder:

[Result cleared — 4,230 tokens recovered]
Enter fullscreen mode Exit fullscreen mode

But not all tools qualify. Only tools producing "large text output with short information shelf life" are eligible — file reads, search results, bash output. Tool results from recent turns are always preserved.

Auto-compaction: The nuclear option

When micro-compaction isn't enough, Claude Code calls the AI itself to generate a summary of the conversation so far. The threshold:

Effective window: 200K - 20K (summary reserve) = 180K
Auto-compact at: 180K - 13K buffer = 167K tokens
Enter fullscreen mode Exit fullscreen mode

The summary replaces older messages, keeping the most recent context intact. It's expensive (a full API call), but it keeps the session alive indefinitely.

Takeaway: Don't build a single compaction strategy. Build a gradient — cheap operations first, expensive ones as a last resort.

Pattern 4: Estimation Over Precision

Token counting is expensive. Counting every message precisely on every loop iteration would be a bottleneck.

Claude Code's trick: find the most recent assistant message that has API usage data (which includes exact token counts from the API), take that number, then add a rough estimate for subsequent messages.

// Fast: use API's exact count as anchor, estimate the rest
// Slow: count every message token precisely
Enter fullscreen mode Exit fullscreen mode

This is much faster than exact counting and "close enough" for triggering compaction thresholds.

Takeaway: Not everything needs to be precise. If you're using a value to trigger a threshold, an estimate that's within 5% is functionally identical to exact counting.

The Bigger Picture

What impressed me most isn't any single pattern — it's the overall philosophy: graceful degradation over hard failures.

When context fills up, the agent doesn't crash. It doesn't just truncate from the top. It applies increasingly aggressive strategies, preserving the most recent and most relevant information at each stage.

This is the difference between a demo agent that works for 10 turns and a production agent that handles 200+ turn sessions.


These patterns come from my deep-dive into Claude Code's actual source code. I documented the complete architecture — from the core loop to tool execution, permissions, context management, multi-agent coordination, and the MCP protocol — in a book:

📘 Claude Code from the Inside Out (English) — $9.99
📕 深入浅出 Claude Code (中文) — $9.99

Based on v2.1.88 source code analysis. 12 chapters, real code, real architecture decisions.

If you're building AI agents and want to learn from a production system that handles millions of sessions, this is the most detailed documentation available.

Top comments (0)