Watch the Video
The Problem: Token Limits Kill Long-Running Agents
If you've built anything with an LLM API — a chatbot, a coding agent, an n8n workflow, anything with a loop — you've hit this wall. Your agent is 45 minutes into a complex task, 100,000+ tokens deep. Every single turn, you're re-sending the entire conversation history.
The token count doesn't just grow — it compounds. Then one of two things happens:
- Hard crash: You hit the 200K wall and the API returns an error. Your agent dies mid-task.
- Silent degradation: The model loses track of decisions it made 30 turns ago and starts contradicting itself.
The workaround most developers use:
// The old hack
messages = messages.slice(-10);
You're throwing away critical decisions, tool call results, user preferences — all gone. There's no intelligence in this truncation.
How Context Compaction Works
Context Compaction is a server-side API that handles this elegantly:
- You set a token threshold (e.g., 100,000 tokens)
- When your conversation approaches that limit, the API itself generates a structured summary
- It drops all raw messages before that summary
- It continues seamlessly from the compacted context
This cycles — when it hits the threshold again, it compacts again. Anthropic calls this "effectively infinite conversations."
You're not building a RAG pipeline. You're not setting up a vector database. You configure a trigger threshold and the API handles the rest.
It works across the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry. The compaction itself adds no extra charge — you actually save money.
Why This Works Now
Three things converged:
1. Model Quality
Opus 4.6 scores 76% on MRC-R v2 benchmark (needle-in-haystack across 1M tokens). Previous best was 18.5% — a 4x improvement. The summaries are actually high quality now.
2. Runtime Stability
Between Feb 10-19, Anthropic fixed unbounded WeakMap memory growth, O(n²) message accumulation bugs, and multiple memory leaks. The runtime no longer crashes under sustained load.
3. No Competition
OpenAI has no equivalent. Google Gemini gives you 1M tokens but no automatic summarization when you exceed it. This is Claude-only.
The Benchmark
| Metric | Without Compaction | With Compaction |
|---|---|---|
| Input tokens | 200,400 | 82,000 |
| Token reduction | — | 58.6% |
| Max conversation | ~200K | 10 million+ |
Implementation: Three Levels
Level 1: Basic (30 seconds)
const response = await anthropic.messages.create({
model: "claude-opus-4-6-20260205",
max_tokens: 8096,
betas: ["compact-2026-12"],
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
}
}]
},
messages: conversationHistory
});
Level 2: Long-Running Agent
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
},
pause_after_compaction: true
}]
}
Level 3: Custom Instructions
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
},
instruction: "Focus on preserving code snippets, variable names, and technical decisions."
}]
}
Key Facts
- Server-side API — no client-side logic needed
- 58.6% token reduction in benchmarks
- Conversations up to 10 million tokens
- Works across Claude API, Bedrock, Vertex, Foundry
- ZDR eligible for enterprise
- One parameter change
- Beta header:
compact-2026-12
Links
Related
- Stop Paying for Opus: Claude Sonnet 4.6 Changes Everything
- Claude Code Tutorial for Beginners (2026)
Originally published at ayyaztech.com
Top comments (0)