DEV Community

Ayyaz Zafar
Ayyaz Zafar

Posted on • Originally published at ayyaztech.com

Claude's Context Compaction API: Infinite Conversations with One Parameter

Watch the Video

The Problem: Token Limits Kill Long-Running Agents

If you've built anything with an LLM API — a chatbot, a coding agent, an n8n workflow, anything with a loop — you've hit this wall. Your agent is 45 minutes into a complex task, 100,000+ tokens deep. Every single turn, you're re-sending the entire conversation history.

The token count doesn't just grow — it compounds. Then one of two things happens:

  • Hard crash: You hit the 200K wall and the API returns an error. Your agent dies mid-task.
  • Silent degradation: The model loses track of decisions it made 30 turns ago and starts contradicting itself.

The workaround most developers use:

// The old hack
messages = messages.slice(-10);
Enter fullscreen mode Exit fullscreen mode

You're throwing away critical decisions, tool call results, user preferences — all gone. There's no intelligence in this truncation.

How Context Compaction Works

Context Compaction is a server-side API that handles this elegantly:

  1. You set a token threshold (e.g., 100,000 tokens)
  2. When your conversation approaches that limit, the API itself generates a structured summary
  3. It drops all raw messages before that summary
  4. It continues seamlessly from the compacted context

This cycles — when it hits the threshold again, it compacts again. Anthropic calls this "effectively infinite conversations."

You're not building a RAG pipeline. You're not setting up a vector database. You configure a trigger threshold and the API handles the rest.

It works across the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry. The compaction itself adds no extra charge — you actually save money.

Why This Works Now

Three things converged:

1. Model Quality

Opus 4.6 scores 76% on MRC-R v2 benchmark (needle-in-haystack across 1M tokens). Previous best was 18.5% — a 4x improvement. The summaries are actually high quality now.

2. Runtime Stability

Between Feb 10-19, Anthropic fixed unbounded WeakMap memory growth, O(n²) message accumulation bugs, and multiple memory leaks. The runtime no longer crashes under sustained load.

3. No Competition

OpenAI has no equivalent. Google Gemini gives you 1M tokens but no automatic summarization when you exceed it. This is Claude-only.

The Benchmark

Metric Without Compaction With Compaction
Input tokens 200,400 82,000
Token reduction 58.6%
Max conversation ~200K 10 million+

Implementation: Three Levels

Level 1: Basic (30 seconds)

const response = await anthropic.messages.create({
  model: "claude-opus-4-6-20260205",
  max_tokens: 8096,
  betas: ["compact-2026-12"],
  context_management: {
    enabled_tools: [{
      type: "compact",
      trigger: {
        type: "input_tokens",
        threshold: 100000
      }
    }]
  },
  messages: conversationHistory
});
Enter fullscreen mode Exit fullscreen mode

Level 2: Long-Running Agent

context_management: {
  enabled_tools: [{
    type: "compact",
    trigger: {
      type: "input_tokens",
      threshold: 100000
    },
    pause_after_compaction: true
  }]
}
Enter fullscreen mode Exit fullscreen mode

Level 3: Custom Instructions

context_management: {
  enabled_tools: [{
    type: "compact",
    trigger: {
      type: "input_tokens",
      threshold: 100000
    },
    instruction: "Focus on preserving code snippets, variable names, and technical decisions."
  }]
}
Enter fullscreen mode Exit fullscreen mode

Key Facts

  • Server-side API — no client-side logic needed
  • 58.6% token reduction in benchmarks
  • Conversations up to 10 million tokens
  • Works across Claude API, Bedrock, Vertex, Foundry
  • ZDR eligible for enterprise
  • One parameter change
  • Beta header: compact-2026-12

Links

Related


Originally published at ayyaztech.com

Top comments (0)