Ayyaz Zafar

Posted on Feb 21 • Originally published at ayyaztech.com

Claude's Context Compaction API: Infinite Conversations with One Parameter

#agents #api #ai #llm

Watch the Video

The Problem: Token Limits Kill Long-Running Agents

If you've built anything with an LLM API — a chatbot, a coding agent, an n8n workflow, anything with a loop — you've hit this wall. Your agent is 45 minutes into a complex task, 100,000+ tokens deep. Every single turn, you're re-sending the entire conversation history.

The token count doesn't just grow — it compounds. Then one of two things happens:

Hard crash: You hit the 200K wall and the API returns an error. Your agent dies mid-task.
Silent degradation: The model loses track of decisions it made 30 turns ago and starts contradicting itself.

The workaround most developers use:

// The old hack
messages = messages.slice(-10);

You're throwing away critical decisions, tool call results, user preferences — all gone. There's no intelligence in this truncation.

How Context Compaction Works

Context Compaction is a server-side API that handles this elegantly:

You set a token threshold (e.g., 100,000 tokens)
When your conversation approaches that limit, the API itself generates a structured summary
It drops all raw messages before that summary
It continues seamlessly from the compacted context

This cycles — when it hits the threshold again, it compacts again. Anthropic calls this "effectively infinite conversations."

You're not building a RAG pipeline. You're not setting up a vector database. You configure a trigger threshold and the API handles the rest.

It works across the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry. The compaction itself adds no extra charge — you actually save money.

Why This Works Now

Three things converged:

1. Model Quality

Opus 4.6 scores 76% on MRC-R v2 benchmark (needle-in-haystack across 1M tokens). Previous best was 18.5% — a 4x improvement. The summaries are actually high quality now.

2. Runtime Stability

Between Feb 10-19, Anthropic fixed unbounded WeakMap memory growth, O(n²) message accumulation bugs, and multiple memory leaks. The runtime no longer crashes under sustained load.

3. No Competition

OpenAI has no equivalent. Google Gemini gives you 1M tokens but no automatic summarization when you exceed it. This is Claude-only.

The Benchmark

Metric	Without Compaction	With Compaction
Input tokens	200,400	82,000
Token reduction	—	58.6%
Max conversation	~200K	10 million+

Implementation: Three Levels

Level 1: Basic (30 seconds)

const response = await anthropic.messages.create({
  model: "claude-opus-4-6-20260205",
  max_tokens: 8096,
  betas: ["compact-2026-12"],
  context_management: {
    enabled_tools: [{
      type: "compact",
      trigger: {
        type: "input_tokens",
        threshold: 100000
      }
    }]
  },
  messages: conversationHistory
});

Level 2: Long-Running Agent

context_management: {
  enabled_tools: [{
    type: "compact",
    trigger: {
      type: "input_tokens",
      threshold: 100000
    },
    pause_after_compaction: true
  }]
}

Level 3: Custom Instructions

context_management: {
  enabled_tools: [{
    type: "compact",
    trigger: {
      type: "input_tokens",
      threshold: 100000
    },
    instruction: "Focus on preserving code snippets, variable names, and technical decisions."
  }]
}

Key Facts

Server-side API — no client-side logic needed
58.6% token reduction in benchmarks
Conversations up to 10 million tokens
Works across Claude API, Bedrock, Vertex, Foundry
ZDR eligible for enterprise
One parameter change
Beta header: compact-2026-12

Links

Originally published at ayyaztech.com

DEV Community