DEV Community

Vilva Athiban P B
Vilva Athiban P B

Posted on

Memory management in Claude Code: Context Pipeline

Memory management in claude code

If you are building an AI agent that needs to survive longer sessions, context management becomes a systems problem very quickly.

At first, the implementation looks simple. You append messages, keep tool outputs around, maybe summarize occasionally, and keep going. But once the session becomes longer, the same problems start showing up: prompts get too large, tool outputs dominate the context, cache reuse drops, and retries start making the situation worse.

That is why Claude Code’s memory design is interesting. The useful part is not one summarization trick. It is the layered control system around context.

This post covers the first half of that system:

  • how the model-facing prompt is built on each turn
  • why tool-result budgeting happens early
  • where microcompact fits in
  • how auto-compact is triggered safely

This is Part 1 of a two-part series. In Part 2(Coming soon), I will look at session memory, full compaction, invariant protection, cleanup, and bounded recovery.


Two-part series

Part 1: Context Pipeline

Part 2: Session Memory and Safe Compaction (Coming soon)

Why context becomes a systems problem

A long-running coding agent keeps collecting context from different places:

  • user and assistant messages
  • tool calls
  • large tool outputs
  • file attachments
  • memory files and instruction overlays
  • subagent and MCP state

The mistake is to treat all of this as one flat history buffer.

It is not.

A recent plan, an active file, and a large terminal output from twelve turns ago should not be treated the same way. If the system has only one strategy — keep appending and summarize later — it will eventually hit one of these failure modes:

  • prompt-too-long loops
  • broken tool_use / tool_result structure
  • duplicate reinjection of context
  • poor prompt-cache reuse
  • brittle resume behavior

So the real problem is not just “how do I summarize history?”

The real problem is:

How do I keep the prompt useful over long sessions without breaking message invariants, hurting cache stability, or making recovery worse?

That is the part Claude Code gets right.


Context pipeline in Claude code

1) Start with a layered pipeline

The implementation uses a layered memory pipeline.

Instead of jumping directly to summarization, it reduces context in stages:

  1. keep only the relevant slice of history
  2. shrink large tool results
  3. do lightweight cleanup
  4. compact only when cheaper steps are not enough

That ordering matters. Summarization is the expensive path, so it should be the fallback, not the default.

A simplified version of the per-turn flow looks like this:

  1. slice messages after the last compact boundary
  2. apply tool-result budget
  3. optionally snip history
  4. run microcompact
  5. optionally apply context-collapse projection
  6. check auto-compact
  7. call the model
  8. use bounded recovery if the call still fails

This is the shape that makes the whole system stable.


2) The per-turn context pipeline

Here is a generic TypeScript version of that control flow.

type QuerySource = "user" | "compact" | "session_memory";

interface Message {
  id: string;
  role: "user" | "assistant";
  kind?: "text" | "tool_use" | "tool_result" | "boundary" | "summary";
  content: string;
  toolUseId?: string;
  assistantMessageId?: string;
  tokenCount?: number;
  timestampMs?: number;
}

interface ModelInfo {
  contextWindow: number;
  maxOutputTokens: number;
}

interface RuntimeState {
  autoCompactFailures: number;
  userConfig: { autoCompactEnabled: boolean };
  env: {
    disableCompact: boolean;
    disableAutoCompact: boolean;
  };
  toolBudgetState: {
    seenIds: Set<string>;
    replacements: Map<string, string>;
  };
}

async function runQueryTurn(
  history: Message[],
  model: ModelInfo,
  source: QuerySource,
  state: RuntimeState
): Promise<Message[]> {
  let msgs = getMessagesAfterCompactBoundary(history);

  msgs = await applyToolResultBudget(msgs, state.toolBudgetState);
  msgs = maybeHistorySnip(msgs);
  msgs = microcompact(msgs);
  msgs = maybeContextCollapseProjection(msgs);

  if (shouldAutoCompact(msgs, model, source, state)) {
    const compacted =
      (await trySessionMemoryCompaction(msgs, state, model)) ??
      (await fullCompaction(msgs, state, model));

    if (compacted) {
      msgs = buildPostCompactMessages(compacted);
      runPostCompactCleanup(state, source);
    } else {
      state.autoCompactFailures += 1;
    }
  }

  try {
    const response = await callModel(msgs, model);
    return appendToHistory(history, msgs, response);
  } catch (err) {
    return retryWithRecoveryLadder(err, msgs, history, model, state);
  }
}
Enter fullscreen mode Exit fullscreen mode

The important thing here is not the exact function names. It is the order.

The model does not receive raw history. It receives a transformed view of the history. And that transformed view is built in stages.

That is also why observability should mirror this pipeline. If your /context command looks at raw history but the model sees a compacted version of it, your debugging numbers will not match reality.


3) Budget tool results before anything expensive

The cheapest and most effective reduction is usually tool output.

In long coding sessions, tool results are often the biggest tokens in the prompt:

  • terminal output
  • search results
  • file diffs
  • generated logs
  • structured JSON returned by tools

The system usually does not need the full payload forever. It needs enough inline context for the model to stay grounded, and a durable reference to the full content outside the prompt.

This is where tool-result budgeting comes in.

A good implementation should do three things:

  1. persist the full output somewhere durable
  2. replace the in-context payload with a stable preview
  3. replay the same replacement text on later turns

That third point matters for prompt caching. If the replacement text changes from turn to turn, the cache key changes too. So once a tool result has been replaced, the decision should be frozen by tool_use_id.

Here is a generic TypeScript version:

interface ToolStorage {
  store(toolUseId: string, content: string): Promise<string>;
}

interface BudgetState {
  seenIds: Set<string>;
  replacements: Map<string, string>;
}

const MAX_TOOL_RESULT_BYTES_PER_MESSAGE = 64_000;

function buildPreview(ref: string, original: string): string {
  return [
    `Tool result stored externally: ${ref}`,
    ``,
    `Preview:`,
    original.slice(0, 500),
    ``,
    `[truncated for context budget]`,
  ].join("\n");
}

async function applyToolResultBudget(
  messages: Message[],
  state: BudgetState,
  storage?: ToolStorage
): Promise<Message[]> {
  return Promise.all(
    messages.map(async (msg) => {
      if (msg.kind !== "tool_result" || !msg.toolUseId) return msg;

      const bytes = Buffer.byteLength(msg.content, "utf8");
      if (bytes <= MAX_TOOL_RESULT_BYTES_PER_MESSAGE) return msg;

      if (state.seenIds.has(msg.toolUseId)) {
        const replacement = state.replacements.get(msg.toolUseId);
        return replacement ? { ...msg, content: replacement } : msg;
      }

      const ref = storage
        ? await storage.store(msg.toolUseId, msg.content)
        : `tool-result://${msg.toolUseId}`;

      const replacement = buildPreview(ref, msg.content);

      state.seenIds.add(msg.toolUseId);
      state.replacements.set(msg.toolUseId, replacement);

      return { ...msg, content: replacement };
    })
  );
}
Enter fullscreen mode Exit fullscreen mode

What this example shows is simple: large tool outputs are converted into small, stable prompt payloads before the system reaches full compaction.

That is a practical optimization, and it pays off early.

Why budget at the API message level?

One useful production detail from Claude Code is that budgeting is applied at the merged API-message level, not just on raw internal message objects.

Why does that matter?

Because tool outputs can arrive in parallel. If you budget each internal fragment separately, several large results can slip through simply because they are split across message objects. Budgeting at the merged message level closes that gap.

That is the kind of detail that usually only becomes obvious once you run real sessions at scale.


4) Use microcompact for cheap cleanup

So what happens after tool-result budgeting?

You still need a lightweight cleanup path before full compaction. That is what microcompact is for.

Claude Code uses two useful ideas here.

Cached microcompact

One path tracks compactable tool_result blocks and edits them in a cache-aware way instead of eagerly rewriting everything locally.

The point is not only to remove tokens. It is also to preserve cache reuse as much as possible.

Time-based microcompact

There is also a time-based path for idle sessions.

If enough time has passed since the last assistant message, older tool results become cheaper to remove because the cache may already be stale. In that case, the system can keep only the most recent few tool results and clear the rest.

This is a useful production idea because not all context pressure is token pressure. Some of it is cache-lifecycle pressure.

Here is a minimal version of a time-based microcompact pass:

type Message = {
  id: string;
  role: string;
  kind?: string;
  content: string;
  timestamp_ms?: number;
};

const IDLE_THRESHOLD_MS = 60 * 60 * 1000;
const KEEP_RECENT_TOOL_RESULTS = 5;

export function microcompact(
  messages: Message[],
  nowMs: number = Date.now()
): Message[] {
  // Find last assistant message (from end)
  const lastAssistant = [...messages]
    .reverse()
    .find((m) => m.role === "assistant");

  if (!lastAssistant || !lastAssistant.timestamp_ms) {
    return messages;
  }

  const idleTooLong =
    nowMs - lastAssistant.timestamp_ms > IDLE_THRESHOLD_MS;

  if (!idleTooLong) {
    return messages;
  }

  // Get all tool_result messages
  const toolResults = messages.filter(
    (m) => m.kind === "tool_result"
  );

  // Keep only last N tool results
  const keepIds = new Set(
    toolResults.slice(-KEEP_RECENT_TOOL_RESULTS).map((m) => m.id)
  );

  // Build compacted messages
  return messages.map((msg) => {
    if (msg.kind === "tool_result" && !keepIds.has(msg.id)) {
      return {
        ...msg,
        content:
          "[microcompacted: old tool result removed after idle period]",
      };
    }
    return msg;
  });
}
Enter fullscreen mode Exit fullscreen mode

This example is intentionally simple. It is not trying to summarize anything. It is just removing stale low-value content with very low latency.

That is exactly where microcompact fits.


5) Auto-compact is a policy decision

Eventually the lightweight paths are not enough. At that point the system needs to decide whether to run a full compaction.

The common mistake is to trigger compaction too late:

if used_tokens > context_window:
    summarize()
Enter fullscreen mode Exit fullscreen mode

That looks fine in a toy implementation. In production it is usually too late.

Claude Code uses a more realistic threshold model.

First, it computes an effective context window:

effectiveContext = contextWindow - min(modelMaxOutput, 20k reserve)
Enter fullscreen mode Exit fullscreen mode

This makes room for the response instead of pretending the whole context window is available for input.

Then it applies an additional buffer:

autoCompactThreshold = effectiveContext - 13k buffer
Enter fullscreen mode Exit fullscreen mode

That extra buffer matters. Compaction itself needs space. Recovery paths also need space. Waiting until the last possible turn is usually a bad policy.

Here is a simple version of the same logic:

type Message = {
  token_count: number;
};

type Model = {
  context_window: number;
  max_output_tokens: number;
};

type State = {
  disable_compact: boolean;
  disable_auto_compact: boolean;
  auto_compact_enabled: boolean;
  auto_compact_failures: number;
};

export function effectiveContextWindow(
  contextWindow: number,
  modelMaxOutput: number
): number {
  const reserve = Math.min(modelMaxOutput, 20_000);
  return contextWindow - reserve;
}

export function autoCompactThreshold(
  contextWindow: number,
  modelMaxOutput: number
): number {
  return effectiveContextWindow(contextWindow, modelMaxOutput) - 13_000;
}

export function shouldAutoCompact(
  messages: Message[],
  model: Model,
  querySource: string,
  state: State
): boolean {
  if (state.disable_compact || state.disable_auto_compact) {
    return false;
  }

  if (!state.auto_compact_enabled) {
    return false;
  }

  if (querySource === "compact" || querySource === "session_memory") {
    return false;
  }

  if (state.auto_compact_failures >= 3) {
    return false;
  }

  const totalTokens = messages.reduce(
    (sum, m) => sum + (m.token_count || 0),
    0
  );

  const threshold = autoCompactThreshold(
    model.context_window,
    model.max_output_tokens
  );

  return totalTokens >= threshold;
}
Enter fullscreen mode Exit fullscreen mode

What this example shows is that compaction is a policy decision, not just a token-count check.

The policy is guarded by:

  • environment flags
  • user config
  • query source
  • a circuit breaker for repeated failures

That is the right way to think about it. Once compaction becomes part of the runtime, it needs the same safety rails as any other production control path.


6) Why this layering works

At this point the overall structure should be clear:

  • tool-result budgeting handles the biggest cheap wins
  • microcompact handles lightweight cleanup
  • auto-compact is the expensive fallback

This is what layered degradation means in practice.

Instead of going from “everything is in the prompt” to “summarize history,” the system has intermediate steps that are cheaper, safer, and easier to reason about.

That matters for two reasons.

First, it keeps latency lower in the common case.

Second, it reduces the chances of structural mistakes because the system does not compact more than it needs to.


7) In Part 2

Part 1 focused on the normal turn pipeline and the cheap reductions that happen before summarization.

In Part 2, I will look at the second half of the architecture:

  • session memory as a separate durable memory plane
  • why full compaction should be treated as a transaction
  • how structural invariants are preserved
  • how bounded recovery avoids retry spirals
  • why post-compact cleanup and observability matter

That is where the design goes from prompt management to stable long-session behavior.


Follow Vilva Athiban for more such production-grade AI/Agentic content

Read next: Session Memory and Safe Compaction (coming soon)

Top comments (0)