Vilva Athiban P B

Posted on Apr 26

Memory management in Claude Code: Session Memory and Safe Compaction

#ai #claude #react #programming

This is Part 2 of the series. If you have not read the first post yet, start with Part 1.

In Part 1, I looked at the first half of the memory pipeline:

slicing history after the last compact boundary
budgeting large tool results
using microcompact for cheap cleanup
triggering auto-compact with a policy instead of a last-second token check

That already solves a big part of the problem.

But once those cheaper layers are no longer enough, the system needs to do something more expensive without losing important working context, breaking message structure, or getting stuck in recovery loops.

This is where Claude Code’s design becomes especially useful. It treats compaction as a transaction, and it keeps a separate session-memory plane so overflow handling does not always start from scratch.

In this post, I will go through that second half:

how session memory is maintained
how full compaction works
why invariant protection matters more than it sounds
how recovery paths are kept bounded
what needs to be cleaned up after compaction

Two-part series

Part 1: Context Pipeline

Part 2: Session Memory and Safe Compaction

1) Keep session memory in parallel

A useful idea in Claude Code is that durable memory extraction does not begin at the moment the prompt overflows.

Instead, the system maintains a separate session-memory file in parallel and updates it only when it is actually worth doing.

That matters because summarization during an overflow path is expensive and time-sensitive. If you already have a distilled session memory available, compaction becomes cheaper and more predictable.

The extraction path is gated by thresholds such as:

initialize only after around 10k tokens
require at least 5k token growth between updates
require enough activity, such as 3 tool calls

The exact numbers are less important than the pattern. Session memory is updated opportunistically, not continuously.

Here is a TS sketch:

type Message = {
  id: string;
  token_count: number;
};

type SessionMemoryState = {
  content: string;
  last_update_tokens: number;
  last_summarized_message_id: string | null;
  extraction_started_at: number | null; // seconds since epoch
  extraction_task: Promise<void> | null;
};

const INIT_THRESHOLD_TOKENS = 10_000;
const GROWTH_THRESHOLD_TOKENS = 5_000;
const TOOL_CALL_THRESHOLD = 3;
const WAIT_TIMEOUT_SECONDS = 15;
const STALE_EXTRACTION_SECONDS = 60;

export function createSessionMemoryState(): SessionMemoryState {
  return {
    content: "",
    last_update_tokens: 0,
    last_summarized_message_id: null,
    extraction_started_at: null,
    extraction_task: null,
  };
}

export function shouldExtractSessionMemory(
  totalTokens: number,
  newTokensSinceLast: number,
  toolCalls: number
): boolean {
  if (totalTokens < INIT_THRESHOLD_TOKENS) {
    return false;
  }

  if (newTokensSinceLast < GROWTH_THRESHOLD_TOKENS) {
    return false;
  }

  if (toolCalls < TOOL_CALL_THRESHOLD) {
    return false;
  }

  return true;
}

function nowSeconds(): number {
  return Date.now() / 1000;
}

function withTimeout<T>(promise: Promise<T>, timeoutSeconds: number): Promise<T> {
  return new Promise<T>((resolve, reject) => {
    const timeoutId = setTimeout(() => {
      reject(new Error("Timeout"));
    }, timeoutSeconds * 1000);

    promise.then(
      (value) => {
        clearTimeout(timeoutId);
        resolve(value);
      },
      (error) => {
        clearTimeout(timeoutId);
        reject(error);
      }
    );
  });
}

export async function waitForInflightExtraction(
  state: SessionMemoryState
): Promise<void> {
  const task = state.extraction_task;

  if (!task) {
    return;
  }

  if (state.extraction_started_at !== null) {
    const age = nowSeconds() - state.extraction_started_at;
    if (age > STALE_EXTRACTION_SECONDS) {
      return;
    }
  }

  try {
    await withTimeout(task, WAIT_TIMEOUT_SECONDS);
  } catch {
    return;
  }
}

// Replace this with your actual durable-memory summarizer
function summarizeForDurableMemory(messages: Message[]): string {
  return `Summarized ${messages.length} messages`;
}

export async function updateSessionMemory(
  state: SessionMemoryState,
  messages: Message[]
): Promise<void> {
  const run = async (): Promise<void> => {
    state.content = summarizeForDurableMemory(messages);
    state.last_update_tokens = messages.reduce(
      (sum, m) => sum + (m.token_count || 0),
      0
    );

    if (messages.length > 0) {
      state.last_summarized_message_id = messages[messages.length - 1].id;
    }
  };

  state.extraction_started_at = nowSeconds();
  state.extraction_task = run();

  await state.extraction_task;
}

What this example shows is the separation of concerns.

background extraction handles durable knowledge distillation
compaction handles overflow when it happens

That split is one of the strongest parts of the design.

Why wait for in-flight extraction?

Before using session memory for compaction, Claude Code waits for a currently running extraction to finish, but only up to a limit.

That is a good practical choice.

It gives the system a chance to use a fresher memory snapshot without risking deadlock. If the extraction appears stale, it skips waiting and continues.

This is a small detail, but it shows the right mindset: use better context when available, but keep the runtime moving.

2) Treat compaction as a transaction

Once the system decides to compact, the easy thing to say is “summarize history.”

That is not enough.

A better way to think about compaction is as a transaction with clear phases:

prepare the compact input
generate the summary
clear state that is no longer valid
restore critical operational context
continue in the same turn

This is the part that makes the design production-grade.

A summary by itself rarely preserves everything the runtime needs. The agent may still need a small set of high-value context after compaction, such as:

currently relevant files
plan or plan mode state
invoked skills
deferred-tool deltas
agent listing deltas
MCP instruction deltas

That is why Claude Code rehydrates operational context after summarization, with strict budgets.

From the visible source snapshot, example restore budgets include:

max restored files: 5
max tokens per file: 5k
total file restore budget: 50k
max tokens per skill: 5k
total skills budget: 25k

Those limits matter. Rehydration without budgets will just recreate the original problem.

Here is a TypeScript sketch:

interface Message {
  id: string;
  role: "user" | "assistant";
  kind?: "text" | "tool_use" | "tool_result" | "boundary" | "summary";
  content: string;
  attachments?: string[];
}

interface ModelInfo {
  contextWindow: number;
  maxOutputTokens: number;
}

interface CompactionResult {
  summary: string;
  boundaryMessage: Message;
  restored: Message[];
}

async function fullCompaction(
  messages: Message[],
  model: ModelInfo
): Promise<CompactionResult | null> {
  const prepped = stripHeavyAttachments(messages);

  let summary: string | null = null;
  let working = [...prepped];

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      summary = await summarizeConversation(working, model);
      break;
    } catch (err) {
      if (!isPromptTooLong(err)) throw err;
      working = truncateOldestApiRounds(working);
    }
  }

  if (!summary) return null;

  clearReadCaches();
  clearNestedMemoryPathCaches();

  const restored = rehydrateCriticalContext(messages, {
    maxFiles: 5,
    maxTokensPerFile: 5_000,
    totalFileBudget: 50_000,
    maxTokensPerSkill: 5_000,
    totalSkillBudget: 25_000,
  });

  return {
    summary,
    boundaryMessage: {
      id: crypto.randomUUID(),
      role: "assistant",
      kind: "boundary",
      content: "[compact-boundary]",
    },
    restored,
  };
}

function buildPostCompactMessages(compacted: CompactionResult): Message[] {
  return [
    compacted.boundaryMessage,
    {
      id: crypto.randomUUID(),
      role: "assistant",
      kind: "summary",
      content: compacted.summary,
    },
    ...compacted.restored,
  ];
}

This code is useful for two reasons.

First, it makes compaction bounded. If the compact prompt is still too long, the system truncates oldest API rounds and retries only a limited number of times.

Second, it makes the restore step explicit. The compacted state is not just “smaller history.” It is “smaller history plus the operational context needed to continue.”

That is a much more reliable model.

3) Protect structure before saving tokens

This is probably the most important practical point in the whole design.

When context compression fails, the first failure is often structural, not semantic.

In other words, the problem is not always that the summary forgot something. The problem is that the resulting message set is invalid or inconsistent.

There are two invariants that matter a lot.

Tool-use and tool-result integrity

If a compaction boundary cuts through a tool interaction, the model can end up seeing a tool_result without the matching tool_use, or vice versa.

That breaks the structure of the conversation.

Assistant chunk integrity

A single assistant message may be represented as multiple chunks internally, such as thinking text and tool-use blocks that share the same assistant message ID. If compaction splits them incorrectly, the merged prompt becomes inconsistent.

Here is a TS sketch that adjusts a keep-start index so these invariants stay intact:

type Message = {
  kind?: string;
  tool_use_id?: string | null;
  assistant_message_id?: string | null;
};

export function adjustKeepStartIndex(
  messages: Message[],
  candidateIdx: number
): number {
  let idx = Math.max(0, candidateIdx);

  // Keep assistant chunk groups together
  if (idx < messages.length) {
    const currentGroup = messages[idx].assistant_message_id;

    while (
      idx > 0 &&
      currentGroup &&
      messages[idx - 1].assistant_message_id === currentGroup
    ) {
      idx -= 1;
    }
  }

  // Collect tool_result IDs after idx
  const toolResultIdsAfter = new Set<string>();
  for (let i = idx; i < messages.length; i++) {
    const m = messages[i];
    if (m.kind === "tool_result" && m.tool_use_id) {
      toolResultIdsAfter.add(m.tool_use_id);
    }
  }

  // Adjust boundary to avoid breaking invariants
  while (idx > 0) {
    const prev = messages[idx - 1];

    // Avoid orphaning tool results
    if (
      prev.kind === "tool_use" &&
      prev.tool_use_id &&
      toolResultIdsAfter.has(prev.tool_use_id)
    ) {
      idx -= 1;
      continue;
    }

    // Keep assistant chunks together
    const prevGroup = prev.assistant_message_id;
    if (
      prevGroup &&
      idx < messages.length &&
      messages[idx].assistant_message_id === prevGroup
    ) {
      idx -= 1;
      continue;
    }

    break;
  }

  return idx;
}

What this example shows is that compaction boundaries cannot be chosen by token count alone. They also need to respect structure.

That is why invariant protection should be treated as a first-class part of memory design, not as cleanup after the fact.

4) Recovery must stay bounded

Even with a good compaction pipeline, model calls can still fail.

Some examples are:

prompt-too-long
media-size failures
max-output-tokens

A common mistake is to handle these with open-ended retries. That usually makes incidents worse.

A better design is to use a bounded recovery ladder:

try the cheapest valid recovery
escalate in a limited way
stop after a hard ceiling
track repeated failures and trip a circuit breaker

Here is a TypeScript example:

async function retryWithRecoveryLadder(
  err: unknown,
  msgs: Message[],
  history: Message[],
  model: ModelInfo,
  state: { autoCompactFailures: number }
): Promise<Message[]> {
  if (isPromptTooLong(err)) {
    const compacted = await fullCompaction(msgs, model);
    if (compacted) {
      const retryMsgs = buildPostCompactMessages(compacted);
      const response = await callModel(retryMsgs, model);
      return appendToHistory(history, retryMsgs, response);
    }
  }

  if (isMediaTooLarge(err)) {
    const stripped = stripHeavyAttachments(msgs);
    const response = await callModel(stripped, model);
    return appendToHistory(history, stripped, response);
  }

  if (isMaxOutputTokens(err)) {
    const bumpedModel = {
      ...model,
      maxOutputTokens: Math.min(model.maxOutputTokens * 2, 8192),
    };
    const response = await callModel(msgs, bumpedModel);
    return appendToHistory(history, msgs, response);
  }

  state.autoCompactFailures += 1;
  throw err;
}

The idea is simple. Recovery paths should help the system escape bad states, not keep it stuck in them.

That is why the circuit breaker on repeated compaction failures matters too. If compaction keeps failing, the runtime should stop trying the same thing over and over.

5) Cleanup is part of correctness

Compaction changes the transcript, but it also invalidates derived state.

That means a compaction flow is not complete until the system clears or resets the state that no longer matches the new prompt.

In the visible Claude Code paths, post-compact cleanup resets things like:

microcompact state
system prompt sections
session message cache
approvals
some user and memory caches for main-thread compactions

At the same time, it intentionally preserves some invoked-skill content so future compactions can still use it.

That is a good reminder that cleanup is not just cache invalidation for performance. It is part of correctness.

Here is a small TypeScript sketch:

function runPostCompactCleanup(source: "user" | "compact" | "session_memory") {
  resetMicrocompactState();
  resetSystemPromptSections();
  clearSessionMessageCache();
  clearApprovals();

  if (source !== "compact") {
    clearUserCaches();
    clearMemoryCaches();
  }

  // Preserve whichever skill cache your runtime wants available
  // for future compaction passes.
}

If this step is missing, the agent can continue with ghost state that refers to a history shape that no longer exists.

That kind of bug is subtle and hard to debug later.

6) Measure the transformed prompt, not raw history

One detail I especially like in this design is that context diagnostics mirror the same transformations used before the model call.

That means a /context command should not look at raw stored history. It should look at:

messages after the compact boundary
with tool-result budgeting applied
with microcompact applied
with any context-collapse projection applied

That way, the numbers shown to operators are the numbers that actually matter.

Here is a TS sketch:

type Message = {
  token_count?: number;
};

type Model = {
  context_window: number;
  max_output_tokens: number;
};

type State = unknown; // adjust based on your actual state shape

// Assume these exist in your pipeline
declare function getMessagesAfterCompactBoundary(history: Message[]): Message[];
declare function simulateToolResultBudget(msgs: Message[], state: State): Message[];
declare function maybeHistorySnip(msgs: Message[]): Message[];
declare function microcompact(msgs: Message[]): Message[];
declare function maybeContextCollapseProjection(msgs: Message[]): Message[];

export function contextDiagnostics(
  history: Message[],
  model: Model,
  state: State
) {
  let msgs = getMessagesAfterCompactBoundary(history);
  msgs = simulateToolResultBudget(msgs, state);
  msgs = maybeHistorySnip(msgs);
  msgs = microcompact(msgs);
  msgs = maybeContextCollapseProjection(msgs);

  const totalTokens = msgs.reduce(
    (sum, m) => sum + (m.token_count || 0),
    0
  );

  const effectiveContext =
    model.context_window - Math.min(model.max_output_tokens, 20_000);

  const threshold = effectiveContext - 13_000;

  return {
    input_tokens: totalTokens,
    auto_compact_threshold: threshold,
    remaining_before_compact: Math.max(0, threshold - totalTokens),
    message_count: msgs.length,
  };
}

This is a small implementation choice, but it improves debugging immediately because the reported state matches the model-facing prompt.

7) The reusable blueprint

At a high level, the full control plane looks like this:

async function runQueryTurn(history: Message[]) {
  let msgs = messagesAfterLastCompactBoundary(history);
  msgs = enforceToolResultBudget(msgs);
  msgs = optionalHistorySnip(msgs);
  msgs = microcompact(msgs);
  msgs = optionalContextCollapseProjection(msgs);

  if (shouldAutoCompact(msgs)) {
    const compacted =
      trySessionMemoryCompaction(msgs) || fullCompaction(msgs);

    if (compacted) {
      msgs = buildPostCompactMessages(compacted);
      runPostCompactCleanup();
    }
  }

  let response = await callModel(msgs);

  if (recoverableOverflow(response)) {
    response = await retryWithRecoveryLadder();
  }

  return append(
    history,
    msgs,
    response,
    getToolResults(response),
    getAttachments(response)
  );
}

This is why the most transferable lesson from Claude Code is architectural.

Memory efficiency in agents is not one algorithm. It is a coordinated system made of:

staged context reduction
stable replacement decisions
asynchronous durable memory extraction
transaction-style compaction
invariant-preserving boundaries
bounded recovery
observability based on the real prompt view

That is what makes long sessions manageable.

Closing

If you only implement summarization, your agent may work for a while.

If you implement a layered memory control plane — cheap reductions first, durable memory in parallel, compaction as a transaction, and bounded recovery — the agent has a much better chance of staying stable over long sessions.

That is the real lesson here.

Not better summaries.

Better memory architecture.

Read Part 1: Context Pipeline

DEV Community