You're deep into a conversation with Claude when a message pops up: "Compacting our conversation so we can keep chatting." What does that actually mean? What's being compacted, and how?
If you're a user, understanding this will explain most of the strange behavior you've noticed in long conversations. If you're building AI applications, this is the most important engineering problem you're not thinking about yet.
The context window is a budget
Every LLM has a context window: the total amount of text it can "see" at once. Think of it like RAM for the conversation. GPT-4o has a 128K token window. The models powering Claude support up to 1 million tokens. Sounds enormous, right?
Here's the catch: your conversation history is only one of many things competing for that space.
Before your messages even enter the picture, the window is already partially claimed. The system prompt eats several thousand tokens with instructions, persona definitions, and behavioral guidelines. Tool definitions take up real space, especially when the assistant has access to many tools with detailed parameter schemas. If the system uses retrieval-augmented generation, the retrieved documents claim their share. Extracted memories from prior turns need room. And a chunk has to be reserved for the model's response.
After all of that, the budget remaining for your actual conversation history might be a fraction of the total window. This is the token budget problem, and every production AI system has to solve it.
What happens when you hit the limit
The simplest approach is truncation: keep the most recent messages, drop everything older. It's fast, predictable, and fits in a dozen lines of code.
But think about what gets lost. You told a coding assistant "we use Postgres and deploy on AWS" in message 3. You shared your API schema in message 8. You mentioned your team uses TypeScript everywhere in message 12. By message 50, the conversation has grown past what the context window can hold. The system quietly drops the oldest messages to make room. All of those early decisions are gone. The assistant is now suggesting MongoDB queries, recommending GCP tooling, and generating Python.
It looks like amnesia because it is amnesia.
For customer support bots handling password resets or FAQ lookups, where each question is essentially self-contained, simple truncation works fine. The moment conversations build on earlier context, though, you need something better. And most interesting AI applications involve exactly that kind of conversation.
Strategy 1: Compressing history with summarization
Rather than dropping old messages entirely, you can compress them. A thirty-message conversation about insurance verification, symptom discussion, and appointment preferences might condense to a two-hundred-token summary that preserves what matters: the patient has a penicillin allergy, prefers morning appointments, and is coming in for knee pain. The verbatim back-and-forth is gone, but the essential facts survive.
This is what Claude's "compacting" step does. A separate LLM call takes your older conversation and compresses it into a summary that captures the essential narrative. You lose the exact wording but keep the decisions, the preferences, and the important facts.
The summarization prompt matters more than you might expect. A generic "summarize this conversation" instruction produces generic summaries that may omit critical details. For a healthcare application, the prompt should explicitly ask to preserve medical conditions, allergies, and safety concerns. For a coding assistant, it should emphasize technology choices, architectural decisions, and requirements. The prompt shapes what survives compression.
Strategy 2: Model-managed memory
Summarization compresses the conversation, but it's still lossy. Critical facts can get diluted across summaries, especially as conversations grow very long and summaries get re-summarized.
A more reliable approach for truly important information: let the model explicitly extract and store facts as structured data, separate from the conversation itself.
The idea, inspired by UC Berkeley's MemGPT paper, treats the LLM like an operating system with the context window as "RAM" and external storage as "disk." Give the model a save_memory tool, and it can decide what's worth remembering permanently.
Imagine a patient mentions a penicillin allergy early in their healthcare intake conversation. Fifty messages later, when the conversation has moved through insurance verification, appointment scheduling, and symptom discussion, will that allergy still be accessible? With simple summarization, it might survive or it might not, depending on how the summary was generated and how many times it's been re-compressed.
With model-managed memory, the assistant explicitly saves {"allergies": ["penicillin"]} the moment the patient mentions it. That fact persists as structured data, available on every future turn regardless of how long the conversation becomes. When the patient later asks, "Can you prescribe amoxicillin for my infection?", the model sees the allergy in its context and can respond appropriately: amoxicillin is a penicillin-family antibiotic.
This approach shines whenever specific facts need to persist reliably across many turns, independent of whether the original messages still exist in context.
Strategy 3: Hierarchical memory
Production systems rarely use a single strategy. Instead, they combine approaches into a hierarchy where each layer serves a different purpose and tolerates different levels of compression.
Tier 1: Structured memories. Critical facts stored as key-value pairs. The penicillin allergy, the preferred appointment time, the tech stack. These sit at the very top of the context, where LLM attention is strongest. They never get dropped.
Tier 2: Rolling summaries. Older conversation history compressed into a few hundred tokens that capture the essential narrative. You lose the verbatim exchange but keep the decisions, the preferences, and the reasoning. As conversations grow, this summary gets periodically regenerated to incorporate newly "old" messages.
Tier 3: Recent messages. The last several exchanges stay word-for-word. When you ask a follow-up about something you just discussed, the model needs to see exactly what was said, not a summary of it.
How you allocate your token budget across these tiers depends on the application. A coding assistant where the last few exchanges contain the code under discussion will weight heavily toward recent messages. A healthcare assistant tracking many critical facts will allocate more to structured memories. The balance matters, and getting it right is a design problem specific to your use case.
Why this matters
If you're a user, understanding this system explains the behavior you've noticed. When an assistant "forgets" something you said earlier, that message was likely truncated or compressed away. When it contradicts itself over a long conversation, earlier reasoning is no longer in context. Short, focused conversations produce better results because everything fits. Explicitly restating important context helps because you're doing the model's memory management for it.
If you're building an AI application, this is the problem that separates demos from production systems. Nobody handles context management for you when you're building on LLM APIs. You need to decide what gets summarized, when compaction triggers, what facts are critical enough to extract into structured memory, and how to allocate your token budget across these tiers.
These are engineering problems, not model problems. The model doesn't know how to manage its own context. You do.
And when users start referencing previous conversations? That's a whole other problem - and a whole other chapter.
We build the full Session Service that handles all of this with working python code and full system design in Chapter 4 of our book "Designing AI Systems: A Guide to Building Production-Ready Platforms" on Manning. The first four chapters are live now on MEAP.



Top comments (0)