Conversation Summarization: Smart Context Management for Long Chats
Finite context windows. Growing conversations. Escalating token costs.
If you're building AI applications with multi-turn conversations, you've faced this problem. Every message adds to the context. Soon you're paying to resend the entire conversation history with every new request. Response quality degrades as the model loses focus. And eventually, you hit the context window limit.
Strategic summarization solves this by condensing older messages while preserving critical information.
When Summarization Becomes Essential
Summarization isn't always necessary. You should consider it when:
- Turn count exceeds 15-20 exchanges — token costs of full history dominate
- Tool outputs inflate context — a single API response can inject thousands of tokens
- Multi-session continuity is required — persistence needs manageable storage costs
- Response quality degrades — models lose focus in very large prompts
Pattern 1: LLM-Powered Condensation
The simplest approach uses an LLM to condense older messages into concise paragraphs:
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink({
conversationMemory: {
enabled: true,
enableSummarization: true,
summarizationOptions: {
// Trigger summarization at this message count
messageThreshold: 10,
// Use a cheap model for summarization
model: "claude-3-5-haiku",
// Preserve the most recent N messages verbatim
preserveRecentMessages: 5,
},
},
});
Key design features:
Important-message preservation: Flag messages that must survive summarization:
await neurolink.generate({
input: { text: "Confirm: wire $10,000 to account ending 4521" },
metadata: {
preserveInSummary: true, // Won't be condensed
importance: "critical",
},
});
Half-and-half splitting: Only the older half of messages gets condensed, avoiding awkward gaps in the conversation flow.
Cheap model usage: Always use your most economical model (Haiku, GPT-4o-mini, Gemini Flash) for summarization. The cost savings compound over thousands of requests.
Pattern 2: Hierarchical Summaries
For long-running conversations (hundreds of turns), maintain multiple summary levels:
interface HierarchicalContext {
// Detailed summary of recent messages
detailedSummary: string;
// Brief summary of older conversation
briefSummary: string;
// Full text of most recent exchanges
recentMessages: Message[];
}
Progressive summarization:
- Level 1: Recent messages (last 5-10 turns) in full
- Level 2: Detailed summary of messages 10-50
- Level 3: Brief summary of messages 50+
This approach maintains context relevance at different token budgets.
Pattern 3: Sliding Window Plus Summary Hybrid
Combine sliding window pruning with LLM summarization based on context window fullness:
const contextStrategy = {
// 0-70% usage: no action
healthy: { action: "none", threshold: 0.7 },
// 70-90% usage: summarize older messages
warning: { action: "summarize", threshold: 0.9 },
// 90-100% usage: sliding window pruning
critical: { action: "sliding-window", threshold: 1.0 },
// Over 100%: aggressive truncation
overflow: { action: "truncate", keep: 0.5 },
};
Pattern 4: Four-Stage Compaction Pipeline (Production-Grade)
NeuroLink implements a sophisticated pipeline that applies progressively more expensive optimizations only when necessary:
const neurolink = new NeuroLink({
conversationMemory: {
enabled: true,
contextCompaction: {
enabled: true,
// Start compaction at 80% of context window
threshold: 0.8,
// Stage 1: Remove old tool outputs
enablePruning: true,
// Stage 2: Deduplicate file reads
enableDeduplication: true,
// Stage 3: LLM summarization
enableSummarization: true,
// Stage 4: Sliding window truncation
enableSlidingWindow: true,
},
},
});
Stage 1: Tool Output Pruning
Replace old tool outputs with compact stubs:
BEFORE: Full JSON API response (2000 tokens)
AFTER: [Tool Result: weather API returned 73°F, sunny]
This stage costs nothing—no LLM calls required.
Stage 2: File Read Deduplication
If the AI read the same file multiple times, keep only the most recent:
BEFORE: Full file content repeated 3 times
AFTER: [Referenced: README.md (read 3x, showing latest)]
Stage 3: LLM Summarization
When cheaper options are insufficient, invoke the fast model to condense older messages.
Stage 4: Sliding Window Truncation
Last resort: remove oldest messages entirely while preserving critical flagged content.
Token Budget Strategies
Set context limits based on your use case:
| Use Case | Recommended Limit | Rationale |
|---|---|---|
| Chatbot | 4K-8K tokens | Fast responses, cost control |
| Code assistant | 16K-32K tokens | Larger code context needed |
| Document analysis | 32K-100K tokens | Processing large documents |
| Long-form writing | 8K-16K tokens | Balance context and cost |
The BudgetChecker estimates tokens across five categories:
interface TokenBudget {
systemPrompt: number;
toolDefinitions: number;
conversationHistory: number;
currentUserPrompt: number;
fileAttachments: number;
}
Monitoring and Observability
Track these metrics to tune your summarization strategy:
// Compaction effectiveness
{
"context.tokensBefore": 14520,
"context.tokensAfter": 4200,
"context.stage": "stage3-summarization",
"context.duration": 245
}
// Budget utilization
{
"context.budgetUsage": 0.87,
"context.threshold": 0.80,
"context.action": "trigger-compaction"
}
Key Principle
Not all context is equally valuable.
Old tool outputs, duplicate file reads, and verbatim conversation history from 50 turns ago contribute far less than a well-crafted summary. By reducing context strategically rather than uniformly, you maintain model focus while controlling costs across indefinitely long conversations.
NeuroLink — The Universal AI SDK for TypeScript
- GitHub: github.com/juspay/neurolink
- Install:
npm install @juspay/neurolink - Docs: docs.neurolink.ink
- Blog: blog.neurolink.ink — 150+ technical articles
Top comments (0)