NeuroLink AI

Posted on Apr 5 • Edited on Jun 27 • Originally published at blog.neurolink.ink

Conversation Summarization: Smart Context Management for Long Chats

#ai #typescript #webdev #programming

Conversation Summarization: Smart Context Management for Long Chats

Finite context windows. Growing conversations. Escalating token costs.

If you're building AI applications with multi-turn conversations, you've faced this problem. Every message adds to the context. Soon you're paying to resend the entire conversation history with every new request. Response quality degrades as the model loses focus. And eventually, you hit the context window limit.

Strategic summarization solves this by condensing older messages while preserving critical information.

When Summarization Becomes Essential

Summarization isn't always necessary. You should consider it when:

Turn count exceeds 15-20 exchanges — token costs of full history dominate
Tool outputs inflate context — a single API response can inject thousands of tokens
Multi-session continuity is required — persistence needs manageable storage costs
Response quality degrades — models lose focus in very large prompts

Pattern 1: LLM-Powered Condensation

The simplest approach uses an LLM to condense older messages into concise paragraphs:

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  conversationMemory: {
    enabled: true,
    enableSummarization: true,
    summarizationOptions: {
      // Trigger summarization at this message count
      messageThreshold: 10,
      // Use a cheap model for summarization
      model: "claude-3-5-haiku",
      // Preserve the most recent N messages verbatim
      preserveRecentMessages: 5,
    },
  },
});

Key design features:

Important-message preservation: Flag messages that must survive summarization:

await neurolink.generate({
  input: { text: "Confirm: wire $10,000 to account ending 4521" },
  metadata: {
    preserveInSummary: true, // Won't be condensed
    importance: "critical",
  },
});

Half-and-half splitting: Only the older half of messages gets condensed, avoiding awkward gaps in the conversation flow.

Cheap model usage: Always use your most economical model (Haiku, GPT-4o-mini, Gemini Flash) for summarization. The cost savings compound over thousands of requests.

Pattern 2: Hierarchical Summaries

For long-running conversations (hundreds of turns), maintain multiple summary levels:

interface HierarchicalContext {
  // Detailed summary of recent messages
  detailedSummary: string;
  // Brief summary of older conversation
  briefSummary: string;
  // Full text of most recent exchanges
  recentMessages: Message[];
}

Progressive summarization:

Level 1: Recent messages (last 5-10 turns) in full
Level 2: Detailed summary of messages 10-50
Level 3: Brief summary of messages 50+

This approach maintains context relevance at different token budgets.

Pattern 3: Sliding Window Plus Summary Hybrid

Combine sliding window pruning with LLM summarization based on context window fullness:

const contextStrategy = {
  // 0-70% usage: no action
  healthy: { action: "none", threshold: 0.7 },
  // 70-90% usage: summarize older messages
  warning: { action: "summarize", threshold: 0.9 },
  // 90-100% usage: sliding window pruning
  critical: { action: "sliding-window", threshold: 1.0 },
  // Over 100%: aggressive truncation
  overflow: { action: "truncate", keep: 0.5 },
};

Pattern 4: Four-Stage Compaction Pipeline (Production-Grade)

NeuroLink implements a sophisticated pipeline that applies progressively more expensive optimizations only when necessary:

const neurolink = new NeuroLink({
  conversationMemory: {
    enabled: true,
    contextCompaction: {
      enabled: true,
      // Start compaction at 80% of context window
      threshold: 0.8,
      // Stage 1: Remove old tool outputs
      enablePruning: true,
      // Stage 2: Deduplicate file reads
      enableDeduplication: true,
      // Stage 3: LLM summarization
      enableSummarization: true,
      // Stage 4: Sliding window truncation
      enableSlidingWindow: true,
    },
  },
});

Stage 1: Tool Output Pruning

Replace old tool outputs with compact stubs:

BEFORE: Full JSON API response (2000 tokens)
AFTER:  [Tool Result: weather API returned 73°F, sunny]

This stage costs nothing—no LLM calls required.

Stage 2: File Read Deduplication

If the AI read the same file multiple times, keep only the most recent:

BEFORE: Full file content repeated 3 times
AFTER:  [Referenced: README.md (read 3x, showing latest)]

Stage 3: LLM Summarization

When cheaper options are insufficient, invoke the fast model to condense older messages.

Stage 4: Sliding Window Truncation

Last resort: remove oldest messages entirely while preserving critical flagged content.

Token Budget Strategies

Set context limits based on your use case:

Use Case	Recommended Limit	Rationale
Chatbot	4K-8K tokens	Fast responses, cost control
Code assistant	16K-32K tokens	Larger code context needed
Document analysis	32K-100K tokens	Processing large documents
Long-form writing	8K-16K tokens	Balance context and cost

The BudgetChecker estimates tokens across five categories:

interface TokenBudget {
  systemPrompt: number;
  toolDefinitions: number;
  conversationHistory: number;
  currentUserPrompt: number;
  fileAttachments: number;
}

Monitoring and Observability

Track these metrics to tune your summarization strategy:

// Compaction effectiveness
{
  "context.tokensBefore": 14520,
  "context.tokensAfter": 4200,
  "context.stage": "stage3-summarization",
  "context.duration": 245
}

// Budget utilization
{
  "context.budgetUsage": 0.87,
  "context.threshold": 0.80,
  "context.action": "trigger-compaction"
}

Key Principle

Not all context is equally valuable.

Old tool outputs, duplicate file reads, and verbatim conversation history from 50 turns ago contribute far less than a well-crafted summary. By reducing context strategically rather than uniformly, you maintain model focus while controlling costs across indefinitely long conversations.

NeuroLink — The Universal AI SDK for TypeScript

GitHub: github.com/juspay/neurolink
Install: npm install @juspay/neurolink
Docs: docs.neurolink.ink
Blog: blog.neurolink.ink — 150+ technical articles

DEV Community

Conversation Summarization: Smart Context Management for Long Chats

Conversation Summarization: Smart Context Management for Long Chats

When Summarization Becomes Essential

Pattern 1: LLM-Powered Condensation

Pattern 2: Hierarchical Summaries

Pattern 3: Sliding Window Plus Summary Hybrid

Pattern 4: Four-Stage Compaction Pipeline (Production-Grade)

Stage 1: Tool Output Pruning

Stage 2: File Read Deduplication

Stage 3: LLM Summarization

Stage 4: Sliding Window Truncation

Token Budget Strategies

Monitoring and Observability

Key Principle

Top comments (0)