DEV Community

Atlas Whoff
Atlas Whoff

Posted on

LLM Context Windows: Managing Tokens in Production AI Apps

The Token Budget Problem

Claude claude-sonnet-4-6 has a 200k token context window. GPT-4o has 128k. These sound enormous until you're building a RAG application that needs to pass document context, conversation history, system prompts, and tool definitions simultaneously.

Running out of context window mid-conversation is an unrecoverable failure. Managing it is an engineering discipline.

Counting Tokens

import Anthropic from '@anthropic-ai/sdk';
import { encoding_for_model } from 'tiktoken'; // for OpenAI

// Anthropic: use the API's token counting endpoint
const anthropic = new Anthropic();

async function countTokens(messages: Anthropic.MessageParam[]) {
  const response = await anthropic.messages.countTokens({
    model: 'claude-sonnet-4-6',
    messages,
    system: 'You are a helpful assistant.',
  });
  return response.input_tokens;
}

// OpenAI: use tiktoken locally (no API call needed)
function countOpenAITokens(text: string, model = 'gpt-4o'): number {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free();
  return tokens.length;
}
Enter fullscreen mode Exit fullscreen mode

Conversation History Management

Naive approach: append every message forever → context overflow.

const MAX_CONTEXT_TOKENS = 100_000;

function truncateHistory(
  history: Message[],
  systemPrompt: string,
  newMessage: string
): Message[] {
  const systemTokens = countOpenAITokens(systemPrompt);
  const newMessageTokens = countOpenAITokens(newMessage);
  const reservedForResponse = 4000;

  const budget = MAX_CONTEXT_TOKENS - systemTokens - newMessageTokens - reservedForResponse;

  let usedTokens = 0;
  const kept: Message[] = [];

  // Keep most recent messages, drop oldest
  for (const msg of [...history].reverse()) {
    const tokens = countOpenAITokens(msg.content);
    if (usedTokens + tokens > budget) break;
    usedTokens += tokens;
    kept.unshift(msg);
  }

  return kept;
}
Enter fullscreen mode Exit fullscreen mode

Summarization Strategy

Instead of dropping old messages, summarize them:

async function summarizeHistory(messages: Message[]): Promise<string> {
  const conversation = messages
    .map(m => `${m.role}: ${m.content}`)
    .join('\n');

  const summary = await anthropic.messages.create({
    model: 'claude-haiku-4-5-20251001', // cheap model for summarization
    max_tokens: 500,
    messages: [{
      role: 'user',
      content: `Summarize this conversation concisely, preserving key facts and decisions:\n\n${conversation}`,
    }],
  });

  return summary.content[0].type === 'text' ? summary.content[0].text : '';
}

// When history exceeds threshold, summarize the oldest half
async function manageContext(history: Message[]): Promise<Message[]> {
  const totalTokens = history.reduce((sum, m) => sum + countOpenAITokens(m.content), 0);

  if (totalTokens < 80_000) return history;

  const midpoint = Math.floor(history.length / 2);
  const old = history.slice(0, midpoint);
  const recent = history.slice(midpoint);

  const summary = await summarizeHistory(old);

  return [
    { role: 'user', content: `[Earlier conversation summary: ${summary}]` },
    ...recent,
  ];
}
Enter fullscreen mode Exit fullscreen mode

RAG Context Budget

const CONTEXT_BUDGET = {
  system: 2000,        // system prompt
  history: 20000,      // conversation history
  retrieved: 40000,    // RAG documents
  response: 4000,      // reserved for response
  // total: 66000 — safe margin under 100k
};

async function buildPrompt(query: string, history: Message[]) {
  // 1. Retrieve relevant documents
  const docs = await vectorSearch(query, 10);

  // 2. Fit documents into budget
  let docTokens = 0;
  const fittedDocs: string[] = [];

  for (const doc of docs) {
    const tokens = countOpenAITokens(doc.content);
    if (docTokens + tokens > CONTEXT_BUDGET.retrieved) break;
    fittedDocs.push(doc.content);
    docTokens += tokens;
  }

  // 3. Truncate history to budget
  const truncatedHistory = truncateHistory(history, '', query);

  return {
    context: fittedDocs.join('\n\n---\n\n'),
    history: truncatedHistory,
  };
}
Enter fullscreen mode Exit fullscreen mode

Streaming for Long Responses

// Don't wait for the full response — stream it
const stream = anthropic.messages.stream({
  model: 'claude-sonnet-4-6',
  max_tokens: 4096,
  messages,
});

for await (const event of stream) {
  if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
    process.stdout.write(event.delta.text); // or send to SSE
  }
}

const finalMessage = await stream.finalMessage();
console.log(`Used ${finalMessage.usage.input_tokens} input tokens`);
Enter fullscreen mode Exit fullscreen mode

Cost Estimation

const PRICING = {
  'claude-sonnet-4-6': { input: 3.00, output: 15.00 }, // per million tokens
  'claude-haiku-4-5-20251001': { input: 0.25, output: 1.25 },
  'gpt-4o': { input: 2.50, output: 10.00 },
  'gpt-4o-mini': { input: 0.15, output: 0.60 },
};

function estimateCost(inputTokens: number, outputTokens: number, model: string): number {
  const prices = PRICING[model];
  return (inputTokens / 1_000_000 * prices.input) + (outputTokens / 1_000_000 * prices.output);
}

// Log costs per request
console.log(`Cost: $${estimateCost(50000, 2000, 'claude-sonnet-4-6').toFixed(4)}`);
Enter fullscreen mode Exit fullscreen mode

Context management is the difference between an AI app that works in demos and one that works in production after 50 messages.


AI integration patterns with context management, RAG, and streaming: see Whoff Agents for MCP tools that handle this automatically.

Top comments (0)