LLM Context Windows: What They Are and How to Work Within Them

#webdev #programming #ai #llm

LLM Context Windows: What They Are and How to Work Within Them

Context windows are the fundamental constraint of working with LLMs. Understanding them unlocks more effective prompting and architecture.

What Is a Context Window?

The context window is the total amount of text an LLM can 'see' at once — both input and output. Everything outside the window is invisible to the model.

Current limits:

Claude claude-sonnet-4-6: 200K tokens (~150K words)
GPT-4o: 128K tokens
Gemini 1.5 Pro: 1M tokens

1 token ≈ 0.75 words in English. Code is denser — more tokens per word.

Why Context Size Matters

Too small: model can't see all relevant information → incomplete or inconsistent outputs.

Too large: cost increases linearly with context. A 200K token prompt costs ~100x a 2K prompt.

Chunking for Long Documents

function chunkText(text: string, maxTokens = 4000): string[] {
  // Rough estimate: 1 token ≈ 4 characters
  const maxChars = maxTokens * 4;
  const chunks: string[] = [];

  // Split on paragraph boundaries when possible
  const paragraphs = text.split('\n\n');
  let currentChunk = '';

  for (const para of paragraphs) {
    if ((currentChunk + para).length > maxChars) {
      if (currentChunk) chunks.push(currentChunk.trim());
      currentChunk = para;
    } else {
      currentChunk += '\n\n' + para;
    }
  }

  if (currentChunk) chunks.push(currentChunk.trim());
  return chunks;
}

RAG: Retrieval-Augmented Generation

Instead of stuffing everything into context, retrieve only what's relevant:

async function answerWithRAG(question: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await embed(question);

  // 2. Find relevant chunks from your knowledge base
  const relevantChunks = await vectorSearch(questionEmbedding, { topK: 5 });

  // 3. Build context from retrieved chunks only
  const context = relevantChunks.map(c => c.text).join('\n\n');

  // 4. Answer using retrieved context
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `Context:\n${context}\n\nQuestion: ${question}`,
    }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

Conversation Memory Management

function manageConversationMemory(
  messages: Message[],
  maxTokens = 100000
): Message[] {
  // Keep system message + recent messages that fit
  const systemMsg = messages.find(m => m.role === 'system');
  const conversation = messages.filter(m => m.role !== 'system');

  let tokenCount = estimateTokens(systemMsg?.content ?? '');
  const kept: Message[] = [];

  // Walk backwards — keep most recent messages
  for (let i = conversation.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(conversation[i].content);
    if (tokenCount + msgTokens > maxTokens) break;
    tokenCount += msgTokens;
    kept.unshift(conversation[i]);
  }

  return systemMsg ? [systemMsg, ...kept] : kept;
}

MCP and Context Efficiency

MCP tools reduce context waste — instead of pasting documentation into the prompt, the model calls a tool to fetch exactly what it needs. The Workflow Automator MCP gives agents tools to fetch live data, keeping context lean and responses accurate.