LLM Context Windows: What They Are and How to Work Within Them
Context windows are the fundamental constraint of working with LLMs. Understanding them unlocks more effective prompting and architecture.
What Is a Context Window?
The context window is the total amount of text an LLM can 'see' at once — both input and output. Everything outside the window is invisible to the model.
Current limits:
- Claude claude-sonnet-4-6: 200K tokens (~150K words)
- GPT-4o: 128K tokens
- Gemini 1.5 Pro: 1M tokens
1 token ≈ 0.75 words in English. Code is denser — more tokens per word.
Why Context Size Matters
Too small: model can't see all relevant information → incomplete or inconsistent outputs.
Too large: cost increases linearly with context. A 200K token prompt costs ~100x a 2K prompt.
Chunking for Long Documents
function chunkText(text: string, maxTokens = 4000): string[] {
// Rough estimate: 1 token ≈ 4 characters
const maxChars = maxTokens * 4;
const chunks: string[] = [];
// Split on paragraph boundaries when possible
const paragraphs = text.split('\n\n');
let currentChunk = '';
for (const para of paragraphs) {
if ((currentChunk + para).length > maxChars) {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = para;
} else {
currentChunk += '\n\n' + para;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
return chunks;
}
RAG: Retrieval-Augmented Generation
Instead of stuffing everything into context, retrieve only what's relevant:
async function answerWithRAG(question: string): Promise<string> {
// 1. Embed the question
const questionEmbedding = await embed(question);
// 2. Find relevant chunks from your knowledge base
const relevantChunks = await vectorSearch(questionEmbedding, { topK: 5 });
// 3. Build context from retrieved chunks only
const context = relevantChunks.map(c => c.text).join('\n\n');
// 4. Answer using retrieved context
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`,
}],
});
return response.content[0].type === 'text' ? response.content[0].text : '';
}
Conversation Memory Management
function manageConversationMemory(
messages: Message[],
maxTokens = 100000
): Message[] {
// Keep system message + recent messages that fit
const systemMsg = messages.find(m => m.role === 'system');
const conversation = messages.filter(m => m.role !== 'system');
let tokenCount = estimateTokens(systemMsg?.content ?? '');
const kept: Message[] = [];
// Walk backwards — keep most recent messages
for (let i = conversation.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(conversation[i].content);
if (tokenCount + msgTokens > maxTokens) break;
tokenCount += msgTokens;
kept.unshift(conversation[i]);
}
return systemMsg ? [systemMsg, ...kept] : kept;
}
MCP and Context Efficiency
MCP tools reduce context waste — instead of pasting documentation into the prompt, the model calls a tool to fetch exactly what it needs. The Workflow Automator MCP gives agents tools to fetch live data, keeping context lean and responses accurate.
Top comments (0)