Target Keyword: "llm context window optimization techniques"
Tags: llm,artificial-intelligence,programming,developer,performance
Type: Guide
Content
LLM Context Window Management: Techniques for Handling Long Documents
Every LLM has a context window limit — a maximum number of tokens you can pass in a single request. Claude 3.5 Sonnet offers 200K tokens, but that's still finite. Here's how to manage context efficiently for production AI applications.
Understanding Context Window Limits
| Model | Context Window | Approximate Pages |
|---|---|---|
| Claude 3.5 Sonnet | 200K tokens | ~500 pages |
| GPT-4 Turbo | 128K tokens | ~300 pages |
| Claude 3 Opus | 200K tokens | ~500 pages |
When you exceed the limit, you get an error. When you're close, you're wasting money on tokens that add no value.
Token Estimation
import re
def estimate_tokens(text: str) -> int:
"""
Rough token estimation.
~4 characters per token for English text.
"""
return len(text) // 4
def estimate_tokens_precise(text: str) -> int:
"""
More precise estimation using word count.
Average English word is ~1.3 tokens.
"""
words = len(re.findall(r'\w+', text))
return int(words * 1.3)
Technique 1: Semantic Chunking
Split documents by meaning, not by character count:
import re
def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 200) -> list[str]:
"""
Split text into semantic chunks (paragraphs).
"""
# Split by double newlines (paragraphs)
paragraphs = re.split(r'\n\n+', text)
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = estimate_tokens(para)
if current_tokens + para_tokens > max_tokens:
# Save current chunk
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
# Start new chunk with overlap
overlap_paras = []
overlap_tokens = 0
for p in reversed(current_chunk):
t = estimate_tokens(p)
if overlap_tokens + t <= overlap:
overlap_paras.insert(0, p)
overlap_tokens += t
else:
break
current_chunk = overlap_paras + [para]
current_tokens = overlap_tokens + para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Technique 2: RAG — Retrieval-Augmented Generation
Don't put everything in the prompt. Retrieve only what's relevant:
class SimpleRAG:
def __init__(self, documents: list[str], chunk_size: int = 1000):
self.chunks = self._create_chunks(documents, chunk_size)
self.embeddings = self._create_embeddings(self.chunks)
def retrieve(self, query: str, top_k: int = 3) -> list[str]:
"""Find most relevant chunks for query."""
query_embedding = self._embed(query)
scores = [
self._cosine_similarity(query_embedding, e)
for e in self.embeddings
]
top_indices = sorted(range(len(scores)),
key=lambda i: scores[i],
reverse=True)[:top_k]
return [self.chunks[i] for i in top_indices]
def _create_chunks(self, documents: list[str], chunk_size: int) -> list[str]:
chunks = []
for doc in documents:
chunks.extend(semantic_chunk(doc, max_tokens=chunk_size))
return chunks
def _embed(self, text: str) -> list[float]:
# In production, use OpenAI or ofox.ai embeddings
pass
Technique 3: Conversation Summary
Summarize older messages to preserve context:
class SummarizingConversation:
def __init__(self, max_tokens: int = 16000):
self.max_tokens = max_tokens
self.messages = []
self.summary = ""
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._maybe_summarize()
def _maybe_summarize(self):
total_tokens = sum(estimate_tokens(m["content"]) for m in self.messages)
if total_tokens > self.max_tokens:
# Summarize older messages
older_messages = self.messages[:-5] # Keep last 5
recent = self.messages[-5:]
summary_prompt = f"""
Summarize this conversation concisely, preserving key information:
{chr(10).join(f'{m[\"role\"]}: {m[\"content\"]}' for m in older_messages)}
"""
# Call LLM to summarize (pseudocode)
self.summary = call_llm_summarize(summary_prompt)
self.messages = [{"role": "system", "content": f"Prior context: {self.summary}"}] + recent
def get_messages(self) -> list[dict]:
return self.messages
Technique 4: System Prompt Optimization
Keep system prompts lean:
# ❌ Verbose system prompt (wastes tokens)
verbose_system = """
You are a helpful AI assistant. You are designed to be respectful,
professional, and helpful. You should provide accurate information
and be honest when you don't know something. You should ...
[200 more words]
"""
# ✅ Lean system prompt (effective)
lean_system = """
Role: helpful AI assistant
Goal: provide accurate, concise answers
When unsure: say "I don't know"
"""
Technique 5: Streaming with Token Tracking
class StreamingTokenTracker:
def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
self.model = model
self.total_input_tokens = 0
self.total_output_tokens = 0
async def stream_chat(self, messages: list[dict]) -> str:
"""Stream response while tracking token usage."""
response = await fetch('https://api.ofox.ai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
},
body: json.dumps({
'model': self.model,
'messages': messages,
'stream': True
})
})
reader = response.body.getReader()
decoder = TextDecoder()
full_response = []
while True:
chunk = await reader.read()
if chunk.done: break
data = decoder.decode(chunk.value)
for line in data.split('\n'):
if line.startswith('data: '):
delta = json.loads(line[6:]).get('choices', [{}])[0].get('delta', {})
if content := delta.get('content'):
self.total_output_tokens += estimate_tokens(content)
yield content
# Track input tokens
self.total_input_tokens = sum(
estimate_tokens(m['content']) for m in messages
)
def get_cost(self, input_cost_per_1k=0.003, output_cost_per_1k=0.015):
input_cost = (self.total_input_tokens / 1000) * input_cost_per_1k
output_cost = (self.total_output_tokens / 1000) * output_cost_per_1k
return input_cost + output_cost
Practical Rule of Thumb
Keep your prompt at < 50% of context window.
This leaves room for:
- User input variations
- Model reasoning
- Unexpected response length
Getting Started
Build token-efficient AI applications with ofox.ai — their OpenAI-compatible API gives you access to Claude with generous context windows at competitive pricing.
This article contains affiliate links.
Tags: llm,artificial-intelligence,programming,developer,performance
Canonical URL: https://dev.to/zny10289
Top comments (0)