DEV Community

ZNY
ZNY

Posted on

DEV.TO ARTICLE 38: LLM Context Window Management: Techniques for Handling Long Documents

Target Keyword: "llm context window optimization techniques"
Tags: llm,artificial-intelligence,programming,developer,performance
Type: Guide


Content

LLM Context Window Management: Techniques for Handling Long Documents

Every LLM has a context window limit — a maximum number of tokens you can pass in a single request. Claude 3.5 Sonnet offers 200K tokens, but that's still finite. Here's how to manage context efficiently for production AI applications.

Understanding Context Window Limits

Model Context Window Approximate Pages
Claude 3.5 Sonnet 200K tokens ~500 pages
GPT-4 Turbo 128K tokens ~300 pages
Claude 3 Opus 200K tokens ~500 pages

When you exceed the limit, you get an error. When you're close, you're wasting money on tokens that add no value.

Token Estimation

import re

def estimate_tokens(text: str) -> int:
    """
    Rough token estimation.
    ~4 characters per token for English text.
    """
    return len(text) // 4

def estimate_tokens_precise(text: str) -> int:
    """
    More precise estimation using word count.
    Average English word is ~1.3 tokens.
    """
    words = len(re.findall(r'\w+', text))
    return int(words * 1.3)
Enter fullscreen mode Exit fullscreen mode

Technique 1: Semantic Chunking

Split documents by meaning, not by character count:

import re

def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 200) -> list[str]:
    """
    Split text into semantic chunks (paragraphs).
    """
    # Split by double newlines (paragraphs)
    paragraphs = re.split(r'\n\n+', text)

    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = estimate_tokens(para)

        if current_tokens + para_tokens > max_tokens:
            # Save current chunk
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))

            # Start new chunk with overlap
            overlap_paras = []
            overlap_tokens = 0
            for p in reversed(current_chunk):
                t = estimate_tokens(p)
                if overlap_tokens + t <= overlap:
                    overlap_paras.insert(0, p)
                    overlap_tokens += t
                else:
                    break

            current_chunk = overlap_paras + [para]
            current_tokens = overlap_tokens + para_tokens
        else:
            current_chunk.append(para)
            current_tokens += para_tokens

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks
Enter fullscreen mode Exit fullscreen mode

Technique 2: RAG — Retrieval-Augmented Generation

Don't put everything in the prompt. Retrieve only what's relevant:

class SimpleRAG:
    def __init__(self, documents: list[str], chunk_size: int = 1000):
        self.chunks = self._create_chunks(documents, chunk_size)
        self.embeddings = self._create_embeddings(self.chunks)

    def retrieve(self, query: str, top_k: int = 3) -> list[str]:
        """Find most relevant chunks for query."""
        query_embedding = self._embed(query)

        scores = [
            self._cosine_similarity(query_embedding, e)
            for e in self.embeddings
        ]

        top_indices = sorted(range(len(scores)), 
                           key=lambda i: scores[i], 
                           reverse=True)[:top_k]

        return [self.chunks[i] for i in top_indices]

    def _create_chunks(self, documents: list[str], chunk_size: int) -> list[str]:
        chunks = []
        for doc in documents:
            chunks.extend(semantic_chunk(doc, max_tokens=chunk_size))
        return chunks

    def _embed(self, text: str) -> list[float]:
        # In production, use OpenAI or ofox.ai embeddings
        pass
Enter fullscreen mode Exit fullscreen mode

Technique 3: Conversation Summary

Summarize older messages to preserve context:

class SummarizingConversation:
    def __init__(self, max_tokens: int = 16000):
        self.max_tokens = max_tokens
        self.messages = []
        self.summary = ""

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._maybe_summarize()

    def _maybe_summarize(self):
        total_tokens = sum(estimate_tokens(m["content"]) for m in self.messages)

        if total_tokens > self.max_tokens:
            # Summarize older messages
            older_messages = self.messages[:-5]  # Keep last 5
            recent = self.messages[-5:]

            summary_prompt = f"""
Summarize this conversation concisely, preserving key information:

{chr(10).join(f'{m[\"role\"]}: {m[\"content\"]}' for m in older_messages)}
"""

            # Call LLM to summarize (pseudocode)
            self.summary = call_llm_summarize(summary_prompt)
            self.messages = [{"role": "system", "content": f"Prior context: {self.summary}"}] + recent

    def get_messages(self) -> list[dict]:
        return self.messages
Enter fullscreen mode Exit fullscreen mode

Technique 4: System Prompt Optimization

Keep system prompts lean:

# ❌ Verbose system prompt (wastes tokens)
verbose_system = """
You are a helpful AI assistant. You are designed to be respectful, 
professional, and helpful. You should provide accurate information 
and be honest when you don't know something. You should ...
[200 more words]
"""

# ✅ Lean system prompt (effective)
lean_system = """
Role: helpful AI assistant
Goal: provide accurate, concise answers
When unsure: say "I don't know"
"""
Enter fullscreen mode Exit fullscreen mode

Technique 5: Streaming with Token Tracking

class StreamingTokenTracker:
    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.model = model
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    async def stream_chat(self, messages: list[dict]) -> str:
        """Stream response while tracking token usage."""
        response = await fetch('https://api.ofox.ai/v1/chat/completions', {
            method: 'POST',
            headers: {
                'Authorization': f'Bearer {API_KEY}',
                'Content-Type': 'application/json'
            },
            body: json.dumps({
                'model': self.model,
                'messages': messages,
                'stream': True
            })
        })

        reader = response.body.getReader()
        decoder = TextDecoder()
        full_response = []

        while True:
            chunk = await reader.read()
            if chunk.done: break

            data = decoder.decode(chunk.value)
            for line in data.split('\n'):
                if line.startswith('data: '):
                    delta = json.loads(line[6:]).get('choices', [{}])[0].get('delta', {})
                    if content := delta.get('content'):
                        self.total_output_tokens += estimate_tokens(content)
                        yield content

        # Track input tokens
        self.total_input_tokens = sum(
            estimate_tokens(m['content']) for m in messages
        )

    def get_cost(self, input_cost_per_1k=0.003, output_cost_per_1k=0.015):
        input_cost = (self.total_input_tokens / 1000) * input_cost_per_1k
        output_cost = (self.total_output_tokens / 1000) * output_cost_per_1k
        return input_cost + output_cost
Enter fullscreen mode Exit fullscreen mode

Practical Rule of Thumb

Keep your prompt at < 50% of context window.
This leaves room for:
- User input variations
- Model reasoning
- Unexpected response length
Enter fullscreen mode Exit fullscreen mode

Getting Started

Build token-efficient AI applications with ofox.ai — their OpenAI-compatible API gives you access to Claude with generous context windows at competitive pricing.

👉 Get started with ofox.ai


This article contains affiliate links.


Tags: llm,artificial-intelligence,programming,developer,performance
Canonical URL: https://dev.to/zny10289

Top comments (0)