DEV Community

Cover image for Tokens, Context Windows, and Why They Matter: The Complete Guide
SATINATH MONDAL
SATINATH MONDAL

Posted on

Tokens, Context Windows, and Why They Matter: The Complete Guide

Last week, I spent 6 hours debugging why my AI application was randomly cutting off responses. The culprit? I didn't understand tokens and context windows. This mistake cost me time, money, and user frustration.

If you're building with LLMs, understanding these concepts isn't optional—it's fundamental. Here's everything you need to know.

What You'll Learn

  • What tokens really are and why "words" is the wrong mental model
  • How tokenization works under the hood with real examples
  • Context window limits and their practical implications
  • Proven strategies for handling documents longer than context limits
  • Cost optimization through smart token management
  • Production-ready code for all scenarios

Prerequisites: Basic Python knowledge. No AI/ML background needed.


Table of Contents


What Are Tokens? (It's Not What You Think)

The wrong mental model: "1 token = 1 word"

The reality: Tokens are subword units that LLMs use to process text. One word can be multiple tokens, or multiple words can be one token.

Real Examples

from tiktoken import encoding_for_model

# Get GPT-4's tokenizer
encoding = encoding_for_model("gpt-4")

# Example 1: Simple word
text1 = "hello"
tokens1 = encoding.encode(text1)
print(f"'{text1}' -> {len(tokens1)} token(s): {tokens1}")
# Output: 'hello' -> 1 token(s): [15339]

# Example 2: Same word with capitalization
text2 = "Hello"
tokens2 = encoding.encode(text2)
print(f"'{text2}' -> {len(tokens2)} token(s): {tokens2}")
# Output: 'Hello' -> 1 token(s): [9906]  # Different token!

# Example 3: Complex word
text3 = "tokenization"
tokens3 = encoding.encode(text3)
print(f"'{text3}' -> {len(tokens3)} token(s): {tokens3}")
# Output: 'tokenization' -> 2 token(s): [30001, 2065]

# Example 4: Technical term
text4 = "ChatGPT"
tokens4 = encoding.encode(text4)
print(f"'{text4}' -> {len(tokens4)} token(s): {tokens4}")
# Output: 'ChatGPT' -> 2 token(s): [13828, 38, 6465]

# Example 5: Numbers
text5 = "12345"
tokens5 = encoding.encode(text5)
print(f"'{text5}' -> {len(tokens5)} token(s): {tokens5}")
# Output: '12345' -> 2 token(s): [4513, 1774]
Enter fullscreen mode Exit fullscreen mode

Why This Matters

1. Cost: LLM pricing is per token, not per word

# Latest pricing (2025) - GPT-4o
INPUT_PRICE_PER_1K_TOKENS = 0.0025   # 75% cheaper than GPT-4 Turbo
OUTPUT_PRICE_PER_1K_TOKENS = 0.01    # 67% cheaper than GPT-4 Turbo

def calculate_cost(input_text: str, output_text: str) -> float:
    """Calculate actual API cost"""
    encoding = encoding_for_model("gpt-4-turbo")

    input_tokens = len(encoding.encode(input_text))
    output_tokens = len(encoding.encode(output_text))

    input_cost = (input_tokens / 1000) * INPUT_PRICE_PER_1K_TOKENS
    output_cost = (output_tokens / 1000) * OUTPUT_PRICE_PER_1K_TOKENS

    total_cost = input_cost + output_cost

    print(f"Input: {input_tokens} tokens (${input_cost:.4f})")
    print(f"Output: {output_tokens} tokens (${output_cost:.4f})")
    print(f"Total: ${total_cost:.4f}")

    return total_cost

# Example
prompt = "Explain quantum computing in simple terms."
response = "Quantum computing uses quantum mechanics principles like superposition and entanglement to process information differently than classical computers..."

cost = calculate_cost(prompt, response)
# Input: 7 tokens ($0.0001)
# Output: 23 tokens ($0.0007)
# Total: $0.0008
Enter fullscreen mode Exit fullscreen mode

2. Context Limits: Models have maximum token limits

def check_context_limit(text: str, model: str = "gpt-4-turbo") -> dict:
    """Check if text fits in model's context window"""
    encoding = encoding_for_model(model)
    tokens = encoding.encode(text)

    # Context limits (as of 2025)
    limits = {
        "gpt-4-turbo": 128000,
        "gpt-4o": 128000,
        "gpt-4o-mini": 128000,
        "claude-3-5-sonnet": 200000,
        "claude-3-5-haiku": 200000,
        "gemini-2.0-flash": 1000000,
        "gemini-1.5-pro": 2000000,
        "llama-3.3-70b": 128000,
        "qwen-2.5-72b": 128000
    }

    limit = limits.get(model, 4096)
    token_count = len(tokens)
    remaining = limit - token_count
    percentage = (token_count / limit) * 100

    return {
        "model": model,
        "token_count": token_count,
        "limit": limit,
        "remaining": remaining,
        "percentage_used": f"{percentage:.1f}%",
        "fits": token_count <= limit
    }

# Example with long document
with open("long_document.txt", "r") as f:
    document = f.read()

result = check_context_limit(document, "gpt-4-turbo")
print(result)
# {
#   'model': 'gpt-4-turbo',
#   'token_count': 15234,
#   'limit': 128000,
#   'remaining': 112766,
#   'percentage_used': '11.9%',
#   'fits': True
# }
Enter fullscreen mode Exit fullscreen mode

3. Response Quality: More tokens ≠ better responses

def analyze_token_efficiency(text: str) -> dict:
    """Analyze token usage efficiency"""
    encoding = encoding_for_model("gpt-4")

    tokens = encoding.encode(text)
    words = text.split()
    chars = len(text)

    return {
        "characters": chars,
        "words": len(words),
        "tokens": len(tokens),
        "chars_per_token": chars / len(tokens),
        "words_per_token": len(words) / len(tokens),
        "efficiency_score": len(words) / len(tokens)  # Higher is better
    }

# Compare different writing styles
verbose = "I would like to take this opportunity to express my sincere gratitude"
concise = "Thank you very much"

print("Verbose:", analyze_token_efficiency(verbose))
# {'tokens': 15, 'words': 12, 'efficiency_score': 0.8}

print("Concise:", analyze_token_efficiency(concise))
# {'tokens': 4, 'words': 4, 'efficiency_score': 1.0}
Enter fullscreen mode Exit fullscreen mode

How Tokenization Actually Works

The BPE Algorithm (Byte-Pair Encoding)

Most modern LLMs (GPT-4, Claude, Llama) use BPE or variants. Here's a simplified explanation:

Step 1: Start with characters

Text: "tokenization"
Initial: ['t', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n']
Enter fullscreen mode Exit fullscreen mode

Step 2: Find most frequent pairs and merge

Iteration 1: 'i' + 'o' = 'io' (appears twice)
Result: ['t', 'o', 'k', 'e', 'n', 'io', 'z', 'a', 't', 'io', 'n']

Iteration 2: 't' + 'io' = 'tio' 
Result: ['t', 'o', 'k', 'e', 'n', 'io', 'z', 'a', 'tio', 'n']

... (continues until no more frequent pairs)
Enter fullscreen mode Exit fullscreen mode

Step 3: Map to token IDs

Final tokens: ['token', 'ization']
Token IDs: [30001, 2065]
Enter fullscreen mode Exit fullscreen mode

Visualizing Tokenization

import tiktoken

def visualize_tokens(text: str, model: str = "gpt-4"):
    """Show how text gets tokenized"""
    encoding = encoding_for_model(model)
    tokens = encoding.encode(text)

    print(f"Original text: '{text}'")
    print(f"Token count: {len(tokens)}\n")

    # Decode each token to show the breakdown
    for i, token in enumerate(tokens):
        decoded = encoding.decode([token])
        print(f"Token {i+1}: {token:6d} -> '{decoded}'")

# Example 1: Common phrase
visualize_tokens("Hello, world!")
# Original text: 'Hello, world!'
# Token count: 4
# Token 1:   9906 -> 'Hello'
# Token 2:     11 -> ','
# Token 3:    995 -> ' world'
# Token 4:     0 -> '!'

# Example 2: Code
visualize_tokens("def hello_world():")
# Original text: 'def hello_world():'
# Token count: 6
# Token 1:   1326 -> 'def'
# Token 2:  25748 -> ' hello'
# Token 3:    729 -> '_'
# Token 4:   5430 -> 'world'
# Token 5:   3419 -> '()'
# Token 6:      8 -> ':'

# Example 3: Multilingual
visualize_tokens("Hello 你好 Bonjour")
# Original text: 'Hello 你好 Bonjour'
# Token count: 6
# Token 1:   9906 -> 'Hello'
# Token 2:  57668 -> ' 你'
# Token 3:  53901 -> '好'
# Token 4:   7911 -> ' Bon'
# Token 5:   1558 -> 'j'
# Token 6:   414 -> 'our'
Enter fullscreen mode Exit fullscreen mode

Why Different Models Have Different Tokenizers

def compare_tokenizers(text: str):
    """Compare tokenization across different models"""
    models = ["gpt-4o", "gpt-4-turbo", "gpt-4o-mini"]

    results = {}
    for model in models:
        try:
            encoding = encoding_for_model(model)
            tokens = encoding.encode(text)
            results[model] = {
                "token_count": len(tokens),
                "tokens": tokens[:5]  # First 5 tokens
            }
        except Exception as e:
            results[model] = {"error": str(e)}

    return results

text = "Artificial intelligence and machine learning are transforming technology."
comparison = compare_tokenizers(text)

for model, data in comparison.items():
    print(f"{model}: {data['token_count']} tokens")
# gpt-4o: 12 tokens
# gpt-4-turbo: 12 tokens
# gpt-4o-mini: 12 tokens (same tokenizer family)
Enter fullscreen mode Exit fullscreen mode

Key Insight: Always use the correct tokenizer for your model. Never assume token counts from one model apply to another.


The Context Window Problem

Understanding Context Windows

The context window is the maximum number of tokens an LLM can process in a single request (input + output combined).

Current Limits (2025):

| Model                 | Context Window | Equivalent Pages* |
|----------------------|---------------|-------------------|
| GPT-4o               | 128K tokens   | ~300 pages       |
| GPT-4o-mini          | 128K tokens   | ~300 pages       |
| Claude 3.5 Sonnet    | 200K tokens   | ~500 pages       |
| Claude 3.5 Haiku     | 200K tokens   | ~500 pages       |
| Gemini 2.0 Flash     | 1M tokens     | ~2,500 pages     |
| Gemini 1.5 Pro       | 2M tokens     | ~5,000 pages     |
| Llama 3.3 70B        | 128K tokens   | ~300 pages       |
| Qwen 2.5 72B         | 128K tokens   | ~300 pages       |
| DeepSeek V3          | 128K tokens   | ~300 pages       |

*Approximate: 1 page ≈ 400-450 tokens

**Latest Developments (Late 2024/Early 2025)**:
- Gemini 1.5 Pro now supports up to 2M tokens (longest available)
- GPT-4o and GPT-4o-mini offer better price/performance than GPT-4 Turbo
- Claude 3.5 models provide best-in-class context handling
- Open source models (Llama 3.3, Qwen 2.5) now match proprietary context windows
Enter fullscreen mode Exit fullscreen mode

Real-World Impact

def analyze_context_usage(prompt: str, expected_output_tokens: int = 1000):
    """Analyze if your prompt fits with expected output"""
    encoding = encoding_for_model("gpt-4-turbo")

    prompt_tokens = len(encoding.encode(prompt))
    total_needed = prompt_tokens + expected_output_tokens
    context_limit = 128000

    result = {
        "prompt_tokens": prompt_tokens,
        "expected_output_tokens": expected_output_tokens,
        "total_tokens_needed": total_needed,
        "context_limit": context_limit,
        "remaining_tokens": context_limit - total_needed,
        "will_fit": total_needed <= context_limit
    }

    # Warning zones
    usage_percent = (total_needed / context_limit) * 100

    if usage_percent > 90:
        result["warning"] = "CRITICAL: >90% context used"
    elif usage_percent > 70:
        result["warning"] = "WARNING: >70% context used"
    elif usage_percent > 50:
        result["warning"] = "CAUTION: >50% context used"
    else:
        result["warning"] = "OK: Healthy context usage"

    return result

# Example: Analyzing a large document
large_document = """
[Imagine a 50-page technical document here...]
""" * 1000  # Simulate large document

analysis = analyze_context_usage(large_document, expected_output_tokens=2000)
print(analysis)
Enter fullscreen mode Exit fullscreen mode

What Happens When You Exceed Context?

from openai import OpenAI, APIError

client = OpenAI()

def safe_api_call(prompt: str, model: str = "gpt-4-turbo"):
    """Handle context length errors gracefully"""
    encoding = encoding_for_model(model)
    prompt_tokens = len(encoding.encode(prompt))

    # Model limits (2025)
    limits = {
        "gpt-4o": 128000,
        "gpt-4o-mini": 128000,
        "gpt-4-turbo": 128000,
        "claude-3-5-sonnet": 200000,
        "gemini-2.0-flash": 1000000,
        "gemini-1.5-pro": 2000000
    }

    limit = limits.get(model, 4096)
    max_response_tokens = 4096  # Reserve space for response

    # Check if prompt fits
    if prompt_tokens + max_response_tokens > limit:
        return {
            "error": "Context length exceeded",
            "prompt_tokens": prompt_tokens,
            "limit": limit,
            "recommendation": "Use chunking or RAG strategy"
        }

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=min(max_response_tokens, limit - prompt_tokens)
        )
        return {
            "success": True,
            "response": response.choices[0].message.content,
            "tokens_used": response.usage.total_tokens
        }
    except APIError as e:
        return {
            "error": str(e),
            "prompt_tokens": prompt_tokens
        }
Enter fullscreen mode Exit fullscreen mode

Strategy 1: Chunking and Summarization

When to use: Documents that exceed context window but need full coverage.

Basic Chunking

def chunk_text(text: str, max_tokens: int = 4000, model: str = "gpt-4-turbo"):
    """Split text into chunks that fit in context window"""
    encoding = encoding_for_model(model)
    tokens = encoding.encode(text)

    chunks = []
    current_chunk = []
    current_length = 0

    # Decode token by token to maintain boundaries
    for token in tokens:
        current_chunk.append(token)
        current_length += 1

        if current_length >= max_tokens:
            # Decode chunk back to text
            chunk_text = encoding.decode(current_chunk)
            chunks.append(chunk_text)

            current_chunk = []
            current_length = 0

    # Add remaining tokens
    if current_chunk:
        chunks.append(encoding.decode(current_chunk))

    return chunks

# Example
long_document = "..." # Your long document
chunks = chunk_text(long_document, max_tokens=3000)

print(f"Document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {len(chunk)} characters")
Enter fullscreen mode Exit fullscreen mode

Smart Chunking (Preserve Meaning)

def smart_chunk(text: str, max_tokens: int = 4000, overlap: int = 200):
    """Chunk text at natural boundaries with overlap"""
    encoding = encoding_for_model("gpt-4-turbo")

    # Split on paragraphs first
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = len(encoding.encode(para))

        # If single paragraph exceeds limit, split it
        if para_tokens > max_tokens:
            # Split on sentences
            sentences = para.split('. ')
            for sentence in sentences:
                sent_tokens = len(encoding.encode(sentence))

                if current_tokens + sent_tokens > max_tokens:
                    # Save current chunk
                    chunks.append('\n\n'.join(current_chunk))

                    # Start new chunk with overlap
                    if len(current_chunk) > 0:
                        # Keep last sentence for context
                        current_chunk = [current_chunk[-1], sentence]
                        current_tokens = sent_tokens + len(encoding.encode(current_chunk[-2]))
                    else:
                        current_chunk = [sentence]
                        current_tokens = sent_tokens
                else:
                    current_chunk.append(sentence)
                    current_tokens += sent_tokens
        else:
            if current_tokens + para_tokens > max_tokens:
                # Save current chunk
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = [para]
                current_tokens = para_tokens
            else:
                current_chunk.append(para)
                current_tokens += para_tokens

    # Add final chunk
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks
Enter fullscreen mode Exit fullscreen mode

Progressive Summarization

from openai import OpenAI

client = OpenAI()

def progressive_summarization(text: str, target_tokens: int = 2000):
    """Summarize long documents progressively"""
    encoding = encoding_for_model("gpt-4-turbo")
    current_text = text

    while len(encoding.encode(current_text)) > target_tokens:
        # Chunk current text
        chunks = smart_chunk(current_text, max_tokens=6000)

        # Summarize each chunk
        summaries = []
        for i, chunk in enumerate(chunks):
            print(f"Summarizing chunk {i+1}/{len(chunks)}...")

            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{
                    "role": "user",
                    "content": f"Summarize this section, preserving key details:\n\n{chunk}"
                }],
                max_tokens=1000
            )
            summaries.append(response.choices[0].message.content)

        # Combine summaries
        current_text = "\n\n".join(summaries)
        current_tokens = len(encoding.encode(current_text))
        print(f"Combined summary: {current_tokens} tokens")

    return current_text

# Example usage
large_doc = """
[Your 100-page document here...]
"""

summary = progressive_summarization(large_doc, target_tokens=3000)
print(f"Final summary: {len(summary)} characters")
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Retrieval-Augmented Generation (RAG)

When to use: Need to answer questions about large documents without processing everything.

Complete RAG Implementation

from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
import tiktoken

client = OpenAI()
pc = Pinecone(api_key="your-api-key")

# Create index
index_name = "documents"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=3072,  # text-embedding-3-large
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

def chunk_and_embed_document(document: str, chunk_size: int = 1000):
    """Split document and create embeddings"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")

    # Smart chunking
    chunks = smart_chunk(document, max_tokens=chunk_size)

    # Create embeddings for each chunk
    vectors = []
    for i, chunk in enumerate(chunks):
        # Generate embedding
        embedding_response = client.embeddings.create(
            input=chunk,
            model="text-embedding-3-large"
        )
        embedding = embedding_response.data[0].embedding

        # Prepare vector
        vectors.append({
            "id": f"chunk_{i}",
            "values": embedding,
            "metadata": {
                "text": chunk,
                "chunk_index": i,
                "token_count": len(encoding.encode(chunk))
            }
        })

        print(f"Embedded chunk {i+1}/{len(chunks)}")

    # Upsert to Pinecone
    index.upsert(vectors=vectors)

    return len(chunks)

def rag_query(question: str, top_k: int = 3, max_context_tokens: int = 6000):
    """Answer question using RAG"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")

    # 1. Generate query embedding
    query_embedding = client.embeddings.create(
        input=question,
        model="text-embedding-3-large"
    ).data[0].embedding

    # 2. Search for relevant chunks
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    # 3. Build context from relevant chunks
    context_parts = []
    total_tokens = 0

    for match in results.matches:
        chunk_text = match.metadata["text"]
        chunk_tokens = len(encoding.encode(chunk_text))

        if total_tokens + chunk_tokens <= max_context_tokens:
            context_parts.append(chunk_text)
            total_tokens += chunk_tokens
        else:
            break

    context = "\n\n---\n\n".join(context_parts)

    # 4. Generate answer
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [m.metadata["chunk_index"] for m in results.matches],
        "context_tokens": total_tokens
    }

# Example usage
document = """
[Your large document content here - could be 1000 pages]
"""

# Index the document
num_chunks = chunk_and_embed_document(document)
print(f"Indexed {num_chunks} chunks")

# Query the document
result = rag_query("What are the main findings?")
print(f"Answer: {result['answer']}")
print(f"Used {result['context_tokens']} tokens from {len(result['sources'])} chunks")
Enter fullscreen mode Exit fullscreen mode

RAG with Token Budget Management

def adaptive_rag_query(
    question: str,
    max_context_tokens: int = 6000,
    min_chunks: int = 2,
    max_chunks: int = 10
):
    """RAG with adaptive token budget"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")

    # Search with more chunks than needed
    query_embedding = client.embeddings.create(
        input=question,
        model="text-embedding-3-large"
    ).data[0].embedding

    results = index.query(
        vector=query_embedding,
        top_k=max_chunks,
        include_metadata=True
    )

    # Greedily add chunks until budget exhausted
    selected_chunks = []
    total_tokens = 0

    for match in results.matches:
        chunk_text = match.metadata["text"]
        chunk_tokens = len(encoding.encode(chunk_text))

        if len(selected_chunks) < min_chunks:
            # Always include minimum chunks
            selected_chunks.append({
                "text": chunk_text,
                "score": match.score,
                "tokens": chunk_tokens
            })
            total_tokens += chunk_tokens
        elif total_tokens + chunk_tokens <= max_context_tokens:
            # Add if within budget
            selected_chunks.append({
                "text": chunk_text,
                "score": match.score,
                "tokens": chunk_tokens
            })
            total_tokens += chunk_tokens
        else:
            break

    # Build context
    context = "\n\n---\n\n".join([c["text"] for c in selected_chunks])

    # Calculate remaining tokens for response
    system_prompt = "You are a helpful assistant. Answer based on the context."
    question_tokens = len(encoding.encode(question))
    context_tokens = len(encoding.encode(context))
    system_tokens = len(encoding.encode(system_prompt))

    total_input_tokens = system_tokens + context_tokens + question_tokens
    max_output_tokens = min(4000, 128000 - total_input_tokens - 100)  # 100 token buffer

    # Generate response
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        max_tokens=max_output_tokens,
        temperature=0
    )

    return {
        "answer": response.choices[0].message.content,
        "chunks_used": len(selected_chunks),
        "input_tokens": total_input_tokens,
        "output_tokens": response.usage.completion_tokens,
        "total_cost": calculate_cost_from_tokens(
            total_input_tokens,
            response.usage.completion_tokens
        )
    }

def calculate_cost_from_tokens(input_tokens: int, output_tokens: int, model: str = "gpt-4o"):
    """Calculate cost for latest models (2025 pricing)"""
    # Updated pricing as of 2025
    pricing = {
        "gpt-4o": {"input": 0.0025, "output": 0.01},
        "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
        "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
        "claude-3-5-haiku": {"input": 0.0008, "output": 0.004},
        "gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},
        "gemini-1.5-pro": {"input": 0.00125, "output": 0.005},
    }

    model_pricing = pricing.get(model, pricing["gpt-4o"])
    input_cost = (input_tokens / 1000) * model_pricing["input"]
    output_cost = (output_tokens / 1000) * model_pricing["output"]
    return input_cost + output_cost
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Map-Reduce Pattern

When to use: Need to process entire document and combine results (e.g., extracting all names, summarizing each section).

Implementation

from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Callable

def map_reduce_llm(
    document: str,
    map_function: Callable[[str], str],
    reduce_function: Callable[[List[str]], str],
    chunk_size: int = 4000,
    max_workers: int = 5
) -> str:
    """
    Apply map-reduce pattern to large documents

    Args:
        document: Full document text
        map_function: Function to apply to each chunk (chunk -> result)
        reduce_function: Function to combine results (results -> final)
        chunk_size: Max tokens per chunk
        max_workers: Parallel processing limit
    """
    # Step 1: Split document into chunks
    chunks = smart_chunk(document, max_tokens=chunk_size)
    print(f"Split into {len(chunks)} chunks")

    # Step 2: Map - Process each chunk in parallel
    map_results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_chunk = {
            executor.submit(map_function, chunk): i 
            for i, chunk in enumerate(chunks)
        }

        for future in as_completed(future_to_chunk):
            chunk_idx = future_to_chunk[future]
            try:
                result = future.result()
                map_results.append(result)
                print(f"Processed chunk {chunk_idx + 1}/{len(chunks)}")
            except Exception as e:
                print(f"Error processing chunk {chunk_idx}: {e}")
                map_results.append("")

    # Step 3: Reduce - Combine results
    final_result = reduce_function(map_results)

    return final_result

# Example 1: Summarization
def summarize_chunk(chunk: str) -> str:
    """Map function: Summarize each chunk"""
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": f"Summarize this section concisely:\n\n{chunk}"
        }],
        max_tokens=500,
        temperature=0
    )
    return response.choices[0].message.content

def combine_summaries(summaries: List[str]) -> str:
    """Reduce function: Combine summaries into final summary"""
    combined = "\n\n".join(summaries)

    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": f"Combine these summaries into one coherent summary:\n\n{combined}"
        }],
        max_tokens=1000,
        temperature=0
    )
    return response.choices[0].message.content

# Usage
long_document = """..."""  # Your 100-page document
final_summary = map_reduce_llm(
    document=long_document,
    map_function=summarize_chunk,
    reduce_function=combine_summaries
)

# Example 2: Extract entities
def extract_entities_from_chunk(chunk: str) -> str:
    """Extract all person names and organizations"""
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": f"""Extract all person names and organizations from this text.
Return as JSON: {{"people": [...], "organizations": [...]}}

Text:
{chunk}"""
        }],
        max_tokens=500,
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

def merge_entities(entity_lists: List[str]) -> str:
    """Merge and deduplicate entities"""
    import json

    all_people = set()
    all_orgs = set()

    for entities_json in entity_lists:
        try:
            entities = json.loads(entities_json)
            all_people.update(entities.get("people", []))
            all_orgs.update(entities.get("organizations", []))
        except:
            continue

    return json.dumps({
        "people": sorted(list(all_people)),
        "organizations": sorted(list(all_orgs)),
        "total_people": len(all_people),
        "total_organizations": len(all_orgs)
    }, indent=2)

# Usage
entities = map_reduce_llm(
    document=long_document,
    map_function=extract_entities_from_chunk,
    reduce_function=merge_entities
)
print(entities)
Enter fullscreen mode Exit fullscreen mode

Strategy 4: Streaming and Windowing

When to use: Processing real-time data or chat conversations that grow over time.

Sliding Window for Chat History

from collections import deque
from typing import List, Dict

class ConversationManager:
    """Manage chat history with token limits"""

    def __init__(self, max_context_tokens: int = 6000, model: str = "gpt-4-turbo"):
        self.max_context_tokens = max_context_tokens
        self.encoding = tiktoken.encoding_for_model(model)
        self.messages = deque()
        self.system_message = {
            "role": "system",
            "content": "You are a helpful assistant."
        }

    def add_message(self, role: str, content: str):
        """Add message to history"""
        self.messages.append({"role": role, "content": content})
        self._trim_to_fit()

    def _count_tokens(self, messages: List[Dict]) -> int:
        """Count total tokens in messages"""
        # Rough approximation: 4 tokens per message overhead
        total = 0
        for msg in messages:
            total += len(self.encoding.encode(msg["content"]))
            total += 4  # Message formatting overhead
        return total

    def _trim_to_fit(self):
        """Remove oldest messages to fit context window"""
        # Always keep system message
        current_messages = [self.system_message] + list(self.messages)
        current_tokens = self._count_tokens(current_messages)

        # Remove oldest messages (except last 2) until within limit
        while current_tokens > self.max_context_tokens and len(self.messages) > 2:
            removed = self.messages.popleft()
            current_messages = [self.system_message] + list(self.messages)
            current_tokens = self._count_tokens(current_messages)
            print(f"Removed message to fit context: {removed['content'][:50]}...")

    def get_messages(self) -> List[Dict]:
        """Get current message history for API call"""
        return [self.system_message] + list(self.messages)

    def get_context_info(self) -> Dict:
        """Get current context usage stats"""
        messages = self.get_messages()
        tokens = self._count_tokens(messages)

        return {
            "message_count": len(self.messages),
            "total_tokens": tokens,
            "max_tokens": self.max_context_tokens,
            "usage_percent": (tokens / self.max_context_tokens) * 100,
            "remaining_tokens": self.max_context_tokens - tokens
        }

# Usage example
conversation = ConversationManager(max_context_tokens=8000)

# Simulate long conversation
for i in range(20):
    user_msg = f"This is user message {i} with some content..."
    conversation.add_message("user", user_msg)

    # Get response from API
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=conversation.get_messages(),
        max_tokens=500
    )

    assistant_msg = response.choices[0].message.content
    conversation.add_message("assistant", assistant_msg)

    # Check context usage
    info = conversation.get_context_info()
    print(f"Turn {i+1}: {info['usage_percent']:.1f}% context used")
Enter fullscreen mode Exit fullscreen mode

Summarization-Based Window

class SummarizingConversationManager:
    """Chat manager that summarizes old messages"""

    def __init__(self, max_tokens: int = 6000, summary_threshold: int = 4000):
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.encoding = tiktoken.encoding_for_model("gpt-4-turbo")
        self.conversation_summary = ""
        self.recent_messages = deque(maxlen=10)  # Keep last 10 messages

    def add_message(self, role: str, content: str):
        """Add message and manage context"""
        self.recent_messages.append({"role": role, "content": content})

        # Check if we need to summarize
        recent_tokens = self._count_tokens(list(self.recent_messages))
        summary_tokens = len(self.encoding.encode(self.conversation_summary))
        total_tokens = recent_tokens + summary_tokens

        if total_tokens > self.summary_threshold:
            self._create_summary()

    def _count_tokens(self, messages: List[Dict]) -> int:
        """Count tokens in messages"""
        return sum(len(self.encoding.encode(m["content"])) for m in messages)

    def _create_summary(self):
        """Summarize older messages"""
        # Take oldest 5 messages to summarize
        to_summarize = list(self.recent_messages)[:5]

        if not to_summarize:
            return

        # Create conversation text
        conversation_text = "\n".join([
            f"{msg['role']}: {msg['content']}"
            for msg in to_summarize
        ])

        # Generate summary
        prompt = f"""Summarize this conversation, preserving key points and context:

{conversation_text}

Previous summary (if any):
{self.conversation_summary}

Combined summary:"""

        response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=500
        )

        self.conversation_summary = response.choices[0].message.content

        # Remove summarized messages
        for _ in range(min(5, len(self.recent_messages))):
            self.recent_messages.popleft()

        print(f"Created summary: {self.conversation_summary[:100]}...")

    def get_messages(self) -> List[Dict]:
        """Get messages for API call"""
        messages = []

        # Add summary as system context if exists
        if self.conversation_summary:
            messages.append({
                "role": "system",
                "content": f"Conversation summary: {self.conversation_summary}"
            })

        # Add recent messages
        messages.extend(list(self.recent_messages))

        return messages

# Usage
conv = SummarizingConversationManager()

for i in range(50):
    conv.add_message("user", f"Question {i}: What about topic {i}?")

    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=conv.get_messages()
    )

    conv.add_message("assistant", response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Cost Optimization Techniques

1. Token Caching

import hashlib
import json
from datetime import datetime, timedelta

class TokenAwareCache:
    """Cache LLM responses with token awareness"""

    def __init__(self, cache_file: str = "llm_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()

    def _load_cache(self) -> dict:
        """Load cache from disk"""
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except:
            return {}

    def _save_cache(self):
        """Save cache to disk"""
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)

    def _get_cache_key(self, prompt: str, model: str) -> str:
        """Generate cache key"""
        content = f"{model}:{prompt}"
        return hashlib.md5(content.encode()).hexdigest()

    def get(self, prompt: str, model: str = "gpt-4-turbo", ttl_hours: int = 24):
        """Get cached response if available"""
        key = self._get_cache_key(prompt, model)

        if key in self.cache:
            entry = self.cache[key]
            cached_time = datetime.fromisoformat(entry['timestamp'])

            # Check if cache is still valid
            if datetime.now() - cached_time < timedelta(hours=ttl_hours):
                print(f"Cache hit! Saved {entry['tokens']} tokens")
                return entry['response']

        return None

    def set(self, prompt: str, response: str, tokens: int, model: str = "gpt-4-turbo"):
        """Cache response"""
        key = self._get_cache_key(prompt, model)

        self.cache[key] = {
            'response': response,
            'tokens': tokens,
            'timestamp': datetime.now().isoformat(),
            'model': model
        }

        self._save_cache()

    def get_stats(self) -> dict:
        """Get cache statistics"""
        total_entries = len(self.cache)
        total_tokens_saved = sum(entry['tokens'] for entry in self.cache.values())
        # Using GPT-4o average pricing (2025)
        estimated_savings = (total_tokens_saved / 1000) * 0.00625  # avg of input/output

        return {
            'total_entries': total_entries,
            'total_tokens_saved': total_tokens_saved,
            'estimated_cost_saved': f"${estimated_savings:.4f}"
        }

# Usage
cache = TokenAwareCache()

def cached_completion(prompt: str, model: str = "gpt-4-turbo"):
    """LLM call with caching"""
    # Check cache
    cached_response = cache.get(prompt, model)
    if cached_response:
        return cached_response

    # Make API call
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    result = response.choices[0].message.content
    tokens = response.usage.total_tokens

    # Cache result
    cache.set(prompt, result, tokens, model)

    return result

# Test
for i in range(5):
    response = cached_completion("What is the capital of France?")
    print(response)

print(cache.get_stats())
# {'total_entries': 1, 'total_tokens_saved': 250, 'estimated_cost_saved': '$0.01'}
Enter fullscreen mode Exit fullscreen mode

2. Prompt Compression

def compress_prompt(prompt: str, max_tokens: int = 2000) -> str:
    """Compress prompt while preserving meaning"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")
    current_tokens = len(encoding.encode(prompt))

    if current_tokens <= max_tokens:
        return prompt

    # Calculate compression ratio needed
    ratio = max_tokens / current_tokens

    # Use cheaper model to compress
    compression_prompt = f"""Compress this text to approximately {int(ratio * 100)}% of its length
while preserving all key information and meaning:

{prompt}

Compressed version:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Most cost-effective for compression
        messages=[{"role": "user", "content": compression_prompt}],
        max_tokens=max_tokens
    )

    compressed = response.choices[0].message.content
    compressed_tokens = len(encoding.encode(compressed))

    print(f"Compressed from {current_tokens} to {compressed_tokens} tokens")
    print(f"Compression ratio: {(compressed_tokens/current_tokens)*100:.1f}%")

    return compressed

# Example
long_prompt = """
[Very long prompt with lots of context and examples...]
""" * 100

compressed = compress_prompt(long_prompt, max_tokens=1000)
Enter fullscreen mode Exit fullscreen mode

3. Smart Model Selection

def route_to_model(prompt: str, task_type: str = "general") -> str:
    """Route to appropriate model based on complexity (2025 models)"""
    encoding = tiktoken.encoding_for_model("gpt-4o")
    tokens = len(encoding.encode(prompt))

    # Define routing logic with latest models
    if task_type == "simple" or tokens < 100:
        model = "gpt-4o-mini"  # Most cost-effective
        print(f"Routing to GPT-4o-mini (simple task, {tokens} tokens)")
    elif task_type == "complex" or tokens > 2000:
        model = "gpt-4o"  # Best quality-to-cost ratio
        print(f"Routing to GPT-4o (complex task, {tokens} tokens)")
    elif task_type == "reasoning":
        model = "o1-preview"  # Best for complex reasoning (Dec 2024)
        print(f"Routing to o1-preview (reasoning task, {tokens} tokens)")
    else:
        model = "gpt-4o-mini"
        print(f"Routing to GPT-4o-mini (moderate task, {tokens} tokens)")

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    cost = calculate_cost_from_tokens(
        response.usage.prompt_tokens,
        response.usage.completion_tokens
    )

    print(f"Cost: ${cost:.4f}")

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Production Best Practices

1. Token Budget Tracking

class TokenBudgetManager:
    """Track and enforce token budgets"""

    def __init__(self, daily_budget_tokens: int = 1000000):
        self.daily_budget = daily_budget_tokens
        self.usage = {}  # date -> token count

    def can_make_request(self, estimated_tokens: int) -> bool:
        """Check if request fits in budget"""
        today = datetime.now().date().isoformat()
        current_usage = self.usage.get(today, 0)

        return (current_usage + estimated_tokens) <= self.daily_budget

    def record_usage(self, tokens: int):
        """Record token usage"""
        today = datetime.now().date().isoformat()
        self.usage[today] = self.usage.get(today, 0) + tokens

    def get_usage_stats(self) -> dict:
        """Get usage statistics"""
        today = datetime.now().date().isoformat()
        today_usage = self.usage.get(today, 0)

        return {
            'today_usage': today_usage,
            'daily_budget': self.daily_budget,
            'remaining': self.daily_budget - today_usage,
            'usage_percent': (today_usage / self.daily_budget) * 100
        }

# Usage
budget_manager = TokenBudgetManager(daily_budget_tokens=100000)

def make_tracked_request(prompt: str):
    """Make request with budget tracking"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")
    estimated_tokens = len(encoding.encode(prompt)) + 1000  # Estimate response

    if not budget_manager.can_make_request(estimated_tokens):
        raise Exception("Daily token budget exceeded!")

    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    actual_tokens = response.usage.total_tokens
    budget_manager.record_usage(actual_tokens)

    stats = budget_manager.get_usage_stats()
    print(f"Budget: {stats['usage_percent']:.1f}% used")

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

2. Error Handling for Context Limits

def safe_completion_with_fallback(prompt: str, max_retries: int = 3):
    """Handle context length errors with fallback strategies"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")
    current_prompt = prompt

    for attempt in range(max_retries):
        try:
            tokens = len(encoding.encode(current_prompt))
            print(f"Attempt {attempt + 1}: {tokens} tokens")

            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{"role": "user", "content": current_prompt}],
                max_tokens=4000
            )

            return response.choices[0].message.content

        except Exception as e:
            if "context_length_exceeded" in str(e).lower():
                print(f"Context length exceeded. Compressing...")

                # Strategy 1: Compress prompt
                current_prompt = compress_prompt(
                    current_prompt,
                    max_tokens=int(len(encoding.encode(current_prompt)) * 0.7)
                )

            elif "maximum context length" in str(e).lower():
                print(f"Maximum context exceeded. Switching to chunking...")

                # Strategy 2: Use chunking
                chunks = smart_chunk(current_prompt, max_tokens=6000)
                summaries = [
                    client.chat.completions.create(
                        model="gpt-4-turbo",
                        messages=[{"role": "user", "content": chunk}]
                    ).choices[0].message.content
                    for chunk in chunks
                ]
                return "\n\n".join(summaries)
            else:
                raise

    raise Exception(f"Failed after {max_retries} attempts")
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

❌ Pitfall 1: Counting Words Instead of Tokens

# WRONG
def wrong_token_estimate(text: str) -> int:
    """This doesn't work!"""
    return len(text.split())  # Words ≠ tokens

# RIGHT
def correct_token_count(text: str, model: str = "gpt-4-turbo") -> int:
    """Always use the actual tokenizer"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Example showing the difference
text = "ChatGPT tokenization example"
print(f"Words: {len(text.split())}")  # 3 words
print(f"Tokens: {correct_token_count(text)}")  # 5 tokens
Enter fullscreen mode Exit fullscreen mode

❌ Pitfall 2: Not Reserving Space for Response

# WRONG
def wrong_max_tokens(prompt: str):
    """This might fail!"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")
    prompt_tokens = len(encoding.encode(prompt))

    # Using all remaining context for response
    max_tokens = 128000 - prompt_tokens  # Can exceed limits!

    return max_tokens

# RIGHT
def correct_max_tokens(prompt: str, safety_margin: int = 100):
    """Always leave a safety margin"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")
    prompt_tokens = len(encoding.encode(prompt))

    # Reserve space for system message and safety
    available = 128000 - prompt_tokens - safety_margin

    # Cap at reasonable maximum
    max_response = min(4000, available)

    return max(1, max_response)  # Never return 0 or negative
Enter fullscreen mode Exit fullscreen mode

❌ Pitfall 3: Ignoring Token Overhead

# Messages have formatting overhead
def calculate_message_tokens(messages: List[Dict]) -> int:
    """Account for message formatting overhead"""
    encoding = tiktoken.encoding_for_model("gpt-4-turbo")

    total_tokens = 0

    for message in messages:
        # Content tokens
        total_tokens += len(encoding.encode(message["content"]))

        # Message formatting overhead (~4 tokens per message)
        total_tokens += 4

        # Role token
        total_tokens += len(encoding.encode(message["role"]))

    # Conversation formatting
    total_tokens += 2

    return total_tokens
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Token Counting Cheat Sheet

import tiktoken

# Initialize tokenizer
enc = tiktoken.encoding_for_model("gpt-4-turbo")

# Count tokens
text = "Your text here"
token_count = len(enc.encode(text))

# Decode tokens
tokens = enc.encode(text)
decoded = enc.decode(tokens)

# Estimate cost (GPT-4o - 2025 pricing)
input_cost = (token_count / 1000) * 0.0025
output_cost = (token_count / 1000) * 0.01
Enter fullscreen mode Exit fullscreen mode

Context Window Limits (2025)

Model Input + Output Recommended Max Input Release Date
Gemini 1.5 Pro 2M 1.95M Dec 2024
Gemini 2.0 Flash 1M 990K Dec 2024
Claude 3.5 Sonnet 200K 195K Oct 2024
Claude 3.5 Haiku 200K 195K Nov 2024
GPT-4o 128K 125K May 2024
GPT-4o-mini 128K 125K Jul 2024
Llama 3.3 70B 128K 125K Dec 2024
Qwen 2.5 72B 128K 125K Nov 2024
DeepSeek V3 128K 125K Dec 2024

Key Updates:

  • Gemini 1.5 Pro leads with 2M tokens (doubled from 1M)
  • Claude 3.5 models offer best accuracy at 200K context
  • GPT-4o/mini replaced GPT-4 Turbo as primary models
  • Open source caught up: Llama 3.3, Qwen 2.5, DeepSeek V3 all support 128K

When to Use Each Strategy

**Chunking + Summarization**
✅ Full document coverage needed
✅ Sequential processing acceptable
❌ Slow for Q&A
❌ May lose context between chunks

**RAG (Retrieval-Augmented Generation)**
✅ Q&A over large documents
✅ Fast queries
✅ Scalable
❌ Requires vector database
❌ Setup complexity

**Map-Reduce**
✅ Parallel processing
✅ Extracting structured data
✅ Aggregating results
❌ Higher API costs
❌ Complex implementation

**Windowing**
✅ Chat/conversation applications
✅ Real-time processing
✅ Simple implementation
❌ Loses old context
❌ Not suitable for long documents
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Understanding Tokens:

  • ✅ Always use the model's tokenizer for counting
  • ✅ Different models have different tokenizers
  • ✅ Tokens ≠ words (especially for code, numbers, special characters)
  • ✅ Account for message formatting overhead

Managing Context Windows:

  • ✅ Reserve space for responses (don't use full context)
  • ✅ Monitor token usage in production
  • ✅ Implement fallback strategies
  • ✅ Use appropriate strategy for your use case

Cost Optimization:

  • ✅ Cache frequent queries
  • ✅ Compress prompts when possible
  • ✅ Route to cheaper models for simple tasks
  • ✅ Track token budgets

Production Best Practices:

  • ✅ Implement error handling for context limits
  • ✅ Monitor token usage and costs
  • ✅ Test with real-world data sizes
  • ✅ Plan for scaling

Resources

Tools:

Documentation:

Further Reading:

  • "Attention Is All You Need" (Transformer paper)
  • "BERT: Pre-training of Deep Bidirectional Transformers"

Questions about token management? Drop them in the comments.

Found this helpful? Follow for more LLM fundamentals.


Part of the "LLM Fundamentals" series. Next: "Embeddings Explained: From Text to Vectors" - coming next week.

ai #llm #tokens #tutorial #beginners

Top comments (0)