Last week, I spent 6 hours debugging why my AI application was randomly cutting off responses. The culprit? I didn't understand tokens and context windows. This mistake cost me time, money, and user frustration.
If you're building with LLMs, understanding these concepts isn't optional—it's fundamental. Here's everything you need to know.
What You'll Learn
- What tokens really are and why "words" is the wrong mental model
- How tokenization works under the hood with real examples
- Context window limits and their practical implications
- Proven strategies for handling documents longer than context limits
- Cost optimization through smart token management
- Production-ready code for all scenarios
Prerequisites: Basic Python knowledge. No AI/ML background needed.
Table of Contents
- What Are Tokens? (It's Not What You Think)
- How Tokenization Actually Works
- The Context Window Problem
- Strategy 1: Chunking and Summarization
- Strategy 2: Retrieval-Augmented Generation (RAG)
- Strategy 3: Map-Reduce Pattern
- Strategy 4: Streaming and Windowing
- Cost Optimization Techniques
- Production Best Practices
- Common Pitfalls
- Quick Reference
What Are Tokens? (It's Not What You Think)
The wrong mental model: "1 token = 1 word"
The reality: Tokens are subword units that LLMs use to process text. One word can be multiple tokens, or multiple words can be one token.
Real Examples
from tiktoken import encoding_for_model
# Get GPT-4's tokenizer
encoding = encoding_for_model("gpt-4")
# Example 1: Simple word
text1 = "hello"
tokens1 = encoding.encode(text1)
print(f"'{text1}' -> {len(tokens1)} token(s): {tokens1}")
# Output: 'hello' -> 1 token(s): [15339]
# Example 2: Same word with capitalization
text2 = "Hello"
tokens2 = encoding.encode(text2)
print(f"'{text2}' -> {len(tokens2)} token(s): {tokens2}")
# Output: 'Hello' -> 1 token(s): [9906] # Different token!
# Example 3: Complex word
text3 = "tokenization"
tokens3 = encoding.encode(text3)
print(f"'{text3}' -> {len(tokens3)} token(s): {tokens3}")
# Output: 'tokenization' -> 2 token(s): [30001, 2065]
# Example 4: Technical term
text4 = "ChatGPT"
tokens4 = encoding.encode(text4)
print(f"'{text4}' -> {len(tokens4)} token(s): {tokens4}")
# Output: 'ChatGPT' -> 2 token(s): [13828, 38, 6465]
# Example 5: Numbers
text5 = "12345"
tokens5 = encoding.encode(text5)
print(f"'{text5}' -> {len(tokens5)} token(s): {tokens5}")
# Output: '12345' -> 2 token(s): [4513, 1774]
Why This Matters
1. Cost: LLM pricing is per token, not per word
# Latest pricing (2025) - GPT-4o
INPUT_PRICE_PER_1K_TOKENS = 0.0025 # 75% cheaper than GPT-4 Turbo
OUTPUT_PRICE_PER_1K_TOKENS = 0.01 # 67% cheaper than GPT-4 Turbo
def calculate_cost(input_text: str, output_text: str) -> float:
"""Calculate actual API cost"""
encoding = encoding_for_model("gpt-4-turbo")
input_tokens = len(encoding.encode(input_text))
output_tokens = len(encoding.encode(output_text))
input_cost = (input_tokens / 1000) * INPUT_PRICE_PER_1K_TOKENS
output_cost = (output_tokens / 1000) * OUTPUT_PRICE_PER_1K_TOKENS
total_cost = input_cost + output_cost
print(f"Input: {input_tokens} tokens (${input_cost:.4f})")
print(f"Output: {output_tokens} tokens (${output_cost:.4f})")
print(f"Total: ${total_cost:.4f}")
return total_cost
# Example
prompt = "Explain quantum computing in simple terms."
response = "Quantum computing uses quantum mechanics principles like superposition and entanglement to process information differently than classical computers..."
cost = calculate_cost(prompt, response)
# Input: 7 tokens ($0.0001)
# Output: 23 tokens ($0.0007)
# Total: $0.0008
2. Context Limits: Models have maximum token limits
def check_context_limit(text: str, model: str = "gpt-4-turbo") -> dict:
"""Check if text fits in model's context window"""
encoding = encoding_for_model(model)
tokens = encoding.encode(text)
# Context limits (as of 2025)
limits = {
"gpt-4-turbo": 128000,
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
"claude-3-5-sonnet": 200000,
"claude-3-5-haiku": 200000,
"gemini-2.0-flash": 1000000,
"gemini-1.5-pro": 2000000,
"llama-3.3-70b": 128000,
"qwen-2.5-72b": 128000
}
limit = limits.get(model, 4096)
token_count = len(tokens)
remaining = limit - token_count
percentage = (token_count / limit) * 100
return {
"model": model,
"token_count": token_count,
"limit": limit,
"remaining": remaining,
"percentage_used": f"{percentage:.1f}%",
"fits": token_count <= limit
}
# Example with long document
with open("long_document.txt", "r") as f:
document = f.read()
result = check_context_limit(document, "gpt-4-turbo")
print(result)
# {
# 'model': 'gpt-4-turbo',
# 'token_count': 15234,
# 'limit': 128000,
# 'remaining': 112766,
# 'percentage_used': '11.9%',
# 'fits': True
# }
3. Response Quality: More tokens ≠ better responses
def analyze_token_efficiency(text: str) -> dict:
"""Analyze token usage efficiency"""
encoding = encoding_for_model("gpt-4")
tokens = encoding.encode(text)
words = text.split()
chars = len(text)
return {
"characters": chars,
"words": len(words),
"tokens": len(tokens),
"chars_per_token": chars / len(tokens),
"words_per_token": len(words) / len(tokens),
"efficiency_score": len(words) / len(tokens) # Higher is better
}
# Compare different writing styles
verbose = "I would like to take this opportunity to express my sincere gratitude"
concise = "Thank you very much"
print("Verbose:", analyze_token_efficiency(verbose))
# {'tokens': 15, 'words': 12, 'efficiency_score': 0.8}
print("Concise:", analyze_token_efficiency(concise))
# {'tokens': 4, 'words': 4, 'efficiency_score': 1.0}
How Tokenization Actually Works
The BPE Algorithm (Byte-Pair Encoding)
Most modern LLMs (GPT-4, Claude, Llama) use BPE or variants. Here's a simplified explanation:
Step 1: Start with characters
Text: "tokenization"
Initial: ['t', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n']
Step 2: Find most frequent pairs and merge
Iteration 1: 'i' + 'o' = 'io' (appears twice)
Result: ['t', 'o', 'k', 'e', 'n', 'io', 'z', 'a', 't', 'io', 'n']
Iteration 2: 't' + 'io' = 'tio'
Result: ['t', 'o', 'k', 'e', 'n', 'io', 'z', 'a', 'tio', 'n']
... (continues until no more frequent pairs)
Step 3: Map to token IDs
Final tokens: ['token', 'ization']
Token IDs: [30001, 2065]
Visualizing Tokenization
import tiktoken
def visualize_tokens(text: str, model: str = "gpt-4"):
"""Show how text gets tokenized"""
encoding = encoding_for_model(model)
tokens = encoding.encode(text)
print(f"Original text: '{text}'")
print(f"Token count: {len(tokens)}\n")
# Decode each token to show the breakdown
for i, token in enumerate(tokens):
decoded = encoding.decode([token])
print(f"Token {i+1}: {token:6d} -> '{decoded}'")
# Example 1: Common phrase
visualize_tokens("Hello, world!")
# Original text: 'Hello, world!'
# Token count: 4
# Token 1: 9906 -> 'Hello'
# Token 2: 11 -> ','
# Token 3: 995 -> ' world'
# Token 4: 0 -> '!'
# Example 2: Code
visualize_tokens("def hello_world():")
# Original text: 'def hello_world():'
# Token count: 6
# Token 1: 1326 -> 'def'
# Token 2: 25748 -> ' hello'
# Token 3: 729 -> '_'
# Token 4: 5430 -> 'world'
# Token 5: 3419 -> '()'
# Token 6: 8 -> ':'
# Example 3: Multilingual
visualize_tokens("Hello 你好 Bonjour")
# Original text: 'Hello 你好 Bonjour'
# Token count: 6
# Token 1: 9906 -> 'Hello'
# Token 2: 57668 -> ' 你'
# Token 3: 53901 -> '好'
# Token 4: 7911 -> ' Bon'
# Token 5: 1558 -> 'j'
# Token 6: 414 -> 'our'
Why Different Models Have Different Tokenizers
def compare_tokenizers(text: str):
"""Compare tokenization across different models"""
models = ["gpt-4o", "gpt-4-turbo", "gpt-4o-mini"]
results = {}
for model in models:
try:
encoding = encoding_for_model(model)
tokens = encoding.encode(text)
results[model] = {
"token_count": len(tokens),
"tokens": tokens[:5] # First 5 tokens
}
except Exception as e:
results[model] = {"error": str(e)}
return results
text = "Artificial intelligence and machine learning are transforming technology."
comparison = compare_tokenizers(text)
for model, data in comparison.items():
print(f"{model}: {data['token_count']} tokens")
# gpt-4o: 12 tokens
# gpt-4-turbo: 12 tokens
# gpt-4o-mini: 12 tokens (same tokenizer family)
Key Insight: Always use the correct tokenizer for your model. Never assume token counts from one model apply to another.
The Context Window Problem
Understanding Context Windows
The context window is the maximum number of tokens an LLM can process in a single request (input + output combined).
Current Limits (2025):
| Model | Context Window | Equivalent Pages* |
|----------------------|---------------|-------------------|
| GPT-4o | 128K tokens | ~300 pages |
| GPT-4o-mini | 128K tokens | ~300 pages |
| Claude 3.5 Sonnet | 200K tokens | ~500 pages |
| Claude 3.5 Haiku | 200K tokens | ~500 pages |
| Gemini 2.0 Flash | 1M tokens | ~2,500 pages |
| Gemini 1.5 Pro | 2M tokens | ~5,000 pages |
| Llama 3.3 70B | 128K tokens | ~300 pages |
| Qwen 2.5 72B | 128K tokens | ~300 pages |
| DeepSeek V3 | 128K tokens | ~300 pages |
*Approximate: 1 page ≈ 400-450 tokens
**Latest Developments (Late 2024/Early 2025)**:
- Gemini 1.5 Pro now supports up to 2M tokens (longest available)
- GPT-4o and GPT-4o-mini offer better price/performance than GPT-4 Turbo
- Claude 3.5 models provide best-in-class context handling
- Open source models (Llama 3.3, Qwen 2.5) now match proprietary context windows
Real-World Impact
def analyze_context_usage(prompt: str, expected_output_tokens: int = 1000):
"""Analyze if your prompt fits with expected output"""
encoding = encoding_for_model("gpt-4-turbo")
prompt_tokens = len(encoding.encode(prompt))
total_needed = prompt_tokens + expected_output_tokens
context_limit = 128000
result = {
"prompt_tokens": prompt_tokens,
"expected_output_tokens": expected_output_tokens,
"total_tokens_needed": total_needed,
"context_limit": context_limit,
"remaining_tokens": context_limit - total_needed,
"will_fit": total_needed <= context_limit
}
# Warning zones
usage_percent = (total_needed / context_limit) * 100
if usage_percent > 90:
result["warning"] = "CRITICAL: >90% context used"
elif usage_percent > 70:
result["warning"] = "WARNING: >70% context used"
elif usage_percent > 50:
result["warning"] = "CAUTION: >50% context used"
else:
result["warning"] = "OK: Healthy context usage"
return result
# Example: Analyzing a large document
large_document = """
[Imagine a 50-page technical document here...]
""" * 1000 # Simulate large document
analysis = analyze_context_usage(large_document, expected_output_tokens=2000)
print(analysis)
What Happens When You Exceed Context?
from openai import OpenAI, APIError
client = OpenAI()
def safe_api_call(prompt: str, model: str = "gpt-4-turbo"):
"""Handle context length errors gracefully"""
encoding = encoding_for_model(model)
prompt_tokens = len(encoding.encode(prompt))
# Model limits (2025)
limits = {
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
"gpt-4-turbo": 128000,
"claude-3-5-sonnet": 200000,
"gemini-2.0-flash": 1000000,
"gemini-1.5-pro": 2000000
}
limit = limits.get(model, 4096)
max_response_tokens = 4096 # Reserve space for response
# Check if prompt fits
if prompt_tokens + max_response_tokens > limit:
return {
"error": "Context length exceeded",
"prompt_tokens": prompt_tokens,
"limit": limit,
"recommendation": "Use chunking or RAG strategy"
}
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=min(max_response_tokens, limit - prompt_tokens)
)
return {
"success": True,
"response": response.choices[0].message.content,
"tokens_used": response.usage.total_tokens
}
except APIError as e:
return {
"error": str(e),
"prompt_tokens": prompt_tokens
}
Strategy 1: Chunking and Summarization
When to use: Documents that exceed context window but need full coverage.
Basic Chunking
def chunk_text(text: str, max_tokens: int = 4000, model: str = "gpt-4-turbo"):
"""Split text into chunks that fit in context window"""
encoding = encoding_for_model(model)
tokens = encoding.encode(text)
chunks = []
current_chunk = []
current_length = 0
# Decode token by token to maintain boundaries
for token in tokens:
current_chunk.append(token)
current_length += 1
if current_length >= max_tokens:
# Decode chunk back to text
chunk_text = encoding.decode(current_chunk)
chunks.append(chunk_text)
current_chunk = []
current_length = 0
# Add remaining tokens
if current_chunk:
chunks.append(encoding.decode(current_chunk))
return chunks
# Example
long_document = "..." # Your long document
chunks = chunk_text(long_document, max_tokens=3000)
print(f"Document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {len(chunk)} characters")
Smart Chunking (Preserve Meaning)
def smart_chunk(text: str, max_tokens: int = 4000, overlap: int = 200):
"""Chunk text at natural boundaries with overlap"""
encoding = encoding_for_model("gpt-4-turbo")
# Split on paragraphs first
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = len(encoding.encode(para))
# If single paragraph exceeds limit, split it
if para_tokens > max_tokens:
# Split on sentences
sentences = para.split('. ')
for sentence in sentences:
sent_tokens = len(encoding.encode(sentence))
if current_tokens + sent_tokens > max_tokens:
# Save current chunk
chunks.append('\n\n'.join(current_chunk))
# Start new chunk with overlap
if len(current_chunk) > 0:
# Keep last sentence for context
current_chunk = [current_chunk[-1], sentence]
current_tokens = sent_tokens + len(encoding.encode(current_chunk[-2]))
else:
current_chunk = [sentence]
current_tokens = sent_tokens
else:
current_chunk.append(sentence)
current_tokens += sent_tokens
else:
if current_tokens + para_tokens > max_tokens:
# Save current chunk
chunks.append('\n\n'.join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
# Add final chunk
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Progressive Summarization
from openai import OpenAI
client = OpenAI()
def progressive_summarization(text: str, target_tokens: int = 2000):
"""Summarize long documents progressively"""
encoding = encoding_for_model("gpt-4-turbo")
current_text = text
while len(encoding.encode(current_text)) > target_tokens:
# Chunk current text
chunks = smart_chunk(current_text, max_tokens=6000)
# Summarize each chunk
summaries = []
for i, chunk in enumerate(chunks):
print(f"Summarizing chunk {i+1}/{len(chunks)}...")
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": f"Summarize this section, preserving key details:\n\n{chunk}"
}],
max_tokens=1000
)
summaries.append(response.choices[0].message.content)
# Combine summaries
current_text = "\n\n".join(summaries)
current_tokens = len(encoding.encode(current_text))
print(f"Combined summary: {current_tokens} tokens")
return current_text
# Example usage
large_doc = """
[Your 100-page document here...]
"""
summary = progressive_summarization(large_doc, target_tokens=3000)
print(f"Final summary: {len(summary)} characters")
Strategy 2: Retrieval-Augmented Generation (RAG)
When to use: Need to answer questions about large documents without processing everything.
Complete RAG Implementation
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
import tiktoken
client = OpenAI()
pc = Pinecone(api_key="your-api-key")
# Create index
index_name = "documents"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=3072, # text-embedding-3-large
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
def chunk_and_embed_document(document: str, chunk_size: int = 1000):
"""Split document and create embeddings"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
# Smart chunking
chunks = smart_chunk(document, max_tokens=chunk_size)
# Create embeddings for each chunk
vectors = []
for i, chunk in enumerate(chunks):
# Generate embedding
embedding_response = client.embeddings.create(
input=chunk,
model="text-embedding-3-large"
)
embedding = embedding_response.data[0].embedding
# Prepare vector
vectors.append({
"id": f"chunk_{i}",
"values": embedding,
"metadata": {
"text": chunk,
"chunk_index": i,
"token_count": len(encoding.encode(chunk))
}
})
print(f"Embedded chunk {i+1}/{len(chunks)}")
# Upsert to Pinecone
index.upsert(vectors=vectors)
return len(chunks)
def rag_query(question: str, top_k: int = 3, max_context_tokens: int = 6000):
"""Answer question using RAG"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
# 1. Generate query embedding
query_embedding = client.embeddings.create(
input=question,
model="text-embedding-3-large"
).data[0].embedding
# 2. Search for relevant chunks
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# 3. Build context from relevant chunks
context_parts = []
total_tokens = 0
for match in results.matches:
chunk_text = match.metadata["text"]
chunk_tokens = len(encoding.encode(chunk_text))
if total_tokens + chunk_tokens <= max_context_tokens:
context_parts.append(chunk_text)
total_tokens += chunk_tokens
else:
break
context = "\n\n---\n\n".join(context_parts)
# 4. Generate answer
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {question}
Answer:"""
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return {
"answer": response.choices[0].message.content,
"sources": [m.metadata["chunk_index"] for m in results.matches],
"context_tokens": total_tokens
}
# Example usage
document = """
[Your large document content here - could be 1000 pages]
"""
# Index the document
num_chunks = chunk_and_embed_document(document)
print(f"Indexed {num_chunks} chunks")
# Query the document
result = rag_query("What are the main findings?")
print(f"Answer: {result['answer']}")
print(f"Used {result['context_tokens']} tokens from {len(result['sources'])} chunks")
RAG with Token Budget Management
def adaptive_rag_query(
question: str,
max_context_tokens: int = 6000,
min_chunks: int = 2,
max_chunks: int = 10
):
"""RAG with adaptive token budget"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
# Search with more chunks than needed
query_embedding = client.embeddings.create(
input=question,
model="text-embedding-3-large"
).data[0].embedding
results = index.query(
vector=query_embedding,
top_k=max_chunks,
include_metadata=True
)
# Greedily add chunks until budget exhausted
selected_chunks = []
total_tokens = 0
for match in results.matches:
chunk_text = match.metadata["text"]
chunk_tokens = len(encoding.encode(chunk_text))
if len(selected_chunks) < min_chunks:
# Always include minimum chunks
selected_chunks.append({
"text": chunk_text,
"score": match.score,
"tokens": chunk_tokens
})
total_tokens += chunk_tokens
elif total_tokens + chunk_tokens <= max_context_tokens:
# Add if within budget
selected_chunks.append({
"text": chunk_text,
"score": match.score,
"tokens": chunk_tokens
})
total_tokens += chunk_tokens
else:
break
# Build context
context = "\n\n---\n\n".join([c["text"] for c in selected_chunks])
# Calculate remaining tokens for response
system_prompt = "You are a helpful assistant. Answer based on the context."
question_tokens = len(encoding.encode(question))
context_tokens = len(encoding.encode(context))
system_tokens = len(encoding.encode(system_prompt))
total_input_tokens = system_tokens + context_tokens + question_tokens
max_output_tokens = min(4000, 128000 - total_input_tokens - 100) # 100 token buffer
# Generate response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
max_tokens=max_output_tokens,
temperature=0
)
return {
"answer": response.choices[0].message.content,
"chunks_used": len(selected_chunks),
"input_tokens": total_input_tokens,
"output_tokens": response.usage.completion_tokens,
"total_cost": calculate_cost_from_tokens(
total_input_tokens,
response.usage.completion_tokens
)
}
def calculate_cost_from_tokens(input_tokens: int, output_tokens: int, model: str = "gpt-4o"):
"""Calculate cost for latest models (2025 pricing)"""
# Updated pricing as of 2025
pricing = {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
"claude-3-5-haiku": {"input": 0.0008, "output": 0.004},
"gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},
"gemini-1.5-pro": {"input": 0.00125, "output": 0.005},
}
model_pricing = pricing.get(model, pricing["gpt-4o"])
input_cost = (input_tokens / 1000) * model_pricing["input"]
output_cost = (output_tokens / 1000) * model_pricing["output"]
return input_cost + output_cost
Strategy 3: Map-Reduce Pattern
When to use: Need to process entire document and combine results (e.g., extracting all names, summarizing each section).
Implementation
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Callable
def map_reduce_llm(
document: str,
map_function: Callable[[str], str],
reduce_function: Callable[[List[str]], str],
chunk_size: int = 4000,
max_workers: int = 5
) -> str:
"""
Apply map-reduce pattern to large documents
Args:
document: Full document text
map_function: Function to apply to each chunk (chunk -> result)
reduce_function: Function to combine results (results -> final)
chunk_size: Max tokens per chunk
max_workers: Parallel processing limit
"""
# Step 1: Split document into chunks
chunks = smart_chunk(document, max_tokens=chunk_size)
print(f"Split into {len(chunks)} chunks")
# Step 2: Map - Process each chunk in parallel
map_results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_chunk = {
executor.submit(map_function, chunk): i
for i, chunk in enumerate(chunks)
}
for future in as_completed(future_to_chunk):
chunk_idx = future_to_chunk[future]
try:
result = future.result()
map_results.append(result)
print(f"Processed chunk {chunk_idx + 1}/{len(chunks)}")
except Exception as e:
print(f"Error processing chunk {chunk_idx}: {e}")
map_results.append("")
# Step 3: Reduce - Combine results
final_result = reduce_function(map_results)
return final_result
# Example 1: Summarization
def summarize_chunk(chunk: str) -> str:
"""Map function: Summarize each chunk"""
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": f"Summarize this section concisely:\n\n{chunk}"
}],
max_tokens=500,
temperature=0
)
return response.choices[0].message.content
def combine_summaries(summaries: List[str]) -> str:
"""Reduce function: Combine summaries into final summary"""
combined = "\n\n".join(summaries)
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": f"Combine these summaries into one coherent summary:\n\n{combined}"
}],
max_tokens=1000,
temperature=0
)
return response.choices[0].message.content
# Usage
long_document = """...""" # Your 100-page document
final_summary = map_reduce_llm(
document=long_document,
map_function=summarize_chunk,
reduce_function=combine_summaries
)
# Example 2: Extract entities
def extract_entities_from_chunk(chunk: str) -> str:
"""Extract all person names and organizations"""
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": f"""Extract all person names and organizations from this text.
Return as JSON: {{"people": [...], "organizations": [...]}}
Text:
{chunk}"""
}],
max_tokens=500,
response_format={"type": "json_object"}
)
return response.choices[0].message.content
def merge_entities(entity_lists: List[str]) -> str:
"""Merge and deduplicate entities"""
import json
all_people = set()
all_orgs = set()
for entities_json in entity_lists:
try:
entities = json.loads(entities_json)
all_people.update(entities.get("people", []))
all_orgs.update(entities.get("organizations", []))
except:
continue
return json.dumps({
"people": sorted(list(all_people)),
"organizations": sorted(list(all_orgs)),
"total_people": len(all_people),
"total_organizations": len(all_orgs)
}, indent=2)
# Usage
entities = map_reduce_llm(
document=long_document,
map_function=extract_entities_from_chunk,
reduce_function=merge_entities
)
print(entities)
Strategy 4: Streaming and Windowing
When to use: Processing real-time data or chat conversations that grow over time.
Sliding Window for Chat History
from collections import deque
from typing import List, Dict
class ConversationManager:
"""Manage chat history with token limits"""
def __init__(self, max_context_tokens: int = 6000, model: str = "gpt-4-turbo"):
self.max_context_tokens = max_context_tokens
self.encoding = tiktoken.encoding_for_model(model)
self.messages = deque()
self.system_message = {
"role": "system",
"content": "You are a helpful assistant."
}
def add_message(self, role: str, content: str):
"""Add message to history"""
self.messages.append({"role": role, "content": content})
self._trim_to_fit()
def _count_tokens(self, messages: List[Dict]) -> int:
"""Count total tokens in messages"""
# Rough approximation: 4 tokens per message overhead
total = 0
for msg in messages:
total += len(self.encoding.encode(msg["content"]))
total += 4 # Message formatting overhead
return total
def _trim_to_fit(self):
"""Remove oldest messages to fit context window"""
# Always keep system message
current_messages = [self.system_message] + list(self.messages)
current_tokens = self._count_tokens(current_messages)
# Remove oldest messages (except last 2) until within limit
while current_tokens > self.max_context_tokens and len(self.messages) > 2:
removed = self.messages.popleft()
current_messages = [self.system_message] + list(self.messages)
current_tokens = self._count_tokens(current_messages)
print(f"Removed message to fit context: {removed['content'][:50]}...")
def get_messages(self) -> List[Dict]:
"""Get current message history for API call"""
return [self.system_message] + list(self.messages)
def get_context_info(self) -> Dict:
"""Get current context usage stats"""
messages = self.get_messages()
tokens = self._count_tokens(messages)
return {
"message_count": len(self.messages),
"total_tokens": tokens,
"max_tokens": self.max_context_tokens,
"usage_percent": (tokens / self.max_context_tokens) * 100,
"remaining_tokens": self.max_context_tokens - tokens
}
# Usage example
conversation = ConversationManager(max_context_tokens=8000)
# Simulate long conversation
for i in range(20):
user_msg = f"This is user message {i} with some content..."
conversation.add_message("user", user_msg)
# Get response from API
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=conversation.get_messages(),
max_tokens=500
)
assistant_msg = response.choices[0].message.content
conversation.add_message("assistant", assistant_msg)
# Check context usage
info = conversation.get_context_info()
print(f"Turn {i+1}: {info['usage_percent']:.1f}% context used")
Summarization-Based Window
class SummarizingConversationManager:
"""Chat manager that summarizes old messages"""
def __init__(self, max_tokens: int = 6000, summary_threshold: int = 4000):
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
self.encoding = tiktoken.encoding_for_model("gpt-4-turbo")
self.conversation_summary = ""
self.recent_messages = deque(maxlen=10) # Keep last 10 messages
def add_message(self, role: str, content: str):
"""Add message and manage context"""
self.recent_messages.append({"role": role, "content": content})
# Check if we need to summarize
recent_tokens = self._count_tokens(list(self.recent_messages))
summary_tokens = len(self.encoding.encode(self.conversation_summary))
total_tokens = recent_tokens + summary_tokens
if total_tokens > self.summary_threshold:
self._create_summary()
def _count_tokens(self, messages: List[Dict]) -> int:
"""Count tokens in messages"""
return sum(len(self.encoding.encode(m["content"])) for m in messages)
def _create_summary(self):
"""Summarize older messages"""
# Take oldest 5 messages to summarize
to_summarize = list(self.recent_messages)[:5]
if not to_summarize:
return
# Create conversation text
conversation_text = "\n".join([
f"{msg['role']}: {msg['content']}"
for msg in to_summarize
])
# Generate summary
prompt = f"""Summarize this conversation, preserving key points and context:
{conversation_text}
Previous summary (if any):
{self.conversation_summary}
Combined summary:"""
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
self.conversation_summary = response.choices[0].message.content
# Remove summarized messages
for _ in range(min(5, len(self.recent_messages))):
self.recent_messages.popleft()
print(f"Created summary: {self.conversation_summary[:100]}...")
def get_messages(self) -> List[Dict]:
"""Get messages for API call"""
messages = []
# Add summary as system context if exists
if self.conversation_summary:
messages.append({
"role": "system",
"content": f"Conversation summary: {self.conversation_summary}"
})
# Add recent messages
messages.extend(list(self.recent_messages))
return messages
# Usage
conv = SummarizingConversationManager()
for i in range(50):
conv.add_message("user", f"Question {i}: What about topic {i}?")
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=conv.get_messages()
)
conv.add_message("assistant", response.choices[0].message.content)
Cost Optimization Techniques
1. Token Caching
import hashlib
import json
from datetime import datetime, timedelta
class TokenAwareCache:
"""Cache LLM responses with token awareness"""
def __init__(self, cache_file: str = "llm_cache.json"):
self.cache_file = cache_file
self.cache = self._load_cache()
def _load_cache(self) -> dict:
"""Load cache from disk"""
try:
with open(self.cache_file, 'r') as f:
return json.load(f)
except:
return {}
def _save_cache(self):
"""Save cache to disk"""
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f)
def _get_cache_key(self, prompt: str, model: str) -> str:
"""Generate cache key"""
content = f"{model}:{prompt}"
return hashlib.md5(content.encode()).hexdigest()
def get(self, prompt: str, model: str = "gpt-4-turbo", ttl_hours: int = 24):
"""Get cached response if available"""
key = self._get_cache_key(prompt, model)
if key in self.cache:
entry = self.cache[key]
cached_time = datetime.fromisoformat(entry['timestamp'])
# Check if cache is still valid
if datetime.now() - cached_time < timedelta(hours=ttl_hours):
print(f"Cache hit! Saved {entry['tokens']} tokens")
return entry['response']
return None
def set(self, prompt: str, response: str, tokens: int, model: str = "gpt-4-turbo"):
"""Cache response"""
key = self._get_cache_key(prompt, model)
self.cache[key] = {
'response': response,
'tokens': tokens,
'timestamp': datetime.now().isoformat(),
'model': model
}
self._save_cache()
def get_stats(self) -> dict:
"""Get cache statistics"""
total_entries = len(self.cache)
total_tokens_saved = sum(entry['tokens'] for entry in self.cache.values())
# Using GPT-4o average pricing (2025)
estimated_savings = (total_tokens_saved / 1000) * 0.00625 # avg of input/output
return {
'total_entries': total_entries,
'total_tokens_saved': total_tokens_saved,
'estimated_cost_saved': f"${estimated_savings:.4f}"
}
# Usage
cache = TokenAwareCache()
def cached_completion(prompt: str, model: str = "gpt-4-turbo"):
"""LLM call with caching"""
# Check cache
cached_response = cache.get(prompt, model)
if cached_response:
return cached_response
# Make API call
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
tokens = response.usage.total_tokens
# Cache result
cache.set(prompt, result, tokens, model)
return result
# Test
for i in range(5):
response = cached_completion("What is the capital of France?")
print(response)
print(cache.get_stats())
# {'total_entries': 1, 'total_tokens_saved': 250, 'estimated_cost_saved': '$0.01'}
2. Prompt Compression
def compress_prompt(prompt: str, max_tokens: int = 2000) -> str:
"""Compress prompt while preserving meaning"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
current_tokens = len(encoding.encode(prompt))
if current_tokens <= max_tokens:
return prompt
# Calculate compression ratio needed
ratio = max_tokens / current_tokens
# Use cheaper model to compress
compression_prompt = f"""Compress this text to approximately {int(ratio * 100)}% of its length
while preserving all key information and meaning:
{prompt}
Compressed version:"""
response = client.chat.completions.create(
model="gpt-4o-mini", # Most cost-effective for compression
messages=[{"role": "user", "content": compression_prompt}],
max_tokens=max_tokens
)
compressed = response.choices[0].message.content
compressed_tokens = len(encoding.encode(compressed))
print(f"Compressed from {current_tokens} to {compressed_tokens} tokens")
print(f"Compression ratio: {(compressed_tokens/current_tokens)*100:.1f}%")
return compressed
# Example
long_prompt = """
[Very long prompt with lots of context and examples...]
""" * 100
compressed = compress_prompt(long_prompt, max_tokens=1000)
3. Smart Model Selection
def route_to_model(prompt: str, task_type: str = "general") -> str:
"""Route to appropriate model based on complexity (2025 models)"""
encoding = tiktoken.encoding_for_model("gpt-4o")
tokens = len(encoding.encode(prompt))
# Define routing logic with latest models
if task_type == "simple" or tokens < 100:
model = "gpt-4o-mini" # Most cost-effective
print(f"Routing to GPT-4o-mini (simple task, {tokens} tokens)")
elif task_type == "complex" or tokens > 2000:
model = "gpt-4o" # Best quality-to-cost ratio
print(f"Routing to GPT-4o (complex task, {tokens} tokens)")
elif task_type == "reasoning":
model = "o1-preview" # Best for complex reasoning (Dec 2024)
print(f"Routing to o1-preview (reasoning task, {tokens} tokens)")
else:
model = "gpt-4o-mini"
print(f"Routing to GPT-4o-mini (moderate task, {tokens} tokens)")
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
cost = calculate_cost_from_tokens(
response.usage.prompt_tokens,
response.usage.completion_tokens
)
print(f"Cost: ${cost:.4f}")
return response.choices[0].message.content
Production Best Practices
1. Token Budget Tracking
class TokenBudgetManager:
"""Track and enforce token budgets"""
def __init__(self, daily_budget_tokens: int = 1000000):
self.daily_budget = daily_budget_tokens
self.usage = {} # date -> token count
def can_make_request(self, estimated_tokens: int) -> bool:
"""Check if request fits in budget"""
today = datetime.now().date().isoformat()
current_usage = self.usage.get(today, 0)
return (current_usage + estimated_tokens) <= self.daily_budget
def record_usage(self, tokens: int):
"""Record token usage"""
today = datetime.now().date().isoformat()
self.usage[today] = self.usage.get(today, 0) + tokens
def get_usage_stats(self) -> dict:
"""Get usage statistics"""
today = datetime.now().date().isoformat()
today_usage = self.usage.get(today, 0)
return {
'today_usage': today_usage,
'daily_budget': self.daily_budget,
'remaining': self.daily_budget - today_usage,
'usage_percent': (today_usage / self.daily_budget) * 100
}
# Usage
budget_manager = TokenBudgetManager(daily_budget_tokens=100000)
def make_tracked_request(prompt: str):
"""Make request with budget tracking"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
estimated_tokens = len(encoding.encode(prompt)) + 1000 # Estimate response
if not budget_manager.can_make_request(estimated_tokens):
raise Exception("Daily token budget exceeded!")
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
actual_tokens = response.usage.total_tokens
budget_manager.record_usage(actual_tokens)
stats = budget_manager.get_usage_stats()
print(f"Budget: {stats['usage_percent']:.1f}% used")
return response.choices[0].message.content
2. Error Handling for Context Limits
def safe_completion_with_fallback(prompt: str, max_retries: int = 3):
"""Handle context length errors with fallback strategies"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
current_prompt = prompt
for attempt in range(max_retries):
try:
tokens = len(encoding.encode(current_prompt))
print(f"Attempt {attempt + 1}: {tokens} tokens")
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": current_prompt}],
max_tokens=4000
)
return response.choices[0].message.content
except Exception as e:
if "context_length_exceeded" in str(e).lower():
print(f"Context length exceeded. Compressing...")
# Strategy 1: Compress prompt
current_prompt = compress_prompt(
current_prompt,
max_tokens=int(len(encoding.encode(current_prompt)) * 0.7)
)
elif "maximum context length" in str(e).lower():
print(f"Maximum context exceeded. Switching to chunking...")
# Strategy 2: Use chunking
chunks = smart_chunk(current_prompt, max_tokens=6000)
summaries = [
client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": chunk}]
).choices[0].message.content
for chunk in chunks
]
return "\n\n".join(summaries)
else:
raise
raise Exception(f"Failed after {max_retries} attempts")
Common Pitfalls
❌ Pitfall 1: Counting Words Instead of Tokens
# WRONG
def wrong_token_estimate(text: str) -> int:
"""This doesn't work!"""
return len(text.split()) # Words ≠ tokens
# RIGHT
def correct_token_count(text: str, model: str = "gpt-4-turbo") -> int:
"""Always use the actual tokenizer"""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example showing the difference
text = "ChatGPT tokenization example"
print(f"Words: {len(text.split())}") # 3 words
print(f"Tokens: {correct_token_count(text)}") # 5 tokens
❌ Pitfall 2: Not Reserving Space for Response
# WRONG
def wrong_max_tokens(prompt: str):
"""This might fail!"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
prompt_tokens = len(encoding.encode(prompt))
# Using all remaining context for response
max_tokens = 128000 - prompt_tokens # Can exceed limits!
return max_tokens
# RIGHT
def correct_max_tokens(prompt: str, safety_margin: int = 100):
"""Always leave a safety margin"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
prompt_tokens = len(encoding.encode(prompt))
# Reserve space for system message and safety
available = 128000 - prompt_tokens - safety_margin
# Cap at reasonable maximum
max_response = min(4000, available)
return max(1, max_response) # Never return 0 or negative
❌ Pitfall 3: Ignoring Token Overhead
# Messages have formatting overhead
def calculate_message_tokens(messages: List[Dict]) -> int:
"""Account for message formatting overhead"""
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
total_tokens = 0
for message in messages:
# Content tokens
total_tokens += len(encoding.encode(message["content"]))
# Message formatting overhead (~4 tokens per message)
total_tokens += 4
# Role token
total_tokens += len(encoding.encode(message["role"]))
# Conversation formatting
total_tokens += 2
return total_tokens
Quick Reference
Token Counting Cheat Sheet
import tiktoken
# Initialize tokenizer
enc = tiktoken.encoding_for_model("gpt-4-turbo")
# Count tokens
text = "Your text here"
token_count = len(enc.encode(text))
# Decode tokens
tokens = enc.encode(text)
decoded = enc.decode(tokens)
# Estimate cost (GPT-4o - 2025 pricing)
input_cost = (token_count / 1000) * 0.0025
output_cost = (token_count / 1000) * 0.01
Context Window Limits (2025)
| Model | Input + Output | Recommended Max Input | Release Date |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | 1.95M | Dec 2024 |
| Gemini 2.0 Flash | 1M | 990K | Dec 2024 |
| Claude 3.5 Sonnet | 200K | 195K | Oct 2024 |
| Claude 3.5 Haiku | 200K | 195K | Nov 2024 |
| GPT-4o | 128K | 125K | May 2024 |
| GPT-4o-mini | 128K | 125K | Jul 2024 |
| Llama 3.3 70B | 128K | 125K | Dec 2024 |
| Qwen 2.5 72B | 128K | 125K | Nov 2024 |
| DeepSeek V3 | 128K | 125K | Dec 2024 |
Key Updates:
- Gemini 1.5 Pro leads with 2M tokens (doubled from 1M)
- Claude 3.5 models offer best accuracy at 200K context
- GPT-4o/mini replaced GPT-4 Turbo as primary models
- Open source caught up: Llama 3.3, Qwen 2.5, DeepSeek V3 all support 128K
When to Use Each Strategy
**Chunking + Summarization**
✅ Full document coverage needed
✅ Sequential processing acceptable
❌ Slow for Q&A
❌ May lose context between chunks
**RAG (Retrieval-Augmented Generation)**
✅ Q&A over large documents
✅ Fast queries
✅ Scalable
❌ Requires vector database
❌ Setup complexity
**Map-Reduce**
✅ Parallel processing
✅ Extracting structured data
✅ Aggregating results
❌ Higher API costs
❌ Complex implementation
**Windowing**
✅ Chat/conversation applications
✅ Real-time processing
✅ Simple implementation
❌ Loses old context
❌ Not suitable for long documents
Key Takeaways
Understanding Tokens:
- ✅ Always use the model's tokenizer for counting
- ✅ Different models have different tokenizers
- ✅ Tokens ≠ words (especially for code, numbers, special characters)
- ✅ Account for message formatting overhead
Managing Context Windows:
- ✅ Reserve space for responses (don't use full context)
- ✅ Monitor token usage in production
- ✅ Implement fallback strategies
- ✅ Use appropriate strategy for your use case
Cost Optimization:
- ✅ Cache frequent queries
- ✅ Compress prompts when possible
- ✅ Route to cheaper models for simple tasks
- ✅ Track token budgets
Production Best Practices:
- ✅ Implement error handling for context limits
- ✅ Monitor token usage and costs
- ✅ Test with real-world data sizes
- ✅ Plan for scaling
Resources
Tools:
- tiktoken - OpenAI's tokenizer
- OpenAI Tokenizer - Visual tokenizer
- Hugging Face Tokenizers - For other models
Documentation:
Further Reading:
- "Attention Is All You Need" (Transformer paper)
- "BERT: Pre-training of Deep Bidirectional Transformers"
Questions about token management? Drop them in the comments.
Found this helpful? Follow for more LLM fundamentals.
Part of the "LLM Fundamentals" series. Next: "Embeddings Explained: From Text to Vectors" - coming next week.
Top comments (0)