You build something with GPT-4o. The model supports 128,000 tokens. You think: that's enough for a full novel. Then, four or five conversation turns in, the model starts forgetting things that were said earlier. Eight turns in, you hit an error. You check the token count — you've used over 100,000 tokens, and you've typed maybe 400 words.
This isn't a bug. It's the predictable consequence of not accounting for where those tokens actually go. A context window isn't blank space waiting to be filled with your words. By the time the first user message arrives, it is already partially consumed — by system instructions, by tool definitions, by retrieved documents, by the tokens the model itself generated in earlier turns. In a production AI agent, 30–60% of the context window is gone before a user types anything.
What follows is a precise accounting of where those tokens go — the four layers that consume the window before users say anything, why the effective limit is substantially lower than the advertised one, what happens to response quality as the window approaches capacity, and which engineering patterns actually manage it at production scale.
Part 1: The Problem
1. The Illusion of Abundance
GPT-4o supports 128K tokens. Claude 3.5 supports 200K. Gemini 1.5 Pro has been demonstrated at a million tokens — roughly 750,000 words, about ten average novels. The numbers sound absurdly generous. How could you possibly run out?
Start with a calibration exercise. What is 128,000 tokens, actually?
In English prose, one token is roughly four characters — about three-quarters of a word. A 1,000-word article runs to around 1,300 tokens, so 128K tokens can hold close to 96,000 words of clean text. That genuinely is a lot.
But text in an LLM application is rarely clean English prose. It is JSON payloads from tool calls. It is API responses full of structured data. It is code. It is URLs. It is conversation history with speaker labels, timestamps, and formatting. All of these serialize into tokens at rates much higher than 4 characters per token.
Then there is the question of performance. The advertised number represents a technical limit — the longest sequence the model can physically process. It does not represent the length at which the model operates at peak accuracy. Research has repeatedly found a significant gap between the two. Long-context benchmarks like RULER (2024) and HELMET (2024) found that in adversarial multi-document tasks, most frontier LLMs showed accuracy drops well before 32K tokens — GPT-4o fell from near-perfect baseline scores to the high-60s percentage range at 32K in some configurations. The technical limit says 128K. The accuracy cliff arrives much earlier.
The Effective Limit Is Not the Advertised Limit
Models claiming 200K context windows show measurable quality degradation around 130K tokens in practice. Treating the advertised number as your operating budget is how production systems quietly degrade without triggering any explicit error.
Cost is the third angle. Every token in the context is a token billed. At GPT-4o's pricing, 128K tokens of input costs several dollars per call — and agents often make dozens of calls per session, each with the full accumulated context. The monthly bill from a badly-managed context window can surprise you well before any error appears in the logs.
2. How Tokens Are Counted — and Why the Count Surprises You
An LLM does not read text. It reads a sequence of integers. Before any word reaches the model, it passes through a tokenizer that converts characters into integer IDs from a vocabulary of roughly 50,000–200,000 entries. The tokenizer used by GPT-4 and GPT-4o is called cl100k_base; it has about 100,000 vocabulary entries. OpenAI's newer models use o200k_base, with about 200,000.
The vocabulary is built using BPE — Byte Pair Encoding. The name comes from the construction: you start with individual characters, then repeatedly merge the pair of adjacent symbols that appears most often in your training corpus, replacing each occurrence of that pair with a new combined token. Do this enough times and common English words end up as single tokens. The algorithm learns what to merge entirely from what was common in the training text — mostly English prose on the internet. That's why "the", "is", "running" each become a single token, while "tokenization" becomes ["token", "ization"] — less common as a whole word, so BPE never fully merged it. Characters and raw bytes are the fallback for anything the vocabulary doesn't cover. The consequence is simple: anything that wasn't well-represented in training data — JSON brackets, URL slashes, code indentation — never got merged aggressively, so those sequences remain expensive in tokens relative to the characters they contain.
The rule-of-thumb of 1 token ≈ 4 characters holds for clean English prose — decent enough for napkin estimates. It falls apart under several conditions that appear constantly in real applications:
Numbers tokenize unexpectedly. BPE learns tokens from frequency in training data. The number "2023" is common in training data — it became a single token. But "2026" is less common, and "19847" is rare — these get split into per-digit or per-pair tokens. The price "USD 1,234,567.89" produces approximately 10–12 tokens, because the commas, period, digits, and currency symbol may each claim separate tokens.
URLs are disproportionately expensive. A URL like https://api.example.com/v2/users/12345 looks compact — 38 characters, which by the prose rule should be about 9–10 tokens. In practice it is closer to 15–20 tokens. Slashes, dots, hyphens, underscores, and alphanumeric path segments each claim their own tokens or merge into small fragments, because URLs are structurally uncommon in prose.
JSON and structured data use roughly 2x the token count of plain text. Consider:
Plain text: The user's name is Alice, she is 28 years old, and her account is active.
JSON: {"user": {"name": "Alice", "age": 28, "status": "active"}}
The plain text version: approximately 18 tokens. The JSON version: approximately 22 tokens — and this is a trivially small object. Real API responses with deeply nested keys, repeated field names, and verbose formatting can be far more expensive. Every brace, colon, and comma is a token or part of a token. A 500-word JSON payload can use 800+ tokens.
Code tokenizes inefficiently in some languages. Research found that Python uses roughly 46% more tokens than equivalent Haskell to express the same computational idea. Python's indentation-based structure requires whitespace tokens, and Python's identifiers were less densely represented in the pre-GPT-4 training corpora.
Analogy: The Luggage Weight Problem
Think of the context window as checked baggage with a weight limit, not a size limit. A suitcase full of dense sweaters weighs less than one with foam packing material filling the same volume. Plain prose is the dense sweaters — you pack a lot of meaning into few tokens. JSON, URLs, and code are the foam — structurally bulky, meaning-sparse, yet they count toward the same limit.
Part 2: The Consumers
3. The Four Layers That Eat Your Context Window
Every LLM API call is a full context payload assembled from four distinct layers. Most developers think about only one: the user's current message. The other three arrive already loaded — silent costs that accumulate before the user types anything.
Layer 1: The System Prompt
The system prompt is the foundational layer. It is always present, on every API call. A minimal system prompt — "You are a helpful assistant" — costs about 7 tokens. But real production system prompts are not minimal.
A typical customer-facing chatbot system prompt contains: the model's persona and tone guidelines, a list of topics it should and should not address, instructions about response format, domain-specific knowledge, legal disclaimers, and formatting instructions. Measured in practice, these range from 800 to 2,500 tokens. They are charged on every single API call. A 1,500-token system prompt running 1,000 calls per day costs you 1.5 million input tokens per day before a user says anything.
Layer 2: Tool Schemas
When you give an LLM access to external tools, you must describe each tool to the model in the context window. These descriptions are written in JSON and can be verbose. A single moderately documented tool schema costs roughly 200 tokens. An agent with five tools carries around 1,000 tokens of tool descriptions on every call, before any user input. The JSON structure alone — all those braces, colons, and quoted keys — is part of why the token cost is higher than reading the description would suggest.
Layer 3: Retrieved Context (RAG)
Many production LLM applications retrieve relevant documents from a database and inject them as supporting material. A typical RAG retrieval returns 3–8 document chunks, each 300–600 tokens. Three chunks at 400 tokens each: 1,200 tokens. Eight chunks at 500 tokens each: 4,000 tokens. In a research assistant with a generous retrieval budget, you might inject 8,000–12,000 tokens of context per query.
The Hidden Fixed Cost
System prompt + tool schemas is your fixed cost floor. It doesn't change turn-to-turn. It can easily reach 2,000–4,000 tokens in a real agent — charged on every single API call in your fleet.
Layer 4: Conversation History
The model has no persistent memory. You create the illusion of memory by re-sending the full conversation history on every API call. Every turn appends two new entries (a user message and a model response) to a history that is re-sent in its entirety. Model responses can be long — a detailed answer with a code snippet might be 600–800 tokens. After ten exchanges, the conversation history alone can be 8,000–12,000 tokens.
4. Context Creep — Watching the Window Fill
The process by which a context window fills over a conversation has a name in production systems: context creep. Consider a realistic customer support agent: 1,200-token system prompt, three tool schemas totaling 600 tokens, RAG retrieval returning two chunks (~800 tokens per turn), user messages averaging 60 tokens, model responses averaging 350 tokens.
Context budget:
Fixed overhead: 1,200 + 600 = 1,800 tokens
Per-turn RAG: 800 tokens
Per-turn history growth: 60 (user) + 350 (model) = 410 tokens
Turns until 80% of 128K:
(1,800 + n × 800 + n × 410) ≥ 102,400
n × 1,210 ≥ 100,600
n ≈ 84 turns
If model reply averages 800 tokens instead:
Per-turn growth: 60 + 800 = 860
n × 1,660 ≥ 100,600
n ≈ 60 turns
Change the model reply length to 800 tokens — a detailed-answer agent — and the window hits 80% around turn 60 rather than 84. Quality degradation begins before you hit the hard limit.
Part 3: The Physics
5. KV Cache Memory — Why Context Has a Physical Cost
The context window limit is not an arbitrary policy. It is enforced by physics — GPU memory.
The transformer's attention mechanism works by comparing every token in the context with every other token. For each token, the model creates a query ("what am I looking for?"), and every other token offers a key ("what do I contain?"). A third vector — the value — carries the actual information that gets passed when attention is high. Assembled across all tokens, these become the matrices Q, K, and V:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The QKᵀ product is an n × n matrix where n is the sequence length. Doubling n quadruples this computation.
There are two distinct computational phases in LLM inference. Prefill processes the entire input prompt at once — O(n²) per attention layer. Implementations like FlashAttention reduce the memory bandwidth pressure dramatically via tiled computation, but the asymptotic complexity doesn't change. Decode generates one token at a time, attending only to the current token against the cached history — O(n) per step with the KV cache. Without caching, decode would also be O(n²). The KV cache converts decode from O(n²) to O(n) at the cost of memory.
KV Cache Memory Formula (Multi-Head Attention):
KV_memory = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_param
For a 7B-parameter model with standard MHA (32 layers, 32 heads, head_dim 128) at bfloat16 (2 bytes):
KV_memory per token ≈ 2 × 32 × 32 × 128 × 1 × 2 = 524,288 bytes ≈ 0.5 MB
At 128K context: 0.5 MB × 128,000 = 64 GB of KV cache alone — more than the model weights at bfloat16 (~14 GB).
Note on GQA and MLA: Most modern models (Llama 3, Mistral, GPT-4o) use Grouped-Query Attention (GQA), which reduces the KV cache by sharing key-value heads across groups of query heads. A model with 32 query heads and 8 KV heads (4× reduction) brings the per-token cache from ~0.5 MB to ~0.125 MB — about 16 GB at 128K context. Still the dominant memory consumer at long contexts. DeepSeek-class models use Multi-head Latent Attention (MLA), which compresses the K and V projections into a low-rank latent space before storing them, achieving 5–10× memory reduction over standard MHA.
A 70B MHA model (80 layers, 64 heads, head_dim 128, bfloat16) runs to roughly 2.5 MB per token: 2 × 80 × 64 × 128 × 2 bytes = 2,621,440 bytes. At 128K context that's ~320 GB — which is why providers either cap context length aggressively for large models, or charge steeply for long-context calls. GQA with 8 KV heads drops it to ~40 GB, still substantial.
Prompt caching (available from OpenAI, Anthropic, Google) caches the computed KV activations for repeated prompt prefixes. Subsequent calls beginning with the same prefix pay 50–75% less for those cached tokens and benefit from lower latency because the prefill phase for the cached portion is skipped. A stable system prompt is an ideal caching candidate. One practical constraint: both OpenAI and Anthropic require a minimum prefix length of at least 1,024 tokens before caching activates. A 200-token system prompt won't benefit — another reason to consolidate instructions into one substantial block rather than spreading them across multiple small messages.
KV cache quantization is an active area of production optimization: storing the K and V tensors in lower-precision formats (int8 or int4) cuts KV cache memory by 2–4× with modest accuracy penalties. Research like KVQuant explores going to 2-bit precision for certain layers while targeting 10M-token contexts on commodity hardware.
6. Lost in the Middle — Why Performance Collapses Before You Hit the Limit
Memory is the first constraint. Attention quality is the second — and it bites you even when your window is half-empty.
In 2023, researchers at Stanford and UC Berkeley published "Lost in the Middle." They gave LLMs a task requiring them to find a specific document from a set of twenty documents, all injected into the context window. The position of the relevant document was varied systematically.
When the relevant document was first or last, models retrieved it accurately. When it was in the middle positions, accuracy dropped by more than 30%. Newer models — Claude 3.5, GPT-4o — have partially mitigated this bias through long-context fine-tuning. "Partially" is doing a lot of work there: independent evaluations continue to find meaningful position-dependent performance gaps in all current models, even at lengths well within their advertised limits.
Analogy: The Lecture Hall Effect
Students reliably remember a lecture's opening and closing. What happened in the middle of hour one is murky. LLMs have an analogous concentration pattern: strong attention to the beginning and end of the context, with a trough in the middle.
The mechanism is structural. RoPE (Rotary Position Embedding), used in most modern architectures, encodes position as a rotation applied to query and key vectors. The mathematical property of this rotation is that the similarity score between two vectors naturally decreases as the distance between their positions increases. At short contexts, the decay is a feature. At long contexts, it becomes a bug: tokens in the middle of a 100K-token window are thousands of positions away from both the beginning and from where the model is currently generating, so their similarity scores are systematically suppressed.
A separate effect, context dilution, compounds this: longer surrounding irrelevant context degrades performance even when the relevant content is guaranteed present. The model's attention distributes across noise, reducing effective attention for the signal — like finding one red marble in a bag of ten thousand, even knowing it's there.
A Subtle RAG Bug
If your RAG system retrieves 8 documents and inserts them in the middle of a long conversation history, the most relevant chunks may be in the attention trough. The model generates a response, you see no error, but the answer doesn't reflect those documents. The failure is silent.
Part 4: Solutions
7. Token Budget Math — Calculating Your Real Available Space
Every LLM application needs an explicit token budget with five zones:
| Zone | Typical token range | Fixed or variable? |
|---|---|---|
| System Prompt | 500–2,500 | Fixed per application |
| Tool Schemas | 200–400 per tool | Fixed per agent |
| RAG Context | 0–12,000 | Variable per turn |
| Conversation History | 0 → grows | Grows each turn |
| Generation Reserve | 500–2,000 | Reserved explicitly |
The generation reserve must be reserved explicitly — if your prompt consumes the entire window, the model either generates nothing or truncates its response.
A worked example. Customer support agent, GPT-4o (128K):
Total window: 128,000 tokens
System prompt: -1,400 tokens (measured)
Tool schemas (4 tools): -800 tokens (measured)
Generation reserve: -1,500 tokens (set by us)
─────────────────────────────────────────
Available for dynamic: 124,300 tokens
Of that:
RAG budget: 20,000 tokens (5 chunks × 4,000 avg)
History budget: ~104,300 tokens (fills over time)
─────────────────────────────────────────
Turns until 80% full:
80% of 128K = 102,400 prompt tokens
Fixed overhead = 1,400 + 800 = 2,200
Per-turn RAG = 800
Per-turn growth = user avg (60) + model avg (350) = 410
Turns until (2,200 + n × 800 + n × 410) ≥ 102,400
n × 1,210 ≥ 100,200
n ≈ 82 turns
82 turns sounds comfortable. But this assumes constant 350-token model replies. A user who triggers several detailed answers can double the history growth rate, cutting that to ~41 turns before the 80% threshold.
Measure, Don't Estimate
The system prompt and tool schema token counts must be measured with the actual tokenizer, not estimated from character counts. Logprompt_tokensandcompletion_tokensfrom every API response. The distribution ofprompt_tokensover time is your context growth curve.
8. Four Strategies for Managing Context Window Limits
Strategy 1: Sliding Window
Keep only the most recent turns of conversation verbatim. In production, truncate by token count, not turn count — a 5-turn history could range from 500 to 8,000 tokens depending on response lengths.
# Turn-count version — simple, good enough for prototyping
MAX_HISTORY_TURNS = 20
def build_messages(system_prompt, history, new_message, rag_chunks):
trimmed_history = history[-MAX_HISTORY_TURNS:]
messages = [{"role": "system", "content": system_prompt}]
if rag_chunks:
context_block = "\n\n".join(rag_chunks)
messages.append({"role": "system", "content": f"Context:\n{context_block}"})
messages.extend(trimmed_history)
messages.append({"role": "user", "content": new_message})
return messages
# Production version — truncate by token count, not turn count
# HISTORY_TOKEN_BUDGET = context_limit - fixed_costs - generation_reserve
# Example for 128K window: 128000 - 2200 (sys+tools) - 1500 (reserve) - 20000 (RAG) ≈ 104000
HISTORY_TOKEN_BUDGET = 40_000 # adjust for your application
def build_messages_token_bounded(system_prompt, history, new_message, rag_chunks):
fixed_tokens = count_tokens(system_prompt) + sum(count_tokens(c) for c in rag_chunks)
new_msg_tokens = count_tokens(new_message)
remaining = HISTORY_TOKEN_BUDGET - fixed_tokens - new_msg_tokens
# Walk history from newest to oldest, keep what fits
trimmed_rev = []
for turn in reversed(history):
turn_tokens = count_tokens(turn["content"])
if remaining - turn_tokens < 0:
break
trimmed_rev.append(turn)
remaining -= turn_tokens
trimmed = list(reversed(trimmed_rev))
messages = [{"role": "system", "content": system_prompt}]
if rag_chunks:
messages.append({"role": "system", "content": "Context:\n" + "\n\n".join(rag_chunks)})
messages.extend(trimmed)
messages.append({"role": "user", "content": new_message})
return messages
The drawback of the sliding window is abrupt forgetting: when turn 1 drops, any fact established there is simply gone. For short-lived task-completion agents, this is fine. For long-running conversational assistants, it creates visible gaps.
Strategy 2: Hierarchical Summarization
Keep recent turns verbatim; compress older turns into a rolling summary.
async def maybe_compress_history(history, summary, buffer_size=10):
verbatim_turns = history[-buffer_size:]
turns_to_summarize = history[:-buffer_size]
if not turns_to_summarize:
return history, summary
new_summary = await llm.complete(
f"Existing summary: {summary}\n\n"
f"New exchanges to incorporate:\n{format_turns(turns_to_summarize)}\n\n"
"Update the summary to include these exchanges. "
"Preserve all concrete facts, decisions, and commitments. "
"Drop conversational filler. Be dense. Max ~400 tokens."
)
return verbatim_turns, new_summary
Cap the summary at 200–400 tokens. Run summarization asynchronously — don't make the user wait for the compression cycle.
Strategy 3: Token Compression (LLMLingua)
Use a compression model to identify and remove low-entropy tokens from prompts, achieving 2–3× compression with minor accuracy loss. The most effective targets are verbose system prompts, RAG context chunks, and few-shot examples.
Never apply compression to the current user message — compressing user input changes their meaning before the model sees it. Test in your specific domain for tasks where precision matters (legal, medical, code).
Strategy 4: Embedding-based Retrieval Over History
Store each conversation turn as a dense vector. At each new turn, embed the current user message and retrieve the most relevant prior turns by similarity. Concretely: as each turn completes, embed the user + assistant text and store it in a vector store alongside the full text. On the next user message, embed it, search for top-k similar turns, inject those into context. Keep only 2–3 verbatim recent turns for coherence.
The effect: only the conversation history relevant to the current question enters the context window. A user asking "what was the budget we discussed?" triggers retrieval of those turns — even if they happened fifty exchanges ago. This requires an embedding model, a vector store, and a retrieval call per user message (adding roughly 50–150ms round-trip with a managed API, under 10ms with a self-hosted model).
The four strategies are not mutually exclusive. Production systems often combine them: a sliding window of 5–8 verbatim turns + rolling summary + retrieval from older history covers all distance scales simultaneously.
9. The Practical Playbook
Short task-completion agents (under 20 turns): Use a sliding window of 10–15 turns. Reserve optimization effort for fixed-cost reduction: audit your system prompt for redundant language, consider dynamic tool registration (load only the tools relevant to the current turn).
Long-running conversational assistants: Implement hierarchical summarization with 8–12 verbatim turns. Cap summaries at 400 tokens. Run asynchronously. Periodically audit system prompt size — prompt creep through edits is real. A prompt that started at 600 tokens can quietly grow to 3,000 across six months of product changes.
Document-heavy research assistants (heavy RAG): Limit retrieval to 3–5 top chunks. Apply token compression to chunks before injection. Sort retrieved chunks so the most relevant appears last in the injected block — adjacent to the user question, within the recency attention peak.
Production agents with many tools: Use dynamic tool registration. A routing classifier (even a keyword matcher) identifies which tools are needed before the main model call and includes only those schemas — reducing 2,000 tokens of tool overhead to ~400 on most turns.
Context ordering (exploit the attention curve): Instead of the framework default (system → history → RAG → user), use: system → recent history (most-recent last) → RAG chunks (most relevant last, adjacent to the user message) → current user message. The most relevant content sits at the end of the context, within the recency attention peak. Older history — the least relevant content — occupies the lower-attention middle.
What to monitor:
-
prompt_tokens / context_limit— alert above 70%, act above 80% - Token count by zone per call — when total grows, know which zone is responsible
- Quality signals segmented by context utilization — you may find degradation starts at 60% in your application
Conclusion: The Window Is a System Resource
A context window isn't a document store you fill until it overflows. It's a compute and memory resource with hard physical limits, a quality curve that degrades well before those limits, and an inference cost that grows with every token you put in it.
In a typical agent, the window is 30–60% consumed before the first user message lands. The fix isn't a bigger context window, though headroom helps. It's building a real budget: measure each zone with an actual tokenizer, set hard limits per zone, implement a context manager that enforces those limits on every call, and track utilization in production dashboards the same way you'd track memory or CPU.
The attention degradation problem — "lost in the middle" — adds a second dimension: even when your window is not full, quality depends on where in the window the important information sits. The primacy bias and recency bias are real, measurable effects that application design can exploit or fall victim to.
The four strategies aren't competitors — most production systems end up combining them. Sliding window for the recent turns, rolling summary for the older ones, compression for the RAG chunks, and retrieval for anything that needs to survive beyond the window. Start with the simplest thing that doesn't break your use case, and add layers as your traffic and conversation length grow.
Context engineering doesn't have the glamour of prompt engineering, but it's where most production LLM failures actually live. Missed retrievals, incoherent multi-turn conversations, bloated inference bills — these trace back to context mismanagement more often than they trace back to the wrong model. It fails silently, which is exactly why it's easy to ignore until you can't.
References
Research Papers
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" — Stanford / Berkeley / Samaya AI, 2023. The original paper quantifying the U-shaped attention bias across context positions.
- "Found in the Middle: Calibrating Positional Attention Bias" — 2024. Proposes an architectural fix to the lost-in-the-middle problem, recovering up to 15pp accuracy.
- "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" — 2025. Demonstrates context dilution: longer irrelevant context degrades performance even when the relevant content is guaranteed present.
- "KVQuant: Towards 10M Context Length LLM Inference with KV Cache Quantization" — 2024. Explores per-channel quantization of the KV cache to enable extreme context lengths on commodity hardware.
Technical References
- OpenAI — Managing Conversation State — Official docs on conversation history management and token counting.
- Anthropic — Context Window Documentation — Claude context limits, caching strategies, and best practices.
- LLMLingua — Prompt Compression — Microsoft Research open-source project for token-level prompt compression.
- KV Cache Memory: Calculating GPU Requirements for LLM Inference — Interactive calculator for KV cache memory requirements given model architecture parameters.
Background Reading
- The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems — TianPan.co, 2025. Production-focused survey of context management challenges.
- Context Window Management for LLM Apps: Developer Guide — Redis, 2025. Practical implementation patterns for context management in production.
- The Complete Guide to Text Embeddings, Vector Databases & LLMs — Swapnanil Saha, 2026. Deep background on tokenization, BPE, transformer attention, and RAG pipelines referenced throughout this post.
Top comments (0)