Tokens, Chunks, and Context Windows (Explained Simply)
When we build LLM applications, we often think in terms of:
- documents
- conversations
- prompts
But large language models don’t see any of that.
LLMs see only tokens — arranged in a limited context window.
Understanding this mental model is essential for:
- RAG systems
- agentic workflows
- cost control
- avoiding “lost in the middle” failures
This article explains how LLMs actually process context — without assuming any framework knowledge.
1. LLMs Don’t Read Text — They Read Tokens
LLMs operate on tokens, not words or characters.
A token might be:
- a word
- part of a word
- punctuation
- whitespace
Example:
"Observability matters."
→ ["Observ", "ability", " matters", "."]
Why this matters
- Token count ≠ word count
- Cost is calculated per token
- Context limits are token-based
Rule of thumb:
1 token ≈ 0.75 words (English, approximate)
2. The Context Window: The Model’s Short-Term Memory
Each LLM has a fixed context window.
Examples (approximate):
- GPT-4: 8k – 128k tokens
- Claude: 100k+ tokens
- Open models: often 4k–32k
Everything the model “knows” for a response must fit inside this window:
- system prompt
- user input
- retrieved documents
- tool outputs
- conversation history
When the window is full, something must be dropped.
3. Why “More Context” Often Makes Answers Worse
A common instinct:
“Let’s just add more documents.”
This often degrades output quality.
Why?
- Attention is distributed across all tokens
- Relevant facts compete with noise
- Important details get buried
This is known as the “lost in the middle” problem:
- Tokens near the beginning and end get more attention
- Middle chunks are easiest to ignore
More context ≠ better context
4. Chunks: The Units of Retrieval (Not Documents)
In RAG systems:
- You don’t retrieve documents
- You retrieve chunks
Each chunk:
- Consumes context window space
- Competes for attention
- Adds cost
Poor chunking leads to:
- Partial facts
- Broken reasoning
- Hallucinations due to missing details
This is why chunk size and overlap are architectural decisions, not tuning knobs.
5. Why Chunk Size Is a Trade-Off (Not a Constant)
Small chunks:
- Higher recall
- Lower semantic completeness
Large chunks:
- Better local context
- Fewer chunks fit in the window
There is no universal best size.
Good systems adapt chunking based on:
- document type
- query intent
- downstream reasoning needs
6. Context Is a Budget (Spend It Wisely)
Think of the context window as a budget:
| Context Consumer | Cost |
|---|---|
| System prompt | Fixed |
| User query | Variable |
| Retrieved chunks | High |
| Tool outputs | Often expensive |
| Conversation history | Grows fast |
If you don’t control this budget:
- Latency increases
- Costs spike
- Answer quality drops
This is why production systems:
- Limit retrieved chunks
- Summarize aggressively
- Prune conversation history
7. Why This Matters Before You Touch RAG or Agents
Before:
- vector databases
- LangChain
- agents
- graphs
You must understand:
- what the model can actually see
- how attention is distributed
- why context placement matters
Without this, you’re debugging symptoms — not causes.
8. Key Takeaways
- LLMs see tokens, not text
- Context windows are hard limits
- More context can hurt accuracy
- Chunking is a first-class design decision
- Treat context like a scarce resource
What’s Next
In the next article, I’ll break down:
Document Ingestion Pipelines: Loaders, Splitters, and Embeddings — Why Most RAG Systems Fail Here
Because once you understand how LLMs see context, ingestion decisions start to make a lot more sense.

Top comments (0)