Parth Sarthi Sharma

Posted on Jan 2

How LLMs Actually “See” Context (Tokens, Chunks, Windows)

#ai #genai #llm #rag

Tokens, Chunks, and Context Windows (Explained Simply)

When we build LLM applications, we often think in terms of:

documents
conversations
prompts

But large language models don’t see any of that.

LLMs see only tokens — arranged in a limited context window.

Understanding this mental model is essential for:

RAG systems
agentic workflows
cost control
avoiding “lost in the middle” failures

This article explains how LLMs actually process context — without assuming any framework knowledge.

1. LLMs Don’t Read Text — They Read Tokens

LLMs operate on tokens, not words or characters.

A token might be:

a word
part of a word
punctuation
whitespace

Example:
"Observability matters." → ["Observ", "ability", " matters", "."]

Why this matters

Token count ≠ word count
Cost is calculated per token
Context limits are token-based

Rule of thumb:
1 token ≈ 0.75 words (English, approximate)

2. The Context Window: The Model’s Short-Term Memory

Each LLM has a fixed context window.

Examples (approximate):

GPT-4: 8k – 128k tokens
Claude: 100k+ tokens
Open models: often 4k–32k

Everything the model “knows” for a response must fit inside this window:

system prompt
user input
retrieved documents
tool outputs
conversation history

When the window is full, something must be dropped.

3. Why “More Context” Often Makes Answers Worse

A common instinct:

“Let’s just add more documents.”

This often degrades output quality.

Why?

Attention is distributed across all tokens
Relevant facts compete with noise
Important details get buried

This is known as the “lost in the middle” problem:

Tokens near the beginning and end get more attention
Middle chunks are easiest to ignore

More context ≠ better context

4. Chunks: The Units of Retrieval (Not Documents)

In RAG systems:

You don’t retrieve documents
You retrieve chunks

Each chunk:

Consumes context window space
Competes for attention
Adds cost

Poor chunking leads to:

Partial facts
Broken reasoning
Hallucinations due to missing details

This is why chunk size and overlap are architectural decisions, not tuning knobs.

5. Why Chunk Size Is a Trade-Off (Not a Constant)

Small chunks:

Higher recall
Lower semantic completeness

Large chunks:

Better local context
Fewer chunks fit in the window

There is no universal best size.

Good systems adapt chunking based on:

document type
query intent
downstream reasoning needs

6. Context Is a Budget (Spend It Wisely)

Think of the context window as a budget:

Context Consumer	Cost
System prompt	Fixed
User query	Variable
Retrieved chunks	High
Tool outputs	Often expensive
Conversation history	Grows fast

If you don’t control this budget:

Latency increases
Costs spike
Answer quality drops

This is why production systems:

Limit retrieved chunks
Summarize aggressively
Prune conversation history

7. Why This Matters Before You Touch RAG or Agents

Before:

vector databases
LangChain
agents
graphs

You must understand:

what the model can actually see
how attention is distributed
why context placement matters

Without this, you’re debugging symptoms — not causes.

8. Key Takeaways

LLMs see tokens, not text
Context windows are hard limits
More context can hurt accuracy
Chunking is a first-class design decision
Treat context like a scarce resource

What’s Next

In the next article, I’ll break down:

Document Ingestion Pipelines: Loaders, Splitters, and Embeddings — Why Most RAG Systems Fail Here

Because once you understand how LLMs see context, ingestion decisions start to make a lot more sense.

DEV Community