DEV Community

Parth Sarthi Sharma
Parth Sarthi Sharma

Posted on

How LLMs Actually “See” Context (Tokens, Chunks, Windows)

Diagram showing how LLMs actually process information

Tokens, Chunks, and Context Windows (Explained Simply)

When we build LLM applications, we often think in terms of:

  • documents
  • conversations
  • prompts

But large language models don’t see any of that.

LLMs see only tokens — arranged in a limited context window.

Understanding this mental model is essential for:

  • RAG systems
  • agentic workflows
  • cost control
  • avoiding “lost in the middle” failures

This article explains how LLMs actually process context — without assuming any framework knowledge.

1. LLMs Don’t Read Text — They Read Tokens

LLMs operate on tokens, not words or characters.

A token might be:

  • a word
  • part of a word
  • punctuation
  • whitespace

Example:
"Observability matters."
→ ["Observ", "ability", " matters", "."]

Why this matters

  • Token count ≠ word count
  • Cost is calculated per token
  • Context limits are token-based

Rule of thumb:
1 token ≈ 0.75 words (English, approximate)

2. The Context Window: The Model’s Short-Term Memory

Each LLM has a fixed context window.

Examples (approximate):

  • GPT-4: 8k – 128k tokens
  • Claude: 100k+ tokens
  • Open models: often 4k–32k

Everything the model “knows” for a response must fit inside this window:

  • system prompt
  • user input
  • retrieved documents
  • tool outputs
  • conversation history

When the window is full, something must be dropped.

3. Why “More Context” Often Makes Answers Worse

A common instinct:

“Let’s just add more documents.”

This often degrades output quality.

Why?

  • Attention is distributed across all tokens
  • Relevant facts compete with noise
  • Important details get buried

This is known as the “lost in the middle” problem:

  • Tokens near the beginning and end get more attention
  • Middle chunks are easiest to ignore

More context ≠ better context

4. Chunks: The Units of Retrieval (Not Documents)

In RAG systems:

  • You don’t retrieve documents
  • You retrieve chunks

Each chunk:

  • Consumes context window space
  • Competes for attention
  • Adds cost

Poor chunking leads to:

  • Partial facts
  • Broken reasoning
  • Hallucinations due to missing details

This is why chunk size and overlap are architectural decisions, not tuning knobs.

5. Why Chunk Size Is a Trade-Off (Not a Constant)

Small chunks:

  • Higher recall
  • Lower semantic completeness

Large chunks:

  • Better local context
  • Fewer chunks fit in the window

There is no universal best size.

Good systems adapt chunking based on:

  • document type
  • query intent
  • downstream reasoning needs

6. Context Is a Budget (Spend It Wisely)

Think of the context window as a budget:

Context Consumer Cost
System prompt Fixed
User query Variable
Retrieved chunks High
Tool outputs Often expensive
Conversation history Grows fast

If you don’t control this budget:

  • Latency increases
  • Costs spike
  • Answer quality drops

This is why production systems:

  • Limit retrieved chunks
  • Summarize aggressively
  • Prune conversation history

7. Why This Matters Before You Touch RAG or Agents

Before:

  • vector databases
  • LangChain
  • agents
  • graphs

You must understand:

  • what the model can actually see
  • how attention is distributed
  • why context placement matters

Without this, you’re debugging symptoms — not causes.

8. Key Takeaways

  • LLMs see tokens, not text
  • Context windows are hard limits
  • More context can hurt accuracy
  • Chunking is a first-class design decision
  • Treat context like a scarce resource

What’s Next

In the next article, I’ll break down:

Document Ingestion Pipelines: Loaders, Splitters, and Embeddings — Why Most RAG Systems Fail Here

Because once you understand how LLMs see context, ingestion decisions start to make a lot more sense.

Top comments (0)