Context First AI

Posted on Mar 25

You're Not Reading Words, You're Reading Chunks: Tokens and Context Windows Explained.

#ai #beginners #llm #nlp

AI models don't read words — they read subword chunks called tokens. Every model also has a context window: a hard limit on how much text it can hold in attention at once. Understanding both changes how you write prompts, how you estimate costs, and why AI occasionally behaves in ways that otherwise seem inexplicable.

This is Part 2 of a five-part series from the Vectors pillar of Context First AI. Built for anyone starting their AI journey — developer or not. No prior knowledge assumed beyond Part 1.

Full series:

Part 1 — The Autocomplete That Ate the World*
Part 2 — You're Not Reading Words, You're Reading Chunks*
Part 3 — Meaning Has a Shape*
Part 4 — You're Not Writing Prompts, You're Writing Instructions for a Very Particular Mind*
Part 5 — What to Do When the Model Doesn't Know Enough

The Session That Went Wrong

It started during a long research session — a product team working through a complex competitive analysis with an AI assistant, building up context across dozens of exchanges.

Somewhere around the fortieth message, the model stopped referencing things mentioned early on. Key constraints from the first few prompts. Background context that had shaped everything since. Gone.

The thing nobody had warned them about: the model hadn't forgotten. It had run out of room.

That's the context window. And it's the second of two mechanics — alongside tokenisation — that sit beneath every single AI interaction, shaping what the model can see, what it can process, and what quietly disappears.

What Is a Token?

Before the model reads anything, your text goes through tokenisation — the process of breaking input into the discrete units the model actually processes.

Those units are tokens: subword chunks that can be full words, partial words, or individual punctuation marks. The model never sees raw text. It sees a sequence of token IDs, each corresponding to a chunk in its learned vocabulary.

You can see this directly with OpenAI's tiktoken library:

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o")

texts = [
    "cat",
    "tokenisation",
    "GDPR",
    "Supercalifragilisticexpialidocious"
]

for text in texts:
    tokens = encoder.encode(text)
    decoded = [encoder.decode([t]) for t in tokens]
    print(f"'{text}' → {len(tokens)} token(s): {decoded}")

Running this produces something like:

'cat' → 1 token(s): ['cat']
'tokenisation' → 3 token(s): ['token', 'is', 'ation']
'GDPR' → 2 token(s): ['GD', 'PR']
'Supercalifragilisticexpialidocious' → 10 token(s): [...]

The pattern is consistent: common, short English words tend to be single tokens. Unusual words, technical acronyms, and long compound terms fragment into multiple pieces.

Why Tokenisation Matters in Practice

Token count ≠ word count

The most immediate practical consequence is cost. AI APIs charge per token, not per word. The rough conversion for English prose is:

def estimate_tokens(word_count: int, content_type: str = "prose") -> dict:
    """
    Rough token estimation by content type.
    These are approximations — always measure directly for production use.
    """
    multipliers = {
        "prose":       1.3,   # Standard English writing
        "technical":   1.4,   # Dense technical content, acronyms
        "code":        1.5,   # Code tends to tokenise less efficiently
        "multilingual": 1.7,  # Non-English content often fragments more
    }

    multiplier = multipliers.get(content_type, 1.3)
    estimated = int(word_count * multiplier)

    return {
        "word_count": word_count,
        "content_type": content_type,
        "estimated_tokens": estimated,
        "note": "Measure with tiktoken for precision before scaling"
    }

# Examples
print(estimate_tokens(1000, "prose"))       # ~1,300 tokens
print(estimate_tokens(1000, "technical"))   # ~1,400 tokens
print(estimate_tokens(1000, "code"))        # ~1,500 tokens

At small scale this difference is negligible. At high throughput — thousands of API calls per day — it compounds quickly.

Fragmented words affect output quality

When a model sees a rare technical term split into five tokens, it has less of a "whole concept" signal to work with than when it sees a familiar word as a single token. This is why models occasionally mishandle very specific regulatory language, uncommon proper nouns, or technical acronyms from specialist domains.

The practical mitigation is to define unfamiliar terms explicitly before using them:

# Less reliable for niche terminology
prompt_naive = "Does this clause comply with DPDPA obligations?"

# More reliable — define before using
prompt_with_context = """
Context: DPDPA refers to India's Digital Personal Data Protection Act 2023,
which governs the processing of digital personal data of Indian residents.

Given this definition, does the following clause comply with DPDPA obligations?

 What Is a Context Window?

Every model has a maximum number of tokens it can process in a single forward pass — input plus output combined. This is the context window.

Think of it as a desk. Everything on the desk is visible and usable. Everything not on the desk doesn't exist for this conversation. When the desk fills up, something has to fall off — typically the oldest content.

python
import openai

client = openai.OpenAI()

def chat_with_context_management(
messages: list[dict],
model: str = "gpt-4o",
max_context_tokens: int = 100_000,
reserve_for_output: int = 2_000
) -> str:
"""
Simple context management: trim oldest messages if approaching limit.
Production implementations should use more sophisticated strategies.
"""
encoder = tiktoken.encoding_for_model(model)

def count_tokens(msgs):
    return sum(len(encoder.encode(m["content"])) for m in msgs)

usable_limit = max_context_tokens - reserve_for_output

# Preserve system message (index 0), trim from oldest user/assistant pairs
system = [messages[0]] if messages[0]["role"] == "system" else []
conversation = messages[len(system):]

while count_tokens(system + conversation) > usable_limit and len(conversation) > 1:
    conversation = conversation[2:]  # Drop oldest user+assistant pair

trimmed_messages = system + conversation

response = client.chat.completions.create(
    model=model,
    messages=trimmed_messages
)

return response.choices[0].message.content


This is a simplified illustration. Production systems need more nuanced strategies — summarisation of dropped context, semantic retrieval of relevant history, or explicit memory management layers. But the underlying constraint is the same regardless of how you handle it.

The Lost in the Middle Problem

Larger context windows have enabled new use cases — full document analysis, long research sessions, multi-document reasoning. But a larger ceiling hasn't eliminated positional bias.

Research has shown that models attend more strongly to content at the beginning and end of long contexts than to content buried in the middle. The implication for prompt design is direct:

python
def structure_prompt_for_attention(
task_instruction: str,
background_context: str,
primary_document: str,
output_format: str
) -> str:
"""
Structure prompt to put high-attention content at boundaries.
Task instruction and output format bookend the context.
Background goes in the middle where precision matters less.
"""
return f"""TASK: {task_instruction}

OUTPUT FORMAT: {output_format}

BACKGROUND CONTEXT:
{background_context}

PRIMARY DOCUMENT TO ANALYSE:
{primary_document}

Reminder: {task_instruction}
Respond in the format specified above."""


The task instruction appears twice — at the top and again as a reminder before the model generates output. The format specification is also near the top. The background context, which requires less precision, sits in the middle where attention is naturally lower.

This is not a workaround for a bug. It is prompt design that accounts for how attention actually distributes across a long context.

Practical Token Efficiency

One of the highest-leverage early habits for anyone calling LLM APIs is prompt auditing — measuring actual token consumption before scaling.

python
import tiktoken

def audit_prompt(system_prompt: str, user_message: str, model: str = "gpt-4o") -> dict:
"""
Audit token usage before sending — useful during development.
"""
encoder = tiktoken.encoding_for_model(model)

system_tokens = len(encoder.encode(system_prompt))
user_tokens = len(encoder.encode(user_message))
total_input = system_tokens + user_tokens

# Approximate cost at GPT-4o input pricing (verify current pricing)
cost_per_million = 5.00  # USD — check current pricing at platform.openai.com
estimated_cost = (total_input / 1_000_000) * cost_per_million

return {
    "system_tokens": system_tokens,
    "user_tokens": user_tokens,
    "total_input_tokens": total_input,
    "estimated_cost_per_call_usd": round(estimated_cost, 6),
    "note": "Output tokens billed separately. Verify pricing at platform.openai.com"
}

Usage

result = audit_prompt(
system_prompt="You are a helpful assistant specialising in contract review.",
user_message="Review the following contract and identify any unusual indemnity clauses: [contract text]"
)
print(result)




Running this during development — before committing to a prompt structure  surfaces inefficiencies early, when they're cheap to fix.

 What Comes Next

Tokenisation prepares the text. The context window determines what the model can see. But neither of these explains how the model derives meaning from those chunks.

That's Part 3. We'll look at embeddings — how models represent concepts as positions in geometric space, and why that representation is the key to understanding semantic search, retrieval, and why AI can find relevant information even when you don't use the exact right words.

See you there.

Created with AI assistance. Originally published at [[Context First AI](https://contextfirst.ai)]

DEV Community

You're Not Reading Words, You're Reading Chunks: Tokens and Context Windows Explained.

Usage

Top comments (0)