Jimin Lee

Posted on Sep 20

Understanding Context Window Size in LLMs

#machinelearning #llm

Note: This article was originally written in May 2024. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.

Why Context Window Size Matters in LLMs

When you look at a new LLM release these days, there’s one number that always makes the headlines: context window size. Accuracy, cost, latency—those all matter, but context length has become a bragging right of its own. Google even announced that Gemini can handle up to 1 million tokens in a single context window.

But what exactly is a context window? Why does it matter so much, and why is it such a hard technical problem? Let’s dive in.

What’s a Context Window, Anyway?

In simple terms, the context window is the maximum amount of input text an LLM can process at once. If you’re using ChatGPT, it’s the length of your prompt—the question you ask or the instructions you give.

Since LLMs don’t process raw characters but tokens, we measure context size in tokens. Roughly speaking, in English, 1 token ≈ 2 to 4 characters (though this varies depending on the tokenizer).

Here’s how popular models stack up:

Model	Context Window
GPT-3.5	4K
GPT-4	8K
GPT-4-32K	32K
LLaMA 2	4K
Gemini	32K
Gemini 1.5	1M

The bigger the context window, the more information you can stuff into a single prompt.

Why Bigger Is Better

Imagine you’re building a document summarizer with an LLM. You’d probably use a prompt like this:

You are a summarizer. Summarize the following document faithfully, without adding outside information, and preserve the original language.

(original text)

The catch: how much of that document can you actually fit inside the prompt?

A short story is ~100K tokens (Harry Potter and the Sorcerer’s Stone is in this ballpark).
The Lord of the Rings trilogy? About 750K tokens.
Add HTML tags or PDF metadata, and the input size balloons further.

Now factor in multimodal input: modern LLMs also accept images, audio, even video. Google claims Gemini 1.5’s 1M token window can fit an hour of video or 11 hours of audio in one shot.

In other words: the bigger the window, the richer and more complex the tasks you can handle.

Why It’s Hard to Scale Up

So if everyone wants bigger windows, why don’t all models just support it? Turns out, it’s not as easy as flipping a switch.

From here on, we’ll dive into the inner workings of the Transformer itself—so a bit of background knowledge is helpful.

If you’re new to Transformers, I recommend checking out this blog post or one of the many excellent Transformer explainers available online before continuing.

1. It’s Not About Parameters

Model size (embedding dimensions, number of layers, heads, etc.) doesn’t directly set the context window. Architecturally, Transformers can, in theory, handle arbitrarily long inputs.

But in practice, models are only trained up to a certain input length. GPT-3.5, for instance, is “guaranteed” to work well up to 4K tokens. Sure, you could feed it 8K—but performance would degrade. That’s why APIs enforce strict limits: not because the model literally can’t run longer, but because providers don’t want you seeing garbage outputs.

2. Training Data

If you only ever read 10-page short stories, you might become an excellent short-story writer. But ask you to write a 10-volume epic? You’d probably stumble.

LLMs are the same. To handle long inputs, they need to train on long sequences. But high-quality long-form data is hard to find. Simply concatenating short snippets doesn’t cut it—you end up with noise, not real long-form structure.

3. Compute Cost

Transformers use attention mechanisms with quadratic complexity: doubling the input length makes computation ~4× slower.

Training time doubles (6 months → 12 months).
Inference latency doubles (3s → 6s).
Serving cost rises, so API prices go up.

This is why you see so many innovations like FlashAttention, Multi-Query Attention (MQA), and Grouped Query Attention (GQA)—they make long contexts computationally feasible.

4. Positional Encoding

Transformers need to know not just what tokens are, but where they are. Early models used absolute positional encodings (token #1 always has vector A, token #2 always vector B, etc.).

The problem? If you trained only up to 4,096 tokens, the model has never seen what token #8,192 looks like. It generates a positional vector—but it’s unfamiliar, so quality drops.

Two clever fixes emerged:

ALiBi (Attention with Linear Biases)

Instead of encoding absolute positions, ALiBi encodes relative distances between tokens. Whether “friend” and “played” are tokens #2 and #3 or #402 and #403 doesn’t matter—the model just learns they’re next to each other.

This makes models much more robust to longer inputs. The catch: you need to train with ALiBi from scratch.

Positional Interpolation

For existing models, there’s a hack: squeeze more positions into the same encoding space. A 4-position model with indices [0,1,2,3] can be interpolated into [0,0.5,1,1.5,2,2.5,3,3.5].

It’s not as strong as ALiBi, but it lets you extend context windows without retraining from scratch.

The Context Arms Race

So why do labs keep pushing context limits?

User demand: more serious use cases (long docs, multimodal input) need longer windows.
Performance gains: giving models more input often makes them smarter.
Marketing: “1M tokens!” sounds impressive on a launch slide.

Of course, offering giant windows is expensive. That’s why providers tier their APIs: a 4K context is cheap, a 128K or 1M context costs much more.

Final Thoughts

Context window size isn’t just a random number in a model spec—it hides a lot of deep engineering trade-offs in data, compute, and architecture.

Next time you see a flashy LLM release boasting “Now with 1M tokens!”, you’ll know what’s behind the claim: months of data curation, expensive training, and clever math tricks to keep the whole thing running fast enough to be usable.

DEV Community