This article was originally published on aifoss.dev
TL;DR: Context window size has a direct, linear cost in VRAM — and most tasks don't need as much as you think. An 8k window handles the vast majority of chat, coding, and single-document use cases. 32k is the practical ceiling for mid-range GPUs. 128k is technically possible on high-VRAM cards but comes with real quality degradation at scale. Choosing the right window size is a better optimization than buying a bigger GPU.
| 8k context | 32k context | 128k context | |
|---|---|---|---|
| Best for | Chat, code completion, single files | Multi-turn research, long code review, medium docs | Full codebase analysis, book-length docs |
| VRAM overhead (Llama 3.1 8B Q4_K_M) | +~1 GB | +~4 GB | +~16 GB |
| Generation speed (RTX 3090) | ~45 tok/s | ~35–40 tok/s | ~10–20 tok/s (partial CPU offload likely) |
| The catch | Truncates long inputs silently | Needs 12–16 GB GPU for 13B+ models | Quality degrades mid-context; expensive |
Honest take: For most developers running local LLMs, 16k–32k is the right default: enough for real work, manageable on a 12–16 GB GPU. Reserve 128k for specific use cases where you've confirmed the model actually uses that context well.
What the Context Window Actually Is
A context window is the total number of tokens a model processes in a single forward pass — system prompt, chat history, injected documents, and the new input combined. Everything outside this window doesn't exist to the model.
Token count matters more than word count. In English, 1 token is roughly 0.75 words; code is denser, often 0.5 words per token. Some practical reference points:
- 8k tokens ≈ 6,000 words ≈ a typical short story or 200–300 lines of code
- 32k tokens ≈ 24,000 words ≈ a medium research paper or a 500–800 line code file
- 128k tokens ≈ 96,000 words ≈ a short novel or a large multi-file codebase
The thing most people don't realize: the model doesn't remember anything beyond the current context window. A 5-hour chat session that overflows 8k doesn't cause a graceful summary — it silently drops the oldest messages. Knowing your token budget prevents that from biting you mid-conversation.
The KV Cache: Why Context Eats VRAM
Every token you add to the context window consumes GPU memory — not through the model weights, which are fixed, but through the KV cache (key-value cache). Transformers compute attention over every previous token at each layer; the KV cache stores those intermediate results so the model doesn't recompute them on every generation step.
The memory cost scales linearly with context length:
KV cache size = 2 × num_layers × num_kv_heads × head_dim × seq_len × dtype_bytes
For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim, float16):
| Context | KV cache | Model weights (Q4_K_M) | Total VRAM |
|---|---|---|---|
| 2k (Ollama default) | ~0.25 GB | ~4.7 GB | ~5.0 GB |
| 8k | ~1.0 GB | ~4.7 GB | ~5.7 GB |
| 16k | ~2.0 GB | ~4.7 GB | ~6.7 GB |
| 32k | ~4.0 GB | ~4.7 GB | ~8.7 GB |
| 64k | ~8.0 GB | ~4.7 GB | ~12.7 GB |
| 128k | ~16.0 GB | ~4.7 GB | ~20.7 GB |
This is why an 8 GB GPU can run Llama 3.1 8B at 8k context fine (5.7 GB total) but runs out of memory trying to push 32k. The model weights didn't change — the KV cache ate your VRAM.
For larger models, the KV cache grows proportionally. A 32B model has more layers and wider attention, so its KV cache at 32k context can exceed 12 GB by itself. An RTX 4090 with 24 GB handles a 32B model at 8k context, but hits the wall around 32k. For 128k with a 32B model, you're looking at NVIDIA A100-class hardware or multi-GPU setups.
Flash attention cuts this cost substantially. Setting OLLAMA_FLASH_ATTENTION=1 reduces KV cache VRAM usage by 30–50% on Ampere and newer GPUs (RTX 3080 and above). Combined with KV cache quantization (available in recent llama.cpp builds), you can roughly double the effective context window before running out of memory — pushing a 128k-capable run on hardware that would normally top out at 64k.
The 8k Sweet Spot
For the majority of local LLM use cases, 8k context is genuinely enough:
- Chat conversations: even long sessions rarely exceed 4k tokens of actual meaningful exchange before the early context stops being relevant anyway
- Code completion and review: most individual files are under 5k tokens; reviewing a single function or class is typically 1k–3k tokens
- Single document Q&A: a 5-page PDF, a README, a blog post — all comfortably within 8k
- RAG pipelines: if you're using retrieval, you're injecting only the top 3–10 chunks into context, not the full document set. 8k is enough for the retrieved context plus the system prompt
The hidden advantage: at 8k, you stay comfortably on GPU with smaller cards. A mid-range RTX 4070 Ti Super (16 GB) runs a 13B Q4_K_M model at 8k context with headroom to spare, and generation stays above 40 tokens/second.
Ollama's default context is set low (2048 tokens in most versions) to avoid unexpected out-of-memory errors. Bumping to 8192 is the first configuration change that actually improves usability without meaningful VRAM cost.
# Permanently in a Modelfile
FROM llama3.1:8b
PARAMETER num_ctx 8192
# One-off via CLI
ollama run llama3.1:8b --option num_ctx 8192
# Via the REST API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"options": {"num_ctx": 8192},
"prompt": "Summarize this code..."
}'
When 32k Makes Sense
There are specific workflows where 32k pays off:
Multi-turn research conversations: you're asking follow-up questions about a paper and want the model to hold the full prior exchange in context — not a 3-message summary of it. At 8k, a long research session overwrites its own early context. At 32k, you can go 20–40 exchanges deep before anything drops.
Long code file review: a 600-line Python file with docstrings and comments is ~6k–10k tokens. You want the full file in context while you ask questions about it, not chunks. 32k gives you room for the file plus a few rounds of Q&A.
Document analysis across multiple pages: a 30-page technical specification or contract runs 15k–25k tokens. 8k forces chunking; 32k lets you ask about relationships between different sections in a single pass.
Agentic coding loops (Aider, Cline): these tools send the full file list, relevant files, and conversation history in each request. Context grows fast. A 32k window allows multi-file edits without hitting the ceiling mid-session.
The hardware minimum for comfortable 32k use with a 13B model: 16 GB VRAM. With flash attention enabled, you can push 32k on a 12 GB card, but generation speed drops as the KV cache grows.
128k: Worth It or Hype?
128k context support is now advertised by most frontier local models — Llama 3.1 8B and 70B, Qwen2.5 7B through 72B, Gemma 3 12B, Mistral models. The capability is real. But three problems limit its practical utility.
Problem 1: VRAM requirements are prohibitive. As the table above shows, running Llama 3.1 8B at 128k context needs ~21 GB of VRAM for that model alone. The 8B model is the small option. Running a 70B model at 128k context requires 80+ GB of VRAM — which means multi-GPU or cloud. For cloud GPU rental, RunPod has H100 80GB instances that handle this, but you're now spending money.
Problem 2: Generation speed collapses. Even if you have the VRAM, processing 100k+ tokens in the KV cache during generation creates severe latency. At 128k context, even an RTX 3090 drops to 10–20 tokens/second for an 8B model — compared to 45+ tokens/second at 8k context. For interactive use, this is pai
Top comments (0)