The Million-Token Context Window Changes What You Put In It

#aiengineering #contextengineering #llmops #developerproductivity

The 1M-token context window is here. Opus 4.8, 4.7, and 4.6, plus Sonnet 4.6, now carry the full million-token context on the Claude API, Amazon Bedrock, and Vertex AI — no surcharge, generally available. A single request can hold up to 600 images or PDF pages. The reflex, the second the number lands, is to point the tool at the whole repo and let it rip.

That is the wrong move. The operators who get real value out of the bigger window treat it as a curated working set, not a junk drawer. What you load decides what the model reasons about, and a million tokens of noise still produces noisy output.

The cleanest statement of this comes from the vendor's own documentation. Anthropic's docs say the 1M window's retrieval gains "depend on what's in context, not just how much fits" (context windows). Read that sentence twice. The platform that just handed you a million tokens is telling you, in the same breath, that capacity and effective use are different things.

Capacity Is Not the Same as Attention

Here is the mistake the dump-the-repo reflex makes. It assumes the model reads a million tokens the way a database reads a million rows — uniformly, with equal fidelity from the first byte to the last. It does not.

Chroma's "Context Rot" research evaluated 18 frontier models — Claude 4, Gemini 2.5, Qwen3, and a dozen others — and found that performance "grows increasingly unreliable as input length grows" (Context Rot). Models do not process long context uniformly. They degrade, and they degrade "even on simple tasks." A task the model nails at 10K tokens gets less reliable at 800K, holding everything else constant.

The type of filler matters too. Chroma found that locally-cancelling operations hurt more than neutral print statements, and topically-related distractors degrade answers non-uniformly. Translate that to a codebase: loading the wrong 800K does not just waste space. It actively poisons the model's reasoning over the right 200K. The half-relevant module, the deprecated helper, the three abandoned migration scripts that look load-bearing — those are not free passengers. They are distractors with a vote.

So the question is never "will it fit." With a million tokens, almost everything fits. The question is "what does loading this do to the model's attention on the part that matters."

The Smallest High-Signal Set

Anthropic's applied AI team frames the discipline directly. Good context engineering means finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome" (effective context engineering). Context is "a finite resource with diminishing marginal returns." Every token you load spends from a shared attention budget.

That last point is the one to internalize. The attention budget is shared. The 50 lines of code that actually contain the bug compete for the model's focus with every other token in the window. Pad the window with the rest of the repo and you have not given the model more help. You have given the 50 lines more competition.

This reframes the whole job. The operator's task is not "assemble everything that could conceivably be relevant." It is "curate the smallest set that makes the answer likely." Those are opposite instincts. The first is collection. The second is editing. The bigger window rewards editors and punishes collectors, because the collector's reflex scales straight into the rot.

In practice the curated set looks like intent, not coverage. The file with the bug, its direct callers, the test that fails, the relevant interface, the one config that governs the behavior. Maybe 200K of the right tokens, assembled because you decided each one earns its place. Not a million tokens assembled because the window happened to be that big.

What the Headroom Is Actually For

So if the answer is "feed it the right 200K," what is the other 800K for? It is not for padding. It is for the cases that genuinely need it — the codebase-wide refactor that legitimately touches forty files, the migration that has to reason across a sprawling schema, the incident where the relevant signal really is distributed across a large surface. Those exist. The headroom is there to serve them when they show up, deliberately, not to be the default fill level for every request.

Anthropic's tooling makes the "manage it, don't fill it" stance concrete. The platform ships server-side compaction, context editing that clears stale tool results and thinking blocks, and "context awareness" — models that track their own remaining token budget rather than guessing how many tokens remain (context windows). That is an architecture for spending a finite budget on purpose. None of it would make sense if the design intent were "load everything and let the window sort it out."

The pattern across all three primary sources is the same. The vendor gives you a million tokens, builds the tooling to help you spend them carefully, and states in the docs that what you put in decides what you get out. The empirical research from outside the vendor confirms the failure mode the docs are warning about: more input, less reliability, even on simple tasks.

The Discipline the Bigger Window Demands

The 1M window is a real capability gain. The forty-file refactor that used to require chunking and stitching can now happen in one pass. That is worth having. But the gain only materializes for operators who bring intent to what they load.

Treat the window as a working set you curate. Feed it the right 200K because you decided each token earns its place. Reach for the headroom when the task genuinely spans a large surface, and reach for the compaction and context-editing tools to keep the working set clean as the session runs long. The discipline is the same one good engineers already apply to a code review: the question is not how much you can put in front of someone, it is what they actually need to see to make the call.

The number on the box went up by 5x. The skill that turns it into output did not change. If anything, the bigger window raises the price of getting it wrong — a million tokens of noise is a much louder distraction than 200K of it ever was. Spend the budget like it is scarce, because for the model's attention, it still is.

Top comments (1)

Max Quimby • Jun 6

"A million tokens of noise still produces noisy output" is the line, and the Context Rot framing is the right mental model — the failure isn't "it didn't fit," it's that the half-relevant module got a vote. That matches what we see: deleting irrelevant files from context often improves answers more than adding the "right" one, because you're removing competing attention, not just saving space.

The trap I'd flag now that the window is effectively free: prompt caching makes a big stable prefix cheap in dollars, so the discipline that used to be enforced by cost ("I can't afford to load the whole repo") is gone. The bill stops punishing you for junk long before the accuracy does. The budget you actually have to manage flipped from tokens to attention — and that one's invisible on the dashboard.

Question on the curation step itself: do you select the working set by hand, by retrieval/embedding, or do you let an agent pick what to load? We've found agent-selected context is great right up until it confidently pulls the deprecated helper that looks load-bearing — the exact distractor class you're describing. What's worked to keep the set high-signal as the codebase moves under you?