DEV Community

Shinsuke Matsuda
Shinsuke Matsuda

Posted on

The 2M Token Trap: Why "Context Stuffing" Kills Reasoning

Why more context often makes LLMs worse—and what to do instead


1. Introduction

The Context Window Arms Race

The expansion of context windows has been staggering:

  • Early 2023: GPT-4 launches with 32K tokens
  • November 2023: GPT-4 Turbo extends to 128K
  • March 2024: Claude 3 reaches 200K
  • February 2024: Gemini 1.5 hits 1M—later expanding to 2M

In just two years, context capacity grew from 32K to 2M tokens—a 62× increase.

The developer intuition was immediate and seemingly logical:

“If everything fits, just put everything in.”

The Paradox: More Context, Worse Results

Practitioners are discovering a counterintuitive pattern:

the more context you provide, the worse the model performs.

Common symptoms include:

  • Passing an entire codebase → misunderstood design intent
  • Including exhaustive logs → critical errors overlooked
  • Providing comprehensive documentation → unfocused responses

This phenomenon has a name in the research literature:

“Lost in the Middle” (Liu et al., 2023).

Information placed in the middle of long contexts is systematically neglected.

The uncomfortable truth is this:

A context window is not just storage capacity. It is cognitive load.

This article explores why Context Stuffing fails, what Anthropic’s Claude Code reveals about effective context management, and how to shift from Prompt Engineering to Context Engineering—the discipline of architectural curation for AI systems.


2. Why “More Context” Doesn’t Mean “Better Understanding”

Capacity vs. Capability

We must distinguish between two fundamentally different concepts:

  • Capacity: How much data fits in memory (e.g. 200K, 2M tokens)
  • Capability: The ability to prioritize, connect, and reason over that data

Just because a model can ingest 2 million tokens does not mean it can pay attention to them equally.

Providing a 2M-token context to an LLM is like handing a new developer 10,000 pages of documentation on day one and expecting them to fix a bug in five minutes.

They won’t understand the system—they will immediately drown in it.

Attention Dilution and “Lost in the Middle”

This limitation is rooted in the self-attention mechanism.

As token count increases, attention distributions flatten, signal-to-noise ratios drop, and relevant information gets buried.

Liu et al. (2023) demonstrated that information placed in the middle of long contexts is systematically neglected—even when explicitly relevant—while content at the beginning and end receives disproportionate attention.

In short:

Context expansion increases what can be accessed, not what can be understood.

Real-World Symptoms

In practice, adding information often degrades accuracy:

  • Entire codebases → architectural misinterpretation
  • Exhaustive logs → critical signals buried
  • Comprehensive docs → answers drift off-topic

These are not failures of model intelligence.

They are failures of information structure and prioritization—problems no amount of context capacity can solve.


3. The 75% Rule: Lessons from Claude Code

The Problem: Quality Degradation in Long Sessions

The strongest evidence against Context Stuffing comes from Claude Code, Anthropic’s terminal-based coding agent with a 200K context window.

In early 2024, users reported recurring issues:

  • Code quality degraded over long sessions
  • Earlier design decisions were forgotten
  • Auto-compact sometimes failed, causing infinite loops

At the time, Claude Code routinely used over 90% of its available context.

The Solution: Auto-Compact at 75%

In September 2024, Anthropic implemented a counterintuitive fix:

Trigger auto-compact when context usage reaches 75%.

This meant:

  • ~150K tokens used for storage
  • ~50K tokens deliberately left empty

What looked like waste turned out to be the key to dramatic quality improvements.

Why It Works: Inference Space

Several hypotheses explain why this works:

  1. Context Compression — Low-relevance information is removed
  2. Information Restructuring — Summaries reorganize scattered data
  3. Preserving Room for Reasoning — Empty space enables generation

As one developer put it:

“That free context space isn’t wasted—it’s where reasoning happens.”

This mirrors computer memory behavior:

Running at 95% RAM doesn’t mean the remaining 5% is idle—it’s system overhead. Push to 100%, and everything grinds to a halt.

Takeaway

Filling context to capacity degrades output quality.

Effective context management requires headroom—space reserved for reasoning, not just retrieval.


4. The Three Principles of Context Engineering

The era of prompt wording tweaks is ending.

As Hamel Husain observed:

“AI Engineering is Context Engineering.”

The critical skill is no longer what you say to the model, but what you put in front of it—and what you deliberately leave out.

Principle 1: Isolation

Do not dump the monolith.

Borrow Bounded Contexts from Domain-Driven Design.

Provide the smallest effective context for the task.

Example: Add OAuth2 authentication

Needed:

  • User model
  • SessionController
  • routes.rb
  • Relevant auth middleware

Not needed:

  • Billing module
  • CSS styles
  • Unrelated APIs
  • Other test fixtures

Ask:

What is the minimum context required to solve this problem?

Principle 2: Chaining

Pass artifacts, not histories.

Break workflows into stages:

Plan → Execute → Reflect

Each stage receives only the previous stage’s output—not the entire conversation history.

This keeps context fresh and signal-dense.

Ask:

Can this be decomposed into stages that pass summaries instead of transcripts?

Principle 3: Headroom

Never run a model at 100% capacity.

Adopt the 75% Rule.

Token limits usually cover input + output. Stuffing 195K tokens into a 200K window leaves almost no room for reasoning.

Ask:

Have I left enough space for the model to think—not just respond?


Treat the context window as a scarce cognitive resource, not infinite storage.


5. Why 200K Is the Sweet Spot

Despite 2M-token models, 200K is the practical sweet spot for Context Engineering.

Cognitive Scale

150K tokens (75% of 200K) is roughly one technical book—about the largest coherent “project state” both humans and LLMs can manage. Beyond that, you need chapters, summaries, and architecture.

Cost and Latency

Attention scales at O(n²).

Doubling context quadruples cost.

200K balances performance, latency, and cost.

Methodological Discipline

200K forces curation.

Exceeding it is a code smell: unclear boundaries, oversized tasks, or stuffing instead of structuring.

Anthropic offers 1M tokens—but behind premium tiers.

The implicit message:

1M is for special cases. 200K is the default for a reason.

The constraint is not a limitation—it is the design principle.


6. Conclusion: From Prompt Engineering to Context Engineering

The context window arms race delivered a 62× increase in capacity.

But capacity was never the bottleneck.

The bottleneck is—and always has been—curation.

The shift is fundamental:

Prompt Engineering Context Engineering
“How do I phrase this?” “What should the model see?”
Optimizing words Architecting information
Single-shot prompts Multi-stage pipelines
Filling capacity Preserving headroom

Three Questions to Ask Before Every Task

  1. Am I stuffing context just because I can?

    Relevant beats exhaustive.

  2. Is this context isolated to the real problem?

    If you can’t state the boundary, you haven’t found it.

  3. Have I left room for the model to think?

    Output quality requires input restraint.

The era of prompt engineering rewarded clever wording.

The era of context engineering rewards architectural judgment.

The question is no longer:

What should I say to the model?

The question is:

What world should the model see?


7. References

Research Papers

Tools & Methodologies

Empirical Studies

Articles & Analysis

Top comments (0)