Not theory. Here's exactly how I use it.
TL;DR
GPT-5.4 and Claude Sonnet 4.6 both shipped with 1 million token context windows this week. I've been testing them in real work — research, writing, code review. Here's what actually works, what doesn't, and the prompts I'm using.
The Promise vs Reality
The hype: "Feed entire codebases! Analyze whole books! Never lose context!"
The reality: More nuanced. 1M tokens is roughly 750,000 words — yes, that's an entire book. But throwing everything at the model doesn't automatically make it smarter.
What Actually Works
1. Research Synthesis (My Killer Use Case)
The workflow:
- Fetch 15-20 sources on a topic
- Paste them all in a single context
- Ask for synthesis, not summary
The prompt:
I've included {N} sources about {topic}.
Don't summarize them individually. Instead:
1. Find the 3-5 key insights across multiple sources
2. Identify contradictions or debates
3. Note what's missing
4. Give me your synthesis in 500 words max.
Why this works: The model can actually cross-reference. Before 1M context, I'd have to manually track which source said what.
2. Code Review With Full Repo Context
find . -name "*.py" -exec cat {} \; | head -c 500000
This is a Python codebase for {project}.
I'm adding: {feature}.
1. Which files will I modify?
2. What patterns should I follow?
3. Any conflicts?
4. Write the code, matching existing style.
3. Document-Heavy Analysis
This is a {document type}, {X} pages.
I need to understand:
1. {Question 1}
2. {Question 2}
Quote exact sections for each answer.
What Doesn't Work (Yet)
❌ Vague prompts — "Analyze this" still produces meh results
❌ Needle-in-haystack — Slower than Ctrl+F
❌ Token-stuffing — 200K relevant > 800K "maybe useful"
Rule: Quality of context > quantity of context.
Cost Reality Check
1M tokens ≈ $3-15 depending on model.
My spend: ~$5-10/day. ROI is obvious when one research session replaces 2+ hours.
Try It Yourself
- Pick one research task you do manually
- Gather 10+ sources
- Paste them all into Claude or GPT-5.4
- Use the synthesis prompt above
- Compare time vs quality
What's your best use case for long context? Comment below!
Top comments (0)