Nova Elvaris

Posted on Mar 28

Context Windows Are Lying to You: How to Actually Use 128K Tokens

#ai #programming #productivity #wecoded

Every model brags about context windows now. 128K tokens. 200K tokens. "Paste your entire codebase!" the marketing says.

I tried it. I pasted 80K tokens of a Node.js project into Claude and asked it to find a bug. It found a bug — in a file I didn't care about, while ignoring the actual issue in the file I mentioned.

Here's what I learned about context windows the hard way.

The Attention Problem

Large context windows don't mean the model pays equal attention to everything. Research on "lost in the middle" showed that LLMs disproportionately focus on the beginning and end of the context, with reduced attention in the middle.

In practice, this means:

File 1 of 50: high attention ✓
Files 2-49: declining attention ✗
File 50: high attention ✓
Your actual question at the end: high attention ✓

So if your bug is in file 27, the model literally pays less attention to it — even though it's "in context."

The Cost Problem

128K input tokens on GPT-4o costs about $0.32. That sounds cheap until you're iterating. Ask 10 follow-up questions with the same context and you've spent $3.20 — for one debugging session. Do this daily and you're burning $60+/month on context you mostly don't need.

The Noise Problem

More context means more potential for the model to get confused. Paste your entire codebase and ask "how does auth work?" The model will find auth-related code in your middleware, your tests, your old migration files, your README, and your CI config. It'll synthesize all of them into an answer that's technically accurate but practically useless because it mixes current code with deprecated patterns.

What Actually Works: The Context Budget

I now treat context like a budget. Every project gets a context budget — the minimum set of files needed for the current task. Nothing more.

Rule 1: 3-File Maximum for Bug Fixes

For any bug fix, I include at most 3 files:

The file with the bug
The file that calls it
The test file (if one exists)

That's usually 1-3K tokens. Fast, cheap, and the model's full attention is on the relevant code.

Rule 2: Tree Before Files

Before pasting any code, I give the model the project structure:

Here's the project tree (files only, no contents):
src/
  auth/
    middleware.ts
    oauth.ts
    session.ts
  api/
    users.ts
    posts.ts
  db/
    schema.prisma
    migrations/

Then I ask: "Which files do you need to see to [task]?" The model picks 2-3 files. I paste those. This is dramatically more effective than dumping everything.

Rule 3: Summaries Over Source for Large Codebases

For architectural questions, I write a 200-word summary instead of pasting 50 files:

## Architecture Summary
- Express API with Prisma ORM, PostgreSQL
- Auth: Google OAuth2, sessions stored in Redis
- 3 main domains: users, posts, comments
- Each domain has: router, service, repository layers
- Tests: Vitest for unit, Playwright for e2e
- Current issue: session expiry not triggering token refresh

This gives the model the map without the territory. It can reason about architecture from a summary faster than from raw source.

Rule 4: Rolling Context for Multi-Step Work

For long sessions, I maintain a "working set" — a running summary of what we've done and decided:

## Working Set (updated after each step)
- Fixed: session middleware now checks token expiry
- Changed: added `refreshToken()` call in auth middleware
- Pending: update tests for new refresh flow
- Decision: using sliding window expiry (not fixed)

This replaces the growing conversation history. Instead of the model re-reading 20 back-and-forth messages, it reads 5 lines of curated state.

The Number That Matters

In my experience, the sweet spot is 2,000-8,000 tokens of input context for most coding tasks. That's:

3-5 source files, or
1 file + project summary + task description, or
A working set summary + the current file

Above 8K, I start seeing diminishing returns. Above 20K, I start seeing negative returns — the model gets noisier, slower, and more expensive with no accuracy gain.

The 128K window isn't there for you to use all of it. It's there so you can include a large file when you need to. Think of it as cargo capacity, not a target.

Quick Reference

Task	Context Budget
Bug fix	~2K (3 files max)
New feature	~5K (spec + related files)
Architecture question	~1K (summary only)
Code review	~4K (diff + touched files)
Refactoring	~8K (module + tests + conventions)

How do you manage context window usage? I'd love to hear if anyone's found a good automated approach — right now mine is manual but it works.

DEV Community