5 Context Window Tricks That Cut My Token Usage in Half

#ai #productivity #programming #llm

I was burning through tokens like they were free. Then I started measuring. My average coding session used ~80K tokens, and most of it was wasted context the model didn't need.

After a month of experimenting, I cut that to ~35K tokens per session with zero loss in output quality. Here are the five tricks that made the difference.

1. The File Summary Header

Instead of pasting an entire file into context, I prepend a 3-line summary:

// FILE: src/auth/middleware.ts (147 lines)
// PURPOSE: Express middleware for JWT validation + role-based access
// EXPORTS: authMiddleware, requireRole, extractUser

Then I only include the specific function I need help with. The model gets enough context to understand the architecture without reading 147 lines of boilerplate.

Token savings: ~60% per file reference.

2. The Dependency Stub

When the model needs to understand how a function interacts with other modules, don't paste the full dependency — paste a stub:

// STUB: database.ts
interface DB {
  query<T>(sql: string, params: any[]): Promise<T[]>;
  transaction(fn: (tx: Transaction) => Promise<void>): Promise<void>;
}

The model only needs the interface contract, not the 500-line implementation with connection pooling and retry logic.

Token savings: ~80% per dependency.

3. The Rolling Context Window

For multi-turn sessions, I reset context every 3-4 turns with a summary:

Context reset. Here's where we are:
- We're building a rate limiter for the /api/upload endpoint
- We've decided on a sliding window algorithm with Redis
- The function signature is: rateLimit(userId: string, windowMs: number, maxRequests: number)
- Current blocker: handling Redis connection failures gracefully

Continue from here.

This prevents the "context sludge" problem where the model drags along 20 turns of outdated conversation.

Token savings: ~40% on sessions longer than 5 turns.

4. The Negative Context Declaration

Tell the model what to ignore explicitly:

Focus only on the error handling logic in processPayment().
Ignore: logging, metrics, the retry wrapper, input validation.
These are tested and working — don't modify or comment on them.

Without this, the model will "helpfully" refactor your logging, suggest improvements to your validation, and burn tokens on things you didn't ask about.

Token savings: ~30% on modification tasks.

5. The Output Budget

Constrain the response format upfront:

Return ONLY:
1. The modified function (no surrounding code)
2. A 2-line summary of what changed
3. One potential edge case to test

Do NOT include: explanations of existing code, import statements,
or alternative approaches.

I started doing this after noticing that ~40% of most AI responses was explanation I didn't need. The code was fine — the commentary was the waste.

Token savings: ~40% on output tokens.

The Combined Effect

Using all five together on a typical refactoring task:

Technique	Before	After
File references	12K tokens	4K tokens
Dependencies	8K tokens	2K tokens
Conversation history	25K tokens	15K tokens
Unfocused responses	15K tokens	8K tokens
Verbose output	20K tokens	6K tokens
Total	80K	35K

That's not just cheaper — it's faster. Smaller context means faster inference, fewer hallucinations, and more focused output.

The Meta-Lesson

Context windows are not "how much the model can read." They're a budget. Every token you spend on unnecessary context is a token not available for reasoning about your actual problem.

Treat your context window like RAM: measure it, manage it, and stop assuming more is better.

Start with trick #1 (file summary headers) — it's the easiest to adopt and has the highest payoff. Then layer in the others as they feel natural.

Your wallet and your response quality will both thank you.