I was burning through tokens like they were free. Then I started measuring. My average coding session used ~80K tokens, and most of it was wasted context the model didn't need.
After a month of experimenting, I cut that to ~35K tokens per session with zero loss in output quality. Here are the five tricks that made the difference.
1. The File Summary Header
Instead of pasting an entire file into context, I prepend a 3-line summary:
// FILE: src/auth/middleware.ts (147 lines)
// PURPOSE: Express middleware for JWT validation + role-based access
// EXPORTS: authMiddleware, requireRole, extractUser
Then I only include the specific function I need help with. The model gets enough context to understand the architecture without reading 147 lines of boilerplate.
Token savings: ~60% per file reference.
2. The Dependency Stub
When the model needs to understand how a function interacts with other modules, don't paste the full dependency — paste a stub:
// STUB: database.ts
interface DB {
query<T>(sql: string, params: any[]): Promise<T[]>;
transaction(fn: (tx: Transaction) => Promise<void>): Promise<void>;
}
The model only needs the interface contract, not the 500-line implementation with connection pooling and retry logic.
Token savings: ~80% per dependency.
3. The Rolling Context Window
For multi-turn sessions, I reset context every 3-4 turns with a summary:
Context reset. Here's where we are:
- We're building a rate limiter for the /api/upload endpoint
- We've decided on a sliding window algorithm with Redis
- The function signature is: rateLimit(userId: string, windowMs: number, maxRequests: number)
- Current blocker: handling Redis connection failures gracefully
Continue from here.
This prevents the "context sludge" problem where the model drags along 20 turns of outdated conversation.
Token savings: ~40% on sessions longer than 5 turns.
4. The Negative Context Declaration
Tell the model what to ignore explicitly:
Focus only on the error handling logic in processPayment().
Ignore: logging, metrics, the retry wrapper, input validation.
These are tested and working — don't modify or comment on them.
Without this, the model will "helpfully" refactor your logging, suggest improvements to your validation, and burn tokens on things you didn't ask about.
Token savings: ~30% on modification tasks.
5. The Output Budget
Constrain the response format upfront:
Return ONLY:
1. The modified function (no surrounding code)
2. A 2-line summary of what changed
3. One potential edge case to test
Do NOT include: explanations of existing code, import statements,
or alternative approaches.
I started doing this after noticing that ~40% of most AI responses was explanation I didn't need. The code was fine — the commentary was the waste.
Token savings: ~40% on output tokens.
The Combined Effect
Using all five together on a typical refactoring task:
| Technique | Before | After |
|---|---|---|
| File references | 12K tokens | 4K tokens |
| Dependencies | 8K tokens | 2K tokens |
| Conversation history | 25K tokens | 15K tokens |
| Unfocused responses | 15K tokens | 8K tokens |
| Verbose output | 20K tokens | 6K tokens |
| Total | 80K | 35K |
That's not just cheaper — it's faster. Smaller context means faster inference, fewer hallucinations, and more focused output.
The Meta-Lesson
Context windows are not "how much the model can read." They're a budget. Every token you spend on unnecessary context is a token not available for reasoning about your actual problem.
Treat your context window like RAM: measure it, manage it, and stop assuming more is better.
Start with trick #1 (file summary headers) — it's the easiest to adopt and has the highest payoff. Then layer in the others as they feel natural.
Your wallet and your response quality will both thank you.
Top comments (0)