DEV Community

Dibyanshu kumar
Dibyanshu kumar

Posted on

How I Stopped Losing Work to Context Window Overflow in Claude Code

If you use Claude Code for long coding sessions, you've probably experienced this: you're 40 minutes in, deep in a complex refactor, and the model starts forgetting things. It repeats itself. It loses track of what files it already edited. Then the session just dies — context window full, conversation over, work lost.

I got tired of it and built a proxy to fix it.

The Problem

LLM coding tools like Claude Code send everything — system prompts, tool definitions, project context, and your entire conversation history — in every API request. As the conversation grows, the payload approaches the model's context limit silently. There's no progress bar. No warning. The tool doesn't tell you "hey, you're at 80%, maybe wrap up."

When it finally overflows, you lose the session. Whatever the model was working on, whatever context it had built up — gone. You start a new conversation from scratch.

What I Tried First

Manual summarization — I'd try to remember to ask the model to write a summary before context ran out. But I'd forget, or misjudge how much room was left.

Shorter sessions — Breaking work into tiny chunks defeats the purpose of having an AI coding assistant handle complex, multi-step tasks.

Prompt caching — I built an entire cache optimization layer with volatility-based decomposition. Six layers, hash-based change detection, provider-specific cache hints. It was elegant in theory. In practice, it didn't meaningfully reduce costs or prevent overflows. I disabled it.

What Actually Worked

I built a local HTTP proxy called Prefixion that sits between Claude Code and the Anthropic API:

Claude Code → Prefixion (localhost:8080) → api.anthropic.com
Enter fullscreen mode Exit fullscreen mode

It doesn't modify your prompts for caching. It doesn't try to be clever. It does two things well:

1. Context Window Warnings

Every request passes through the proxy. Prefixion estimates token usage from the payload size and tracks where you are relative to the model's context limit.

When you cross a threshold, it injects a warning directly into the conversation — appended to your last message so the model sees it as an urgent instruction:

At 70% — a gentle alert:

"This conversation has used 72% of its context window. Write a conversation summary and suggest starting a new conversation."

At 80% — a firm warning:

"STOP. BEFORE responding to the user, write a conversation summary. Tell the user to start a new conversation."

At 90% — an emergency stop:

"STOP ALL WORK IMMEDIATELY. Do not make any more tool calls. Write a conversation summary. This conversation must end now."

The warnings escalate per conversation, so you only see each level once. And because they're injected into the user message (not the system prompt), they don't break any existing cache prefixes.

The result: the model writes a summary file — what was accomplished, current status, open items, key files modified — before the session dies. When you start a new conversation, you have full context to pick up where you left off.

2. Everything Gets Tracked

Every turn is logged to a local SQLite database with:

  • Input/output token counts
  • Cache read/write tokens (from the API response)
  • Calculated cost in USD
  • Guard events that fired (which warnings triggered, when)

There's a web dashboard where you can browse conversations, see per-turn token breakdowns, and check guard efficiency metrics. It's useful for understanding how your sessions actually behave — which ones cost the most, where context fills up fastest, how often you hit the wall.

How It's Set Up

Point Claude Code at http://localhost:8080 as the API base URL and start the proxy. That's it. Auth headers pass through untouched. Streaming works. If the proxy fails for any reason, it forwards the original request unmodified — the "do no harm" principle.

What I Learned

The real problem isn't cost — it's session reliability. I started this project trying to optimize prompt caching and reduce API bills. That turned out to be the wrong problem. The thing that actually hurt was losing work. A $2 session that crashes is worse than a $4 session that finishes.

Warnings need to be injected, not displayed. A notification in a sidebar doesn't help. The model needs to see the warning as an instruction it can act on. Injecting it into the conversation is crude but effective.

LLM tools will probably build these features natively. Context awareness, session handoff — these should be built into Claude Code and Cursor and Aider. Until they are, a proxy is a clean way to add them without forking anything.

Should You Build One?

Honestly — probably not. If you're a casual user, shorter sessions and manual summaries work fine. If you're a power user running 60-minute sessions on complex codebases, the context overflow problem is real and a proxy like this helps.

But the ideas are what matter more than the code:

  • Monitor context usage and intervene before it's too late
  • Inject warnings as model instructions, not UI notifications
  • Always write a summary before a session ends, not after

These are patterns any tool can implement. The proxy approach is just one way to do it.


— DK

Top comments (0)