Building a Context Budget: A Practical Token Allocation Framework

#contextbudget #contectengineer #ainative #token

Open the /context command in Claude Code and you'll see something most teams have never seen for their own product: a precise breakdown of where every token in the session went. System prompt, eight percent. Tools and skills, fourteen percent. Conversation history, sixty-one percent. Free space, seventeen percent. The command exists because someone decided a context window needed a statement, the same way a company needs a P&L.

Most teams building on top of these models have no equivalent. They know their token limit. They have almost no idea what's spending it.

That's not a tooling gap. It's a discipline gap. The teams getting hurt by context rot are rarely the ones with a context window that's too small. They're the ones who never asked where each token was going, or whether it had earned its place there.

This is the practice most teams skip, and the one worth building before you touch a single percentage.

What a Token Budget Actually Is

Most people use "token budget" to mean the ceiling: the advertised window size on the model's spec sheet. That's not a budget. It's a credit limit. A credit limit tells you the most you could spend. It says nothing about where the money is actually going.

A context window is capacity. The budget is the layer above it: the deliberate decision about what fills that capacity, in what proportion, reviewed by someone whose job it is to ask whether each slice still deserves the space. Capacity and governance are different problems, and conflating them is how teams end up with a 1-million-token window and the same quality complaints they had at 50,000.

The reason this matters operationally is that the categories competing for that capacity are in zero-sum competition with each other. Pull ten retrieved documents at 1,500 tokens apiece into a RAG pipeline and you've spent 15,000 tokens before the model has read the actual question. Every one of those tokens came out of the same fixed pool that conversation history, tool output, and your system instructions are also drawing from. Add more of one category and something else gets less. There's no way around that math, only choices about how to make it on purpose instead of by accident.

The Ceiling Is Not the Budget

A bigger window doesn't fix a budgeting problem. It postpones the moment the problem becomes visible, and it usually makes the eventual mess larger.

When Claude Code moved to a default 1-million-token context window, several practitioners who'd been getting strong results on the 200,000-token version reported the opposite of an upgrade. More room didn't mean better reasoning. It meant more space for stale tool output, abandoned approaches, and old file reads to accumulate without anyone noticing, because nothing forced a cleanup the way a tighter ceiling used to. One developer found his sessions actually improved after deliberately shrinking back down, treating the extra headroom the way you'd treat a hard drive with ten times the space: a place where clutter survives longer, not a reason to stop sorting it.

The practical version of this lesson: window size is RAM, not storage. Treat it like a constrained resource regardless of how large the number on the spec sheet is. A bigger number doesn't earn you the right to be lazier about what loads.

Nobody Owns the Four Categories

Every context window, regardless of product, splits into roughly four things competing for the same fixed space: system instructions, retrieved knowledge, conversation or session history, and tool output. In most teams, every one of those four grows on its own, because no single person is accountable for the total.

System instructions grow by accretion. Someone adds a clarifying line after a bad output, another engineer adds a guardrail after an incident, and six months later the system prompt is a record of every edge case anyone has ever hit, most of which will never recur. Retrieved knowledge grows because "pull a few extra chunks to be safe" feels free and almost never gets measured against whether the extra chunks changed the answer. Conversation history grows because trimming it feels like throwing away memory, even when most of what's being kept is no longer relevant to the current step. Tool output is the worst offender precisely because it's the least visible: a single page snapshot, a pulled list of records, or a raw API response can carry far more raw text than the model needs in that form, and unless something intercepts it, all of it sits in context anyway.

A recent review of agent memory architectures put the underlying issue plainly: the context window is the scarcest shared resource any agent system has, and how it gets allocated is a coordination problem that no single piece of the system can solve by itself. That's the part most teams miss. Budgeting context isn't an engineering task you assign to one function. It's a cross-cutting one, and cross-cutting problems are exactly the ones that go unowned by default.

Budgeting Is a Practice, Not a Percentage

The honest fix here isn't a universal split. Anyone offering "40 percent retrieval, 30 percent history, 20 percent tools, 10 percent system" as a template that fits every product is selling something that won't survive contact with a real codebase or a real support queue. A coding agent's ideal allocation looks nothing like a customer-support bot's, and a fixed-system-prompt template that worked for one will misallocate the other.

What actually transfers across products is the practice, not the numbers. Name the four categories explicitly. Give one person, not a committee, the job of asking whether each category still earns its share. Review it on a cadence, the same way a team reviews a cloud bill line by line rather than just checking they're under the annual cap. A cap tells you nothing about which service is burning the budget. A line-item review does.

Some teams are already doing pieces of this without naming it. Cursor's .cursorignore file is a budgeting decision made before the fact: entire categories of files are told they will never compete for context at all, rather than being added and then managed once they're already taking up space. When Cursor's agent needs to search broadly across a codebase, it can spawn a separate subagent with its own context window just for that search, so raw results never spend tokens out of the main conversation's budget. That's a team deciding, explicitly, that one category of work deserves its own ledger rather than sharing the main one.

Claude Code's /context breakdown is the other half of the same idea: a dashboard that exists specifically so someone can see the split before a session runs long enough to degrade. The dashboard isn't the discipline. Running it before every long session is.

Why This Doesn't Show Up on a Dashboard

The cost of skipping this practice doesn't throw an error. It shows up as a slow, undramatic decline. A support agent starts needing three exchanges where one used to do. A coding agent begins re-deciding things it already decided an hour earlier in the same session. Nobody gets paged, because nothing crashed. The decline gets blamed on the model, because the model is the part of the system anyone can name. The actual cause, an unaudited and unowned allocation of tokens, doesn't show up on any dashboard a team is currently watching.

This is the same root issue that's run through this series in different clothes. Wrong build order showed up as eval infrastructure nobody trusted. Here it shows up as context nobody owns. Both are organizational problems wearing technical costumes. The fix in both cases is the same shape: assign the question to a person, on a schedule, before the system grows large enough that no one can audit it by hand anymore.

A context window doesn't tell you when it's full of the wrong things. It just gets quietly worse and keeps answering anyway.

A token budget you never check isn't a budget. It's just a limit you haven't hit yet.