Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget.
The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things.
I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem.
The idea
Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts.
pip install contextcram
from contextcram import Packer
ctx = (
Packer(budget=8000)
.add(system_prompt, priority="required") # never dropped
.add(chat_history, priority="high", strategy="trim") # drop oldest turns
.add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing
.add(tool_output, priority="low", strategy="truncate") # cut to fit
.fit()
)
print(ctx.text) # the assembled, in-budget context
print(ctx.used_tokens) # e.g. 7840
print(ctx.dropped_names) # what didn't make the cut
Strategies
When an optional item doesn't fully fit, its strategy decides what happens:
| Strategy | Behavior |
|---|---|
drop |
Include it whole, or not at all |
truncate |
Cut from the end, keep the head (default) |
truncate_head |
Cut from the start, keep the tail |
trim |
For lists (e.g. messages): drop oldest first |
required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt.
The part I actually wanted: model-aware budgets + room to answer
Two recurring annoyances solved in one line:
from contextcram import Packer
# Budget pulled from the model; hold back 2k tokens for the reply
packer = Packer(model="gpt-4o", reserve=2000)
print(packer.full_budget) # 128000
print(packer.budget) # 126000 <- what you actually pack into
reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong.
Real-world: with LangChain
from contextcram import Packer
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
llm = ChatOpenAI(model="gpt-4o")
docs = [d.page_content for d in retriever.invoke(question)]
history = [f"{m.type}: {m.content}" for m in memory.messages]
ctx = (
Packer(model="gpt-4o", reserve=1500)
.add(SYSTEM_PROMPT, priority="required")
.add(history, priority="high", strategy="trim")
.add("\n\n".join(docs), priority="medium", strategy="drop")
.fit()
)
response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])
Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o"), or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer. The default is a fast characters-per-token heuristic so there are no required dependencies.
How is this different from Priompt / Prompt Poet?
Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating).
contextcram deliberately trades features for simplicity and zero dependencies:
- Pure stdlib — no Jinja2, no YAML, no heavy SDK.
- A 3-line API:
Packer(...).add(...).fit(). - Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing.
If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it.
Try it
pip install contextcram
It's MIT, fully typed (mypy --strict), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.
Top comments (0)