DEV Community

Cover image for Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)
Wael Rahhal
Wael Rahhal

Posted on

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget.

The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things.

I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem.

The idea

Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts.

pip install contextcram
Enter fullscreen mode Exit fullscreen mode
from contextcram import Packer

ctx = (
    Packer(budget=8000)
    .add(system_prompt, priority="required")                 # never dropped
    .add(chat_history, priority="high", strategy="trim")     # drop oldest turns
    .add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing
    .add(tool_output, priority="low", strategy="truncate")   # cut to fit
    .fit()
)

print(ctx.text)          # the assembled, in-budget context
print(ctx.used_tokens)   # e.g. 7840
print(ctx.dropped_names) # what didn't make the cut
Enter fullscreen mode Exit fullscreen mode

Strategies

When an optional item doesn't fully fit, its strategy decides what happens:

Strategy Behavior
drop Include it whole, or not at all
truncate Cut from the end, keep the head (default)
truncate_head Cut from the start, keep the tail
trim For lists (e.g. messages): drop oldest first

required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt.

The part I actually wanted: model-aware budgets + room to answer

Two recurring annoyances solved in one line:

from contextcram import Packer

# Budget pulled from the model; hold back 2k tokens for the reply
packer = Packer(model="gpt-4o", reserve=2000)
print(packer.full_budget)  # 128000
print(packer.budget)       # 126000  <- what you actually pack into
Enter fullscreen mode Exit fullscreen mode

reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong.

Real-world: with LangChain

from contextcram import Packer
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o")
docs = [d.page_content for d in retriever.invoke(question)]
history = [f"{m.type}: {m.content}" for m in memory.messages]

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add(history, priority="high", strategy="trim")
    .add("\n\n".join(docs), priority="medium", strategy="drop")
    .fit()
)

response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])
Enter fullscreen mode Exit fullscreen mode

Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o"), or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer. The default is a fast characters-per-token heuristic so there are no required dependencies.

How is this different from Priompt / Prompt Poet?

Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating).

contextcram deliberately trades features for simplicity and zero dependencies:

  • Pure stdlib — no Jinja2, no YAML, no heavy SDK.
  • A 3-line API: Packer(...).add(...).fit().
  • Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing.

If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it.

Try it

pip install contextcram
Enter fullscreen mode Exit fullscreen mode

It's MIT, fully typed (mypy --strict), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.

Top comments (0)