Compact LLM chat history without LangChain (zero dependencies)

Wael Rahhal — Tue, 16 Jun 2026 19:20:53 +0000

Long conversations eventually overflow the model's context window. Both common fixes hurt: drop old turns and you lose context; keep everything and the request won't fit.

The middle ground is summarize the old turns, keep the recent ones verbatim — but today that usually means adopting LangChain's memory classes or a hosted "memory platform." If you just want the building block, that's a lot of baggage.

So I built chatcram — a tiny, zero-dependency library that does exactly that, on plain message dicts, with a summarizer you provide.

Install

pip install chatcram

Use it

from chatcram import Compactor

def summarize(transcript: str) -> str:
    # your LLM call here — any str -> str function
    return my_client.complete(f"Summarize this conversation:\n{transcript}")

compactor = Compactor(budget=4000, summarize=summarize, keep_recent=1500)

result = compactor.compact(messages)   # list of {"role", "content"} dicts
messages = result.messages             # ready to send to the model
print(result.summarized, result.used_tokens)

What comes back:

system messages, kept verbatim, at the front
one summary message for the older middle (via your summarizer)
the most recent turns, kept verbatim (up to keep_recent tokens)

If the history already fits budget, it's returned unchanged.

Why bring-your-own summarizer?

It keeps chatcram dependency-free and provider-agnostic. summarize is any str -> str callable — GPT, Claude, a local model, even a non-LLM heuristic. No hidden API calls, no lock-in.

Pairs with contextcram

chatcram handles the history; contextcram handles the whole prompt. Compact the conversation, then pack it (plus system prompt + retrieved docs) into your token budget:

from chatcram import Compactor
from contextcram import Packer

history = Compactor(budget=3000, summarize=summarize).compact(messages).messages

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add([f"{m['role']}: {m['content']}" for m in history], priority="high", strategy="trim")
    .add(retrieved_docs, priority="medium", strategy="drop")
    .fit()
)

Two tiny, zero-dependency pieces that together cover most context-management needs — without a framework.

How is this different from LangChain memory / mem0?

Honest answer: the idea isn't new. LangChain's ConversationSummaryBufferMemory and platforms like mem0/Zep do this (and more). chatcram trades features for being standalone, dependency-free, and framework-agnostic — a building block, not a platform.

Try it

pip install chatcram

⭐ Repo: https://github.com/Waelr1985/chatcram
📦 PyPI: https://pypi.org/project/chatcram/

MIT, fully typed (mypy --strict), tested on Python 3.10–3.13. Feedback welcome — especially on the default keep/summarize split.

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

Wael Rahhal — Mon, 15 Jun 2026 23:09:00 +0000

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget.

The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things.

I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem.

The idea

Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts.

pip install contextcram

from contextcram import Packer

ctx = (
    Packer(budget=8000)
    .add(system_prompt, priority="required")                 # never dropped
    .add(chat_history, priority="high", strategy="trim")     # drop oldest turns
    .add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing
    .add(tool_output, priority="low", strategy="truncate")   # cut to fit
    .fit()
)

print(ctx.text)          # the assembled, in-budget context
print(ctx.used_tokens)   # e.g. 7840
print(ctx.dropped_names) # what didn't make the cut

Strategies

When an optional item doesn't fully fit, its strategy decides what happens:

Strategy	Behavior
`drop`	Include it whole, or not at all
`truncate`	Cut from the end, keep the head (default)
`truncate_head`	Cut from the start, keep the tail
`trim`	For lists (e.g. messages): drop oldest first

required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt.

The part I actually wanted: model-aware budgets + room to answer

Two recurring annoyances solved in one line:

from contextcram import Packer

# Budget pulled from the model; hold back 2k tokens for the reply
packer = Packer(model="gpt-4o", reserve=2000)
print(packer.full_budget)  # 128000
print(packer.budget)       # 126000  <- what you actually pack into

reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong.

Real-world: with LangChain

from contextcram import Packer
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o")
docs = [d.page_content for d in retriever.invoke(question)]
history = [f"{m.type}: {m.content}" for m in memory.messages]

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add(history, priority="high", strategy="trim")
    .add("\n\n".join(docs), priority="medium", strategy="drop")
    .fit()
)

response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])

Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o"), or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer. The default is a fast characters-per-token heuristic so there are no required dependencies.

How is this different from Priompt / Prompt Poet?

Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating).

contextcram deliberately trades features for simplicity and zero dependencies:

Pure stdlib — no Jinja2, no YAML, no heavy SDK.
A 3-line API: Packer(...).add(...).fit().
Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing.

If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it.

Try it

pip install contextcram

⭐ Repo: https://github.com/Waelr1985/contextcram
📦 PyPI: https://pypi.org/project/contextcram/

It's MIT, fully typed (mypy --strict), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.

DEV Community: Wael Rahhal

Compact LLM chat history without LangChain (zero dependencies)

Install

Use it

Why bring-your-own summarizer?

Pairs with contextcram

How is this different from LangChain memory / mem0?

Try it

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

The idea

Strategies

The part I actually wanted: model-aware budgets + room to answer

Real-world: with LangChain

How is this different from Priompt / Prompt Poet?

Try it