Long conversations eventually overflow the model's context window. Both common fixes hurt: drop old turns and you lose context; keep everything and the request won't fit.
The middle ground is summarize the old turns, keep the recent ones verbatim — but today that usually means adopting LangChain's memory classes or a hosted "memory platform." If you just want the building block, that's a lot of baggage.
So I built chatcram — a tiny, zero-dependency library that does exactly that, on plain message dicts, with a summarizer you provide.
Install
pip install chatcram
Use it
from chatcram import Compactor
def summarize(transcript: str) -> str:
# your LLM call here — any str -> str function
return my_client.complete(f"Summarize this conversation:\n{transcript}")
compactor = Compactor(budget=4000, summarize=summarize, keep_recent=1500)
result = compactor.compact(messages) # list of {"role", "content"} dicts
messages = result.messages # ready to send to the model
print(result.summarized, result.used_tokens)
What comes back:
- system messages, kept verbatim, at the front
- one summary message for the older middle (via your summarizer)
- the most recent turns, kept verbatim (up to
keep_recenttokens)
If the history already fits budget, it's returned unchanged.
Why bring-your-own summarizer?
It keeps chatcram dependency-free and provider-agnostic. summarize is any str -> str callable — GPT, Claude, a local model, even a non-LLM heuristic. No hidden API calls, no lock-in.
Pairs with contextcram
chatcram handles the history; contextcram handles the whole prompt. Compact the conversation, then pack it (plus system prompt + retrieved docs) into your token budget:
from chatcram import Compactor
from contextcram import Packer
history = Compactor(budget=3000, summarize=summarize).compact(messages).messages
ctx = (
Packer(model="gpt-4o", reserve=1500)
.add(SYSTEM_PROMPT, priority="required")
.add([f"{m['role']}: {m['content']}" for m in history], priority="high", strategy="trim")
.add(retrieved_docs, priority="medium", strategy="drop")
.fit()
)
Two tiny, zero-dependency pieces that together cover most context-management needs — without a framework.
How is this different from LangChain memory / mem0?
Honest answer: the idea isn't new. LangChain's ConversationSummaryBufferMemory and platforms like mem0/Zep do this (and more). chatcram trades features for being standalone, dependency-free, and framework-agnostic — a building block, not a platform.
Try it
pip install chatcram
MIT, fully typed (mypy --strict), tested on Python 3.10–3.13. Feedback welcome — especially on the default keep/summarize split.
Top comments (0)