Compact LLM chat history without LangChain (zero dependencies)

#ai #opensource #llm #python

Long conversations eventually overflow the model's context window. Both common fixes hurt: drop old turns and you lose context; keep everything and the request won't fit.

The middle ground is summarize the old turns, keep the recent ones verbatim — but today that usually means adopting LangChain's memory classes or a hosted "memory platform." If you just want the building block, that's a lot of baggage.

So I built chatcram — a tiny, zero-dependency library that does exactly that, on plain message dicts, with a summarizer you provide.

Install

pip install chatcram

Use it

from chatcram import Compactor

def summarize(transcript: str) -> str:
    # your LLM call here — any str -> str function
    return my_client.complete(f"Summarize this conversation:\n{transcript}")

compactor = Compactor(budget=4000, summarize=summarize, keep_recent=1500)

result = compactor.compact(messages)   # list of {"role", "content"} dicts
messages = result.messages             # ready to send to the model
print(result.summarized, result.used_tokens)

What comes back:

system messages, kept verbatim, at the front
one summary message for the older middle (via your summarizer)
the most recent turns, kept verbatim (up to keep_recent tokens)

If the history already fits budget, it's returned unchanged.

Why bring-your-own summarizer?

It keeps chatcram dependency-free and provider-agnostic. summarize is any str -> str callable — GPT, Claude, a local model, even a non-LLM heuristic. No hidden API calls, no lock-in.

Pairs with contextcram

chatcram handles the history; contextcram handles the whole prompt. Compact the conversation, then pack it (plus system prompt + retrieved docs) into your token budget:

from chatcram import Compactor
from contextcram import Packer

history = Compactor(budget=3000, summarize=summarize).compact(messages).messages

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add([f"{m['role']}: {m['content']}" for m in history], priority="high", strategy="trim")
    .add(retrieved_docs, priority="medium", strategy="drop")
    .fit()
)

Two tiny, zero-dependency pieces that together cover most context-management needs — without a framework.

How is this different from LangChain memory / mem0?

Honest answer: the idea isn't new. LangChain's ConversationSummaryBufferMemory and platforms like mem0/Zep do this (and more). chatcram trades features for being standalone, dependency-free, and framework-agnostic — a building block, not a platform.