DEV Community

Cover image for Compact LLM chat history without LangChain (zero dependencies)
Wael Rahhal
Wael Rahhal

Posted on

Compact LLM chat history without LangChain (zero dependencies)

Long conversations eventually overflow the model's context window. Both common fixes hurt: drop old turns and you lose context; keep everything and the request won't fit.

The middle ground is summarize the old turns, keep the recent ones verbatim — but today that usually means adopting LangChain's memory classes or a hosted "memory platform." If you just want the building block, that's a lot of baggage.

So I built chatcram — a tiny, zero-dependency library that does exactly that, on plain message dicts, with a summarizer you provide.

Install

pip install chatcram
Enter fullscreen mode Exit fullscreen mode

Use it

from chatcram import Compactor

def summarize(transcript: str) -> str:
    # your LLM call here — any str -> str function
    return my_client.complete(f"Summarize this conversation:\n{transcript}")

compactor = Compactor(budget=4000, summarize=summarize, keep_recent=1500)

result = compactor.compact(messages)   # list of {"role", "content"} dicts
messages = result.messages             # ready to send to the model
print(result.summarized, result.used_tokens)
Enter fullscreen mode Exit fullscreen mode

What comes back:

  • system messages, kept verbatim, at the front
  • one summary message for the older middle (via your summarizer)
  • the most recent turns, kept verbatim (up to keep_recent tokens)

If the history already fits budget, it's returned unchanged.

Why bring-your-own summarizer?

It keeps chatcram dependency-free and provider-agnostic. summarize is any str -> str callable — GPT, Claude, a local model, even a non-LLM heuristic. No hidden API calls, no lock-in.

Pairs with contextcram

chatcram handles the history; contextcram handles the whole prompt. Compact the conversation, then pack it (plus system prompt + retrieved docs) into your token budget:

from chatcram import Compactor
from contextcram import Packer

history = Compactor(budget=3000, summarize=summarize).compact(messages).messages

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add([f"{m['role']}: {m['content']}" for m in history], priority="high", strategy="trim")
    .add(retrieved_docs, priority="medium", strategy="drop")
    .fit()
)
Enter fullscreen mode Exit fullscreen mode

Two tiny, zero-dependency pieces that together cover most context-management needs — without a framework.

How is this different from LangChain memory / mem0?

Honest answer: the idea isn't new. LangChain's ConversationSummaryBufferMemory and platforms like mem0/Zep do this (and more). chatcram trades features for being standalone, dependency-free, and framework-agnostic — a building block, not a platform.

Try it

pip install chatcram
Enter fullscreen mode Exit fullscreen mode

MIT, fully typed (mypy --strict), tested on Python 3.10–3.13. Feedback welcome — especially on the default keep/summarize split.

Top comments (0)