How I built a zero-token memory layer for LLMs (and why it outperforms vector store approaches)

#llm #ai #python #memory

If you've built an AI chatbot or agent, you've hit the same problem: the LLM forgets everything between sessions. The standard solution is to stuff your conversation history into a vector store and retrieve relevant chunks before each call. It works — but it has a hidden cost.

The token problem nobody talks about

Every popular memory solution — mem0, Zep, Langchain ConversationSummaryMemory — runs an LLM under the hood when you recall. That's anywhere from 500 to 7,000 tokens per recall call, on top of your actual LLM call.

For a chatbot with 1,000 daily active users doing 10 messages each, that's 10,000 recall calls × ~2,000 tokens = 20 million extra tokens per day. Before your LLM has said a single word.

The retrieval-only approach

I built BECOMER around a different idea: semantic retrieval using embeddings, no LLM inside the memory layer. Store → embed → index → retrieve. Your LLM receives the retrieved context and reasons over it — exactly what it's already doing.

from becomer import Client

mem = Client("bcm_your-api-key")

# Before your LLM call
context = mem.recall("what does this user prefer?", top_k=5)

# Inject into your system prompt
system_prompt = f"User context:\n{chr(10).join(context)}"

# After your LLM call
mem.store("User asked about Python decorators, found list comprehension more intuitive")

Benchmark results

Tested against LongMemEval (n=500) — the academic standard for conversational memory:

System	Score	Tokens/recall
BECOMER	94.4%	0
mem0	93.4%	~6,787
Hindsight	91.4%	~6,787

The honest caveat: on LOCOMO's multi-hop reasoning questions, mem0 scores 91.6% vs our 69.5%. Their system adds an LLM reasoning pass over retrieved results. We return the context; your LLM reasons. For most agent use cases where you control the final LLM call, this gap disappears.

Multi-tenant in two lines

For developers building apps with multiple end-users, pass a user_id:

# Each user gets a fully isolated namespace
mem_alice = Client("bcm_key", user_id="alice-123")
mem_alice.store("Alice prefers TypeScript and dark mode")

mem_bob = Client("bcm_key", user_id="bob-456")
mem_bob.recall("preferences")  # → [] — completely isolated

Isolation is enforced at the database layer, not just application code. One master key covers your entire user base.

Agent use cases

The pattern that makes BECOMER useful beyond chatbots is shared namespaces for multi-agent systems:

# Research agent (GPT-4o) stores findings
mem = Client("bcm_key", user_id="task-abc")
mem.store("API endpoint: POST /v2/payments, OAuth2")
mem.store("Rate limit: 100 req/min")

# Executor agent (Claude) — different process, same namespace
ctx = Client("bcm_key", user_id="task-abc").recall("payment API details")
# → gets exactly what the research agent found
# No message passing. No state files. No coordination code.