DEV Community

Becomer.net
Becomer.net

Posted on

How I built a zero-token memory layer for LLMs (and why it outperforms vector store approaches)

If you've built an AI chatbot or agent, you've hit the same problem: the LLM forgets everything between sessions. The standard solution is to stuff your conversation history into a vector store and retrieve relevant chunks before each call. It works — but it has a hidden cost.

The token problem nobody talks about

Every popular memory solution — mem0, Zep, Langchain ConversationSummaryMemory — runs an LLM under the hood when you recall. That's anywhere from 500 to 7,000 tokens per recall call, on top of your actual LLM call.

For a chatbot with 1,000 daily active users doing 10 messages each, that's 10,000 recall calls × ~2,000 tokens = 20 million extra tokens per day. Before your LLM has said a single word.

The retrieval-only approach

I built BECOMER around a different idea: semantic retrieval using embeddings, no LLM inside the memory layer. Store → embed → index → retrieve. Your LLM receives the retrieved context and reasons over it — exactly what it's already doing.

from becomer import Client

mem = Client("bcm_your-api-key")

# Before your LLM call
context = mem.recall("what does this user prefer?", top_k=5)

# Inject into your system prompt
system_prompt = f"User context:\n{chr(10).join(context)}"

# After your LLM call
mem.store("User asked about Python decorators, found list comprehension more intuitive")
Enter fullscreen mode Exit fullscreen mode

Benchmark results

Tested against LongMemEval (n=500) — the academic standard for conversational memory:

System Score Tokens/recall
BECOMER 94.4% 0
mem0 93.4% ~6,787
Hindsight 91.4% ~6,787

The honest caveat: on LOCOMO's multi-hop reasoning questions, mem0 scores 91.6% vs our 69.5%. Their system adds an LLM reasoning pass over retrieved results. We return the context; your LLM reasons. For most agent use cases where you control the final LLM call, this gap disappears.

Multi-tenant in two lines

For developers building apps with multiple end-users, pass a user_id:

# Each user gets a fully isolated namespace
mem_alice = Client("bcm_key", user_id="alice-123")
mem_alice.store("Alice prefers TypeScript and dark mode")

mem_bob = Client("bcm_key", user_id="bob-456")
mem_bob.recall("preferences")  # → [] — completely isolated
Enter fullscreen mode Exit fullscreen mode

Isolation is enforced at the database layer, not just application code. One master key covers your entire user base.

Agent use cases

The pattern that makes BECOMER useful beyond chatbots is shared namespaces for multi-agent systems:

# Research agent (GPT-4o) stores findings
mem = Client("bcm_key", user_id="task-abc")
mem.store("API endpoint: POST /v2/payments, OAuth2")
mem.store("Rate limit: 100 req/min")

# Executor agent (Claude) — different process, same namespace
ctx = Client("bcm_key", user_id="task-abc").recall("payment API details")
# → gets exactly what the research agent found
# No message passing. No state files. No coordination code.
Enter fullscreen mode Exit fullscreen mode

Self-improving systems work the same way: store every attempt with its outcome, recall what worked before the next run.

What's available today

  • REST API
  • Python SDK: pip install becomer
  • JS/Node SDK: npm install @becomerpackage/sdk (zero deps, TypeScript types)
  • MCP: works with Claude Desktop and Cursor, set BECOMER_API_KEY and go
  • Framework adapters: LangChain, LlamaIndex, LangGraph, CrewAI, AutoGen

Free tier: 1,000 calls/month. Pro: $12/month.

https://becomer.net — full docs, benchmarks, and free API key.

I'm curious how others are handling the token cost problem for memory. What approaches have you found that work at scale?

Top comments (0)