If you've built an AI chatbot or agent, you've hit the same problem: the LLM forgets everything between sessions. The standard solution is to stuff your conversation history into a vector store and retrieve relevant chunks before each call. It works — but it has a hidden cost.
The token problem nobody talks about
Every popular memory solution — mem0, Zep, Langchain ConversationSummaryMemory — runs an LLM under the hood when you recall. That's anywhere from 500 to 7,000 tokens per recall call, on top of your actual LLM call.
For a chatbot with 1,000 daily active users doing 10 messages each, that's 10,000 recall calls × ~2,000 tokens = 20 million extra tokens per day. Before your LLM has said a single word.
The retrieval-only approach
I built BECOMER around a different idea: semantic retrieval using embeddings, no LLM inside the memory layer. Store → embed → index → retrieve. Your LLM receives the retrieved context and reasons over it — exactly what it's already doing.
from becomer import Client
mem = Client("bcm_your-api-key")
# Before your LLM call
context = mem.recall("what does this user prefer?", top_k=5)
# Inject into your system prompt
system_prompt = f"User context:\n{chr(10).join(context)}"
# After your LLM call
mem.store("User asked about Python decorators, found list comprehension more intuitive")
Benchmark results
Tested against LongMemEval (n=500) — the academic standard for conversational memory:
| System | Score | Tokens/recall |
|---|---|---|
| BECOMER | 94.4% | 0 |
| mem0 | 93.4% | ~6,787 |
| Hindsight | 91.4% | ~6,787 |
The honest caveat: on LOCOMO's multi-hop reasoning questions, mem0 scores 91.6% vs our 69.5%. Their system adds an LLM reasoning pass over retrieved results. We return the context; your LLM reasons. For most agent use cases where you control the final LLM call, this gap disappears.
Multi-tenant in two lines
For developers building apps with multiple end-users, pass a user_id:
# Each user gets a fully isolated namespace
mem_alice = Client("bcm_key", user_id="alice-123")
mem_alice.store("Alice prefers TypeScript and dark mode")
mem_bob = Client("bcm_key", user_id="bob-456")
mem_bob.recall("preferences") # → [] — completely isolated
Isolation is enforced at the database layer, not just application code. One master key covers your entire user base.
Agent use cases
The pattern that makes BECOMER useful beyond chatbots is shared namespaces for multi-agent systems:
# Research agent (GPT-4o) stores findings
mem = Client("bcm_key", user_id="task-abc")
mem.store("API endpoint: POST /v2/payments, OAuth2")
mem.store("Rate limit: 100 req/min")
# Executor agent (Claude) — different process, same namespace
ctx = Client("bcm_key", user_id="task-abc").recall("payment API details")
# → gets exactly what the research agent found
# No message passing. No state files. No coordination code.
Self-improving systems work the same way: store every attempt with its outcome, recall what worked before the next run.
What's available today
- REST API
- Python SDK:
pip install becomer - JS/Node SDK:
npm install @becomerpackage/sdk(zero deps, TypeScript types) - MCP: works with Claude Desktop and Cursor, set
BECOMER_API_KEYand go - Framework adapters: LangChain, LlamaIndex, LangGraph, CrewAI, AutoGen
Free tier: 1,000 calls/month. Pro: $12/month.
https://becomer.net — full docs, benchmarks, and free API key.
I'm curious how others are handling the token cost problem for memory. What approaches have you found that work at scale?

Top comments (0)