<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Becomer.net</title>
    <description>The latest articles on DEV Community by Becomer.net (@becomernet).</description>
    <link>https://dev.to/becomernet</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962854%2F7c35a451-535f-4e02-b2fe-88d9e3214a6b.png</url>
      <title>DEV Community: Becomer.net</title>
      <link>https://dev.to/becomernet</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/becomernet"/>
    <language>en</language>
    <item>
      <title>How I built a zero-token memory layer for LLMs (and why it outperforms vector store approaches)</title>
      <dc:creator>Becomer.net</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:35:50 +0000</pubDate>
      <link>https://dev.to/becomernet/how-i-built-a-zero-token-memory-layer-for-llms-and-why-it-outperforms-vector-store-approaches-3n0g</link>
      <guid>https://dev.to/becomernet/how-i-built-a-zero-token-memory-layer-for-llms-and-why-it-outperforms-vector-store-approaches-3n0g</guid>
      <description>&lt;p&gt;If you've built an AI chatbot or agent, you've hit the same problem: the LLM forgets everything between sessions. The standard solution is to stuff your conversation history into a vector store and retrieve relevant chunks before each call. It works — but it has a hidden cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The token problem nobody talks about&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every popular memory solution — mem0, Zep, Langchain ConversationSummaryMemory — runs an LLM under the hood when you recall. That's anywhere from 500 to 7,000 tokens per recall call, on top of your actual LLM call.&lt;/p&gt;

&lt;p&gt;For a chatbot with 1,000 daily active users doing 10 messages each, that's 10,000 recall calls × ~2,000 tokens = 20 million extra tokens per day. Before your LLM has said a single word.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The retrieval-only approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built BECOMER around a different idea: semantic retrieval using embeddings, no LLM inside the memory layer. Store → embed → index → retrieve. Your LLM receives the retrieved context and reasons over it — exactly what it's already doing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;becomer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bcm_your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Before your LLM call
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what does this user prefer?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Inject into your system prompt
&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;chr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# After your LLM call
&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User asked about Python decorators, found list comprehension more intuitive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benchmark results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tested against LongMemEval (n=500) — the academic standard for conversational memory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Tokens/recall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BECOMER&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;94.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mem0&lt;/td&gt;
&lt;td&gt;93.4%&lt;/td&gt;
&lt;td&gt;~6,787&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hindsight&lt;/td&gt;
&lt;td&gt;91.4%&lt;/td&gt;
&lt;td&gt;~6,787&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest caveat: on LOCOMO's multi-hop reasoning questions, mem0 scores 91.6% vs our 69.5%. Their system adds an LLM reasoning pass over retrieved results. We return the context; your LLM reasons. For most agent use cases where you control the final LLM call, this gap disappears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tenant in two lines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For developers building apps with multiple end-users, pass a &lt;code&gt;user_id&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Each user gets a fully isolated namespace
&lt;/span&gt;&lt;span class="n"&gt;mem_alice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bcm_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mem_alice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice prefers TypeScript and dark mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mem_bob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bcm_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bob-456&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mem_bob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# → [] — completely isolated
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Isolation is enforced at the database layer, not just application code. One master key covers your entire user base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent use cases&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pattern that makes BECOMER useful beyond chatbots is shared namespaces for multi-agent systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Research agent (GPT-4o) stores findings
&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bcm_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task-abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API endpoint: POST /v2/payments, OAuth2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate limit: 100 req/min&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Executor agent (Claude) — different process, same namespace
&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bcm_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task-abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment API details&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → gets exactly what the research agent found
# No message passing. No state files. No coordination code.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Self-improving systems work the same way: store every attempt with its outcome, recall what worked before the next run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's available today&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;REST API&lt;/li&gt;
&lt;li&gt;Python SDK: &lt;code&gt;pip install becomer&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JS/Node SDK: &lt;code&gt;npm install @becomerpackage/sdk&lt;/code&gt; (zero deps, TypeScript types)&lt;/li&gt;
&lt;li&gt;MCP: works with Claude Desktop and Cursor, set &lt;code&gt;BECOMER_API_KEY&lt;/code&gt; and go&lt;/li&gt;
&lt;li&gt;Framework adapters: LangChain, LlamaIndex, LangGraph, CrewAI, AutoGen&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free tier: 1,000 calls/month. Pro: $12/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://becomer.net" rel="noopener noreferrer"&gt;https://becomer.net&lt;/a&gt;&lt;/strong&gt; — full docs, benchmarks, and free API key.&lt;/p&gt;

&lt;p&gt;I'm curious how others are handling the token cost problem for memory. What approaches have you found that work at scale?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqxlakqq31x4bxy0nv1b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqxlakqq31x4bxy0nv1b.png" alt=" " width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>python</category>
      <category>memory</category>
    </item>
  </channel>
</rss>
