<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: aarjay singh</title>
    <description>The latest articles on DEV Community by aarjay singh (@aarjay_singh_0f76e7ca03bf).</description>
    <link>https://dev.to/aarjay_singh_0f76e7ca03bf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879574%2F6d71ff63-3bf0-4889-ba62-dc8a74183906.jpg</url>
      <title>DEV Community: aarjay singh</title>
      <link>https://dev.to/aarjay_singh_0f76e7ca03bf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aarjay_singh_0f76e7ca03bf"/>
    <language>en</language>
    <item>
      <title>Why I stopped putting LLMs in my agent memory retrieval path</title>
      <dc:creator>aarjay singh</dc:creator>
      <pubDate>Wed, 15 Apr 2026 03:03:41 +0000</pubDate>
      <link>https://dev.to/aarjay_singh_0f76e7ca03bf/why-i-stopped-putting-llms-in-my-agent-memory-retrieval-path-4bia</link>
      <guid>https://dev.to/aarjay_singh_0f76e7ca03bf/why-i-stopped-putting-llms-in-my-agent-memory-retrieval-path-4bia</guid>
      <description>&lt;p&gt;Every agent pipeline I've touched in the last eighteen months reinvents memory, and most of them do it badly.&lt;/p&gt;

&lt;p&gt;Planner decisions never reach the executor. Giant prompts get passed between agents as "context." Tokens burn on stale data. An LLM call sits in the retrieval path, so the same query returns different ranked results on different runs — which makes the system impossible to reason about and impossible to unit-test.&lt;/p&gt;

&lt;p&gt;The fix, once I'd seen it happen enough times, is boring: treat memory as &lt;strong&gt;infrastructure&lt;/strong&gt;, not as prompt engineering.&lt;/p&gt;

&lt;p&gt;That's what Memwright is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with LLM-in-the-loop retrieval
&lt;/h2&gt;

&lt;p&gt;A lot of "agentic memory" libraries shell out to an LLM during recall — to rewrite the query, to re-rank results, to summarize retrieved chunks. That sounds smart. In production, it's a liability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-determinism.&lt;/strong&gt; Same inputs, different outputs. Debugging a misranked memory becomes an archaeology expedition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; You just added 500–2000ms to every recall, in the critical path of every agent step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Every retrieval is a paid call. In a planner/executor loop that fires 50 times, that's 50× the token bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Untestable.&lt;/strong&gt; You can't write a unit test that says "given these ten memories and this query, the top three results should be X, Y, Z."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So Memwright's first rule: &lt;strong&gt;zero LLM in the critical path.&lt;/strong&gt; Embeddings are computed once at write time, locally, using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;. Retrieval is pure math and graph traversal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-layer pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query
  ↓
[1] Tag Match      — SQLite FTS, exact + fuzzy token hits
  ↓
[2] Graph Expansion — NetworkX BFS, depth 2 from matched entities
  ↓
[3] Vector Search  — ChromaDB cosine similarity on 384-D embeddings
  ↓
[4] Fusion + Rank  — Reciprocal Rank Fusion (k=60) + PageRank + confidence decay
  ↓
[5] Diversity      — MMR (λ=0.7) + greedy token-budget pack
  ↓
Top-K memories, fits your prompt budget, deterministic, unit-testable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is a pure function. You can unit-test the whole thing by feeding it fixtures and asserting on rankings. I did. The test suite is 607 cases and runs without Docker or API keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-agent primitives are first-class
&lt;/h2&gt;

&lt;p&gt;Most memory libraries treat the agent as a singleton. Real pipelines have an orchestrator coordinating a planner that dispatches to executors that report to reviewers. They should not all share a single memory pool.&lt;/p&gt;

&lt;p&gt;Memwright models this directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Six RBAC roles&lt;/strong&gt;: &lt;code&gt;ORCHESTRATOR&lt;/code&gt;, &lt;code&gt;PLANNER&lt;/code&gt;, &lt;code&gt;EXECUTOR&lt;/code&gt;, &lt;code&gt;RESEARCHER&lt;/code&gt;, &lt;code&gt;REVIEWER&lt;/code&gt;, &lt;code&gt;MONITOR&lt;/code&gt;. Each has different read/write permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Namespace isolation&lt;/strong&gt; enforced at the row level, not in application code. A tenant column is on every table, every query filters on it, no escape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance chain&lt;/strong&gt;: every memory carries &lt;code&gt;source_id&lt;/code&gt;, content hash, ingest timestamp, and the agent role that wrote it. You can reconstruct who told the system what.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-agent token budgets and write quotas&lt;/strong&gt;. A runaway executor cannot fill the memory with junk.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Temporal correctness
&lt;/h2&gt;

&lt;p&gt;Memwright never overwrites. It &lt;strong&gt;supersedes&lt;/strong&gt;. Every fact has a validity window (&lt;code&gt;valid_from&lt;/code&gt;, &lt;code&gt;valid_to&lt;/code&gt;). Newer contradicting facts don't delete older ones — they close them. &lt;code&gt;recall(as_of=...)&lt;/code&gt; replays the past.&lt;/p&gt;

&lt;p&gt;This matters for audit ("what did the desk know on March 12?"), for debugging ("why did the planner make that call yesterday?"), and for any regulated domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same API, six backends
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentMemory&lt;/span&gt;
&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# SQLite + ChromaDB + NetworkX, zero config
&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Planner decided to use Rust for the hot path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what did we pick for the hot path?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the local mode. Identical code runs against:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres (pgvector + Apache AGE)&lt;/li&gt;
&lt;li&gt;ArangoDB (doc + vector + graph in one engine)&lt;/li&gt;
&lt;li&gt;AWS ECS + ArangoDB&lt;/li&gt;
&lt;li&gt;Azure Cosmos DB DiskANN&lt;/li&gt;
&lt;li&gt;GCP AlloyDB (pgvector + ScaNN + AGE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reference Terraform ships in the repo. Swap backends with one config line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;LOCOMO v2 (long-context memory benchmark): &lt;strong&gt;81.2%&lt;/strong&gt;. For reference: OpenAI memory 52.9%, Mem0 66.9%, Letta 74%, Zep ~75%, MemMachine 84.9%. Honest competitive, not state-of-the-art, and I know exactly why the gap exists (embedding model — next release bumps it).&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;memwright
memwright api &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Claude Code users: type &lt;code&gt;install agent memory&lt;/code&gt; in Claude Code and the MCP server interviews you, installs itself, wires hooks, runs a health check.&lt;/p&gt;

&lt;p&gt;MIT. Repo: &lt;a href="https://github.com/bolnet/agent-memory" rel="noopener noreferrer"&gt;github.com/bolnet/agent-memory&lt;/a&gt;. Discussion on HN today: &lt;a href="https://news.ycombinator.com/item?id=47773981" rel="noopener noreferrer"&gt;news.ycombinator.com/item?id=47773981&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'd love to hear what breaks for you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
