<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yevhenii</title>
    <description>The latest articles on DEV Community by Yevhenii (@ggqandv).</description>
    <link>https://dev.to/ggqandv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3882986%2F27bfa994-55bf-47ae-9e38-11cb0aaebee7.jpeg</url>
      <title>DEV Community: Yevhenii</title>
      <link>https://dev.to/ggqandv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ggqandv"/>
    <language>en</language>
    <item>
      <title>I built a local memory layer for LLM agents – here's why and how</title>
      <dc:creator>Yevhenii</dc:creator>
      <pubDate>Thu, 16 Apr 2026 18:31:54 +0000</pubDate>
      <link>https://dev.to/ggqandv/i-built-a-local-memory-layer-for-llm-agents-heres-why-and-how-105d</link>
      <guid>https://dev.to/ggqandv/i-built-a-local-memory-layer-for-llm-agents-heres-why-and-how-105d</guid>
      <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;LLM agents are brilliant in the moment and amnesiac by design.&lt;br&gt;
You explain your stack, your constraints, your decisions — then open a new chat and do it all again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mnemostroma&lt;/strong&gt; is my attempt to fix that without changing how you work.&lt;/p&gt;

&lt;p&gt;It's a local daemon that sits between you and your agents. It watches the conversation I/O silently, decides what's worth keeping, compresses it into structured memory, and surfaces it back when it's relevant. You never call "save". You never write a prompt to recall something. The agent just... knows.&lt;/p&gt;

&lt;p&gt;What's unusual about the design:&lt;br&gt;
The agent only reads memory — it never writes it. All observation, classification and storage happens in a separate pipeline running in the background. This turned out to be a surprisingly important constraint: it means the memory layer is completely decoupled from the agent's behavior and can't be "confused" by the model into storing garbage.&lt;/p&gt;

&lt;p&gt;Under the hood:&lt;br&gt;
Dual-stream async pipeline (Observer + Content), RAM-first index, SQLite WAL persistence. Five memory layers with gradual decay — important decisions stay, low-value noise fades. Semantic retrieval via numpy matmul over ONNX INT8 embeddings, ~20 ms. No torch. No transformers. No cloud. No Docker. ~420 MB RAM baseline.&lt;/p&gt;

&lt;p&gt;Try it today:&lt;/p&gt;

&lt;p&gt;pip install "git+&lt;a href="https://github.com/GG-QandV/mnemostroma.git" rel="noopener noreferrer"&gt;https://github.com/GG-QandV/mnemostroma.git&lt;/a&gt;"&lt;br&gt;
  mnemostroma setup   # downloads ~300 MB ONNX models, generates TLS cert&lt;br&gt;
  mnemostroma on&lt;br&gt;
  mnemostroma status&lt;/p&gt;

&lt;p&gt;Connects to Claude Desktop, Claude Code, Cursor, Windsurf, Zed and anything else that speaks MCP. There's also a passthrough proxy mode for Claude Code — you launch your IDE through a wrapper, the Observer starts capturing without touching your workflow.&lt;/p&gt;

&lt;p&gt;Status: v1.8.1 beta. 400+ tests passing. Not on PyPI yet (git install only). API surface is stabilizing; breaking changes are unlikely but possible.&lt;/p&gt;

&lt;p&gt;Privacy: everything lives in ~/.mnemostroma as plain SQLite. Local-only logging subsystem for latency/diagnostics — can be disabled or wiped anytime. Nothing leaves your machine.&lt;/p&gt;




&lt;p&gt;A few things I'm genuinely unsure about and would love input on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ~420MB (400-650) RAM footprint for a background daemon — dealbreaker for you, or fine?&lt;/li&gt;
&lt;li&gt;The "agent reads, Observer writes" split — does this feel right, or would you want the agent to be able to annotate its own memory?&lt;/li&gt;
&lt;li&gt;Which integration matters most to you: VS Code, Cursor, a standalone CLI, something else?&lt;/li&gt;
&lt;li&gt;What's your biggest fear about persistent agent memory — wrong recalls? Stale decisions? Privacy?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm in the thread. Happy to go deep on architecture, share internals, or hear "this is over-engineered and here's why."&lt;/p&gt;

&lt;p&gt;If you run it and something breaks — tell me. There's detailed local telemetry and I'd rather tune against real usage than synthetic tests.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/GG-QandV/mnemostroma" rel="noopener noreferrer"&gt;https://github.com/GG-QandV/mnemostroma&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>mnemostroma</category>
    </item>
  </channel>
</rss>
