I built a local memory layer for LLM agents – here's why and how

Yevhenii — Thu, 16 Apr 2026 18:31:54 +0000

Hi,

LLM agents are brilliant in the moment and amnesiac by design.
You explain your stack, your constraints, your decisions — then open a new chat and do it all again.

Mnemostroma is my attempt to fix that without changing how you work.

It's a local daemon that sits between you and your agents. It watches the conversation I/O silently, decides what's worth keeping, compresses it into structured memory, and surfaces it back when it's relevant. You never call "save". You never write a prompt to recall something. The agent just... knows.

What's unusual about the design:
The agent only reads memory — it never writes it. All observation, classification and storage happens in a separate pipeline running in the background. This turned out to be a surprisingly important constraint: it means the memory layer is completely decoupled from the agent's behavior and can't be "confused" by the model into storing garbage.

Under the hood:
Dual-stream async pipeline (Observer + Content), RAM-first index, SQLite WAL persistence. Five memory layers with gradual decay — important decisions stay, low-value noise fades. Semantic retrieval via numpy matmul over ONNX INT8 embeddings, ~20 ms. No torch. No transformers. No cloud. No Docker. ~420 MB RAM baseline.

Try it today:

pip install "git+https://github.com/GG-QandV/mnemostroma.git"
mnemostroma setup # downloads ~300 MB ONNX models, generates TLS cert
mnemostroma on
mnemostroma status

Connects to Claude Desktop, Claude Code, Cursor, Windsurf, Zed and anything else that speaks MCP. There's also a passthrough proxy mode for Claude Code — you launch your IDE through a wrapper, the Observer starts capturing without touching your workflow.

Status: v1.8.1 beta. 400+ tests passing. Not on PyPI yet (git install only). API surface is stabilizing; breaking changes are unlikely but possible.

Privacy: everything lives in ~/.mnemostroma as plain SQLite. Local-only logging subsystem for latency/diagnostics — can be disabled or wiped anytime. Nothing leaves your machine.

A few things I'm genuinely unsure about and would love input on:

The ~420MB (400-650) RAM footprint for a background daemon — dealbreaker for you, or fine?
The "agent reads, Observer writes" split — does this feel right, or would you want the agent to be able to annotate its own memory?
Which integration matters most to you: VS Code, Cursor, a standalone CLI, something else?
What's your biggest fear about persistent agent memory — wrong recalls? Stale decisions? Privacy?

I'm in the thread. Happy to go deep on architecture, share internals, or hear "this is over-engineered and here's why."

If you run it and something breaks — tell me. There's detailed local telemetry and I'd rather tune against real usage than synthetic tests.

GitHub: https://github.com/GG-QandV/mnemostroma

DEV Community: Yevhenii

I built a local memory layer for LLM agents – here's why and how