Hi,
LLM agents are brilliant in the moment and amnesiac by design.
You explain your stack, your constraints, your decisions — then open a new chat and do it all again.
Mnemostroma is my attempt to fix that without changing how you work.
It's a local daemon that sits between you and your agents. It watches the conversation I/O silently, decides what's worth keeping, compresses it into structured memory, and surfaces it back when it's relevant. You never call "save". You never write a prompt to recall something. The agent just... knows.
What's unusual about the design:
The agent only reads memory — it never writes it. All observation, classification and storage happens in a separate pipeline running in the background. This turned out to be a surprisingly important constraint: it means the memory layer is completely decoupled from the agent's behavior and can't be "confused" by the model into storing garbage.
Under the hood:
Dual-stream async pipeline (Observer + Content), RAM-first index, SQLite WAL persistence. Five memory layers with gradual decay — important decisions stay, low-value noise fades. Semantic retrieval via numpy matmul over ONNX INT8 embeddings, ~20 ms. No torch. No transformers. No cloud. No Docker. ~420 MB RAM baseline.
Try it today:
pip install "git+https://github.com/GG-QandV/mnemostroma.git"
mnemostroma setup # downloads ~300 MB ONNX models, generates TLS cert
mnemostroma on
mnemostroma status
Connects to Claude Desktop, Claude Code, Cursor, Windsurf, Zed and anything else that speaks MCP. There's also a passthrough proxy mode for Claude Code — you launch your IDE through a wrapper, the Observer starts capturing without touching your workflow.
Status: v1.8.1 beta. 400+ tests passing. Not on PyPI yet (git install only). API surface is stabilizing; breaking changes are unlikely but possible.
Privacy: everything lives in ~/.mnemostroma as plain SQLite. Local-only logging subsystem for latency/diagnostics — can be disabled or wiped anytime. Nothing leaves your machine.
A few things I'm genuinely unsure about and would love input on:
- The ~420MB (400-650) RAM footprint for a background daemon — dealbreaker for you, or fine?
- The "agent reads, Observer writes" split — does this feel right, or would you want the agent to be able to annotate its own memory?
- Which integration matters most to you: VS Code, Cursor, a standalone CLI, something else?
- What's your biggest fear about persistent agent memory — wrong recalls? Stale decisions? Privacy?
I'm in the thread. Happy to go deep on architecture, share internals, or hear "this is over-engineered and here's why."
If you run it and something breaks — tell me. There's detailed local telemetry and I'd rather tune against real usage than synthetic tests.
Top comments (0)