Mnemosy

Posted on Feb 24

We Built the First AI Agent Memory System With Zero LLM Calls — Here's the Architecture

#ai #typescript #machinelearning #opensource

Why AI Agents Need Brains, Not Just Vector Databases

Every AI agent shipping today has a fundamental problem: amnesia.

Load up any agent framework — LangChain, CrewAI, AutoGen, custom builds — and start a conversation. Ask it about your project. It knows nothing. Give it context across 50 turns. Then watch the context window compact. It knows nothing again.

This isn't a minor UX issue. It's the single biggest bottleneck to autonomous AI. Agents can't learn from mistakes if they don't remember making them. They can't build expertise if every session starts from scratch. They can't collaborate if they can't share what they know.

The industry's response has been to wrap vector databases with LLM-powered extraction layers. Send text to GPT-4, extract key facts, store as vectors, retrieve by similarity. Systems like Mem0, Zep, Cognee, and Letta have raised ~$47M combined doing variations of this approach.

It works for demos. It doesn't work for production. Here's why.

The Problem with LLM-in-the-Loop Memory

When you put an LLM in your memory ingestion pipeline, you inherit three structural problems:

1. Non-deterministic behavior. The same input can produce different extracted facts on different runs. Your memory system's behavior changes when the model version changes, when the prompt drifts, when the temperature fluctuates. In production, you need memory that behaves consistently.

2. Latency floor. Every memory store operation requires an LLM API call — 500ms to 2 seconds minimum. When your agent processes 100 memories per session, that's 50-200 seconds of just waiting for extraction. For real-time agent interactions, this is unacceptable.

3. Linear cost scaling. At approximately $0.01 per memory, storing 100K memories costs $1,000. A million memories costs $10,000. Per month. This scales linearly with no efficiency gains. For production systems processing tens of thousands of interactions daily, the economics are brutal.

These aren't implementation bugs. They're architectural consequences of the LLM-in-the-loop design.

What If Memory Worked Like a Brain?

We spent months running a 10-machine AI agent mesh — 10 agents collaborating on real tasks, 13,000+ memories accumulated, sub-200ms retrieval requirements. The vector-store-plus-LLM approach broke down immediately. We needed something fundamentally different.

So we built Mnemosyne: a 5-layer cognitive memory operating system for AI agents. Not another vector wrapper. An actual memory architecture inspired by how biological memory systems work — from the neural substrate up to metacognition.

+----------------------------------------------------------------------+
|                      MNEMOSYNE COGNITIVE OS                          |
|                                                                      |
|  L5  SELF-IMPROVEMENT                                                |
|  [ Reinforcement ] [ Consolidation ] [ Flash Reasoning ] [ ToMA ]    |
|                                                                      |
|  L4  COGNITIVE                                                       |
|  [ Activation Decay ] [ Confidence ] [ Priority ] [ Diversity ]      |
|                                                                      |
|  L3  KNOWLEDGE GRAPH                                                 |
|  [ Temporal Graph ] [ Auto-Linking ] [ Path Traversal ] [ Entities ] |
|                                                                      |
|  L2  PIPELINE                                                        |
|  [ Extraction ] [ Classify ] [ Dedup & Merge ] [ Security Filter ]   |
|                                                                      |
|  L1  INFRASTRUCTURE                                                  |
|  [ Qdrant ] [ FalkorDB ] [ Redis Cache ] [ Redis Pub/Sub ]          |
+----------------------------------------------------------------------+

33 features across 5 layers. Every feature independently toggleable. MIT licensed. TypeScript.

Zero LLM Calls: The Core Design Bet

The most controversial architectural decision in Mnemosyne: the entire ingestion pipeline runs without any LLM calls.

Every memory passes through a deterministic 12-step pipeline:

Security Filter — 3-tier classification blocks API keys, credentials, private keys
Embedding — 768-dim vectors via any OpenAI-compatible endpoint
Dedup & Merge — Cosine ≥0.92 = duplicate (merge). 0.70–0.92 = conflict (alert).
Entity Extraction — People, IPs, technologies, dates, URLs. Algorithmic, not LLM.
Type Classification — 7 types: episodic, semantic, preference, procedural, relationship, profile, core
Urgency Detection — 4 levels: critical, important, reference, background
Domain Classification — 5 domains: technical, personal, project, knowledge, general
Priority Scoring — Urgency × domain composite (0.0–1.0)
Confidence Rating — 3-signal composite with 4 human-readable tiers
Vector Storage — Written to appropriate collection with 23-field metadata
Auto-Linking — Bidirectional links to related memories (Zettelkasten-style)
Broadcast — Published to agent mesh via typed channels

Total time: <50ms. LLM calls: 0. Cost: $0.

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

// Full 12-step pipeline, <50ms, $0
await m.store({ text: "CRITICAL: Auth service JWT expiry changed from 1hr to 30min" })
// -> type: semantic, urgency: critical, domain: technical
// -> priority: 1.0, entities: [Auth service, JWT, 1hr, 30min]
// -> auto-linked to 2 existing JWT memories
// -> broadcast to agent mesh with critical priority

The trade-off is real: LLM-based extraction catches implicit relationships and nuanced semantic structure that algorithmic extraction misses. Cognee's LLM-powered graph construction builds richer knowledge graphs for document corpora. But for the vast majority of agent memory operations — where entities are explicit, facts are stated directly, and you need speed, consistency, and zero cost — the algorithmic approach dominates.

Cognitive Features That Only Exist in Papers

Here's where it gets interesting. Beyond the pipeline, Mnemosyne implements 10 capabilities that previously existed only in academic research:

Activation Decay

Memories fade over time following a logarithmic model inspired by the Ebbinghaus forgetting curve. Critical memories stay active for months. Background observations fade within hours. Procedural memories (runbooks, deployment steps) are immune to decay — like how you never forget how to ride a bike.

// Critical memory: stays active for months
await m.store({ text: "CRITICAL: Never deploy to prod on Fridays" })
// -> decay rate: 0.3, baseline: +2.0

// Background memory: fades within hours
await m.store({ text: "User mentioned they had coffee this morning" })
// -> decay rate: 0.8, baseline: -1.0

// Procedural memory: immune to decay forever
await m.store({ text: "To deploy: 1) Run tests 2) Build 3) Push 4) Apply" })
// -> type: procedural, activation: permanent

Multi-Signal Scoring with Intent Detection

Every recall query is scored across 5 independent signals — not just cosine similarity:

Signal	Weight	What it measures
Semantic Similarity	35%	Vector distance
Temporal Recency	20%	Time since last access
Importance × Confidence	20%	Priority score × confidence
Access Frequency	15%	How often retrieved (log scale)
Type Relevance	10%	Memory type vs. query intent

Mnemosyne auto-detects 5 query intents (factual, temporal, procedural, preference, exploratory) and dynamically adjusts these weights. A temporal query ("what happened recently?") boosts recency to 35%. A procedural query ("how do I deploy?") boosts frequency and type relevance.

Flash Reasoning

BFS traversal through linked memory graphs that reconstructs multi-step logic chains:

const results = await m.recall({ query: "why did auth service crash?" })
// Primary: "Auth service crashed after config update"
// Chain: -> (because) "Config changed JWT expiry from 1hr to 30min"
//        -> (leads_to) "Short-lived tokens caused session storm"
//        -> (therefore) "Rollback to 1hr expiry resolved the issue"

Your agent gets the complete narrative from a single recall.

Theory of Mind for Agents

In a multi-agent mesh, any agent can model what other agents know:

// What does the DevOps agent know about the production database?
const knowledge = await m.toma("devops-agent", "production database")

// Knowledge gap analysis
const gap = await m.knowledgeGap("frontend-agent", "backend-agent", "API contracts")
// -> { onlyFrontendKnows: [...], onlyBackendKnows: [...], bothKnow: [...] }

This concept comes from developmental psychology (Baron-Cohen, 1985) and multi-agent systems research (Gmytrasiewicz & Doshi, 2005). It has never been deployed as production infrastructure until now.

Cross-Agent Synthesis

When 3+ agents independently store corroborating memories about the same fact, it's automatically promoted to "Mesh Fact" — the highest confidence tier. Independent corroboration from separate agents operating in different contexts is strong evidence of factual accuracy.

Reinforcement Learning on Memory

Feedback closes the loop. Memories that consistently prove useful are promoted to core status (immune to decay). Memories that consistently mislead are flagged for review. Over time, retrieval quality improves without manual curation.

await m.recall({ query: "database config" })
// Agent uses the result successfully...
await m.feedback("positive")
// After 3+ retrievals with >70% positive ratio → auto-promoted to core

The Knowledge Graph: Built-In, Free, Temporal

Mnemosyne includes a temporal knowledge graph powered by FalkorDB. Every entity extracted from memories becomes a graph node. Relationships carry timestamps. The graph grows automatically as memories are stored.

Auto-linking: Related memories are bidirectionally connected (Zettelkasten-style)
Path finding: "How is Alice related to PostgreSQL?" → Alice → deployed auth service → auth service uses → PostgreSQL
Timeline reconstruction: Chronological history of everything known about any entity
Temporal queries: "What was server-1 connected to as of January 15th?"

Mem0 charges $249/month for their knowledge graph. Mnemosyne's ships with the MIT license.

Cost Comparison at Scale

Memories/month	Mnemosyne	Mem0	Zep	Cognee	Letta
10K	~$30	~$130-330	~$70-220	~$140-540	~$130-530
100K	~$60	~$1K-3K	~$1K-2K	~$1K-5K	~$1K-5K
1M	~$250	~$10K-30K	—	~$10K-50K	~$10K-50K

The difference is entirely the per-memory LLM processing cost that Mnemosyne eliminates. Infrastructure costs (Qdrant, Redis, FalkorDB) are roughly equivalent across all systems.

Feature Count Comparison

System	Features	Knowledge Graph	Multi-Agent	Self-Improving	Cost/Memory
Mnemosyne	33	Free (built-in)	Full mesh	Yes (RL + consolidation)	$0
Mem0	~5	$249/mo	Enterprise only	No	~$0.01
Zep	~3	None	None	No	~$0.01
Cognee	~5	Built-in	None	No	~$0.01
LangMem	~0	None	None	No	~$0.01
Letta	~4	None	Basic	No	~$0.01

Getting Started

npm install mnemosy-ai
docker run -d -p 6333:6333 qdrant/qdrant  # Only hard requirement

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

await m.store({ text: "User prefers TypeScript and dark mode" })
const memories = await m.recall({ query: "user preferences" })
await m.feedback("positive")

Start with just Qdrant (vector-only mode). Add FalkorDB for the knowledge graph. Add Redis for multi-agent mesh. Every feature is independently toggleable — adopt progressively.

What We Didn't Build

To be honest about scope: Mnemosyne doesn't have a managed cloud offering (you run your own infra). It's TypeScript-only (the AI/ML ecosystem is mostly Python). It doesn't have 41K GitHub stars (Mem0 earned those). And its algorithmic entity extraction won't catch the implicit relationships that Cognee's LLM-powered extraction finds.

These are real trade-offs. Mnemosyne is purpose-built for teams that need cognitive intelligence, multi-agent collaboration, zero-LLM economics, and self-improving memory — and are willing to run their own infrastructure in exchange.

Try It

GitHub: github.com/28naem-del/mnemosyne
npm: npm install mnemosy-ai
Website: mnemosy.ai
Discord: discord.gg/Sp6ZXD3X
License: MIT

33 features. 5 cognitive layers. $0 per memory stored. The brain your agents are missing.

Mnemosyne — Because intelligence without memory isn't intelligence.

DEV Community