DEV Community: Mnemosy

We Built the First AI Agent Memory System With Zero LLM Calls — Here's the Architecture

Mnemosy — Tue, 24 Feb 2026 22:04:50 +0000

Why AI Agents Need Brains, Not Just Vector Databases

Every AI agent shipping today has a fundamental problem: amnesia.

Load up any agent framework — LangChain, CrewAI, AutoGen, custom builds — and start a conversation. Ask it about your project. It knows nothing. Give it context across 50 turns. Then watch the context window compact. It knows nothing again.

This isn't a minor UX issue. It's the single biggest bottleneck to autonomous AI. Agents can't learn from mistakes if they don't remember making them. They can't build expertise if every session starts from scratch. They can't collaborate if they can't share what they know.

The industry's response has been to wrap vector databases with LLM-powered extraction layers. Send text to GPT-4, extract key facts, store as vectors, retrieve by similarity. Systems like Mem0, Zep, Cognee, and Letta have raised ~$47M combined doing variations of this approach.

It works for demos. It doesn't work for production. Here's why.

The Problem with LLM-in-the-Loop Memory

When you put an LLM in your memory ingestion pipeline, you inherit three structural problems:

1. Non-deterministic behavior. The same input can produce different extracted facts on different runs. Your memory system's behavior changes when the model version changes, when the prompt drifts, when the temperature fluctuates. In production, you need memory that behaves consistently.

2. Latency floor. Every memory store operation requires an LLM API call — 500ms to 2 seconds minimum. When your agent processes 100 memories per session, that's 50-200 seconds of just waiting for extraction. For real-time agent interactions, this is unacceptable.

3. Linear cost scaling. At approximately $0.01 per memory, storing 100K memories costs $1,000. A million memories costs $10,000. Per month. This scales linearly with no efficiency gains. For production systems processing tens of thousands of interactions daily, the economics are brutal.

These aren't implementation bugs. They're architectural consequences of the LLM-in-the-loop design.

What If Memory Worked Like a Brain?

We spent months running a 10-machine AI agent mesh — 10 agents collaborating on real tasks, 13,000+ memories accumulated, sub-200ms retrieval requirements. The vector-store-plus-LLM approach broke down immediately. We needed something fundamentally different.

So we built Mnemosyne: a 5-layer cognitive memory operating system for AI agents. Not another vector wrapper. An actual memory architecture inspired by how biological memory systems work — from the neural substrate up to metacognition.

+----------------------------------------------------------------------+
|                      MNEMOSYNE COGNITIVE OS                          |
|                                                                      |
|  L5  SELF-IMPROVEMENT                                                |
|  [ Reinforcement ] [ Consolidation ] [ Flash Reasoning ] [ ToMA ]    |
|                                                                      |
|  L4  COGNITIVE                                                       |
|  [ Activation Decay ] [ Confidence ] [ Priority ] [ Diversity ]      |
|                                                                      |
|  L3  KNOWLEDGE GRAPH                                                 |
|  [ Temporal Graph ] [ Auto-Linking ] [ Path Traversal ] [ Entities ] |
|                                                                      |
|  L2  PIPELINE                                                        |
|  [ Extraction ] [ Classify ] [ Dedup & Merge ] [ Security Filter ]   |
|                                                                      |
|  L1  INFRASTRUCTURE                                                  |
|  [ Qdrant ] [ FalkorDB ] [ Redis Cache ] [ Redis Pub/Sub ]          |
+----------------------------------------------------------------------+

33 features across 5 layers. Every feature independently toggleable. MIT licensed. TypeScript.

Zero LLM Calls: The Core Design Bet

The most controversial architectural decision in Mnemosyne: the entire ingestion pipeline runs without any LLM calls.

Every memory passes through a deterministic 12-step pipeline:

Security Filter — 3-tier classification blocks API keys, credentials, private keys
Embedding — 768-dim vectors via any OpenAI-compatible endpoint
Dedup & Merge — Cosine ≥0.92 = duplicate (merge). 0.70–0.92 = conflict (alert).
Entity Extraction — People, IPs, technologies, dates, URLs. Algorithmic, not LLM.
Type Classification — 7 types: episodic, semantic, preference, procedural, relationship, profile, core
Urgency Detection — 4 levels: critical, important, reference, background
Domain Classification — 5 domains: technical, personal, project, knowledge, general
Priority Scoring — Urgency × domain composite (0.0–1.0)
Confidence Rating — 3-signal composite with 4 human-readable tiers
Vector Storage — Written to appropriate collection with 23-field metadata
Auto-Linking — Bidirectional links to related memories (Zettelkasten-style)
Broadcast — Published to agent mesh via typed channels

Total time: <50ms. LLM calls: 0. Cost: $0.

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

// Full 12-step pipeline, <50ms, $0
await m.store({ text: "CRITICAL: Auth service JWT expiry changed from 1hr to 30min" })
// -> type: semantic, urgency: critical, domain: technical
// -> priority: 1.0, entities: [Auth service, JWT, 1hr, 30min]
// -> auto-linked to 2 existing JWT memories
// -> broadcast to agent mesh with critical priority

The trade-off is real: LLM-based extraction catches implicit relationships and nuanced semantic structure that algorithmic extraction misses. Cognee's LLM-powered graph construction builds richer knowledge graphs for document corpora. But for the vast majority of agent memory operations — where entities are explicit, facts are stated directly, and you need speed, consistency, and zero cost — the algorithmic approach dominates.

Cognitive Features That Only Exist in Papers

Here's where it gets interesting. Beyond the pipeline, Mnemosyne implements 10 capabilities that previously existed only in academic research:

Activation Decay

Memories fade over time following a logarithmic model inspired by the Ebbinghaus forgetting curve. Critical memories stay active for months. Background observations fade within hours. Procedural memories (runbooks, deployment steps) are immune to decay — like how you never forget how to ride a bike.

// Critical memory: stays active for months
await m.store({ text: "CRITICAL: Never deploy to prod on Fridays" })
// -> decay rate: 0.3, baseline: +2.0

// Background memory: fades within hours
await m.store({ text: "User mentioned they had coffee this morning" })
// -> decay rate: 0.8, baseline: -1.0

// Procedural memory: immune to decay forever
await m.store({ text: "To deploy: 1) Run tests 2) Build 3) Push 4) Apply" })
// -> type: procedural, activation: permanent

Multi-Signal Scoring with Intent Detection

Every recall query is scored across 5 independent signals — not just cosine similarity:

Signal	Weight	What it measures
Semantic Similarity	35%	Vector distance
Temporal Recency	20%	Time since last access
Importance × Confidence	20%	Priority score × confidence
Access Frequency	15%	How often retrieved (log scale)
Type Relevance	10%	Memory type vs. query intent

Mnemosyne auto-detects 5 query intents (factual, temporal, procedural, preference, exploratory) and dynamically adjusts these weights. A temporal query ("what happened recently?") boosts recency to 35%. A procedural query ("how do I deploy?") boosts frequency and type relevance.

Flash Reasoning

BFS traversal through linked memory graphs that reconstructs multi-step logic chains:

const results = await m.recall({ query: "why did auth service crash?" })
// Primary: "Auth service crashed after config update"
// Chain: -> (because) "Config changed JWT expiry from 1hr to 30min"
//        -> (leads_to) "Short-lived tokens caused session storm"
//        -> (therefore) "Rollback to 1hr expiry resolved the issue"

Your agent gets the complete narrative from a single recall.

Theory of Mind for Agents

In a multi-agent mesh, any agent can model what other agents know:

// What does the DevOps agent know about the production database?
const knowledge = await m.toma("devops-agent", "production database")

// Knowledge gap analysis
const gap = await m.knowledgeGap("frontend-agent", "backend-agent", "API contracts")
// -> { onlyFrontendKnows: [...], onlyBackendKnows: [...], bothKnow: [...] }

This concept comes from developmental psychology (Baron-Cohen, 1985) and multi-agent systems research (Gmytrasiewicz & Doshi, 2005). It has never been deployed as production infrastructure until now.

Cross-Agent Synthesis

When 3+ agents independently store corroborating memories about the same fact, it's automatically promoted to "Mesh Fact" — the highest confidence tier. Independent corroboration from separate agents operating in different contexts is strong evidence of factual accuracy.

Reinforcement Learning on Memory

Feedback closes the loop. Memories that consistently prove useful are promoted to core status (immune to decay). Memories that consistently mislead are flagged for review. Over time, retrieval quality improves without manual curation.

await m.recall({ query: "database config" })
// Agent uses the result successfully...
await m.feedback("positive")
// After 3+ retrievals with >70% positive ratio → auto-promoted to core

The Knowledge Graph: Built-In, Free, Temporal

Mnemosyne includes a temporal knowledge graph powered by FalkorDB. Every entity extracted from memories becomes a graph node. Relationships carry timestamps. The graph grows automatically as memories are stored.

Auto-linking: Related memories are bidirectionally connected (Zettelkasten-style)
Path finding: "How is Alice related to PostgreSQL?" → Alice → deployed auth service → auth service uses → PostgreSQL
Timeline reconstruction: Chronological history of everything known about any entity
Temporal queries: "What was server-1 connected to as of January 15th?"

Mem0 charges $249/month for their knowledge graph. Mnemosyne's ships with the MIT license.

Cost Comparison at Scale

Memories/month	Mnemosyne	Mem0	Zep	Cognee	Letta
10K	~$30	~$130-330	~$70-220	~$140-540	~$130-530
100K	~$60	~$1K-3K	~$1K-2K	~$1K-5K	~$1K-5K
1M	~$250	~$10K-30K	—	~$10K-50K	~$10K-50K

The difference is entirely the per-memory LLM processing cost that Mnemosyne eliminates. Infrastructure costs (Qdrant, Redis, FalkorDB) are roughly equivalent across all systems.

Feature Count Comparison

System	Features	Knowledge Graph	Multi-Agent	Self-Improving	Cost/Memory
Mnemosyne	33	Free (built-in)	Full mesh	Yes (RL + consolidation)	$0
Mem0	~5	$249/mo	Enterprise only	No	~$0.01
Zep	~3	None	None	No	~$0.01
Cognee	~5	Built-in	None	No	~$0.01
LangMem	~0	None	None	No	~$0.01
Letta	~4	None	Basic	No	~$0.01

Getting Started

npm install mnemosy-ai
docker run -d -p 6333:6333 qdrant/qdrant  # Only hard requirement

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

await m.store({ text: "User prefers TypeScript and dark mode" })
const memories = await m.recall({ query: "user preferences" })
await m.feedback("positive")

Start with just Qdrant (vector-only mode). Add FalkorDB for the knowledge graph. Add Redis for multi-agent mesh. Every feature is independently toggleable — adopt progressively.

What We Didn't Build

To be honest about scope: Mnemosyne doesn't have a managed cloud offering (you run your own infra). It's TypeScript-only (the AI/ML ecosystem is mostly Python). It doesn't have 41K GitHub stars (Mem0 earned those). And its algorithmic entity extraction won't catch the implicit relationships that Cognee's LLM-powered extraction finds.

These are real trade-offs. Mnemosyne is purpose-built for teams that need cognitive intelligence, multi-agent collaboration, zero-LLM economics, and self-improving memory — and are willing to run their own infrastructure in exchange.

Try It

GitHub: github.com/28naem-del/mnemosyne
npm: npm install mnemosy-ai
Website: mnemosy.ai
Discord: discord.gg/Sp6ZXD3X
License: MIT

33 features. 5 cognitive layers. $0 per memory stored. The brain your agents are missing.

Mnemosyne — Because intelligence without memory isn't intelligence.

We Built the First AI Agent Memory System With Zero LLM Calls — Here's the Architecture

Mnemosy — Tue, 24 Feb 2026 21:56:23 +0000

We Built the First AI Agent Memory System With Zero LLM Calls

Every AI memory system on the market makes the same architectural choice: send your text to an LLM for extraction before storing it.

Mem0 calls GPT-4o. Zep makes multiple async LLM calls. Cognee uses LLMs for knowledge extraction. Letta's entire memory engine is an LLM.

That means every single memory.store() costs ~$0.01, takes 500ms-2s, and produces non-deterministic results. At 100K memories/month, you're paying $1,000-3,000 just to remember things.

We asked: what if you didn't need an LLM at all?

The result is Mnemosyne — the first cognitive memory OS for AI agents with zero LLM calls in the entire ingestion pipeline. 33 features, 5 cognitive layers, $0 per memory stored. MIT licensed.

The Cost Table Nobody Wants You to See

System	LLM Required?	Cost per memory	100K memories/mo
Mnemosyne	No	$0.00	~$60 (infra only)
Mem0	Yes (GPT-4o)	~$0.01	$1,000-3,000
Zep	Yes (multiple calls)	~$0.01	$1,000-2,000
Cognee	Yes (extraction)	~$0.01	$1,000-5,000
Letta/MemGPT	Yes (core engine)	~$0.01	$1,000-5,000

This isn't a criticism of these projects — Mem0 has 41K stars and popularized this entire space. But the LLM-in-the-loop architecture has fundamental trade-offs that nobody talks about.

Three Problems With LLM-Powered Memory

1. Non-deterministic behavior

The same input can produce different extracted facts on different runs. Your memory system's behavior changes when the model updates. In production, you need memory that behaves consistently.

2. Latency floor

Every store() requires an LLM API call — 500ms to 2 seconds minimum. When your agent processes 100 memories per session, that's 50-200 seconds of just waiting.

3. Linear cost scaling

At $0.01 per memory, 1M memories = $10,000. Per month. With no efficiency gains at scale.

How We Eliminated Every LLM Call

Mnemosyne's 12-step ingestion pipeline is 100% algorithmic:

Security Filter — blocks API keys, credentials, secrets (regex patterns)
Embedding — local vectors via Ollama (nomic-embed-text)
Dedup & Merge — cosine similarity ≥0.92 = duplicate
Entity Extraction — people, IPs, technologies, dates (pattern matching)
Type Classification — 7 types: episodic, semantic, procedural, preference, relationship, profile, core
Urgency Detection — critical / important / reference / background
Domain Classification — technical / personal / project / knowledge / general
Priority Scoring — urgency × domain composite
Confidence Rating — 3-signal composite with human-readable tiers
Vector Storage — 23-field metadata per memory
Auto-Linking — bidirectional links to related memories
Mesh Broadcast — published to agent network

Total time: <50ms. LLM calls: 0. Cost: $0.

The trade-off is real: LLM-based extraction catches implicit relationships that rule-based extractors miss. But for the vast majority of agent memory — where entities are explicit and speed matters — the algorithmic approach dominates.

But It's Not Just a Vector Wrapper

This is where it gets interesting. Mnemosyne implements 10 cognitive capabilities that previously only existed in research papers:

🧠 Activation Decay

Memories fade over time following the Ebbinghaus forgetting curve. Critical memories survive months. Background observations fade in hours. Procedural memories (like runbooks) never decay — just like how you never forget how to ride a bike.

⚡ Flash Reasoning

Query "why did auth crash?" and get the full chain:

Config changed → JWT expiry shortened → Session storm → Rollback fixed it

One recall. Complete narrative. BFS through linked memory graphs.

🤝 Theory of Mind for Agents

Agent A can ask "what does Agent B know about the database?" without talking to Agent B. Modeled after developmental psychology research (Baron-Cohen, 1985).

📈 Reinforcement Learning

Memories that consistently help → auto-promoted to permanent. Bad memories → flagged. Your memory system gets smarter through use, not manual curation.

🔗 Knowledge Graph (Built-in, Free)

Temporal entity graph with auto-linking, path finding, and timeline reconstruction. Mem0 charges $249/month for their knowledge graph. Ours ships with the MIT license.

🌐 Multi-Agent Mesh

When 3+ agents independently confirm the same fact, it's automatically promoted to "Mesh Fact" — the highest confidence tier. Real distributed consensus for AI knowledge.

33 Features, 5 Layers

L5  SELF-IMPROVEMENT
    Reinforcement · Consolidation · Flash Reasoning · ToMA · Synthesis

L4  COGNITIVE
    Activation Decay · Confidence · Priority · Diversity Reranking

L3  KNOWLEDGE GRAPH
    Temporal Graph · Auto-Linking · Path Traversal · Entity Extraction

L2  PIPELINE
    Security Filter · Classify · Dedup · Merge · 12-step Ingestion

L1  INFRASTRUCTURE
    Qdrant · FalkorDB · Redis Cache · Redis Pub/Sub

Every feature is independently toggleable. Start with just Qdrant, progressively enable as needed.

Running in Production Right Now

This isn't a demo. It's running on a 10-machine AI agent mesh:

13,000+ memories stored
<50ms ingestion (full 12-step pipeline)
<200ms recall (multi-signal ranked, graph-enriched)
>60% cache hit rate in conversational workloads
10 agents collaborating with shared memory

Quick Start (2 minutes)

npm install mnemosy-ai
docker run -d -p 6333:6333 qdrant/qdrant

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

// Full 12-step pipeline, <50ms, $0
await m.store({ text: "User prefers dark mode and TypeScript" })

// Multi-signal ranked recall
const memories = await m.recall({ query: "user preferences" })

// Memories learn from feedback
await m.feedback("positive")

Only hard requirement: Qdrant. Redis and FalkorDB are optional power-ups.

Honest Trade-offs

We're not claiming Mnemosyne is better at everything:

Mem0 has 41K stars, a great community, and production-hardened cloud offering
Cognee builds richer knowledge graphs via LLM extraction
Letta's LLM-directed memory management is genuinely innovative

Mnemosyne fills a specific niche: if you need cognitive intelligence + multi-agent collaboration + zero-LLM economics + self-improving memory — all in one open-source system — this is currently the only option that exists.

The Bet

We're betting that the future of AI memory is deterministic, local-first, and free at the point of storage. That cognitive capabilities don't require sending every memory through GPT-4. That you can build a brain without renting someone else's.

13,000 memories and counting. Zero LLM calls. The math speaks for itself.

GitHub: github.com/28naem-del/mnemosyne
npm: npm install mnemosy-ai
Website: mnemosy.ai
Discord: discord.gg/Sp6ZXD3X
License: MIT

Mnemosyne — Because intelligence without memory isn't intelligence.