DEV Community

Mnemosy
Mnemosy

Posted on

We Built the First AI Agent Memory System With Zero LLM Calls — Here's the Architecture

Why AI Agents Need Brains, Not Just Vector Databases

Every AI agent shipping today has a fundamental problem: amnesia.

Load up any agent framework — LangChain, CrewAI, AutoGen, custom builds — and start a conversation. Ask it about your project. It knows nothing. Give it context across 50 turns. Then watch the context window compact. It knows nothing again.

This isn't a minor UX issue. It's the single biggest bottleneck to autonomous AI. Agents can't learn from mistakes if they don't remember making them. They can't build expertise if every session starts from scratch. They can't collaborate if they can't share what they know.

The industry's response has been to wrap vector databases with LLM-powered extraction layers. Send text to GPT-4, extract key facts, store as vectors, retrieve by similarity. Systems like Mem0, Zep, Cognee, and Letta have raised ~$47M combined doing variations of this approach.

It works for demos. It doesn't work for production. Here's why.

The Problem with LLM-in-the-Loop Memory

When you put an LLM in your memory ingestion pipeline, you inherit three structural problems:

1. Non-deterministic behavior. The same input can produce different extracted facts on different runs. Your memory system's behavior changes when the model version changes, when the prompt drifts, when the temperature fluctuates. In production, you need memory that behaves consistently.

2. Latency floor. Every memory store operation requires an LLM API call — 500ms to 2 seconds minimum. When your agent processes 100 memories per session, that's 50-200 seconds of just waiting for extraction. For real-time agent interactions, this is unacceptable.

3. Linear cost scaling. At approximately $0.01 per memory, storing 100K memories costs $1,000. A million memories costs $10,000. Per month. This scales linearly with no efficiency gains. For production systems processing tens of thousands of interactions daily, the economics are brutal.

These aren't implementation bugs. They're architectural consequences of the LLM-in-the-loop design.

What If Memory Worked Like a Brain?

We spent months running a 10-machine AI agent mesh — 10 agents collaborating on real tasks, 13,000+ memories accumulated, sub-200ms retrieval requirements. The vector-store-plus-LLM approach broke down immediately. We needed something fundamentally different.

So we built Mnemosyne: a 5-layer cognitive memory operating system for AI agents. Not another vector wrapper. An actual memory architecture inspired by how biological memory systems work — from the neural substrate up to metacognition.

+----------------------------------------------------------------------+
|                      MNEMOSYNE COGNITIVE OS                          |
|                                                                      |
|  L5  SELF-IMPROVEMENT                                                |
|  [ Reinforcement ] [ Consolidation ] [ Flash Reasoning ] [ ToMA ]    |
|                                                                      |
|  L4  COGNITIVE                                                       |
|  [ Activation Decay ] [ Confidence ] [ Priority ] [ Diversity ]      |
|                                                                      |
|  L3  KNOWLEDGE GRAPH                                                 |
|  [ Temporal Graph ] [ Auto-Linking ] [ Path Traversal ] [ Entities ] |
|                                                                      |
|  L2  PIPELINE                                                        |
|  [ Extraction ] [ Classify ] [ Dedup & Merge ] [ Security Filter ]   |
|                                                                      |
|  L1  INFRASTRUCTURE                                                  |
|  [ Qdrant ] [ FalkorDB ] [ Redis Cache ] [ Redis Pub/Sub ]          |
+----------------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

33 features across 5 layers. Every feature independently toggleable. MIT licensed. TypeScript.

Zero LLM Calls: The Core Design Bet

The most controversial architectural decision in Mnemosyne: the entire ingestion pipeline runs without any LLM calls.

Every memory passes through a deterministic 12-step pipeline:

  1. Security Filter — 3-tier classification blocks API keys, credentials, private keys
  2. Embedding — 768-dim vectors via any OpenAI-compatible endpoint
  3. Dedup & Merge — Cosine ≥0.92 = duplicate (merge). 0.70–0.92 = conflict (alert).
  4. Entity Extraction — People, IPs, technologies, dates, URLs. Algorithmic, not LLM.
  5. Type Classification — 7 types: episodic, semantic, preference, procedural, relationship, profile, core
  6. Urgency Detection — 4 levels: critical, important, reference, background
  7. Domain Classification — 5 domains: technical, personal, project, knowledge, general
  8. Priority Scoring — Urgency × domain composite (0.0–1.0)
  9. Confidence Rating — 3-signal composite with 4 human-readable tiers
  10. Vector Storage — Written to appropriate collection with 23-field metadata
  11. Auto-Linking — Bidirectional links to related memories (Zettelkasten-style)
  12. Broadcast — Published to agent mesh via typed channels

Total time: <50ms. LLM calls: 0. Cost: $0.

import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

// Full 12-step pipeline, <50ms, $0
await m.store({ text: "CRITICAL: Auth service JWT expiry changed from 1hr to 30min" })
// -> type: semantic, urgency: critical, domain: technical
// -> priority: 1.0, entities: [Auth service, JWT, 1hr, 30min]
// -> auto-linked to 2 existing JWT memories
// -> broadcast to agent mesh with critical priority
Enter fullscreen mode Exit fullscreen mode

The trade-off is real: LLM-based extraction catches implicit relationships and nuanced semantic structure that algorithmic extraction misses. Cognee's LLM-powered graph construction builds richer knowledge graphs for document corpora. But for the vast majority of agent memory operations — where entities are explicit, facts are stated directly, and you need speed, consistency, and zero cost — the algorithmic approach dominates.

Cognitive Features That Only Exist in Papers

Here's where it gets interesting. Beyond the pipeline, Mnemosyne implements 10 capabilities that previously existed only in academic research:

Activation Decay

Memories fade over time following a logarithmic model inspired by the Ebbinghaus forgetting curve. Critical memories stay active for months. Background observations fade within hours. Procedural memories (runbooks, deployment steps) are immune to decay — like how you never forget how to ride a bike.

// Critical memory: stays active for months
await m.store({ text: "CRITICAL: Never deploy to prod on Fridays" })
// -> decay rate: 0.3, baseline: +2.0

// Background memory: fades within hours
await m.store({ text: "User mentioned they had coffee this morning" })
// -> decay rate: 0.8, baseline: -1.0

// Procedural memory: immune to decay forever
await m.store({ text: "To deploy: 1) Run tests 2) Build 3) Push 4) Apply" })
// -> type: procedural, activation: permanent
Enter fullscreen mode Exit fullscreen mode

Multi-Signal Scoring with Intent Detection

Every recall query is scored across 5 independent signals — not just cosine similarity:

Signal Weight What it measures
Semantic Similarity 35% Vector distance
Temporal Recency 20% Time since last access
Importance × Confidence 20% Priority score × confidence
Access Frequency 15% How often retrieved (log scale)
Type Relevance 10% Memory type vs. query intent

Mnemosyne auto-detects 5 query intents (factual, temporal, procedural, preference, exploratory) and dynamically adjusts these weights. A temporal query ("what happened recently?") boosts recency to 35%. A procedural query ("how do I deploy?") boosts frequency and type relevance.

Flash Reasoning

BFS traversal through linked memory graphs that reconstructs multi-step logic chains:

const results = await m.recall({ query: "why did auth service crash?" })
// Primary: "Auth service crashed after config update"
// Chain: -> (because) "Config changed JWT expiry from 1hr to 30min"
//        -> (leads_to) "Short-lived tokens caused session storm"
//        -> (therefore) "Rollback to 1hr expiry resolved the issue"
Enter fullscreen mode Exit fullscreen mode

Your agent gets the complete narrative from a single recall.

Theory of Mind for Agents

In a multi-agent mesh, any agent can model what other agents know:

// What does the DevOps agent know about the production database?
const knowledge = await m.toma("devops-agent", "production database")

// Knowledge gap analysis
const gap = await m.knowledgeGap("frontend-agent", "backend-agent", "API contracts")
// -> { onlyFrontendKnows: [...], onlyBackendKnows: [...], bothKnow: [...] }
Enter fullscreen mode Exit fullscreen mode

This concept comes from developmental psychology (Baron-Cohen, 1985) and multi-agent systems research (Gmytrasiewicz & Doshi, 2005). It has never been deployed as production infrastructure until now.

Cross-Agent Synthesis

When 3+ agents independently store corroborating memories about the same fact, it's automatically promoted to "Mesh Fact" — the highest confidence tier. Independent corroboration from separate agents operating in different contexts is strong evidence of factual accuracy.

Reinforcement Learning on Memory

Feedback closes the loop. Memories that consistently prove useful are promoted to core status (immune to decay). Memories that consistently mislead are flagged for review. Over time, retrieval quality improves without manual curation.

await m.recall({ query: "database config" })
// Agent uses the result successfully...
await m.feedback("positive")
// After 3+ retrievals with >70% positive ratio → auto-promoted to core
Enter fullscreen mode Exit fullscreen mode

The Knowledge Graph: Built-In, Free, Temporal

Mnemosyne includes a temporal knowledge graph powered by FalkorDB. Every entity extracted from memories becomes a graph node. Relationships carry timestamps. The graph grows automatically as memories are stored.

  • Auto-linking: Related memories are bidirectionally connected (Zettelkasten-style)
  • Path finding: "How is Alice related to PostgreSQL?" → Alice → deployed auth service → auth service uses → PostgreSQL
  • Timeline reconstruction: Chronological history of everything known about any entity
  • Temporal queries: "What was server-1 connected to as of January 15th?"

Mem0 charges $249/month for their knowledge graph. Mnemosyne's ships with the MIT license.

Cost Comparison at Scale

Memories/month Mnemosyne Mem0 Zep Cognee Letta
10K ~$30 ~$130-330 ~$70-220 ~$140-540 ~$130-530
100K ~$60 ~$1K-3K ~$1K-2K ~$1K-5K ~$1K-5K
1M ~$250 ~$10K-30K ~$10K-50K ~$10K-50K

The difference is entirely the per-memory LLM processing cost that Mnemosyne eliminates. Infrastructure costs (Qdrant, Redis, FalkorDB) are roughly equivalent across all systems.

Feature Count Comparison

System Features Knowledge Graph Multi-Agent Self-Improving Cost/Memory
Mnemosyne 33 Free (built-in) Full mesh Yes (RL + consolidation) $0
Mem0 ~5 $249/mo Enterprise only No ~$0.01
Zep ~3 None None No ~$0.01
Cognee ~5 Built-in None No ~$0.01
LangMem ~0 None None No ~$0.01
Letta ~4 None Basic No ~$0.01

Getting Started

npm install mnemosy-ai
docker run -d -p 6333:6333 qdrant/qdrant  # Only hard requirement
Enter fullscreen mode Exit fullscreen mode
import { createMnemosyne } from 'mnemosy-ai'

const m = await createMnemosyne({
  vectorDbUrl: 'http://localhost:6333',
  embeddingUrl: 'http://localhost:11434/v1/embeddings',
  agentId: 'my-agent'
})

await m.store({ text: "User prefers TypeScript and dark mode" })
const memories = await m.recall({ query: "user preferences" })
await m.feedback("positive")
Enter fullscreen mode Exit fullscreen mode

Start with just Qdrant (vector-only mode). Add FalkorDB for the knowledge graph. Add Redis for multi-agent mesh. Every feature is independently toggleable — adopt progressively.

What We Didn't Build

To be honest about scope: Mnemosyne doesn't have a managed cloud offering (you run your own infra). It's TypeScript-only (the AI/ML ecosystem is mostly Python). It doesn't have 41K GitHub stars (Mem0 earned those). And its algorithmic entity extraction won't catch the implicit relationships that Cognee's LLM-powered extraction finds.

These are real trade-offs. Mnemosyne is purpose-built for teams that need cognitive intelligence, multi-agent collaboration, zero-LLM economics, and self-improving memory — and are willing to run their own infrastructure in exchange.

Try It

33 features. 5 cognitive layers. $0 per memory stored. The brain your agents are missing.


Mnemosyne — Because intelligence without memory isn't intelligence.

Top comments (0)