Fran

Posted on Mar 12

I replaced my claude.md with a 3-layer cognitive memory system. Here's the architecture.

#ai #showdev #typescript #architecture

I built a structured memory system for AI called Alma. This post explains the architecture, not the marketing.

The problem, technically

Current AI memory implementations (claude.md, .cursorrules, ChatGPT Memory) share these limitations:

No schema. All data is unstructured text. No types, no fields, no queryable metadata.
No weighting. Every piece of information has equal priority in the context window.
No automatic extraction. The user manually maintains the memory.
No deduplication. Similar information accumulates without merging.
No separation of concerns. Identity, style preferences, and session context are mixed.

The architecture

Alma has three data layers and an assembly engine:

┌─────────────────────────────────────────────┐
│                Context Assembler             │
│  (dynamic token budget, relevance scoring)   │
├──────────┬──────────┬──────────┬────────────┤
│ Soul     │ Memories │ Episodes │ Procedures │
│ Engine   │          │          │            │
│ 13 blocks│ Weighted │ Summaries│ Behavioral │
│ Identity │ facts    │ w/ topics│ patterns   │
│ Style    │ w/ score │ outcomes │ auto-       │
│ Context  │ category │ search   │ extracted  │
└──────────┴──────────┴──────────┴────────────┘
         ↑ Background Processor ↑
         (async, every N messages)

Layer 1: Memories

Schema:

interface Memory {
  id: string;
  content: string;
  category: 'preference' | 'fact' | 'decision' | 'project' | 'general';
  importance: number;     // 0-1, determines context priority
  source: 'manual' | 'extracted' | 'extension' | 'api' | 'consolidated';
  access_count: number;   // incremented on retrieval
  reinforcement_count: number; // incremented on dedup match
  embedding: Float32Array;     // for semantic search
  created_at: string;
  last_accessed_at: string;
}

Deduplication uses Jaccard similarity on keyword sets with a 60% threshold and 3-keyword minimum. Above threshold: reinforce existing memory (increment count) instead of creating new record.

Search is hybrid: keyword (SQL FTS5) + semantic (cosine similarity on Cloudflare Vectorize embeddings). Results merged and re-ranked by a weighted score:

const WEIGHTS = {
  relevance: 0.40,   // Cosine similarity to current query
  importance: 0.30,  // 0.0-1.0, extracted or user-assigned
  recency: 0.20,     // Exponential decay, 7-day half-life
  frequency: 0.10,   // Logarithmic scale of access count
};

Layer 2: Episodes

interface Episode {
  id: string;
  conversation_id: string;
  summary: string;
  topics: string[];
  outcome: string;
  message_count: number;
  embedding: Float32Array;
}

Auto-generated at conversation end. Searchable by topic, outcome, or semantic similarity.

Layer 3: Procedures

interface Procedure {
  id: string;
  content: string;        // "Checks error handling first in code reviews"
  category: string;
  trigger: string;        // When this pattern activates
  source: 'extracted' | 'manual';
}

Extracted by the background processor analyzing conversation patterns. These represent behavioral habits, not explicit preferences.

Soul Engine: 13 blocks

type SoulSection = 'identity' | 'style' | 'context';
type BlockKey =
  | 'identity' | 'worldview' | 'tensions' | 'rules'
  | 'style_guide' | 'anti_patterns' | 'communication' | 'examples'
  | 'user_profile' | 'active_context' | 'learned_patterns'
  | 'scratchpad' | 'custom';

interface SoulBlock {
  key: BlockKey;
  section: SoulSection;
  content: string;
  char_limit: number;
  priority: number;
  truncation: 'head' | 'tail';  // head = keep newest, tail = keep oldest
}

Identity blocks use tail truncation (preserve oldest = core values stable). Context blocks use head truncation (trim oldest = keep fresh data). This simple mechanism creates different temporal behaviors without complex logic.

Context Assembler

async function assembleContext(userId: string, message: string): Promise<string> {
  // 1. Soul Engine — always included, highest priority
  const soul = await renderSoulBlocks(userId);

  // 2. Relevant memories — scored by semantic similarity to current message
  const memories = await searchMemories(userId, message, { mode: 'hybrid' });

  // 3. Recent episodes — for conversation continuity
  const episodes = await getRecentEpisodes(userId);

  // 4. Matching procedures — behavioral patterns
  const procedures = await matchProcedures(userId, message);

  // 5. Dynamic token budget — sections compete for space
  return buildPrompt({ soul, memories, episodes, procedures }, TOKEN_BUDGET);
}

Each section has a priority. If total tokens exceed the budget, lower-priority sections get truncated first. The Soul Engine is always preserved in full.

Background Processor

Fires asynchronously via ctx.waitUntil() every N messages:

Sends recent conversation to Claude Haiku for analysis
Receives structured JSON with extracted memories, episodes, procedures
Deduplicates memories against existing store
Updates relevant soul blocks (active_context, learned_patterns, user_profile)
Stores episode summary

Zero impact on conversation latency.

Infrastructure

Entirely Cloudflare:

Workers — API, SSE streaming, background processing
D1 — SQLite database (56 migrations)
Vectorize — Embedding storage and similarity search
R2 — File uploads (images, documents)
KV — Configuration cache
Durable Objects — Atomic budget tracking (single-threaded counters)

No AWS. No external databases. Cold start under 5ms.

Numbers

1,690 passing tests across 102 files
56 database migrations
180 REST API endpoints
15 fully localized languages
6 agent tools in chat + 21 MCP tools + 9 MCP resources

Try it

Web app: alma.olivares.ai

Free tier: 500 memories, Claude Haiku, automatic learning. No credit card.

Built by Francisco @ Olivares.AI

DEV Community