I built a structured memory system for AI called Alma. This post explains the architecture, not the marketing.
The problem, technically
Current AI memory implementations (claude.md, .cursorrules, ChatGPT Memory) share these limitations:
- No schema. All data is unstructured text. No types, no fields, no queryable metadata.
- No weighting. Every piece of information has equal priority in the context window.
- No automatic extraction. The user manually maintains the memory.
- No deduplication. Similar information accumulates without merging.
- No separation of concerns. Identity, style preferences, and session context are mixed.
The architecture
Alma has three data layers and an assembly engine:
┌─────────────────────────────────────────────┐
│ Context Assembler │
│ (dynamic token budget, relevance scoring) │
├──────────┬──────────┬──────────┬────────────┤
│ Soul │ Memories │ Episodes │ Procedures │
│ Engine │ │ │ │
│ 13 blocks│ Weighted │ Summaries│ Behavioral │
│ Identity │ facts │ w/ topics│ patterns │
│ Style │ w/ score │ outcomes │ auto- │
│ Context │ category │ search │ extracted │
└──────────┴──────────┴──────────┴────────────┘
↑ Background Processor ↑
(async, every N messages)
Layer 1: Memories
Schema:
interface Memory {
id: string;
content: string;
category: 'preference' | 'fact' | 'decision' | 'project' | 'general';
importance: number; // 0-1, determines context priority
source: 'manual' | 'extracted' | 'extension' | 'api' | 'consolidated';
access_count: number; // incremented on retrieval
reinforcement_count: number; // incremented on dedup match
embedding: Float32Array; // for semantic search
created_at: string;
last_accessed_at: string;
}
Deduplication uses Jaccard similarity on keyword sets with a 60% threshold and 3-keyword minimum. Above threshold: reinforce existing memory (increment count) instead of creating new record.
Search is hybrid: keyword (SQL FTS5) + semantic (cosine similarity on Cloudflare Vectorize embeddings). Results merged and re-ranked by a weighted score:
const WEIGHTS = {
relevance: 0.40, // Cosine similarity to current query
importance: 0.30, // 0.0-1.0, extracted or user-assigned
recency: 0.20, // Exponential decay, 7-day half-life
frequency: 0.10, // Logarithmic scale of access count
};
Layer 2: Episodes
interface Episode {
id: string;
conversation_id: string;
summary: string;
topics: string[];
outcome: string;
message_count: number;
embedding: Float32Array;
}
Auto-generated at conversation end. Searchable by topic, outcome, or semantic similarity.
Layer 3: Procedures
interface Procedure {
id: string;
content: string; // "Checks error handling first in code reviews"
category: string;
trigger: string; // When this pattern activates
source: 'extracted' | 'manual';
}
Extracted by the background processor analyzing conversation patterns. These represent behavioral habits, not explicit preferences.
Soul Engine: 13 blocks
type SoulSection = 'identity' | 'style' | 'context';
type BlockKey =
| 'identity' | 'worldview' | 'tensions' | 'rules'
| 'style_guide' | 'anti_patterns' | 'communication' | 'examples'
| 'user_profile' | 'active_context' | 'learned_patterns'
| 'scratchpad' | 'custom';
interface SoulBlock {
key: BlockKey;
section: SoulSection;
content: string;
char_limit: number;
priority: number;
truncation: 'head' | 'tail'; // head = keep newest, tail = keep oldest
}
Identity blocks use tail truncation (preserve oldest = core values stable). Context blocks use head truncation (trim oldest = keep fresh data). This simple mechanism creates different temporal behaviors without complex logic.
Context Assembler
async function assembleContext(userId: string, message: string): Promise<string> {
// 1. Soul Engine — always included, highest priority
const soul = await renderSoulBlocks(userId);
// 2. Relevant memories — scored by semantic similarity to current message
const memories = await searchMemories(userId, message, { mode: 'hybrid' });
// 3. Recent episodes — for conversation continuity
const episodes = await getRecentEpisodes(userId);
// 4. Matching procedures — behavioral patterns
const procedures = await matchProcedures(userId, message);
// 5. Dynamic token budget — sections compete for space
return buildPrompt({ soul, memories, episodes, procedures }, TOKEN_BUDGET);
}
Each section has a priority. If total tokens exceed the budget, lower-priority sections get truncated first. The Soul Engine is always preserved in full.
Background Processor
Fires asynchronously via ctx.waitUntil() every N messages:
- Sends recent conversation to Claude Haiku for analysis
- Receives structured JSON with extracted memories, episodes, procedures
- Deduplicates memories against existing store
- Updates relevant soul blocks (active_context, learned_patterns, user_profile)
- Stores episode summary
Zero impact on conversation latency.
Infrastructure
Entirely Cloudflare:
- Workers — API, SSE streaming, background processing
- D1 — SQLite database (56 migrations)
- Vectorize — Embedding storage and similarity search
- R2 — File uploads (images, documents)
- KV — Configuration cache
- Durable Objects — Atomic budget tracking (single-threaded counters)
No AWS. No external databases. Cold start under 5ms.
Numbers
- 1,690 passing tests across 102 files
- 56 database migrations
- 180 REST API endpoints
- 15 fully localized languages
- 6 agent tools in chat + 21 MCP tools + 9 MCP resources
Try it
Web app: alma.olivares.ai
Free tier: 500 memories, Claude Haiku, automatic learning. No credit card.
Built by Francisco @ Olivares.AI


Top comments (0)