Every AI app eventually hits the same problem: the model needs context, but you can't dump everything into the system prompt. Token budgets are finite. Not all information is equally relevant. And the naive approach — "just send the last N messages" — falls apart the moment your user has 200 memories and a 4,000 token budget.
I've been running a memory scoring and context assembly system in production for months. This is how it works, with actual code.
The pipeline
The system has four stages:
- Extract structured memories from conversation
- Deduplicate against existing memories
- Score and rank by relevance, importance, recency, and frequency
- Assemble a token-budgeted system prompt
Each stage has specific engineering decisions that took a while to get right.
Stage 1: Extraction
After every 5 assistant messages, a background processor fires asynchronously via ctx.waitUntil(). It takes the last 20 messages and asks the cheapest available model to extract structured data:
interface ExtractedMemory {
content: string;
category: 'preference' | 'fact' | 'decision' | 'project';
importance: number; // 0.0 to 1.0
}
interface ExtractedEpisode {
summary: string;
topics: string[];
outcome: string | null;
}
The extraction prompt has specific rules:
- Importance: 0.9+ for critical info, 0.5-0.8 for useful, below 0.5 for minor
- Keep memories concise (one sentence each)
- Extract 0-10 memories (only what's genuinely worth remembering)
The "0-10 memories" range matters. Early versions didn't cap extraction and the system generated noise — trivial facts diluting important ones. Capping at 10 per extraction cycle and requiring importance thresholds cleaned this up.
The episode summary is also structured — not "you talked for 45 minutes" but { summary: "Debugged auth middleware", topics: ["authentication", "middleware"], outcome: "Root cause was missing await" }. This makes episodes searchable by topic without embedding the full transcript.
One critical detail: this runs fire-and-forget. The user never waits. On Cloudflare Workers, that means every background promise needs both ctx.waitUntil() AND .catch():
const backgroundWork = processor.process(conversationId, messages, llm)
.catch(err => console.error('Background processing failed:', err));
ctx.waitUntil(backgroundWork);
Missing that .catch() on Workers with compatibility dates 2024-10+ causes unhandled rejections that silently kill the Worker. This single line prevented a crash on every chat request.
Stage 2: Deduplication
Without dedup, you get the same preference stored 30 times. "User prefers TypeScript" appearing in every extraction cycle.
The approach: Jaccard similarity on extracted keywords with a 60% threshold and a 3-keyword minimum.
Why 60%? Tested extensively:
- 40% merges distinct memories ("prefers TypeScript" conflates with "prefers functional patterns")
- 80% lets obvious duplicates through
- 60% with 3-keyword minimum catches real duplicates while preserving distinct-but-related memories
When a duplicate is detected, the existing memory's access_count increments. Frequently confirmed facts naturally rise in rankings without creating noise.
Stage 3: Scoring
This is where it gets interesting. Every memory gets a composite score:
const DEFAULT_WEIGHTS = {
relevance: 0.40, // cosine similarity to current query
importance: 0.30, // extracted weight (0-1)
recency: 0.20, // exponential decay, 7-day half-life
frequency: 0.10, // log-scaled access count
};
The recency function uses exponential decay:
function recencyScore(accessedAt: string): number {
const accessed = new Date(accessedAt).getTime();
const hoursAgo = (Date.now() - accessed) / (1000 * 60 * 60);
const halfLifeHours = 7 * 24; // 7 days
return Math.exp((-Math.LN2 * hoursAgo) / halfLifeHours);
}
A memory accessed today scores 1.0. One week ago: 0.5. Two weeks: 0.25. This means stale memories don't disappear — they just yield to fresher ones when the budget is tight.
Frequency uses logarithmic scaling so high-access memories don't dominate:
function frequencyScore(accessCount: number): number {
if (accessCount <= 0) return 0;
return Math.min(1, Math.log10(accessCount + 1) / 2);
}
Why these weights?
Relevance at 0.40 because a perfectly scored memory about cooking is useless when you're debugging auth. Semantic relevance is the primary filter.
Importance at 0.30 because not all memories are equal. "User is migrating to PostgreSQL this quarter" (0.9) should outrank "User mentioned coffee" (0.3), even if the coffee mention is more recent.
Recency at 0.20 because conversations have temporal context. What you discussed yesterday is more likely relevant than what you discussed a month ago — but not always.
Frequency at 0.10 as a tiebreaker. Memories that keep surfacing in different conversations are probably important, but this shouldn't override direct relevance.
The confidence dimension
Each memory also has a confidence score that's separate from importance:
- 1.0 — user explicitly stated this
- 0.7 — AI inferred this from conversation context
Confidence feeds into retrieval quality. A high-confidence preference (the user said "I always use TypeScript") should surface over a high-importance but low-confidence inference ("probably prefers dark mode based on theme discussion").
Stage 4: Context Assembly
The Context Assembler takes scored memories and builds a token-budgeted system prompt:
interface AssembledContext {
systemPrompt: string;
metadata: {
soulTokens: number;
memoriesIncluded: number;
memoriesTokens: number;
episodesIncluded: number;
proceduresIncluded: number;
totalTokens: number;
topMemoryScores: Array<{ content: string; score: number }>;
};
}
The assembly order is strict:
- Soul blocks first (identity, style, context) — always included, non-negotiable
- Scored memories — ranked, filling up to 50% of remaining token budget
- Recent episodes — latest conversation summaries
- Relevant procedures — behavioral patterns matching the current query
Everything is wrapped in XML sections for structured parsing:
<alma_soul>
<identity>...</identity>
<anti_patterns>...</anti_patterns>
</alma_soul>
<alma_memories>
<memory importance="0.9" category="project">Migrating auth to PostgreSQL</memory>
<memory importance="0.7" category="preference">Prefers concise code reviews</memory>
</alma_memories>
<alma_episodes>
<episode topics="auth,middleware">Debugged auth middleware...</episode>
</alma_episodes>
XML-safe truncation is critical — you never cut mid-tag. If a memory doesn't fit within the remaining budget, skip it entirely rather than corrupting the XML structure.
Why XML over JSON?
Tested both. XML with labeled attributes gives the model clearer section boundaries. JSON works fine for structured data but the model is more likely to reference XML-tagged content naturally in responses. The importance and category attributes are visible to the model, which helps it prioritize.
What I got wrong
First version had no scoring. Just retrieved the N most recent memories. This breaks immediately — a critical project decision from last week gets buried under trivial facts from today.
Second version over-weighted recency. Everything decayed too fast. Important long-term preferences disappeared within two weeks.
Third version didn't deduplicate. After a month of use, the same preferences appeared 40+ times, eating token budget with redundant information.
The current scoring weights are version four. They've been stable in production for months, but they're still configurable per user — different use cases might need different balances.
Numbers
- Extraction latency: 0ms user-facing (background processing)
- Scoring: <5ms for 500 memories
- Context assembly: <10ms including soul prompt rendering
- D1 reads: 1-5ms, writes: 5-15ms
- Total overhead per message: near-zero for the user, ~2-4 seconds background
The system is Alma — alma.olivares.ai. It wraps this pipeline in a web app, MCP server (21 tools for Claude Desktop/Cursor/Windsurf), VSCode extension, and REST API. Free tier available.
But the scoring architecture applies to any AI system that needs to manage context at scale. The core insight: memory without ranking is just a pile of text. Ranking without token budgeting overflows the context window. Both without extraction means the user maintains everything by hand. You need all four stages.
Top comments (0)