Abhishek Chauhan

Posted on May 18

Production Agent Memory: Compaction, Decay, and the Observation Engine

#ai #agentmemoryarchitecture #episodicmemoryagents #productivity

Most guides on agent memory stop at storage. Pick a vector store, embed your documents, retrieve the top-k. That works for RAG. It does not work for agents that run continuously across weeks and months, accumulating behavioral history about real users making real decisions.

Production agent memory is a different problem. The questions aren't just "what do I store?" and "how do I retrieve it?" They are:

How does the agent learn that a user always makes the same correction, without being explicitly told?
How do you give the agent three months of behavioral history without flooding the context window?
What happens when a retrieved memory is wrong — not just irrelevant, but actively contradicted by the user?
When should an observed pattern become a permanent rule?

This post builds a complete architecture for answering those questions — from taxonomy to scoring formula to the nightly maintenance job that keeps it all clean.

The Three Failure Modes

Before any architecture decision, name the failure modes you're designing against:

Too much — flooding the context with everything you know. The model gets slow, expensive, and loses precision. Ironically, more memory makes the agent worse.

Too little — injecting nothing. The agent repeats mistakes, ignores learned rules, asks the user to re-explain preferences they stated last week.

Wrong type — injecting stale, contradicted, or irrelevant memories. Worse than nothing: the agent acts on false information confidently.

Every design decision in this post traces back to avoiding one of these three.

The Four Memory Types

Not all memory is the same. A production system needs four distinct types, each with a different storage backend, lifecycle, and injection strategy.

Working Memory

The live context for the current task only. Exists for the duration of one agent run. Discarded when the task ends.

What it holds: the current task payload, intermediate reasoning steps, partial tool call results, running approval state.

Key constraint: 4,000 token maximum, enforced before every LLM call. If working memory exceeds this, compress intermediate steps to a summary using a cheap model call first. Never silently truncate — truncation loses state. Compress explicitly.

Storage: in-process object. No database write during execution — only persisted when the task reaches terminal state (done, failed, rolled back).

Episodic Memory

A timestamped log of specific past events. The raw ground truth. "On Friday at 15:04, the agent drafted a reply to a supplier and the user modified it before sending."

What it holds:

Every completed agent action (who, what, when, outcome)
Every approval decision (approved / rejected / modified, with the modification text)
Every exception the agent surfaced and how it was resolved
Every user correction — agent did X, user changed it to Y

Storage: relational rows in an episodes table (append-only, never modified) plus a vector embedding per episode for semantic retrieval.

Retrieval: hybrid — BM25 full-text search for exact matches on names, amounts, dates, combined with cosine similarity on the embedding for conceptually related events. Results merged, rescored with the decay function, top-k injected.

Semantic Memory

Facts independent of any specific event. Stable knowledge about the user, their contacts, their company, their preferences. Changes slowly.

What it holds:

User profile: name, company, tone preference, industry, CRM type
Contact profiles: name, relationship, tone preference, known quirks, routing notes
Company rules: payment terms per supplier, invoice thresholds, routing rules ("technical emails → forward to Lara")
Agent configuration: which capabilities are enabled, global off-limits (contacts never auto-replied to, folders never touched)

This type does not need vector search. Direct SQL lookups are faster, cheaper, and more precise for structured facts. Before every agent call, the orchestrator fetches the relevant semantic facts for that agent type and injects them as a structured block in the system prompt. Deterministic, synchronous, never misses.

Procedural Memory

Learned workflows — rules of thumb the agent has inferred from repeated user corrections and confirmed behavioral patterns. Not facts, not events, but how to behave.

What it holds (example rules, in plain language):

"Never use semicolons — the user always removes them"
"Emails from Lara: never archive automatically"
"Quotes above €10,000: don't prepare a draft, the user always rewrites them"
"Friday afternoon after 14:30: defer everything to Monday"
"Pelletteria Veneto SRL: always flag, never route autonomously"

Storage: a procedures table. Each row has: agent, rule text (plain language, injected verbatim), source observation ID (foreign key to the observation that triggered promotion), promoted timestamp, confirmation count, last applied timestamp.

No vector search needed here either. All active procedures for the current agent are fetched in full and injected at the start of every system prompt. There are never enough procedures to overflow context — the promotion threshold prevents noise accumulation.

The Observation Engine

This is the component most memory systems don't have — and its absence is why agents that "remember" things still feel dumb.

The observation engine is the mechanism that detects behavioral patterns from raw episodic data and promotes them into procedural rules. It's the bridge between "things that happened" and "rules that govern future behavior."

Sources of Raw Signals

Every agent in the system feeds signals into a queue. The signals are not interpretations — they're raw behavioral data:

Signal	What it captures
User modifies a draft before sending	Which part changed, how many times this pattern has repeated
User rejects an approval	What was rejected, the rejection reason if given
User routes something differently than predicted	Where the agent sent it, where the user moved it
Agent surfaces an exception	New contact with no known rule, amount outside threshold
User correction	Agent did X, user explicitly changed it to Y

These signals go into a queue. The observation engine processes them nightly.

Pattern Detection

The nightly job (runs at 02:00) reads the last 30 days of the episodes table and prompts an LLM to identify genuine repeated patterns. The prompt enforces strict constraints:

You are the observation engine. Analyse user behavioral data and identify
genuine, repeated patterns. DO NOT invent. DO NOT generalise from a single event.
An observation is only valid if it has at least 3 consistent occurrences.

Required format for each observation:
<observation>
{
  "category": "writing-style | rhythm | people | tools | decisions",
  "agent": "email | accounting | crm | relay | files | system",
  "quote": "direct statement, max 25 words",
  "evidence": "concise phrase with supporting numbers, max 40 words",
  "occurrences": 4,
  "confidence": "low | medium | high | very-high",
  "promotion_candidate": true | false
}
</observation>

Rules:
- confidence = high only if occurrences >= 5 AND pattern is 80%+ consistent
- promotion_candidate = true only if the observation implies a clear action rule
- Maximum 3 new observations per run

The 3-observation cap per run prevents the system from flooding the observations table. Patterns that are genuine will recur and be detected in subsequent nightly runs, accumulating evidence over time.

Deduplication Before Insert

Before any new observation is written, it's checked against existing ones:

async function isDuplicate(newObs: ObservationCandidate): Promise<boolean> {
  const embedding = await embedDocument(newObs.quote)
  const results = await db.execute(sql`
    SELECT o.id, o.quote,
      vec_distance_cosine(ov.embedding, ${embedding}) as dist
    FROM observations_vec ov
    JOIN observations o ON o.id = ov.id
    WHERE o.agent    = ${newObs.agent}
      AND o.category = ${newObs.category}
      AND o.status   = 'active'
      AND vec_distance_cosine(ov.embedding, ${embedding}) < 0.15
    LIMIT 1
  `)
  return results.rows.length > 0
}

If a near-duplicate exists and the new one has higher occurrences, the existing row's evidence is updated and occurrence count incremented — not replaced. Observations grow stronger over time, they don't get duplicated.

Confidence Thresholds

low          → 3-4 occurrences,  pattern < 70% consistent
medium       → 4-5 occurrences,  pattern 70-80% consistent
high         → 5+  occurrences,  pattern > 80% consistent
very-high    → 8+  occurrences,  pattern > 90% consistent, no contradictions

Promotion to Procedure

A procedural rule is promoted when:

Confidence = high or very-high
promotion_candidate = true
User has not marked it "wrong" (see feedback loop below)
User has not dismissed it within 48 hours of creation

The promotion threshold is deliberately conservative. A rule injected into every agent call for weeks shapes behavior continuously. False positives are more damaging than false negatives.

The Feedback Loop

Every observation is shown to the user with two buttons: "You're right" and "You're wrong."

async function handleFeedback(
  obsId: string,
  feedback: 'correct' | 'wrong'
) {
  if (feedback === 'correct') {
    await db.update(observations)
      .set({ confidenceBoost: sql`confidence_boost + 0.2` })
      .where(eq(observations.id, obsId))

    await maybePromoteToProcedure(obsId)

  } else {
    // Set status to rejected — excluded from all future retrieval
    await db.update(observations)
      .set({ status: 'rejected' })
      .where(eq(observations.id, obsId))

    // Delete embedding — rejected observations never surface in retrieval
    await deleteEmbedding('observation', obsId)

    // Demote any procedure that was promoted from this observation
    await db.update(procedures)
      .set({ status: 'demoted' })
      .where(eq(procedures.sourceObservationId, obsId))

    // Write a counter-signal so the nightly job doesn't regenerate
    // the same wrong observation next run
    await db.insert(agentSignals).values({
      signalType: 'observation_rejected',
      payload: JSON.stringify({ rejectedQuote: obs.quote }),
    })
  }
}

The counter-signal write is the part most implementations miss. Without it, the nightly pattern detection job will see the same 30 days of data, detect the same pattern, and reinsert the same observation you just rejected. The counter-signal closes the loop.

The No-Delete Principle

This is the most important design decision in the entire architecture, and the one teams get wrong most often.

Do not delete episodic memory rows when their decay score falls below a threshold.

The reasoning matters. Decay score measures recency of access — how long since this episode was retrieved. It does not measure behavioral importance.

Consider an episode from 95 days ago: "User rejected the Magnani draft three times, never wanted assertive language with this contact." This episode scores low today because it hasn't been retrieved recently. The moment the agent receives a new email from Magnani, that old episode is the most critical thing in memory. If you deleted it on decay grounds, the agent has permanent amnesia about a behaviorally defining pattern — and will repeat the exact mistake the user corrected three times.

The correct model: low decay score means retrieve this less often, not destroy it.

Episodes are compacted — compressed into summary records that cost far less to store, embed, and retrieve — but the raw rows are never deleted by an automated job.

The only paths to actual hard deletion:

Explicit user GDPR erasure request
User marks an observation as wrong (flags its source episodes as contradicted)
Admin-level seat deletion

Everything else is compaction.

The Compaction Pipeline

Compaction is a lossless-to-lossy compression pipeline that preserves behavioral signal in progressively smaller form. It's how you give an agent three months of behavioral history without overflowing the context window.

Tier 0 — Raw episodes
  Individual action records. Full detail. Embedded individually.
  "Mon 09:11, agent drafted reply to Bertelli re: Q3 order.
  User approved without modification."
  → Used for: recent retrieval (< 30 days), audit, rollback.

Tier 1 — Weekly compaction (triggered at 30 days)
  Groups of 5-15 raw episodes from the same agent + contact cluster,
  spanning one week, summarised into a single compact record.
  "Week of 3-9 June: agent handled 8 interactions with Bertelli.
  6 approved without edits (orders, shipping confirmations).
  1 modified: removed semicolon, shortened opening paragraph.
  1 routed to manager (quote request). Pattern confirmed: direct,
  no preamble, fast approval rate."
  Source raw episodes → status='superseded_raw', embeddings deleted.
  Compact record → status='active', fresh embedding from summary.
  Storage ratio: ~8:1.

Tier 2 — Monthly compaction (triggered at 90 days)
  Groups of tier-1 compact records from the same agent + contact + month,
  summarised into a period record.
  "June 2026 — email agent × Bertelli: 32 interactions.
  Approval rate 94%. Established pattern: direct opener, no semicolons,
  route quotes to manager. No exceptions. Tone: formal-concise, confirmed."
  Source tier-1 records → status='superseded_tier1', embeddings deleted.
  Compact record → status='active', fresh embedding.
  Storage ratio from raw: ~32:1.

Tier 2 records never compact further.
They are permanent behavioral summaries.

The schema needs to track tier, status, and compact group ID on every episode row:

export const episodes = sqliteTable('episodes', {
  id:             text('id').primaryKey().$defaultFn(() => crypto.randomUUID()),
  agent:          text('agent').notNull(),
  eventType:      text('event_type').notNull(),
  // 'action_completed' | 'approval_approved' | 'approval_rejected' |
  // 'approval_modified' | 'exception_raised' | 'user_correction' | 'compact'

  summary:        text('summary'),          // raw episode: plain text, max 100 chars
  outcome:        text('outcome'),          // done | approved | rejected | modified | exception
  entities:       text('entities'),         // JSON: [{type: 'contact', name: 'John Smith'}]

  compactSummary: text('compact_summary'),  // compact record: multi-sentence narrative
  compactionTier: integer('compaction_tier').notNull().default(0),
  compactGroupId: text('compact_group_id'), // ID of the compact record covering this row

  status: text('status').notNull().default('raw'),
  // 'raw'              → live raw episode, has embedding
  // 'active'           → live compact record (tier-1 or tier-2), has embedding
  // 'superseded_raw'   → absorbed into tier-1, row kept, embedding gone
  // 'superseded_tier1' → absorbed into tier-2, row kept, embedding gone

  lastAccessedAt: integer('last_accessed_at'), // updated on each retrieval — feeds decay calc
  createdAt:      integer('created_at').notNull().$defaultFn(() => Date.now()),
})

The episodes_vec virtual table holds embeddings only for retrieval-eligible rows — status='raw' and status='active'. Superseded rows have no embedding. This means the semantic search naturally covers the full timeline at the right level of granularity: recent events as individual rows, older events as compact summaries. No extra filtering needed.

Tier-Aware Decay Scoring

The retrieval score governs which memories float to the top when assembling the context window. It is computed dynamically at retrieval time — not pre-computed, not stored.

retrieval_score(episode, query) =
    cosine_similarity(embed(query), episode.embedding)
  × recency_weight(episode.last_accessed_at, tier)
  × importance_weight(episode.compaction_tier)

recency_weight(t, tier) = e^(−λ × days_since_last_access)

  λ = 0.04  for raw episodes (tier-0)    → 17-day half-life
  λ = 0.015 for tier-1 compact           → 46-day half-life
  λ = 0.005 for tier-2 compact           → 138-day half-life
  λ = 0     for procedures               → no decay (active rules don't fade)
  λ = 0.02  for observations             → 35-day half-life

importance_weight:
  raw episode (tier-0)   → 1.0  (full signal)
  tier-1 compact         → 1.2  (confirmed repeated patterns — boosted)
  tier-2 compact         → 1.1  (period summaries)

Three things to notice in these numbers:

Tier-1 compact records get a higher importance weight (1.2) than raw episodes (1.0). This is intentional. A weekly summary exists because 5-15 individual events were similar enough to summarise together. That repetition is itself a signal — these patterns proved their worth. They should rank higher than a single raw event of equivalent semantic similarity.

Tier-2 records decay slower than tier-1 (λ=0.005 vs 0.015) because monthly period summaries represent stable, long-running patterns. A summary describing three months of consistent behavior should remain relevant for much longer than a summary of last week's activity.

Procedures have λ=0. A learned rule like "never use semicolons" doesn't become less applicable just because it hasn't been triggered recently. Decay doesn't touch rules.

The Full Scoring Implementation

function scoreEpisode(
  row: EpisodeRow,
  semanticScore: number,
  keywordScore: number,
  now: number
): number {
  const λ = row.compaction_tier === 0 ? 0.04
          : row.compaction_tier === 1 ? 0.015
          :                             0.005

  const daysSinceAccess =
    (now - (row.last_accessed_at ?? row.created_at)) / 86_400_000

  const recency = Math.exp(-λ * daysSinceAccess)

  const importanceWeight =
    row.compaction_tier === 0 ? 1.0
  : row.compaction_tier === 1 ? 1.2
  :                             1.1

  // Semantic similarity weighted higher than keyword match
  return (semanticScore * 0.7 + keywordScore * 0.3) * recency * importanceWeight
}

Hybrid retrieval — BM25 keyword search merged with semantic similarity — is worth the implementation complexity. Contact names, amounts, and dates don't embed well: "Marco Bertelli" and "Bertelli" produce different vectors but BM25 catches both as exact matches. For memory systems grounded in real-world entities, keyword recall fills the gaps that pure vector similarity misses.

The Context Budget

One of the most underspecified parts of agent memory design is how much of the context window each memory type should occupy. Without explicit budgets, whichever retrieval path returns the most text wins — which is almost never the right outcome.

Here's a concrete token budget for a 32,000-token context window:

Slot	Tokens	Content
System prompt base	800	Agent persona, core instructions, behavioral pact
Semantic facts	600	User profile + relevant contact profiles + company rules
Active procedures	400	All active rules for this agent (typically 3-8 rules)
Retrieved episodic memories	1,200	Top-5 most relevant past events, scored and formatted
Retrieved observations	600	Top-3 most relevant observations for this task type
Current task / working memory	4,000	The actual task payload
Tool call history (this session)	2,000	Tool calls and results so far
Response buffer	2,000	Reserved for model output
Total	~11,600	Leaves substantial headroom for larger payloads

The key insight: semantic facts and procedures are the cheapest and most reliable memory. 400 tokens of active procedural rules — plain-language behavioral constraints injected verbatim — have more impact on agent behavior than 1,200 tokens of retrieved episodic memories. Procedures are pre-validated, zero retrieval error, zero semantic ambiguity. Don't underallocate them to make room for more episodes.

Overflow Handling

When the current task payload exceeds its budget (a long email thread, a large invoice batch):

Extract only the last 3 exchanges from the thread
Summarise older exchanges in 3 sentences using a cheap model call
Append the full text as a reference block the agent can query via tool if needed

Never silently truncate. Truncation removes content without the agent knowing it's missing.

What Injection Actually Looks Like

Concrete example: the email agent handles an incoming email from a known contact.

Semantic facts injected:

User: Maria Rossi, Nico Rossi Ltd, fashion sector, formal-concise tone, Italian
Contact: Marco Bertelli (Bertelli & Co, client): formal tone, no exclamation marks,
  reliable payments, primary contact for autumn/winter orders
Rules: never send without approval · emails containing 'urgent': high priority

Procedures injected (for the email agent):

- Never use semicolons — the user always removes them
- With technical clients: direct, no opening pleasantries, get to the point in the first line
- Quotes above €10,000: don't prepare a draft, the user always rewrites them
- Friday after 14:30: defer to Monday, don't prepare a response

Episodes injected (top 5 for this contact — mixed tiers):

[2 days ago] Drafted reply re: Q3 order. Outcome: approved and sent.

[1 week ago] Drafted reply re: samples. Outcome: modified by user
(removed semicolon, shortened central paragraph).

[2 weeks ago] Incoming email: quote request. Outcome: forwarded to
manager with tag "quote-bertelli".

[5 weeks ago · weekly summary] Week of May 2-8: 6 interactions with
Bertelli. 5 approved without edits (orders, shipping confirmations).
1 modified: removed semicolon, shortened opener. No exceptions.
Pattern confirmed: direct tone, brief openings, fast approval rate.

[3 months ago · monthly summary] March 2026 — email agent × Bertelli:
24 interactions, 96% approval rate. Consolidated style: formal-concise,
no semicolons, quotes always forwarded to manager. No significant
exceptions in the month.

Observations injected:

"With this contact you consistently use formal, concise tone.
No exclamation marks in 18 emails."
  high confidence · 18 occurrences

"Emails from this contact containing quote requests were always
marked 'to-do' — not responded to the same day."
  medium confidence · 4 occurrences

Total context used: ~1,800 tokens for all memory + ~400 tokens for the actual email.

The agent has three months of behavioral history about this specific contact — recent events at full fidelity, older patterns as compact summaries — without the context window growing unboundedly. This is what the compaction pipeline earns you.

The Nightly Maintenance Job

All compaction, pattern detection, and promotion happens in a single nightly job. It runs at a quiet time (02:14 — offset from round hours to avoid resource contention with other scheduled jobs):

async function nightlyMemoryMaintenance() {
  const now = Date.now()
  const day30ago = now - 30 * 86_400_000
  const day90ago = now - 90 * 86_400_000

  // Step 1: Tier-1 compaction
  // Find raw episodes older than 30 days, grouped by agent × contact × week
  // Summarise into weekly compact records using a cheap model
  // Source rows → status='superseded_raw', embeddings deleted
  await runTier1Compaction(day30ago)

  // Step 2: Tier-2 compaction
  // Find tier-1 compact records older than 90 days, grouped by agent × contact × month
  // Summarise into monthly period records
  // Source tier-1 records → status='superseded_tier1', embeddings deleted
  await runTier2Compaction(day90ago)

  // Step 3: Observation consolidation
  // Merge observations with cosine similarity > 0.85 in the same agent + category
  // Winner keeps the richer evidence, loser → status='merged', embedding deleted
  await consolidateObservations()

  // Step 4: Pattern detection on last 30 days of active episodes
  // Reads raw + tier-1 compact, outputs up to 3 new observations
  await runPatternDetection()

  // Step 5: Promotion check
  // Promote observations that meet the confidence + candidate threshold to procedures
  await checkPromotionCandidates()

  // Step 6: GDPR deletion requests
  // Process any queued erasure requests — all tiers, all memory types
  await processDeletionRequests()

  // Note: no decay score refresh step.
  // Decay is computed dynamically from last_accessed_at at retrieval time.
  // Pre-computing and storing it would add complexity for no benefit.
}

The ordering matters. Compaction runs before pattern detection so the pattern detector sees the already-compacted timeline — it reads compact summaries for older data, not thousands of raw rows. This keeps the pattern detection prompt short and cheap.

Choosing an Embedding Model for Behavioral Memory

For most RAG applications, text-embedding-3-small is the right default. For behavioral memory systems with multilingual content, you need to think harder about one specific capability: negation handling.

Consider these two memories:

"Never archive emails from Lara"
"Archive emails from Lara"

A static embedding model — one that averages token embeddings without running attention over the full sentence — will produce nearly identical vectors for these. The negation ("never") is a single low-frequency token whose embedding gets averaged away. In a document retrieval system, this is annoying. In a behavioral memory system where a wrong rule gets injected verbatim into every agent call, this is a correctness failure.

Contextual encoder models (XLM-RoBERTa family, E5 family) run full attention over the input. They produce meaningfully different embeddings for negated vs non-negated rules because the attention mechanism encodes the relationship between "non" and the rest of the sentence.

For local deployment (no data leaving the device), intfloat/multilingual-e5-small in q8 ONNX quantization is a strong choice:

384 dimensions, 117M parameters
~30MB on disk, ~90MB loaded
12-20ms warm inference on CPU
Strong multilingual quality including Italian, German, Spanish, French
Ships a pre-built quantized ONNX via Transformers.js — no compilation step

The E5 model requires prefixes on its inputs — passage: for documents being stored, query: for queries at retrieval time. This is a training requirement, not optional:

// For storing a document (episode summary, observation, procedure rule)
const docEmbedding = await pipeline(`passage: ${text}`, {
  pooling: 'mean', normalize: true
})

// For a retrieval query (task description, current agent context)
const queryEmbedding = await pipeline(`query: ${text}`, {
  pooling: 'mean', normalize: true
})

Omitting the prefix degrades retrieval quality measurably. The asymmetric prefixes are how the model was trained — passage: for longer, self-contained documents; query: for shorter, lookup-intent strings.

Run the model in a worker thread to keep the main process event loop unblocked. Embedding inference on CPU takes 12-20ms — tolerable in a background context, but a source of latency jitter if it blocks the main thread during agent execution.

GDPR: Memory Systems Have a Compliance Problem

Behavioral agent memory is not just vector embeddings. It's observations about how a person writes, when they work, how they make decisions. Under GDPR, this is personal data. Under the EU AI Act (fully applicable from August 2026), agents that make consequential decisions using this data may be high-risk systems subject to documentation and traceability requirements.

The memory architecture choices that matter for compliance:

Namespace every memory type by user from day one. Not as an afterthought. If your behavioral data lives in a flat, unnested store, you cannot answer an Article 17 erasure request without a full table scan and potential collateral deletion. User-scoped namespaces make deletion an O(1) operation: DELETE FROM episodes WHERE user_id = ? cascades cleanly.

Store behavioral signals, not content. The episode table should store "user modified draft, removed semicolon from third paragraph" — not the email text itself. Content stays in working memory and is discarded at session end. Behavioral patterns are what matter for memory; content is the medium through which they were expressed. This distinction dramatically reduces your Article 35 DPIA scope.

The deletion pipeline must cover all tiers. When a user requests erasure, you must delete: raw episodes, tier-1 compact records, tier-2 compact records, observations, procedures, embeddings in episodes_vec and observations_vec, and the source signal queue entries. A spec-level deletion path that covers only the main table and misses the vector tables is a compliance failure.

Encrypt exported memory files. If you implement a memory export feature (for backup or portability), use AES-256-GCM with a scrypt-derived key from a user-supplied passphrase. The derived key should exist only in memory for the duration of the operation — never written to disk. A stolen backup file should reveal nothing without the passphrase.

Production Checklist

Before shipping a behavioral memory system:

No-delete rule enforced: decay score never triggers row deletion — only compaction
Compaction nightly job: tier-1 at 30 days, tier-2 at 90 days, never deletes source rows
Feedback loop complete: "wrong" feedback cascades to observation rejection + embedding deletion + procedure demotion + counter-signal write
Counter-signal on rejection: nightly pattern detection reads rejected quotes before inserting, skips near-duplicates of known rejections
Hybrid retrieval: BM25 + cosine similarity merged and rescored — not pure vector search
Tier-aware decay: different λ per compaction tier, λ=0 for procedures
Context budget enforced: explicit token caps per memory type, overflow handled by compression not truncation
Procedures injected in full: never vector-searched, always fetched entirely and prepended to system prompt
Semantic facts via direct SQL: no vector search for structured relational lookups
Embedding model handles negation: contextual encoder (E5 family), not static averaging
Embedding prefix discipline: passage: for documents, query: for retrieval queries
Worker thread for inference: never block the main event loop on embedding calls
GDPR deletion covers all tiers: raw episodes + compact records + embeddings + observations + procedures + signal queue
Behavioral signal only: episode table stores metadata and outcomes, not content

Adding Memory to Production AI Agents: Mem0, Zep, and LangMem Compared — when to use external memory layers vs building your own
Designing Agent Architecture with Memory: A Framework from Anthropic's Patterns and LangGraph's Primitives — matching Anthropic's workflow patterns to the right LangGraph memory architecture
GDPR-Compliant AI: Building Guardrails for EU AI Act Readiness — the full compliance stack for EU-facing AI systems

DEV Community