DEV Community: Abhishek Chauhan

Production Agent Memory: Compaction, Decay, and the Observation Engine

Abhishek Chauhan — Mon, 18 May 2026 04:58:57 +0000

Most guides on agent memory stop at storage. Pick a vector store, embed your documents, retrieve the top-k. That works for RAG. It does not work for agents that run continuously across weeks and months, accumulating behavioral history about real users making real decisions.

Production agent memory is a different problem. The questions aren't just "what do I store?" and "how do I retrieve it?" They are:

How does the agent learn that a user always makes the same correction, without being explicitly told?
How do you give the agent three months of behavioral history without flooding the context window?
What happens when a retrieved memory is wrong — not just irrelevant, but actively contradicted by the user?
When should an observed pattern become a permanent rule?

This post builds a complete architecture for answering those questions — from taxonomy to scoring formula to the nightly maintenance job that keeps it all clean.

The Three Failure Modes

Before any architecture decision, name the failure modes you're designing against:

Too much — flooding the context with everything you know. The model gets slow, expensive, and loses precision. Ironically, more memory makes the agent worse.

Too little — injecting nothing. The agent repeats mistakes, ignores learned rules, asks the user to re-explain preferences they stated last week.

Wrong type — injecting stale, contradicted, or irrelevant memories. Worse than nothing: the agent acts on false information confidently.

Every design decision in this post traces back to avoiding one of these three.

The Four Memory Types

Not all memory is the same. A production system needs four distinct types, each with a different storage backend, lifecycle, and injection strategy.

Working Memory

The live context for the current task only. Exists for the duration of one agent run. Discarded when the task ends.

What it holds: the current task payload, intermediate reasoning steps, partial tool call results, running approval state.

Key constraint: 4,000 token maximum, enforced before every LLM call. If working memory exceeds this, compress intermediate steps to a summary using a cheap model call first. Never silently truncate — truncation loses state. Compress explicitly.

Storage: in-process object. No database write during execution — only persisted when the task reaches terminal state (done, failed, rolled back).

Episodic Memory

A timestamped log of specific past events. The raw ground truth. "On Friday at 15:04, the agent drafted a reply to a supplier and the user modified it before sending."

What it holds:

Every completed agent action (who, what, when, outcome)
Every approval decision (approved / rejected / modified, with the modification text)
Every exception the agent surfaced and how it was resolved
Every user correction — agent did X, user changed it to Y

Storage: relational rows in an episodes table (append-only, never modified) plus a vector embedding per episode for semantic retrieval.

Retrieval: hybrid — BM25 full-text search for exact matches on names, amounts, dates, combined with cosine similarity on the embedding for conceptually related events. Results merged, rescored with the decay function, top-k injected.

Semantic Memory

Facts independent of any specific event. Stable knowledge about the user, their contacts, their company, their preferences. Changes slowly.

What it holds:

User profile: name, company, tone preference, industry, CRM type
Contact profiles: name, relationship, tone preference, known quirks, routing notes
Company rules: payment terms per supplier, invoice thresholds, routing rules ("technical emails → forward to Lara")
Agent configuration: which capabilities are enabled, global off-limits (contacts never auto-replied to, folders never touched)

This type does not need vector search. Direct SQL lookups are faster, cheaper, and more precise for structured facts. Before every agent call, the orchestrator fetches the relevant semantic facts for that agent type and injects them as a structured block in the system prompt. Deterministic, synchronous, never misses.

Procedural Memory

Learned workflows — rules of thumb the agent has inferred from repeated user corrections and confirmed behavioral patterns. Not facts, not events, but how to behave.

What it holds (example rules, in plain language):

"Never use semicolons — the user always removes them"
"Emails from Lara: never archive automatically"
"Quotes above €10,000: don't prepare a draft, the user always rewrites them"
"Friday afternoon after 14:30: defer everything to Monday"
"Pelletteria Veneto SRL: always flag, never route autonomously"

Storage: a procedures table. Each row has: agent, rule text (plain language, injected verbatim), source observation ID (foreign key to the observation that triggered promotion), promoted timestamp, confirmation count, last applied timestamp.

No vector search needed here either. All active procedures for the current agent are fetched in full and injected at the start of every system prompt. There are never enough procedures to overflow context — the promotion threshold prevents noise accumulation.

The Observation Engine

This is the component most memory systems don't have — and its absence is why agents that "remember" things still feel dumb.

The observation engine is the mechanism that detects behavioral patterns from raw episodic data and promotes them into procedural rules. It's the bridge between "things that happened" and "rules that govern future behavior."

Sources of Raw Signals

Every agent in the system feeds signals into a queue. The signals are not interpretations — they're raw behavioral data:

Signal	What it captures
User modifies a draft before sending	Which part changed, how many times this pattern has repeated
User rejects an approval	What was rejected, the rejection reason if given
User routes something differently than predicted	Where the agent sent it, where the user moved it
Agent surfaces an exception	New contact with no known rule, amount outside threshold
User correction	Agent did X, user explicitly changed it to Y

These signals go into a queue. The observation engine processes them nightly.

Pattern Detection

The nightly job (runs at 02:00) reads the last 30 days of the episodes table and prompts an LLM to identify genuine repeated patterns. The prompt enforces strict constraints:

You are the observation engine. Analyse user behavioral data and identify
genuine, repeated patterns. DO NOT invent. DO NOT generalise from a single event.
An observation is only valid if it has at least 3 consistent occurrences.

Required format for each observation:
<observation>
{
  "category": "writing-style | rhythm | people | tools | decisions",
  "agent": "email | accounting | crm | relay | files | system",
  "quote": "direct statement, max 25 words",
  "evidence": "concise phrase with supporting numbers, max 40 words",
  "occurrences": 4,
  "confidence": "low | medium | high | very-high",
  "promotion_candidate": true | false
}
</observation>

Rules:
- confidence = high only if occurrences >= 5 AND pattern is 80%+ consistent
- promotion_candidate = true only if the observation implies a clear action rule
- Maximum 3 new observations per run

The 3-observation cap per run prevents the system from flooding the observations table. Patterns that are genuine will recur and be detected in subsequent nightly runs, accumulating evidence over time.

Deduplication Before Insert

Before any new observation is written, it's checked against existing ones:

async function isDuplicate(newObs: ObservationCandidate): Promise<boolean> {
  const embedding = await embedDocument(newObs.quote)
  const results = await db.execute(sql`
    SELECT o.id, o.quote,
      vec_distance_cosine(ov.embedding, ${embedding}) as dist
    FROM observations_vec ov
    JOIN observations o ON o.id = ov.id
    WHERE o.agent    = ${newObs.agent}
      AND o.category = ${newObs.category}
      AND o.status   = 'active'
      AND vec_distance_cosine(ov.embedding, ${embedding}) < 0.15
    LIMIT 1
  `)
  return results.rows.length > 0
}

If a near-duplicate exists and the new one has higher occurrences, the existing row's evidence is updated and occurrence count incremented — not replaced. Observations grow stronger over time, they don't get duplicated.

Confidence Thresholds

low          → 3-4 occurrences,  pattern < 70% consistent
medium       → 4-5 occurrences,  pattern 70-80% consistent
high         → 5+  occurrences,  pattern > 80% consistent
very-high    → 8+  occurrences,  pattern > 90% consistent, no contradictions

Promotion to Procedure

A procedural rule is promoted when:

Confidence = high or very-high
promotion_candidate = true
User has not marked it "wrong" (see feedback loop below)
User has not dismissed it within 48 hours of creation

The promotion threshold is deliberately conservative. A rule injected into every agent call for weeks shapes behavior continuously. False positives are more damaging than false negatives.

The Feedback Loop

Every observation is shown to the user with two buttons: "You're right" and "You're wrong."

async function handleFeedback(
  obsId: string,
  feedback: 'correct' | 'wrong'
) {
  if (feedback === 'correct') {
    await db.update(observations)
      .set({ confidenceBoost: sql`confidence_boost + 0.2` })
      .where(eq(observations.id, obsId))

    await maybePromoteToProcedure(obsId)

  } else {
    // Set status to rejected — excluded from all future retrieval
    await db.update(observations)
      .set({ status: 'rejected' })
      .where(eq(observations.id, obsId))

    // Delete embedding — rejected observations never surface in retrieval
    await deleteEmbedding('observation', obsId)

    // Demote any procedure that was promoted from this observation
    await db.update(procedures)
      .set({ status: 'demoted' })
      .where(eq(procedures.sourceObservationId, obsId))

    // Write a counter-signal so the nightly job doesn't regenerate
    // the same wrong observation next run
    await db.insert(agentSignals).values({
      signalType: 'observation_rejected',
      payload: JSON.stringify({ rejectedQuote: obs.quote }),
    })
  }
}

The counter-signal write is the part most implementations miss. Without it, the nightly pattern detection job will see the same 30 days of data, detect the same pattern, and reinsert the same observation you just rejected. The counter-signal closes the loop.

The No-Delete Principle

This is the most important design decision in the entire architecture, and the one teams get wrong most often.

Do not delete episodic memory rows when their decay score falls below a threshold.

The reasoning matters. Decay score measures recency of access — how long since this episode was retrieved. It does not measure behavioral importance.

Consider an episode from 95 days ago: "User rejected the Magnani draft three times, never wanted assertive language with this contact." This episode scores low today because it hasn't been retrieved recently. The moment the agent receives a new email from Magnani, that old episode is the most critical thing in memory. If you deleted it on decay grounds, the agent has permanent amnesia about a behaviorally defining pattern — and will repeat the exact mistake the user corrected three times.

The correct model: low decay score means retrieve this less often, not destroy it.

Episodes are compacted — compressed into summary records that cost far less to store, embed, and retrieve — but the raw rows are never deleted by an automated job.

The only paths to actual hard deletion:

Explicit user GDPR erasure request
User marks an observation as wrong (flags its source episodes as contradicted)
Admin-level seat deletion

Everything else is compaction.

The Compaction Pipeline

Compaction is a lossless-to-lossy compression pipeline that preserves behavioral signal in progressively smaller form. It's how you give an agent three months of behavioral history without overflowing the context window.

Tier 0 — Raw episodes
  Individual action records. Full detail. Embedded individually.
  "Mon 09:11, agent drafted reply to Bertelli re: Q3 order.
  User approved without modification."
  → Used for: recent retrieval (< 30 days), audit, rollback.

Tier 1 — Weekly compaction (triggered at 30 days)
  Groups of 5-15 raw episodes from the same agent + contact cluster,
  spanning one week, summarised into a single compact record.
  "Week of 3-9 June: agent handled 8 interactions with Bertelli.
  6 approved without edits (orders, shipping confirmations).
  1 modified: removed semicolon, shortened opening paragraph.
  1 routed to manager (quote request). Pattern confirmed: direct,
  no preamble, fast approval rate."
  Source raw episodes → status='superseded_raw', embeddings deleted.
  Compact record → status='active', fresh embedding from summary.
  Storage ratio: ~8:1.

Tier 2 — Monthly compaction (triggered at 90 days)
  Groups of tier-1 compact records from the same agent + contact + month,
  summarised into a period record.
  "June 2026 — email agent × Bertelli: 32 interactions.
  Approval rate 94%. Established pattern: direct opener, no semicolons,
  route quotes to manager. No exceptions. Tone: formal-concise, confirmed."
  Source tier-1 records → status='superseded_tier1', embeddings deleted.
  Compact record → status='active', fresh embedding.
  Storage ratio from raw: ~32:1.

Tier 2 records never compact further.
They are permanent behavioral summaries.

The schema needs to track tier, status, and compact group ID on every episode row:

export const episodes = sqliteTable('episodes', {
  id:             text('id').primaryKey().$defaultFn(() => crypto.randomUUID()),
  agent:          text('agent').notNull(),
  eventType:      text('event_type').notNull(),
  // 'action_completed' | 'approval_approved' | 'approval_rejected' |
  // 'approval_modified' | 'exception_raised' | 'user_correction' | 'compact'

  summary:        text('summary'),          // raw episode: plain text, max 100 chars
  outcome:        text('outcome'),          // done | approved | rejected | modified | exception
  entities:       text('entities'),         // JSON: [{type: 'contact', name: 'John Smith'}]

  compactSummary: text('compact_summary'),  // compact record: multi-sentence narrative
  compactionTier: integer('compaction_tier').notNull().default(0),
  compactGroupId: text('compact_group_id'), // ID of the compact record covering this row

  status: text('status').notNull().default('raw'),
  // 'raw'              → live raw episode, has embedding
  // 'active'           → live compact record (tier-1 or tier-2), has embedding
  // 'superseded_raw'   → absorbed into tier-1, row kept, embedding gone
  // 'superseded_tier1' → absorbed into tier-2, row kept, embedding gone

  lastAccessedAt: integer('last_accessed_at'), // updated on each retrieval — feeds decay calc
  createdAt:      integer('created_at').notNull().$defaultFn(() => Date.now()),
})

The episodes_vec virtual table holds embeddings only for retrieval-eligible rows — status='raw' and status='active'. Superseded rows have no embedding. This means the semantic search naturally covers the full timeline at the right level of granularity: recent events as individual rows, older events as compact summaries. No extra filtering needed.

Tier-Aware Decay Scoring

The retrieval score governs which memories float to the top when assembling the context window. It is computed dynamically at retrieval time — not pre-computed, not stored.

retrieval_score(episode, query) =
    cosine_similarity(embed(query), episode.embedding)
  × recency_weight(episode.last_accessed_at, tier)
  × importance_weight(episode.compaction_tier)

recency_weight(t, tier) = e^(−λ × days_since_last_access)

  λ = 0.04  for raw episodes (tier-0)    → 17-day half-life
  λ = 0.015 for tier-1 compact           → 46-day half-life
  λ = 0.005 for tier-2 compact           → 138-day half-life
  λ = 0     for procedures               → no decay (active rules don't fade)
  λ = 0.02  for observations             → 35-day half-life

importance_weight:
  raw episode (tier-0)   → 1.0  (full signal)
  tier-1 compact         → 1.2  (confirmed repeated patterns — boosted)
  tier-2 compact         → 1.1  (period summaries)

Three things to notice in these numbers:

Tier-1 compact records get a higher importance weight (1.2) than raw episodes (1.0). This is intentional. A weekly summary exists because 5-15 individual events were similar enough to summarise together. That repetition is itself a signal — these patterns proved their worth. They should rank higher than a single raw event of equivalent semantic similarity.

Tier-2 records decay slower than tier-1 (λ=0.005 vs 0.015) because monthly period summaries represent stable, long-running patterns. A summary describing three months of consistent behavior should remain relevant for much longer than a summary of last week's activity.

Procedures have λ=0. A learned rule like "never use semicolons" doesn't become less applicable just because it hasn't been triggered recently. Decay doesn't touch rules.

The Full Scoring Implementation

function scoreEpisode(
  row: EpisodeRow,
  semanticScore: number,
  keywordScore: number,
  now: number
): number {
  const λ = row.compaction_tier === 0 ? 0.04
          : row.compaction_tier === 1 ? 0.015
          :                             0.005

  const daysSinceAccess =
    (now - (row.last_accessed_at ?? row.created_at)) / 86_400_000

  const recency = Math.exp(-λ * daysSinceAccess)

  const importanceWeight =
    row.compaction_tier === 0 ? 1.0
  : row.compaction_tier === 1 ? 1.2
  :                             1.1

  // Semantic similarity weighted higher than keyword match
  return (semanticScore * 0.7 + keywordScore * 0.3) * recency * importanceWeight
}

Hybrid retrieval — BM25 keyword search merged with semantic similarity — is worth the implementation complexity. Contact names, amounts, and dates don't embed well: "Marco Bertelli" and "Bertelli" produce different vectors but BM25 catches both as exact matches. For memory systems grounded in real-world entities, keyword recall fills the gaps that pure vector similarity misses.

The Context Budget

One of the most underspecified parts of agent memory design is how much of the context window each memory type should occupy. Without explicit budgets, whichever retrieval path returns the most text wins — which is almost never the right outcome.

Here's a concrete token budget for a 32,000-token context window:

Slot	Tokens	Content
System prompt base	800	Agent persona, core instructions, behavioral pact
Semantic facts	600	User profile + relevant contact profiles + company rules
Active procedures	400	All active rules for this agent (typically 3-8 rules)
Retrieved episodic memories	1,200	Top-5 most relevant past events, scored and formatted
Retrieved observations	600	Top-3 most relevant observations for this task type
Current task / working memory	4,000	The actual task payload
Tool call history (this session)	2,000	Tool calls and results so far
Response buffer	2,000	Reserved for model output
Total	~11,600	Leaves substantial headroom for larger payloads

The key insight: semantic facts and procedures are the cheapest and most reliable memory. 400 tokens of active procedural rules — plain-language behavioral constraints injected verbatim — have more impact on agent behavior than 1,200 tokens of retrieved episodic memories. Procedures are pre-validated, zero retrieval error, zero semantic ambiguity. Don't underallocate them to make room for more episodes.

Overflow Handling

When the current task payload exceeds its budget (a long email thread, a large invoice batch):

Extract only the last 3 exchanges from the thread
Summarise older exchanges in 3 sentences using a cheap model call
Append the full text as a reference block the agent can query via tool if needed

Never silently truncate. Truncation removes content without the agent knowing it's missing.

What Injection Actually Looks Like

Concrete example: the email agent handles an incoming email from a known contact.

Semantic facts injected:

User: Maria Rossi, Nico Rossi Ltd, fashion sector, formal-concise tone, Italian
Contact: Marco Bertelli (Bertelli & Co, client): formal tone, no exclamation marks,
  reliable payments, primary contact for autumn/winter orders
Rules: never send without approval · emails containing 'urgent': high priority

Procedures injected (for the email agent):

- Never use semicolons — the user always removes them
- With technical clients: direct, no opening pleasantries, get to the point in the first line
- Quotes above €10,000: don't prepare a draft, the user always rewrites them
- Friday after 14:30: defer to Monday, don't prepare a response

Episodes injected (top 5 for this contact — mixed tiers):

[2 days ago] Drafted reply re: Q3 order. Outcome: approved and sent.

[1 week ago] Drafted reply re: samples. Outcome: modified by user
(removed semicolon, shortened central paragraph).

[2 weeks ago] Incoming email: quote request. Outcome: forwarded to
manager with tag "quote-bertelli".

[5 weeks ago · weekly summary] Week of May 2-8: 6 interactions with
Bertelli. 5 approved without edits (orders, shipping confirmations).
1 modified: removed semicolon, shortened opener. No exceptions.
Pattern confirmed: direct tone, brief openings, fast approval rate.

[3 months ago · monthly summary] March 2026 — email agent × Bertelli:
24 interactions, 96% approval rate. Consolidated style: formal-concise,
no semicolons, quotes always forwarded to manager. No significant
exceptions in the month.

Observations injected:

"With this contact you consistently use formal, concise tone.
No exclamation marks in 18 emails."
  high confidence · 18 occurrences

"Emails from this contact containing quote requests were always
marked 'to-do' — not responded to the same day."
  medium confidence · 4 occurrences

Total context used: ~1,800 tokens for all memory + ~400 tokens for the actual email.

The agent has three months of behavioral history about this specific contact — recent events at full fidelity, older patterns as compact summaries — without the context window growing unboundedly. This is what the compaction pipeline earns you.

The Nightly Maintenance Job

All compaction, pattern detection, and promotion happens in a single nightly job. It runs at a quiet time (02:14 — offset from round hours to avoid resource contention with other scheduled jobs):

async function nightlyMemoryMaintenance() {
  const now = Date.now()
  const day30ago = now - 30 * 86_400_000
  const day90ago = now - 90 * 86_400_000

  // Step 1: Tier-1 compaction
  // Find raw episodes older than 30 days, grouped by agent × contact × week
  // Summarise into weekly compact records using a cheap model
  // Source rows → status='superseded_raw', embeddings deleted
  await runTier1Compaction(day30ago)

  // Step 2: Tier-2 compaction
  // Find tier-1 compact records older than 90 days, grouped by agent × contact × month
  // Summarise into monthly period records
  // Source tier-1 records → status='superseded_tier1', embeddings deleted
  await runTier2Compaction(day90ago)

  // Step 3: Observation consolidation
  // Merge observations with cosine similarity > 0.85 in the same agent + category
  // Winner keeps the richer evidence, loser → status='merged', embedding deleted
  await consolidateObservations()

  // Step 4: Pattern detection on last 30 days of active episodes
  // Reads raw + tier-1 compact, outputs up to 3 new observations
  await runPatternDetection()

  // Step 5: Promotion check
  // Promote observations that meet the confidence + candidate threshold to procedures
  await checkPromotionCandidates()

  // Step 6: GDPR deletion requests
  // Process any queued erasure requests — all tiers, all memory types
  await processDeletionRequests()

  // Note: no decay score refresh step.
  // Decay is computed dynamically from last_accessed_at at retrieval time.
  // Pre-computing and storing it would add complexity for no benefit.
}

The ordering matters. Compaction runs before pattern detection so the pattern detector sees the already-compacted timeline — it reads compact summaries for older data, not thousands of raw rows. This keeps the pattern detection prompt short and cheap.

Choosing an Embedding Model for Behavioral Memory

For most RAG applications, text-embedding-3-small is the right default. For behavioral memory systems with multilingual content, you need to think harder about one specific capability: negation handling.

Consider these two memories:

"Never archive emails from Lara"
"Archive emails from Lara"

A static embedding model — one that averages token embeddings without running attention over the full sentence — will produce nearly identical vectors for these. The negation ("never") is a single low-frequency token whose embedding gets averaged away. In a document retrieval system, this is annoying. In a behavioral memory system where a wrong rule gets injected verbatim into every agent call, this is a correctness failure.

Contextual encoder models (XLM-RoBERTa family, E5 family) run full attention over the input. They produce meaningfully different embeddings for negated vs non-negated rules because the attention mechanism encodes the relationship between "non" and the rest of the sentence.

For local deployment (no data leaving the device), intfloat/multilingual-e5-small in q8 ONNX quantization is a strong choice:

384 dimensions, 117M parameters
~30MB on disk, ~90MB loaded
12-20ms warm inference on CPU
Strong multilingual quality including Italian, German, Spanish, French
Ships a pre-built quantized ONNX via Transformers.js — no compilation step

The E5 model requires prefixes on its inputs — passage: for documents being stored, query: for queries at retrieval time. This is a training requirement, not optional:

// For storing a document (episode summary, observation, procedure rule)
const docEmbedding = await pipeline(`passage: ${text}`, {
  pooling: 'mean', normalize: true
})

// For a retrieval query (task description, current agent context)
const queryEmbedding = await pipeline(`query: ${text}`, {
  pooling: 'mean', normalize: true
})

Omitting the prefix degrades retrieval quality measurably. The asymmetric prefixes are how the model was trained — passage: for longer, self-contained documents; query: for shorter, lookup-intent strings.

Run the model in a worker thread to keep the main process event loop unblocked. Embedding inference on CPU takes 12-20ms — tolerable in a background context, but a source of latency jitter if it blocks the main thread during agent execution.

GDPR: Memory Systems Have a Compliance Problem

Behavioral agent memory is not just vector embeddings. It's observations about how a person writes, when they work, how they make decisions. Under GDPR, this is personal data. Under the EU AI Act (fully applicable from August 2026), agents that make consequential decisions using this data may be high-risk systems subject to documentation and traceability requirements.

The memory architecture choices that matter for compliance:

Namespace every memory type by user from day one. Not as an afterthought. If your behavioral data lives in a flat, unnested store, you cannot answer an Article 17 erasure request without a full table scan and potential collateral deletion. User-scoped namespaces make deletion an O(1) operation: DELETE FROM episodes WHERE user_id = ? cascades cleanly.

Store behavioral signals, not content. The episode table should store "user modified draft, removed semicolon from third paragraph" — not the email text itself. Content stays in working memory and is discarded at session end. Behavioral patterns are what matter for memory; content is the medium through which they were expressed. This distinction dramatically reduces your Article 35 DPIA scope.

The deletion pipeline must cover all tiers. When a user requests erasure, you must delete: raw episodes, tier-1 compact records, tier-2 compact records, observations, procedures, embeddings in episodes_vec and observations_vec, and the source signal queue entries. A spec-level deletion path that covers only the main table and misses the vector tables is a compliance failure.

Encrypt exported memory files. If you implement a memory export feature (for backup or portability), use AES-256-GCM with a scrypt-derived key from a user-supplied passphrase. The derived key should exist only in memory for the duration of the operation — never written to disk. A stolen backup file should reveal nothing without the passphrase.

Production Checklist

Before shipping a behavioral memory system:

No-delete rule enforced: decay score never triggers row deletion — only compaction
Compaction nightly job: tier-1 at 30 days, tier-2 at 90 days, never deletes source rows
Feedback loop complete: "wrong" feedback cascades to observation rejection + embedding deletion + procedure demotion + counter-signal write
Counter-signal on rejection: nightly pattern detection reads rejected quotes before inserting, skips near-duplicates of known rejections
Hybrid retrieval: BM25 + cosine similarity merged and rescored — not pure vector search
Tier-aware decay: different λ per compaction tier, λ=0 for procedures
Context budget enforced: explicit token caps per memory type, overflow handled by compression not truncation
Procedures injected in full: never vector-searched, always fetched entirely and prepended to system prompt
Semantic facts via direct SQL: no vector search for structured relational lookups
Embedding model handles negation: contextual encoder (E5 family), not static averaging
Embedding prefix discipline: passage: for documents, query: for retrieval queries
Worker thread for inference: never block the main event loop on embedding calls
GDPR deletion covers all tiers: raw episodes + compact records + embeddings + observations + procedures + signal queue
Behavioral signal only: episode table stores metadata and outcomes, not content

Adding Memory to Production AI Agents: Mem0, Zep, and LangMem Compared — when to use external memory layers vs building your own
Designing Agent Architecture with Memory: A Framework from Anthropic's Patterns and LangGraph's Primitives — matching Anthropic's workflow patterns to the right LangGraph memory architecture
GDPR-Compliant AI: Building Guardrails for EU AI Act Readiness — the full compliance stack for EU-facing AI systems

Your AI Agent Is Confidently Lying — And It's Your Memory System's Fault

Abhishek Chauhan — Mon, 06 Apr 2026 20:40:32 +0000

Last month, an AI agent I built told a user "As a Senior Engineer at Google, you should consider..."

The user had been promoted to Staff Engineer three months earlier. The agent had no idea. No error. No warning. Just a confident, wrong answer served from stale memory.

That's when I realized: the biggest risk in AI agents isn't hallucination — it's stale memory served with high confidence.

The Problem Nobody Talks About

AI agents using memory systems (Mem0, Zep, Letta, LangMem) store facts about users, companies, and decisions. Things like:

"John works as Senior Engineer at Google"
"Pro plan costs $99/month"
"Sarah reports to Mike in Engineering"

These facts get stored once and served forever. No expiration. No re-verification. No staleness check.

Here's what makes it dangerous: memory systems decay facts by access frequency or TTL timers. But a frequently-retrieved memory about a user's job title is highly relevant until the moment it's wrong — at which point it becomes confidently wrong rather than just outdated.

An agent without memory would ask "What do you do?" again. Slightly annoying, but honest. An agent with stale memory states the wrong answer as established fact. That's worse.

How Big Is This Problem?

I ran a simple experiment. I stored 24 real-world facts in Mem0 — job titles, pricing, company info, policies, technical details. Then I checked each one against its original source after simulating 90 days:

Pricing facts — 55% had changed
Policy facts — 45% had changed
Job titles — 15% had changed
Addresses — 5% had changed

More than a third of stored facts were wrong within 3 months. And agents were retrieving them hundreds of times without knowing.

What I Built: MemGuard

I built an open-source platform that sits beside your memory system (doesn't replace it) and continuously validates whether stored facts are still true.

Think of it as Datadog for agent memory — it monitors, validates, and alerts, but doesn't own the data.

How It Works

1. Connect — MemGuard plugs into your existing memory system. Native connectors for Mem0, Zep, Letta, LangMem, or any REST API.

2. Validate — Five strategies, from simple to AI-powered:

Strategy	How	Needs LLM?
Source-Linked	Re-fetch original source URL, compare values	No
Cross-Reference	Check against 2-3 independent sources	No
Temporal Pattern	Statistical staleness prediction per fact-type	No
Semantic Drift	LLM detects contradictions in recent context	Yes
Causal Chain	Find dependent facts that break together	Yes

3. Score — Every memory gets a composite trust score (0-100%) based on source reliability, freshness, cross-reference agreement, and retrieval frequency.

4. Quarantine — Facts below 30% trust are automatically quarantined so agents stop using them. Facts below 50% are flagged for review.

5. Alert — Dashboard, webhooks, or MCP tools so agents can call validate_memory() before acting on stored facts.

The Trust Score

This is the core of MemGuard. Each memory's trust score is a weighted combination of:

Trust = 0.20 x source_reliability
      + 0.25 x freshness (exponential decay by fact-type)
      + 0.20 x cross_reference_agreement  
      + 0.10 x dependency_health
      + 0.15 x historical_accuracy
      + 0.10 x retrieval_importance

The key insight: retrieval frequency increases urgency, not trust. A stale memory retrieved 100 times/day is more dangerous than one retrieved once/month. High retrieval + low trust = highest risk.

MCP Integration — Agents Validate Before Acting

MemGuard exposes an MCP server so agents can self-check before using memories:

# Agent's internal flow
memory = get_memory("user_job_title")

# Before acting on it, validate
result = mcp.call("validate_memory", {"memory_id": memory.id})

if result.trust_score > 0.7:
    # Safe to use
    respond(f"As a {memory.content}...")
else:
    # Don't trust it, ask the user instead
    respond("Can you confirm your current role?")

Four MCP tools available:

validate_memory — check a specific fact before using it
get_memory_health — overall health metrics
report_stale_memory — agent reports suspected staleness
get_trusted_memories — retrieve only high-trust facts

Quick Start

One command:

git clone https://github.com/ac12644/MemGuard.git
cd MemGuard
docker-compose up

Dashboard at localhost:3000. API docs at localhost:8001/docs.

Then: Add Connector -> Pick Mem0/Zep/Letta -> Enter API key -> Sync -> Run Validation.

Tech Stack

Backend: Python 3.12, FastAPI, SQLAlchemy 2.0, Celery
Database: PostgreSQL 16, Redis 7
Dashboard: React 18, Tailwind CSS, Vite, Recharts
LLM: Anthropic Claude (optional — core works without it)
MCP: Python MCP SDK for agent integration
Deploy: Docker Compose, Caddy for auto-TLS in production

What I Learned Building This

1. Fact-type matters more than age. Pricing changes every quarter. Addresses change every decade. A blanket TTL is useless — you need per-category staleness curves.

2. The most dangerous memories are the most useful ones. High-retrieval memories are the ones agents rely on most. When they go stale, the blast radius is massive.

3. Agents should validate, not just retrieve. The MCP integration changes the agent's behavior from "retrieve and trust" to "retrieve, validate, then decide." That single change prevents most stale-memory errors.

4. You don't need LLM for most validation. Source re-fetch and temporal patterns catch 80% of staleness without any LLM cost. Save the AI-powered strategies for edge cases.

Open Source — Apache 2.0

The full project is on GitHub:

ac12644 / MemGuard

AI Agent Memory Validation Platform — continuously verify whether facts stored in AI agent memory systems (Mem0, Zep, Letta, LangMem) are still true. Like Datadog for agent memory.

AI Agent Memory Validation Platform
Continuously verify whether facts stored in AI agent memory systems are still true

Quick Start · Connectors · Strategies · API · Contributing

Why MemGuard?

AI agents store facts in memory systems — a user's job title, a product's price, a company's address. These facts go stale silently. The agent keeps using them with high confidence, delivering wrong answers without any warning.

MemGuard sits beside your memory system (Mem0, Zep, Letta, LangMem, or any REST API) as a sidecar that monitors, validates, and alerts — like Datadog for agent memory.

Core insight: Memory systems decay facts by access frequency or TTL timers. But a frequently-retrieved memory about a user's employer is highly relevant until it's wrong — then it becomes confidently wrong rather than just outdated. MemGuard detects this proactively.

Screenshots

Memories — Browse and filter tracked memories with trust scores

Validations — Run…

View on GitHub

5 connectors (Mem0, Zep, Letta, LangMem, Generic REST)
5 validation strategies
40 API endpoints
Dashboard with onboarding
MCP server for agent integration
Production-ready with Caddy TLS + automated backups

Contributions welcome. If you're building AI agents with memory systems, I'd love to hear what validation strategies matter most for your use cases.

If your agent has ever confidently told a user something that was true six months ago but not today — that's the problem MemGuard solves.

I Built a Multi-Agent Starter Kit with LangGraph — 6 Patterns, 5 Providers, One Command

Abhishek Chauhan — Sun, 05 Apr 2026 14:09:08 +0000

If you've built more than one LangGraph project, you know the drill. Supervisor setup. Provider config. Handoff tools. Persistence. Streaming endpoint. Same boilerplate, different repo.

So I stopped rewriting it and packaged the whole thing.

LangGraph Starter Kit

npx create-langgraph-app

Interactive CLI. Pick your provider, pick your patterns, get a project that runs.

Or clone the full kit with everything included.

6 Patterns

Each one is a standalone app you can use, modify, or delete:

Supervisor — central coordinator routes tasks to worker agents
Swarm — agents hand off to each other with transfer tools, no central brain
Human-in-the-Loop — graph pauses for approval before destructive actions
Structured Output — typed JSON responses validated by Zod
Research Agent — web search + scraping, supervisor coordinates a researcher and writer
RAG — in-memory vector store, semantic retrieval, no external DB

5 Providers

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...

Two lines. Done.

OpenAI, Anthropic, Google, Groq, Ollama (local). Each has a sensible default model. Override with LLM_MODEL if you want.

Extending It

export function createMyApp() {
  const agent = makeAgent({
    name: "my_agent",
    llm,
    tools: [/* your tools */],
    system: "You are a helpful assistant.",
  });

  return makeSupervisor({
    agents: [agent],
    llm,
    outputMode: "last_message",
    supervisorName: "my_supervisor",
  });
}

Also Ships With

MCP tool integration (stdio + HTTP)
SSE streaming on every endpoint
LangGraph Studio config
LangSmith tracing (one env var)
Docker Compose with Postgres
25+ tests, GitHub Actions CI
Railway + Render deploy configs

Get Started

npx create-langgraph-app

Or:

git clone https://github.com/ac12644/langgraph-starter-kit.git
cd langgraph-starter-kit
npm install && cp .env.example .env
npm run dev

ac12644 / langgraph-starter-kit

Boilerplate for building multi-agent AI systems with LangGraph. Includes Swarm and Supervisor patterns, memory, tools, and HTTP API out of the box.

LangGraph Starter Kit

The fastest way to build production-ready multi-agent apps with LangGraph

6 patterns. 5 providers. One command.

Quick Start • Patterns • Providers • API • Contributing

Why This Exists

Building multi-agent systems with LangGraph means writing the same boilerplate over and over — setting up supervisors, wiring handoff tools, configuring providers, adding persistence. This starter kit gives you all of that out of the box so you can focus on your agent logic, not infrastructure.

npx create-langgraph-app

What you get:

Pick your LLM provider (OpenAI, Anthropic, Google, Groq, or local Ollama)
Choose which agent patterns you need
Get a ready-to-run project with tests, types, and a Fastify server

Or clone the full kit with all 6 patterns included.

Architecture

              ┌─────────────────────────────────────────────┐
              │             LangGraph Starter Kit            │
              └──────────────────┬──────────────────────────┘
                                 │
              ┌──────────────────┼──────────────────────┐
              ▼                  ▼                       ▼
       ┌─────────────┐   ┌─────────────┐        ┌─────────────┐
       │  CLI Demo    │   │ HTTP Server │        │  LangGraph  │
       │  npm

…

View on GitHub

Apache 2.0. PRs welcome.

What are you building with LangGraph? Curious what patterns people are reaching for.

DEV Community: Abhishek Chauhan

Production Agent Memory: Compaction, Decay, and the Observation Engine

The Three Failure Modes

The Four Memory Types

Working Memory

Episodic Memory

Semantic Memory

Procedural Memory

The Observation Engine

Sources of Raw Signals

Pattern Detection

Deduplication Before Insert

Confidence Thresholds

Promotion to Procedure

The Feedback Loop

The No-Delete Principle

The Compaction Pipeline

Tier-Aware Decay Scoring

The Full Scoring Implementation

The Context Budget

Overflow Handling

What Injection Actually Looks Like

The Nightly Maintenance Job

Choosing an Embedding Model for Behavioral Memory

GDPR: Memory Systems Have a Compliance Problem

Production Checklist

Related Posts

Your AI Agent Is Confidently Lying — And It's Your Memory System's Fault

The Problem Nobody Talks About

How Big Is This Problem?

What I Built: MemGuard

How It Works

The Trust Score

MCP Integration — Agents Validate Before Acting

Quick Start

Tech Stack

What I Learned Building This

Open Source — Apache 2.0

ac12644 / MemGuard

AI Agent Memory Validation Platform — continuously verify whether facts stored in AI agent memory systems (Mem0, Zep, Letta, LangMem) are still true. Like Datadog for agent memory.

Why MemGuard?

Screenshots

I Built a Multi-Agent Starter Kit with LangGraph — 6 Patterns, 5 Providers, One Command

LangGraph Starter Kit

6 Patterns

5 Providers

Extending It

Also Ships With

Get Started

ac12644 / langgraph-starter-kit

Boilerplate for building multi-agent AI systems with LangGraph. Includes Swarm and Supervisor patterns, memory, tools, and HTTP API out of the box.

LangGraph Starter Kit

Why This Exists

Architecture