Jack M

Posted on Jun 12

AI Agent Memory Store: Stop Long-Running Agents From Forgetting the Job

#agents #ai #architecture #llm

An AI agent can look brilliant for ten minutes and lost after ten steps.

It starts with a clean plan. Then the agent reads docs, calls tools, rewrites files, summarizes a customer ticket, checks a policy, and tries to continue. Somewhere in that loop, it forgets why a decision was made. It repeats a tool call. It trusts an old fact. It pulls the wrong tenant preference. The output still sounds confident, but the job has drifted.

That is not only a model problem. It is a memory design problem.

If you are building production AI workflows, you need more than a bigger context window. You need an AI agent memory store: a controlled system for deciding what the agent remembers, what it forgets, what it retrieves, and what it is allowed to use.

Why Agent Memory Is Suddenly a Production Problem

Recent AI tooling trends point in the same direction: agents are getting longer-lived, more tool-heavy, and more expensive to run. Developers are asking how to run agents reliably in production, not just how to build impressive demos.

A simple chatbot can survive with a single prompt and recent messages. A production agent cannot.

It may need to remember:

The user's goal
Decisions made earlier in the workflow
Tool results that should not be recomputed
Customer preferences
Tenant-specific rules
Failed attempts
Approval history
Source snapshots
Known risks
What not to do again

The catch: remembering everything is dangerous.

Too much memory creates token bloat, stale context, privacy risk, cross-tenant leakage, and weird behavior where the agent follows old assumptions instead of the current task. Too little memory makes the agent repeat work and lose the thread.

The goal is not infinite memory. The goal is useful, scoped, auditable memory.

The Common Mistake: Treating the Context Window as Memory

The context window is not memory. It is the agent's working surface.

Think of it like a whiteboard. It is useful while the task is active, but it is not a database, audit log, preference store, or policy engine. If you keep stuffing everything into the prompt, you eventually hit four problems:

Cost climbs because every turn carries old tokens.
Attention degrades because important instructions compete with noise.
Stale facts survive because old summaries are treated like truth.
Debugging gets messy because you cannot tell where a bad memory came from.

A memory store fixes this by separating storage from retrieval. The agent does not automatically see everything. It receives only the memory that is relevant, fresh, permitted, and useful for the current step.

A Practical Memory Architecture

A production memory system usually needs four layers.

Layer	Purpose	Example
Working memory	Current task state	"The user wants a refund workflow summary."
Episodic memory	Timeline of events	"At 10:03, the agent called `get_invoice` and found invoice INV-42."
Semantic memory	Stable facts	"Acme prefers PDF exports and uses Stripe."
Procedural memory	Reusable process	"For billing disputes, check invoice, payment status, refund policy, then draft response."

You do not need to build all four on day one. But you should know which type of memory each record belongs to, because each type has different rules.

Working memory is short-lived. Episodic memory is audit-heavy. Semantic memory needs verification. Procedural memory should be versioned like code.

Mix them together and the agent becomes hard to control.

Memory Object Design

Do not store memory as random text blobs. Store it as an object with enough metadata to filter, rank, expire, and audit it.

Here is a simple TypeScript shape:

type MemoryKind = "working" | "episodic" | "semantic" | "procedural";

type MemoryVisibility = "user" | "tenant" | "workspace" | "system";

type AgentMemory = {
  id: string;
  tenantId: string;
  userId?: string;
  agentId: string;
  workflowId?: string;

  kind: MemoryKind;
  visibility: MemoryVisibility;

  summary: string;
  content: string;
  source: {
    type: "user_message" | "tool_result" | "document" | "human_note" | "system_event";
    ref?: string;
  };

  confidence: number; // 0 to 1
  importance: number; // 0 to 1
  createdAt: string;
  expiresAt?: string;
  lastUsedAt?: string;

  tags: string[];
  policy: {
    containsPii: boolean;
    allowInPrompt: boolean;
    requireCitation: boolean;
    requireFreshnessCheck: boolean;
  };
};

The metadata matters more than it looks.

If a memory contains personal data, you may need to mask it. If it came from a tool result, you may need to cite it. If it is old, you may need to revalidate it. If it belongs to one tenant, it must never be retrieved for another.

What Should Become Memory?

A good memory store is selective. Most agent context should not become long-term memory.

Use this rule: save memory only when future behavior should change because of it.

Good candidates:

A user preference that was clearly stated
A verified business rule
A workflow decision with a reason
A failed approach that should not be repeated
A stable integration detail
A human approval or rejection
A reusable troubleshooting pattern

Bad candidates:

Raw chat filler
One-off guesses
Unverified model conclusions
Temporary drafts
Sensitive data without a retention reason
Tool outputs that can be fetched cheaply again
Anything from another tenant or workspace

For example, this is weak memory:

User seemed annoyed about invoices.

This is better:

User requested that billing exports include invoice ID, payment status, and refund eligibility. Source: message msg_123. Applies to workspace ws_9. Confidence: 0.92.

The second version can safely shape future behavior.

Memory Write Pipeline

Never let the agent write directly to long-term memory without checks. Add a write pipeline.

A simple flow:

Agent proposes a memory.
Classifier labels kind, sensitivity, tenant scope, and confidence.
Policy layer rejects unsafe or low-value entries.
Deduplication checks for existing similar memories.
Human approval is required for sensitive or global memory.
Memory is stored with source, timestamp, and expiry.

Example pseudo-code:

async function proposeMemory(input: ProposedMemory) {
  const classified = await classifyMemory(input);

  if (classified.confidence < 0.75) return { saved: false, reason: "low_confidence" };
  if (classified.policy.containsPii && !classified.retentionReason) {
    return { saved: false, reason: "pii_without_retention_reason" };
  }

  const duplicate = await findSimilarMemory({
    tenantId: input.tenantId,
    userId: input.userId,
    text: classified.summary,
    kind: classified.kind
  });

  if (duplicate) {
    return mergeMemory(duplicate.id, classified);
  }

  if (classified.visibility === "system" || classified.importance > 0.9) {
    return createApprovalRequest(classified);
  }

  return saveMemory(classified);
}

This may feel slow at first, but it prevents memory rot. A bad memory can be worse than no memory because it quietly influences future outputs.

Memory Retrieval Pipeline

Retrieval is where many systems fail. They store useful memories, then dump the top vector matches into the prompt.

That is not enough.

A safer retrieval pipeline should check:

Is the memory in the same tenant?
Is it allowed for this user or workflow?
Is it fresh enough?
Does the current task actually need it?
Is it a verified fact or only a past guess?
Does it need a citation?
Is it worth the token cost?

A better retrieval function might look like this:

type RetrievalRequest = {
  tenantId: string;
  userId?: string;
  agentId: string;
  workflowId?: string;
  task: string;
  maxTokens: number;
  allowedKinds: MemoryKind[];
};

async function retrieveMemory(req: RetrievalRequest) {
  const candidates = await vectorSearch({
    tenantId: req.tenantId,
    query: req.task,
    kinds: req.allowedKinds,
    limit: 30
  });

  const filtered = candidates
    .filter(m => m.policy.allowInPrompt)
    .filter(m => hasAccess(req, m))
    .filter(m => !isExpired(m))
    .filter(m => isUsefulForTask(req.task, m));

  const ranked = rankByUtility(filtered, {
    relevance: 0.45,
    recency: 0.2,
    confidence: 0.2,
    importance: 0.15
  });

  return fitWithinTokenBudget(ranked, req.maxTokens);
}

Notice the order: search, filter, rank, budget. Do not skip the filter step.

Add Decay Before Memory Becomes Junk

Memory stores get worse over time unless you design forgetting.

Forgetting is not a bug. It is a feature.

Use decay rules such as:

Working memory expires when the workflow ends.
Tool-result memories expire when source data changes.
User preferences expire after a long period of non-use.
Low-confidence memories expire faster.
Sensitive memories require a retention reason and shorter TTL.
Procedural memories stay only if versioned and reviewed.

You can calculate memory priority like this:

function memoryPriority(memory: AgentMemory) {
  const ageDays = daysSince(memory.createdAt);
  const recencyBoost = memory.lastUsedAt ? 0.15 : 0;
  const agePenalty = Math.min(ageDays / 180, 0.4);

  return memory.importance * 0.4 +
    memory.confidence * 0.3 +
    recencyBoost -
    agePenalty;
}

Then run a daily or weekly cleanup job:

async function pruneMemories(tenantId: string) {
  const memories = await listMemories(tenantId);

  for (const memory of memories) {
    if (isExpired(memory)) await archiveMemory(memory.id, "expired");
    else if (memoryPriority(memory) < 0.25) await archiveMemory(memory.id, "low_priority");
  }
}

Archiving is better than deleting when audit matters. For privacy-sensitive data, deletion may be required. Make that a policy decision, not an agent decision.

Multi-Tenant Memory Rules

If your product serves multiple customers, memory isolation is non-negotiable.

Minimum rules:

Every memory row must include tenantId.
Retrieval queries must require tenantId.
Vector indexes should support tenant filters.
No global memory should include tenant-specific data.
Admin tools must show why a memory was retrieved.
Tests must prove cross-tenant memories cannot leak.

A common mistake is storing embeddings in one shared vector index and trusting application code to filter after retrieval. That can work if implemented carefully, but pre-filtering by tenant is safer when your database supports it.

Bad retrieval:

const results = await vectorSearch(query);
return results.filter(r => r.tenantId === tenantId);

Better retrieval:

const results = await vectorSearch(query, {
  filter: { tenantId }
});

The difference is boring until it prevents a privacy incident.

Where to Store Agent Memory

You have several options:

Storage option	Good for	Watch out for
Postgres	Structured memory, audit logs, tenant filters	Needs vector extension or separate vector store
Vector database	Semantic retrieval	Weak metadata discipline can create messy retrieval
Document store	Flexible memory records	Harder relational auditing
Object storage	Source snapshots and raw artifacts	Not ideal for direct retrieval
Redis	Short-lived working memory	Not a long-term audit store

A practical starting stack:

Redis for active working memory
Postgres for memory metadata and audit receipts
pgvector or a vector database for semantic search
Object storage for large source snapshots

A Simple Build Plan

If you are adding memory to an existing AI product, build in this order.

1. Start with episodic memory

Log what happened. Tool calls, approvals, errors, source references, and decisions. This gives you debugging value without letting the agent reuse memories automatically.

2. Add working memory

Track current workflow state outside the prompt. Store goals, completed steps, open questions, and blockers. Use it to resume long-running tasks.

3. Add controlled semantic memory

Save stable user or tenant facts only after classification and deduplication. Keep confidence and source metadata.

4. Add retrieval gates

Before memory enters the prompt, check tenant, user, freshness, sensitivity, relevance, and token budget.

5. Add memory review UI

Let humans inspect, correct, archive, and approve important memories. This is especially useful for customer-facing workflows.

6. Add evaluations

Create tests for stale memory, cross-tenant leakage, prompt injection in stored memory, bad retrieval ranking, and over-retrieval.

Evaluation Cases You Should Run

Memory needs tests just like prompts and tools.

Try these cases:

A user changes their preference. Does the old memory stop winning?
A customer document is updated. Does stale memory require refresh?
A malicious web page tries to get stored as memory. Is it rejected?
Two tenants use similar company names. Are memories isolated?
A low-confidence summary conflicts with a verified tool result. Which wins?
A workflow resumes after 24 hours. Does the agent recover the correct state?
The memory budget is cut in half. Does the agent still include the best facts?

The most important evaluation is simple: can the agent explain which memories affected its answer?

If not, your memory system is not production-ready.

Final Checklist

Before shipping an AI agent memory store, confirm:

[ ] Every memory has tenant scope
[ ] Memory type is explicit
[ ] Sensitive memory has policy metadata
[ ] Writes go through classification and deduplication
[ ] Retrieval filters before ranking
[ ] Stale memories expire or require refresh
[ ] Prompt memory has a token budget
[ ] Memory use is logged in receipts
[ ] Humans can inspect and correct important memories
[ ] Tests cover leakage, stale facts, and prompt injection

A memory store should make agents calmer, not weirder. If the agent becomes more confident but less traceable, the system is moving in the wrong direction.

FAQ

What is an AI agent memory store?

An AI agent memory store is a system that saves, filters, retrieves, and audits information an agent may need across steps or sessions. It can include workflow state, past events, stable facts, user preferences, and reusable procedures.

Is a vector database enough for agent memory?

No. A vector database can help with semantic search, but memory also needs metadata, tenant filters, expiry rules, sensitivity labels, confidence scores, and audit logs. Retrieval quality depends on policy as much as similarity search.

What should an AI agent remember?

It should remember information that should change future behavior: verified preferences, workflow decisions, failed attempts, approvals, stable business rules, and reusable procedures. It should not remember random chat filler, unverified guesses, or sensitive data without a clear retention reason.

How do you prevent stale memory from hurting answers?

Use expiry dates, freshness checks, source references, confidence scores, and revalidation rules. When a memory conflicts with a newer verified source, the newer source should win. Stale memory should be archived or marked as requiring refresh.

How much memory should be added to a prompt?

Usually less than you think. Start with a small budget, such as 3 to 7 high-value memories or a few hundred tokens. Track whether retrieved memory improves task success, reduces repeated tool calls, or causes stale-answer incidents.

How do you stop cross-tenant memory leaks?

Require tenant IDs on every memory object, pre-filter retrieval by tenant, test similar-name tenant cases, avoid global memory that contains customer data, and log memory receipts so you can prove which records were retrieved and used.

Top comments (3)

Max Quimby • Jun 12

The "context window is a whiteboard, not a database" framing is the part I wish more people internalized — we lost a lot of time early on treating running summaries as ground truth and then debugging behavior that traced back to a stale fact nobody could source.

The thing that bit us hardest in production wasn't retrieval, it was write discipline. Two rules ended up mattering more than any embedding choice: (1) dedup before you write, or the same "fact" accretes five slightly-different copies and the agent confidently picks the wrong one, and (2) stamp every memory with provenance plus a timestamp, so you can expire it instead of trusting it forever. Your point about auditable memory is exactly this.

One thing I'd love your take on: how are you handling contradiction? Retrieval surfaces the relevant memory, but when a fresh tool result disagrees with a stored one, something has to decide which wins. We ended up needing an explicit recency/authority rule rather than letting the model reconcile it in-context. Do you scope that into the store, or leave it to the agent?

Mehmet Can Farsak • Jun 12

Great write-up on agent memory architecture. That 'job has drifted' problem you described hits home — I've seen the same thing with agents jumping from ideation to execution without a clear boundary. Built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) to tackle this from the hooks side — it uses PreToolUse hooks to enforce mode boundaries so the agent stays in thinking mode instead of drifting into action mode.

Mehmet Can Farsak • Jun 13

Solid point about the drift problem — agents losing the thread after a few tool calls. I ran into a related issue where agents would abandon brainstorming and jump straight to coding when you just wanted ideation. Put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that uses hooks to enforce "thinking mode" vs "action mode" at the infrastructure level. Three modes (divergent, actionable, academic) help keep the agent in the right headspace instead of prematurely reaching for tools.