DEV Community

Kotcherla Murali Krishna
Kotcherla Murali Krishna

Posted on

Memory Systems for AI Agents: The Complete Developer Guide

How modern AI agents remember, reason, and learn — and how to build them right

Memory systems for AI Agents

Introduction

Every time you close a chat window, the AI forgets you exist.

This is not a bug — it is architecture. Large language models are, by default, stateless. They receive a prompt, generate a response, and discard everything. But the next generation of AI agents needs to do far more: track long-running tasks, learn from past interactions, coordinate across sessions, and reason over accumulated knowledge.

Memory is the missing layer that transforms a chatbot into an agent.

This guide breaks down every major memory system used in AI agents today — what they are, how they work, when to use each, and how to combine them into production-ready architectures.

Why Memory Matters

Consider the difference between these two interactions:

Without memory:

User: “What did we decide about the database schema last Tuesday?” Agent: “I don’t have access to previous conversations.”

With memory:

User: “What did we decide about the database schema last Tuesday?” Agent: “You decided to normalize the user table into three relations and defer the indexing strategy to after the first load test.”

The gap is not intelligence — it is persistence. Memory gives agents:

  • Continuity across sessions and workflows
  • Personalization based on accumulated user context
  • Efficiency by avoiding repeated reasoning over the same facts
  • Autonomy to pursue multi-step goals without human re-prompting at every step

The Four Types of Agent Memory

AI agent memory maps loosely onto cognitive science. Researchers typically distinguish four types, each serving a different purpose.

1. Sensory / Working Memory (In-Context)

What it is: Everything currently inside the model’s context window — the “working desk” of the agent.

How it works: The transformer attention mechanism operates over all tokens in the context window simultaneously. This is the only memory that directly influences model outputs.

Characteristics:

  • Fast — zero retrieval latency
  • Limited — bounded by context window size (4K to 2M tokens depending on model)
  • Volatile — completely lost when the session ends
  • Ordered — the model can reason over temporal sequences within context

When to use it:

  • Current task state, tool outputs, user messages in the active session
  • Recently retrieved facts that need active reasoning
  • Intermediate reasoning steps (chain-of-thought scratchpads)

Implementation:

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Step 1 result: ..."},
    {"role": "assistant", "content": "Proceeding to step 2..."},
    {"role": "user", "content": "Step 2 result: ..."},
]
response = client.messages.create(model="claude-sonnet-4-20250514", messages=messages)
Enter fullscreen mode Exit fullscreen mode

Key insight: Context window management is itself a memory problem. When context fills up, agents must decide what to summarize, compress, or evict — turning working memory into a policy decision.

2. Episodic Memory (Conversation & Event History)

What it is: Records of specific past events — conversations, actions taken, outcomes observed.

How it works: Episodes are stored externally (database, vector store, log files) and retrieved selectively into context when relevant.

Characteristics:

  • Persistent across sessions
  • Structured around time and causality (“what happened, when, and what followed”)
  • Indexed for retrieval by recency, relevance, or both
  • Can store raw transcripts or summarized episode representations

When to use it:

  • User conversation history (“You mentioned last week that…”)
  • Agent action logs for debugging and auditing
  • Workflow checkpointing for long-running tasks
  • Learning from past successes and failures

Implementation pattern:

# Store episode
episode = {
    "session_id": "abc123",
    "timestamp": "2025-05-22T14:30:00Z",
    "user_input": "Analyze the Q2 sales data",
    "agent_actions": ["read_csv", "compute_summary", "generate_chart"],
    "outcome": "success",
    "summary": "User requested Q2 analysis. Identified 18% YoY growth in APAC region."
}
db.episodes.insert(episode)

# Retrieve relevant episodes
past_episodes = db.episodes.find({
    "user_id": current_user,
    "timestamp": {"$gte": thirty_days_ago}
}).sort("timestamp", -1).limit(5)
Enter fullscreen mode Exit fullscreen mode

Design consideration: Store both raw episodes and distilled summaries. Raw episodes support audit trails and replay; summaries support fast context injection.

3. Semantic Memory (Knowledge & Facts)

What it is: General knowledge, facts, concepts, and domain expertise — decoupled from specific events.

How it works: Information is embedded into a vector space and stored in a vector database. At query time, semantically similar content is retrieved and injected into context (Retrieval-Augmented Generation, or RAG).

Characteristics:

  • Persistent and shareable across users and sessions
  • Retrieved by semantic similarity, not exact match
  • Scales to millions of documents
  • Requires embedding model + vector store infrastructure

When to use it:

  • Company knowledge bases, documentation, FAQs
  • Domain-specific corpora (legal, medical, financial)
  • Product catalogs, policy documents
  • Any knowledge too large to fit in context

Implementation with semantic search:

from sentence_transformers import SentenceTransformer
import chromadb

# Index documents
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("knowledge_base")

documents = load_documents("./docs/")
embeddings = model.encode([doc.text for doc in documents])
collection.add(
    embeddings=embeddings.tolist(),
    documents=[doc.text for doc in documents],
    ids=[doc.id for doc in documents]
)

# Retrieve at query time
query_embedding = model.encode([user_query])
results = collection.query(query_embeddings=query_embedding, n_results=5)

# Inject into context
context = "\n\n".join(results["documents"][0])
augmented_prompt = f"Use the following context:\n{context}\n\nUser query: {user_query}"
Enter fullscreen mode Exit fullscreen mode

Advanced pattern — Hybrid Retrieval:

Pure vector search misses exact keyword matches. Combine BM25 (keyword) with dense retrieval (semantic) for higher recall:

from rank_bm25 import BM25Okapi
from sklearn.preprocessing import normalize
import numpy as np

def hybrid_retrieve(query, documents, embeddings, alpha=0.5, top_k=5):
    # BM25 scores
    tokenized = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.split())

    # Dense scores
    query_emb = model.encode([query])
    dense_scores = np.dot(normalize(embeddings), normalize(query_emb).T).flatten()

    # Combine
    combined = alpha * normalize([bm25_scores])[0] + (1 - alpha) * dense_scores
    top_indices = np.argsort(combined)[::-1][:top_k]
    return [documents[i] for i in top_indices]
Enter fullscreen mode Exit fullscreen mode

4. Procedural Memory (Skills & Workflows)

What it is: Encoded knowledge of how to do things — reusable procedures, tool-use patterns, and learned behavioral strategies.

How it works: Can be represented as prompts (few-shot examples), code (tool implementations), structured workflows (graphs, state machines), or fine-tuned model weights.

Characteristics:

  • Represents capability, not facts
  • Typically stable — updated less frequently than episodic or semantic memory
  • Can be invoked on demand (“use the data_analysis skill”)
  • Encodes both successful and corrected failure patterns

When to use it:

  • Standardized multi-step workflows (data pipelines, report generation)
  • Tool usage patterns and API call sequences
  • Domain-specific reasoning strategies
  • Safety and compliance guardrails as behavioral constraints

Implementation — Prompt-based procedural memory:

PROCEDURES = {
    "data_analysis": """
        When analyzing data:
        1. First, describe the dataset schema and shape.
        2. Check for missing values and outliers.
        3. Compute descriptive statistics.
        4. Identify trends and correlations.
        5. Summarize key findings in plain language.
        Always cite the row count and column names in your response.
    """,
    "code_review": """
        When reviewing code:
        1. Check for correctness first (does it do what it claims?).
        2. Identify security vulnerabilities (injection, auth, secrets).
        3. Assess performance implications.
        4. Comment on readability and maintainability.
        5. Suggest concrete improvements with code examples.
    """
}

def inject_procedure(task_type: str, base_prompt: str) -> str:
    procedure = PROCEDURES.get(task_type, "")
    return f"{procedure}\n\n{base_prompt}" if procedure else base_prompt
Enter fullscreen mode Exit fullscreen mode

Memory Storage Backends

Choosing the right storage layer is as important as choosing the right memory type.

Memory storage backends

Practical recommendation: Start with PostgreSQL + pgvector. It handles relational structure (episodes, user data) and vector search in one system, avoiding the operational overhead of a separate vector database until scale demands it.

Memory Architecture Patterns

Pattern 1: The Memory Stack

The simplest production pattern layers all four memory types, with each feeding into context at retrieval time.

The Memory Stack

Memory write-back is critical and often overlooked. After each interaction, the agent should update:

  • Episodic store with a summary of what happened
  • Semantic store if new facts were established
  • Procedural store if a new workflow pattern emerged

Pattern 2: Memory-Augmented ReAct

ReAct (Reason + Act) agents interleave reasoning steps with tool calls. Memory becomes a first-class tool:

Thought: I need to check if we’ve handled this type of request before.

Action: memory_search(query=”database migration rollback”, type=”episodic”)

Observation: Found 3 similar episodes. In 2/3 cases, the solution was…

Thought: Based on past experience, I should first verify the backup exists.

Action: check_backup(database=”prod_db”)

This makes memory transparent, auditable, and controllable.

Pattern 3: Hierarchical Summarization

For very long-running agents (days to weeks), raw episode storage becomes unmanageable. Use hierarchical summarization:

Raw episodes (last 24h) → Daily summary

Daily summaries (last 7d) → Weekly summary

Weekly summaries → Persistent user profile

This mirrors how humans consolidate memories during sleep — detail fades, patterns persist.

async def consolidate_memory(user_id: str):
    # Fetch yesterday's raw episodes
    yesterday = datetime.now() - timedelta(days=1)
    episodes = await db.get_episodes(user_id, since=yesterday)

    # Summarize via LLM
    raw_text = "\n".join([e["summary"] for e in episodes])
    daily_summary = await llm.summarize(
        f"Summarize these agent interactions into key facts and outcomes:\n{raw_text}"
    )

    # Store consolidated summary
    await db.store_daily_summary(user_id, daily_summary, date=yesterday)

    # Optionally archive raw episodes to cold storage
    await db.archive_episodes(user_id, before=yesterday)
Enter fullscreen mode Exit fullscreen mode

Memory Retrieval Strategies

Recency-Weighted Retrieval

Recent information is usually more relevant. Apply time decay to retrieval scores:

import math
from datetime import datetime

def time_decay_score(base_score: float, created_at: datetime, half_life_days: float = 7) -> float:
    days_elapsed = (datetime.now() - created_at).days
    decay = math.exp(-0.693 * days_elapsed / half_life_days)
    return base_score * decay
Enter fullscreen mode Exit fullscreen mode

Importance-Based Retention

Not all memories are equal. Score importance at write time:

IMPORTANCE_SIGNALS = {
    "user_correction": 1.0, # "No, that's wrong — it should be..."
    "explicit_preference": 0.9, # "I always prefer..."
    "task_success": 0.7, # Completed goals
    "factual_statement": 0.5, # Stated facts
    "casual_mention": 0.2, # Passing references
}

def score_importance(episode: dict) -> float:
    signal = episode.get("signal_type", "casual_mention")
    return IMPORTANCE_SIGNALS.get(signal, 0.3)
Enter fullscreen mode Exit fullscreen mode

Contextual Compression

Before injecting retrieved memories into context, compress them to fit within token budgets:

async def compress_memories(memories: list[str], max_tokens: int = 800) -> str:
    combined = "\n".join(memories)
    if count_tokens(combined) <= max_tokens:
        return combined

    compressed = await llm.complete(
        f"Compress the following memory entries to under {max_tokens} tokens, "
        f"preserving only the most relevant facts:\n\n{combined}"
    )
    return compressed
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls and How to Avoid Them

Pitfall 1: Memory Hallucination Propagation

If the agent writes a hallucinated fact to memory, it reinforces itself across future sessions. The agent becomes increasingly confident in a falsehood.

Fix: Apply confidence thresholds at write time. Only persist memories that pass a factual grounding check, or flag them with uncertainty metadata.

async def write_memory_safe(fact: str, source: str) -> bool:
    grounding_check = await llm.complete(
        f"Is the following statement grounded in verifiable evidence? "
        f"Reply only YES or NO.\n\nStatement: {fact}\nSource: {source}"
    )
    if "YES" in grounding_check.upper():
        await memory_store.write(fact, source=source, confidence="high")
        return True
    else:
        await memory_store.write(fact, source=source, confidence="uncertain")
        return False
Enter fullscreen mode Exit fullscreen mode

Pitfall 2: Context Flooding

Retrieving too many memories degrades performance. Studies show LLM accuracy drops when context is cluttered with marginally relevant content (“lost in the middle” problem).

Fix: Enforce hard limits on retrieved memory tokens (e.g., max 20% of context window), and rank retrieval results before injection.

Pitfall 3: No Memory Eviction Policy

Without eviction, stores grow unbounded. Old, irrelevant memories add noise and increase costs.

Fix: Implement TTL (time-to-live) for episodic memories and importance-based pruning for semantic stores.

# PostgreSQL: auto-expire old low-importance episodes
CREATE INDEX idx_expires_at ON episodes(expires_at);

-- Set TTL on insert
INSERT INTO episodes (content, importance, expires_at)
VALUES ($1, $2, NOW() + INTERVAL '30 days' * $2); -- importance scales retention
Enter fullscreen mode Exit fullscreen mode

Pitfall 4: Memory Without Privacy Controls

In multi-user systems, memory isolation failures can leak one user’s data to another.

Fix: Namespace all memory keys by user ID. Apply row-level security in the database. Audit access patterns.

# Always scope queries to the requesting user
async def retrieve_memory(query: str, user_id: str) -> list:
    results = await vector_store.query(
        embedding=embed(query),
        filter={"user_id": user_id}, # Hard filter, not just ranking
        top_k=5
    )
    return results
Enter fullscreen mode Exit fullscreen mode

Putting It All Together: A Reference Architecture

class AgentMemorySystem:
    def __init__ (self, user_id: str):
        self.user_id = user_id
        self.episodic_db = PostgreSQLStore()
        self.semantic_db = VectorStore(collection=f"user_{user_id}")
        self.procedures = ProcedureRegistry()
        self.working_memory = [] # In-context state

    async def retrieve_context(self, query: str, max_tokens: int = 2000) -> str:
        """Retrieve and assemble relevant memory for the current query."""

        # 1. Fetch recent episodes (recency bias)
        episodes = await self.episodic_db.recent(self.user_id, n=5)

        # 2. Semantic search for relevant knowledge
        knowledge = await self.semantic_db.search(query, top_k=5)

        # 3. Select relevant procedure
        procedure = self.procedures.select(query)

        # 4. Assemble and compress to fit token budget
        context_parts = [
            f"Relevant past interactions:\n{self._format(episodes)}",
            f"Relevant knowledge:\n{self._format(knowledge)}",
            f"Behavioral guidelines:\n{procedure}",
        ]

        return await compress_to_budget("\n\n".join(context_parts), max_tokens)

    async def write_back(self, interaction: dict):
        """Persist memory after each interaction."""
        summary = await self._summarize(interaction)
        importance = score_importance(interaction)

        await self.episodic_db.insert({
            "user_id": self.user_id,
            "summary": summary,
            "importance": importance,
            "expires_at": compute_expiry(importance)
        })

        # Extract any new standalone facts for semantic store
        facts = await self._extract_facts(interaction)
        for fact in facts:
            await self.semantic_db.upsert(fact, metadata={"user_id": self.user_id})
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Memory is not one thing. Working, episodic, semantic, and procedural memory serve different roles and require different implementations.
  2. Write-back is as important as retrieval. Memory systems that only read from the past but never learn from the present are incomplete.
  3. Start simple, scale deliberately. In-context state → external episodic store → vector RAG → consolidated profiles. Add layers as complexity demands.
  4. Design for eviction from day one. Unlimited memory growth is a cost and accuracy problem, not just a storage problem.
  5. Privacy is an architecture decision. Namespace, isolate, and audit memory access from the beginning — retrofitting privacy controls is expensive.
  6. Transparency beats opacity. Make memory retrieval visible to users. The ability to inspect, correct, and delete agent memories builds trust and improves system accuracy.

If this guide helped you build better agents, consider following for more deep dives on AI systems engineering, LLM inference optimization, and production agent architecture.

Tags: Artificial Intelligence · Machine Learning · LLM · AI Agents · Software Engineering

Top comments (0)