Richard Dillon

Posted on May 4

Agentic Memory Systems — From Chaotic Context to Learned Control

#ai #agents #machinelearning #programming

Agentic Memory Systems — From Chaotic Context to Learned Control

Your agent just failed a customer support escalation because it couldn't remember that this same user had already explained their billing issue twice in previous sessions. The context window filled up with tool calls and intermediate reasoning, and the critical historical context got evicted. This isn't a rare edge case—it's the default failure mode for any agent that runs longer than a single conversation turn. The 2024-era solutions of naive RAG retrieval and sliding window compression treat memory as passive storage, but production agents need something fundamentally different: the ability to decide what to remember.

The research wave from early 2026 has crystallized around a compelling answer. Papers on agentic memory architectures and benchmarks like MemoryArena have demonstrated that treating memory operations as learnable actions—not hardcoded heuristics—recovers 15-25% accuracy on multi-session tasks where even the best models were failing. This shift from "memory as database" to "memory as learned skill" represents the most significant architectural evolution in agent design since tool use became standard.

This article breaks down the four-memory-type architecture emerging as the production standard and shows you how to implement learned memory policies in LangGraph with the new LangChain + MongoDB integration.

The Four-Memory-Type Architecture for Agents

The cognitive science literature has long distinguished between different memory systems in humans, and this taxonomy turns out to be remarkably useful for agent design. The survey on memory mechanisms for autonomous agents identifies four distinct memory types that map directly to different operational needs in production systems.

Working memory is what the agent is thinking right now—the live reasoning context held in the current LLM call. It's bounded by your context window (128K tokens for Claude, up to 2M with Google's models), and everything flows through it. The critical insight is that working memory isn't just the user's message; it's the curated subset of all other memory types that's been loaded for this specific reasoning step.

Episodic memory stores timestamped records of specific interactions and events. When a user asks "what did we discuss last week about the API migration?" the answer lives in episodic memory. Each episode captures not just what was said, but the outcome—did the user seem satisfied? Did the suggested solution work? This outcome tracking is what enables learning.

Semantic memory contains consolidated facts and rules extracted from episodes. If a customer support agent handles fifty return requests, the episodes are individual conversations, but the semantic memory extracts "customers mentioning 'damaged in shipping' are eligible for express replacement without requiring photos." This generalization is what prevents agents from repeatedly discovering the same patterns.

Procedural memory stores action sequences and workflows as reusable routines. When an agent learns that processing a refund requires checking order status, then verifying payment method, then initiating the return, this becomes procedural knowledge that can be invoked without re-reasoning from first principles.

The interaction patterns matter as much as the types themselves. Episodic memory consolidates into semantic memory through generalization—after enough similar episodes, a pattern becomes a fact. Procedural and semantic memory load into working memory during task execution, providing the context needed for reasoning. The architectural taxonomies emerging in the literature consistently show this hierarchical flow: episodes → facts → working context.

Memory Type	Persistence	Update Frequency	Typical Backend
Working	Single call	Every token	LLM context
Episodic	Long-term	Per interaction	Document store, MongoDB
Semantic	Long-term	Periodic consolidation	Vector store, graph DB
Procedural	Long-term	Rare refinement	Code/config, document store

From Passive Storage to Learned Memory Policies

The traditional approach to agent memory is entirely heuristic. Summarize every N turns. Retrieve the top-K similar chunks. Compress anything older than M messages. These rules are easy to implement and easy to reason about, but they fail at the edges where production systems actually live.

Over-summarization loses critical detail. A summary that says "user discussed billing issues" isn't useful when the specific detail was that the user's card was charged twice on March 3rd for transaction ID 4829. Under-retrieval causes agents to repeat mistakes or ask users to re-explain problems they've already described. The heuristics don't know what matters for the current task.

The breakthrough in the agentic memory research is treating memory operations—store, retrieve, consolidate, forget—as actions in a reinforcement learning framework. Instead of hardcoding "summarize every 10 turns," you train the agent to decide when summarization helps and when it hurts. The training signal comes from downstream task success: did remembering this detail lead to a correct answer? Did consolidating those episodes produce a useful generalization?

The agent memory paper list catalogs the rapid evolution of these techniques. Step-wise policy gradient methods like GRPO allow fine-grained credit assignment—which specific memory decision contributed to the final outcome? This is fundamentally different from end-to-end training because memory decisions have delayed effects; storing something now might only prove useful three sessions later.

Benchmark results from MemoryArena illustrate the gap. Models that achieve near-perfect scores on single-session long-context tasks (LoCoMo-style benchmarks) drop to 40-60% accuracy on multi-session tasks with interdependencies. The context window is long enough, but the agent can't figure out what to load from history. Learned memory policies recover 15-25% of this accuracy gap—not by expanding context, but by making smarter decisions about what goes into it.

The operational gotcha is that learned policies require task-specific fine-tuning. An off-the-shelf model won't magically know what to remember for your customer support workflow versus your code review assistant. Until you've collected enough trajectories to train on, you need explicit memory scaffolding—which brings us to implementation.

Hands-On: Code Walkthrough

We'll build a LangGraph agent that maintains episodic and semantic memory across sessions using MongoDB as the backend. This architecture leverages the LangChain + MongoDB integration announced for production agent deployments. The goal is a working memory system you can deploy today with heuristic policies, structured for easy upgrade to learned policies later.

from datetime import datetime
from typing import Literal, Optional
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.mongodb import MongoDBSaver
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from pymongo import MongoClient
import uuid

# Step 1: Define memory schemas with Pydantic
# These schemas determine what we track in each memory type

class Episode(BaseModel):
    """A single interaction event with full context and outcome."""
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    timestamp: datetime = Field(default_factory=datetime.utcnow)
    user_id: str
    summary: str  # Natural language summary of what happened
    entities: list[str]  # Extracted entities (names, IDs, topics)
    user_message: str  # The original user input
    agent_response: str  # What the agent said
    outcome: Optional[str] = None  # success, failure, unknown
    outcome_signal: Optional[float] = None  # Numeric reward for RL training

class SemanticFact(BaseModel):
    """A consolidated fact extracted from one or more episodes."""
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    statement: str  # The actual fact, e.g., "User prefers email over phone"
    confidence: float = Field(ge=0.0, le=1.0)  # How certain we are
    source_episode_ids: list[str]  # Provenance for debugging
    created_at: datetime = Field(default_factory=datetime.utcnow)
    last_used: Optional[datetime] = None  # For LRU-style eviction
    use_count: int = 0  # Track utility for learned policies

class MemoryState(BaseModel):
    """The state object passed through the LangGraph nodes."""
    messages: list  # Conversation history (working memory)
    user_id: str
    current_episode: Optional[Episode] = None
    retrieved_episodes: list[Episode] = []
    retrieved_facts: list[SemanticFact] = []
    memory_action: Optional[str] = None  # What the controller decided
    consolidation_pending: bool = False

# Step 2: Memory storage layer using MongoDB
# Separate collections for episodes and semantic facts

class MemoryStore:
    def __init__(self, mongo_uri: str, db_name: str = "agent_memory"):
        self.client = MongoClient(mongo_uri)
        self.db = self.client[db_name]
        self.episodes = self.db["episodes"]
        self.semantic_facts = self.db["semantic_facts"]
        # Create indexes for efficient queries
        self.episodes.create_index([("user_id", 1), ("timestamp", -1)])
        self.episodes.create_index([("entities", 1)])
        self.semantic_facts.create_index([("user_id", 1), ("confidence", -1)])

    def store_episode(self, user_id: str, episode: Episode) -> str:
        """Persist an episode to MongoDB."""
        doc = episode.model_dump()
        doc["user_id"] = user_id
        self.episodes.insert_one(doc)
        return episode.id

    def query_episodes(
        self, 
        user_id: str, 
        entities: list[str] = None,
        limit: int = 5,
        days_back: int = 30
    ) -> list[Episode]:
        """Retrieve relevant episodes for a user."""
        query = {
            "user_id": user_id,
            "timestamp": {"$gte": datetime.utcnow() - timedelta(days=days_back)}
        }
        if entities:
            query["entities"] = {"$in": entities}

        cursor = self.episodes.find(query).sort("timestamp", -1).limit(limit)
        return [Episode(**doc) for doc in cursor]

    def count_recent_episodes(self, user_id: str, entity: str) -> int:
        """Count episodes mentioning an entity—used for consolidation trigger."""
        return self.episodes.count_documents({
            "user_id": user_id,
            "entities": entity,
            "timestamp": {"$gte": datetime.utcnow() - timedelta(days=7)}
        })

    def store_semantic_fact(self, user_id: str, fact: SemanticFact) -> str:
        """Persist a semantic fact to MongoDB."""
        doc = fact.model_dump()
        doc["user_id"] = user_id
        self.semantic_facts.insert_one(doc)
        return fact.id

    def query_semantic_facts(
        self, 
        user_id: str, 
        min_confidence: float = 0.7,
        limit: int = 10
    ) -> list[SemanticFact]:
        """Retrieve high-confidence facts for context loading."""
        cursor = self.semantic_facts.find({
            "user_id": user_id,
            "confidence": {"$gte": min_confidence}
        }).sort("confidence", -1).limit(limit)
        return [SemanticFact(**doc) for doc in cursor]

# Step 3: Define the memory controller node
# This is where heuristic policy lives—replace with learned policy later

def memory_controller(state: MemoryState, store: MemoryStore) -> MemoryState:
    """
    Decide which memory operation to perform based on current state.

    Heuristic policy (v1):
    - Always retrieve relevant episodes and facts before reasoning
    - Store episode after each user turn
    - Trigger consolidation when 5+ episodes share an entity
    """
    # Extract entities from the last user message (simplified)
    last_message = state.messages[-1].content if state.messages else ""
    # In production, use NER or LLM extraction here
    entities = extract_entities_simple(last_message)

    # Retrieve relevant context
    state.retrieved_episodes = store.query_episodes(
        state.user_id, 
        entities=entities,
        limit=3
    )
    state.retrieved_facts = store.query_semantic_facts(
        state.user_id,
        min_confidence=0.7,
        limit=5
    )

    # Check if consolidation should trigger
    for entity in entities:
        if store.count_recent_episodes(state.user_id, entity) >= 5:
            state.consolidation_pending = True
            state.memory_action = f"consolidate_{entity}"
            break
    else:
        state.memory_action = "retrieve_only"

    return state

def extract_entities_simple(text: str) -> list[str]:
    """Placeholder entity extraction—use spaCy or LLM in production."""
    # Very simplified: extract capitalized words and common patterns
    import re
    words = re.findall(r'\b[A-Z][a-z]+\b', text)
    return list(set(words))[:5]

# Step 4: Build the reasoning node that uses memory context

def reasoning_node(state: MemoryState, llm: ChatAnthropic) -> MemoryState:
    """
    Main reasoning with memory-augmented context.
    Loads retrieved episodes and facts into working memory.
    """
    # Build system prompt with memory context
    memory_context = build_memory_context(
        state.retrieved_episodes,
        state.retrieved_facts
    )

    system_msg = SystemMessage(content=f"""You are a helpful assistant with access to memory of past interactions.

RELEVANT HISTORY:
{memory_context}

Use this context to provide personalized, consistent responses. 
Reference past interactions when relevant.""")

    # Call the LLM with augmented context
    messages = [system_msg] + state.messages
    response = llm.invoke(messages)

    # Create episode record for this interaction
    state.current_episode = Episode(
        user_id=state.user_id,
        summary=f"User asked about: {state.messages[-1].content[:100]}",
        entities=extract_entities_simple(state.messages[-1].content),
        user_message=state.messages[-1].content,
        agent_response=response.content
    )

    # Add response to conversation
    state.messages.append(AIMessage(content=response.content))

    return state

def build_memory_context(episodes: list[Episode], facts: list[SemanticFact]) -> str:
    """Format retrieved memories for inclusion in prompt."""
    parts = []

    if facts:
        parts.append("KNOWN FACTS ABOUT THIS USER:")
        for fact in facts:
            parts.append(f"- {fact.statement} (confidence: {fact.confidence:.0%})")

    if episodes:
        parts.append("\nRECENT RELEVANT INTERACTIONS:")
        for ep in episodes:
            date_str = ep.timestamp.strftime("%Y-%m-%d")
            parts.append(f"- [{date_str}] {ep.summary}")

    return "\n".join(parts) if parts else "No relevant history found."

# Step 5: Consolidation node—extracts semantic facts from episodes

def consolidation_node(state: MemoryState, store: MemoryStore, llm: ChatAnthropic) -> MemoryState:
    """
    Consolidate multiple episodes into semantic facts.
    This is where episodic → semantic generalization happens.
    """
    if not state.consolidation_pending:
        return state

    # Get episodes to consolidate
    entity = state.memory_action.replace("consolidate_", "")
    episodes = store.query_episodes(state.user_id, entities=[entity], limit=10)

    if len(episodes) < 3:
        state.consolidation_pending = False
        return state

    # Use LLM to extract generalizable facts
    episode_summaries = "\n".join([
        f"- {ep.timestamp.strftime('%Y-%m-%d')}: {ep.summary}" 
        for ep in episodes
    ])

    extraction_prompt = f"""Based on these past interactions, extract 1-3 general facts about the user's preferences, patterns, or needs:

INTERACTIONS:
{episode_summaries}

Output each fact on its own line, starting with "FACT: "
Only include facts that appear consistently across multiple interactions."""

    response = llm.invoke([HumanMessage(content=extraction_prompt)])

    # Parse and store extracted facts
    for line in response.content.split("\n"):
        if line.strip().startswith("FACT:"):
            fact_text = line.replace("FACT:", "").strip()
            fact = SemanticFact(
                statement=fact_text,
                confidence=0.8,  # Adjust based on episode count
                source_episode_ids=[ep.id for ep in episodes]
            )
            store.store_semantic_fact(state.user_id, fact)

    state.consolidation_pending = False
    return state

# Step 6: Wire everything into a LangGraph StateGraph

def build_memory_agent(mongo_uri: str, anthropic_api_key: str):
    """Construct the full memory-enabled agent graph."""

    # Initialize components
    store = MemoryStore(mongo_uri)
    llm = ChatAnthropic(
        model="claude-sonnet-4-20250514",
        api_key=anthropic_api_key
    )
    checkpointer = MongoDBSaver.from_conn_string(mongo_uri)

    # Define the graph
    graph = StateGraph(MemoryState)

    # Add nodes with bound dependencies
    graph.add_node("memory_controller", lambda s: memory_controller(s, store))
    graph.add_node("reasoning", lambda s: reasoning_node(s, llm))
    graph.add_node("store_episode", lambda s: store_and_return(s, store))
    graph.add_node("consolidation", lambda s: consolidation_node(s, store, llm))

    # Define edges
    graph.add_edge(START, "memory_controller")
    graph.add_edge("memory_controller", "reasoning")
    graph.add_edge("reasoning", "store_episode")

    # Conditional edge: consolidate if pending, else end
    graph.add_conditional_edges(
        "store_episode",
        lambda s: "consolidation" if s.consolidation_pending else "end",
        {"consolidation": "consolidation", "end": END}
    )
    graph.add_edge("consolidation", END)

    return graph.compile(checkpointer=checkpointer)

def store_and_return(state: MemoryState, store: MemoryStore) -> MemoryState:
    """Persist the current episode and return state."""
    if state.current_episode:
        store.store_episode(state.user_id, state.current_episode)
    return state

# Usage example with LangSmith tracing
from langsmith import traceable

@traceable(name="memory_agent_turn", tags=["memory_enabled"])
def run_agent_turn(agent, user_id: str, message: str, thread_id: str):
    """Execute a single turn with full tracing."""
    config = {
        "configurable": {"thread_id": thread_id},
        "metadata": {"user_id": user_id}
    }

    initial_state = MemoryState(
        messages=[HumanMessage(content=message)],
        user_id=user_id
    )

    result = agent.invoke(initial_state, config)
    return result.messages[-1].content

The extension point for learned policies is the memory_controller function. Replace the heuristic rules with a fine-tuned classifier that takes the current state and predicts the optimal memory action. The GRPO training approach mentioned in the research uses trajectories where you label which memory decisions led to successful task completion.

Benchmarking Your Agent's Memory: MemoryArena and MemBench

Standard benchmarks for long-context models miss the critical challenge in production agents. LoCoMo, LongBench, and similar evaluations test single-session performance—can the model find a needle in a haystack within one context window? But your production agent runs across dozens of sessions over weeks or months. The survey on memory evaluation identifies this gap as a primary reason deployed agents underperform their benchmark scores.

MemoryArena addresses this with four domains specifically designed for multi-session evaluation: customer support (returning users with ongoing issues), project management (tasks that span days with status updates), personal assistant (preference learning over time), and collaborative coding (incremental feature development). Tasks span 5-20 sessions with explicit interdependencies—session 7 might require information from session 2 that wasn't relevant in sessions 3-6.

The agentic AI architectures survey highlights five dimensions for memory evaluation that you should track in your own systems:

Retention accuracy: Does the agent remember critical facts after they leave the context window?
Retrieval precision: When memory is loaded, is it actually relevant to the current query?
Consolidation quality: Do extracted semantic facts accurately generalize from episodes?
Interference resistance: Does learning new information corrupt existing memories?
Forgetting appropriateness: Does the agent correctly discard outdated or superseded information?

For practical measurement in LangSmith, instrument your agent with these metrics:

Memory hit rate: Of retrieved memories, what percentage appeared in the final response or reasoning trace? Track this with metadata tags on your traces.
Consolidation ratio: Episodes created vs. semantic facts extracted. A ratio of 5:1 (5 episodes per fact) suggests healthy generalization; 2:1 might indicate overfitting to specific instances.
Memory bloat: Total storage growth per active user per week. Unbounded growth signals missing TTL policies or over-storage.

To create your own MemoryArena-style evaluation, export multi-session conversation logs from your production system, annotate them with ground-truth "should remember" and "should retrieve" labels, and compare agent performance with memory enabled versus disabled. The Awesome AI Agents collection includes several evaluation harnesses you can adapt.

What This Means for Your Stack

Immediate adoption (this week): Add structured episodic logging to your existing agents. Even without learned policies, queryable history improves debugging when something goes wrong and increases user trust when the agent demonstrates continuity. The code above gives you a working MongoDB-backed episodic store you can deploy today.

Medium-term investment (next quarter): Implement the four-memory-type separation. Use MongoDB or PostgreSQL with JSON columns for episodic storage—the LangChain + MongoDB partnership provides native integration. Add a vector store (Pinecone, Weaviate, or MongoDB Atlas Vector Search) for semantic retrieval. The investment pays off in personalization quality and reduced user friction.

Advanced path (6+ months): Fine-tune a memory controller on your domain using GRPO or DPO. This requires collecting trajectories with labeled outcomes—which memory decisions led to task success? The emerging frameworks like Auton provide scaffolding for this training loop, but expect to invest in custom data collection infrastructure.

One critical architecture decision: should memory consolidation run inline (during the conversation) or as a background job? Inline consolidation adds latency—100-500ms for an LLM call to extract facts—but keeps memory fresh. Background batch processing adds staleness (facts extracted hours after the relevant episodes) but maintains conversation responsiveness. For most applications, background consolidation with aggressive episode retrieval is the right trade-off.

Operational considerations you'll hit in production: memory storage grows unboundedly without intervention. Implement TTL policies (archive episodes older than 90 days to cold storage), user-scoped isolation (critical for multi-tenant systems), and GDPR-compliant deletion hooks (when a user requests data deletion, you need to cascade through episodes, facts, and any derived embeddings).

The agentic engineering practices emerging in production teams emphasize that memory systems are infrastructure, not features. Budget for them accordingly—monitoring, alerting on memory bloat, and regular audits of consolidation quality.

What to Build This Week

Project: Memory-Enabled Support Agent with Consolidation Dashboard

Build a customer support agent that remembers user preferences and issue history across sessions, with a Streamlit dashboard showing memory operations in real-time.

Deploy the MongoDB-backed memory system from the code walkthrough
Create a simple support chat interface (Gradio or Streamlit)
Simulate 10 multi-turn conversations with 3 different "users," each discussing a recurring topic (billing, technical issues, feature requests)
Build a dashboard that displays:
- Episode timeline per user
- Extracted semantic facts with source episode links
- Memory hit rate per conversation (did retrieved memories appear in responses?)
- Consolidation triggers (when and why facts were extracted)
Run a comparison: disable memory retrieval for half the conversations and measure how often the agent asks users to repeat information they've already provided

The success metric is demonstrating that memory-enabled conversations require fewer clarifying questions and produce more personalized responses. Post your results with LangSmith trace links—the community needs more real-world data on memory system performance.

Sources

- How Swarms of AI Agents Are Redefining Software Engineering

This is part of the **Agentic Engineering Weekly* series — a deep-dive every Monday into the frameworks,
patterns, and techniques shaping the next generation of AI systems.*

Follow the Agentic Engineering Weekly series on Dev.to to catch every edition.

Building something agentic? Drop a comment — I'd love to feature reader projects.

DEV Community

Agentic Memory Systems — From Chaotic Context to Learned Control

Agentic Memory Systems — From Chaotic Context to Learned Control

The Four-Memory-Type Architecture for Agents

From Passive Storage to Learned Memory Policies

Hands-On: Code Walkthrough

Benchmarking Your Agent's Memory: MemoryArena and MemBench

What This Means for Your Stack

What to Build This Week

Sources

- How Swarms of AI Agents Are Redefining Software Engineering

Top comments (0)