DEV Community

Cover image for Agent Series (6): Memory Management — Teaching Your Agent to Remember What Matters
WonderLab
WonderLab

Posted on

Agent Series (6): Memory Management — Teaching Your Agent to Remember What Matters

Memory: Turning an Agent from a "Tool" into an "Assistant"

An Agent without memory starts from scratch every conversation. You tell it your name, your job, your preferred learning style — next time, it has no idea who you are.

That's not a bug. It's a missing architectural layer.

LLMs are stateless by design. Every call is independent. If you want an Agent to remember things, you must explicitly store, manage, and retrieve memory at the architecture level. That's exactly what memory management solves.

This article breaks Agent memory down into four dimensions: a taxonomy of memory types, three context management strategies, LangGraph's two memory primitives (checkpointer and store), and an auto-compression scheme for arbitrarily long conversations.


Four Memory Types: From Cognitive Science to Engineering

Borrowing from cognitive science's model of human memory, Agent memory naturally decomposes into four layers — each with a corresponding LangGraph implementation:

┌──────────────────────────────────────────────────────────────┐
│                     Memory Hierarchy                          │
├──────────────────────┬───────────────────────────────────────┤
│ Sensory Memory       │ In-flight messages for the current    │
│                      │ turn. Lifetime: one LLM call.         │
├──────────────────────┼───────────────────────────────────────┤
│ Working Memory       │ Recent message history (last K turns) │
│                      │ Implementation: inject messages list   │
│                      │ into prompt                           │
├──────────────────────┼───────────────────────────────────────┤
│ Episodic Memory      │ Vectorized / summarized history       │
│                      │ Implementation: summary compression + │
│                      │ VectorStore retrieval                 │
├──────────────────────┼───────────────────────────────────────┤
│ Semantic Memory      │ Long-term user preferences and facts  │
│                      │ Implementation: LangGraph store       │
│                      │ (KV Store)                            │
└──────────────────────┴───────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Sensory Memory: The Current Turn

The most transient layer. The input and output of one LLM call — used and discarded:

q = "What is len([1, 2, 3])?"
answer = llm.invoke([HumanMessage(q)])
# answer.content → "len([1, 2, 3]) equals 3."
# Once this invoke() returns, the answer is gone
Enter fullscreen mode Exit fullscreen mode

There's nothing to "manage" here — sensory memory is just the LLM call itself.

Working Memory: Bounded Conversation History

Prepend the last few turns of conversation into the prompt. The effect is immediate and obvious:

history = [
    HumanMessage("My name is Li Lei, I'm a Python engineer"),
    AIMessage("Hello, Li Lei! Nice to meet you."),
    HumanMessage("I've been learning LangGraph lately"),
    AIMessage("LangGraph is powerful — great for building stateful Agents."),
]
test_q = "What's my name again?"
Enter fullscreen mode Exit fullscreen mode

Measured output:

With history → "Yes, you told me your name is Li Lei, and you're a Python engineer..."
Without history → "I'm sorry, I cannot recall the name you told me earlier, as an AI
                   I don't have persistent memory to store personal data..."
Enter fullscreen mode Exit fullscreen mode

The gap is stark. The drawback: token cost grows linearly with conversation length, so you need truncation or summarization to keep it bounded.

Episodic Memory: Compressed History Snippets

When conversation history grows long, stuffing everything into the prompt becomes expensive. Episodic memory's approach: compress first, then store:

long_history = history * 4  # 16 messages
summary = llm.invoke([
    SystemMessage("Compress the following conversation into under 60 words, preserving key facts"),
    HumanMessage(str([m.content for m in long_history])),
])
# → "Li Lei, Python engineer, learning LangGraph enthusiastically,
#    praised it as powerful for stateful Agents."
Enter fullscreen mode Exit fullscreen mode

16 messages compressed to 28 words. Next turn, use the summary instead of raw history — token cost drops dramatically.

Semantic Memory: Cross-Session User Facts

The most persistent layer. Survives across conversations, specifically for storing long-term facts about the user (name, role, preferences):

# Store user profile in KV store — readable in any future session
user_profile = {
    "name": "Li Lei",
    "role": "Python engineer",
    "interests": ["LangGraph", "Agent development"],
    "level": "intermediate",
}
# With this, the Agent gives personalized answers
# "What should I learn next?" → recommend LangGraph advanced patterns,
# not Python basics
Enter fullscreen mode Exit fullscreen mode

Three Context Management Strategies: Truncation / Summarization / Retrieval

When conversation history keeps growing and the context window can't keep up, you have three options:

Strategy 1: Truncation

The simplest approach — keep only the last N messages, discard the rest:

# Keep only the last 4 messages
truncated = history[-4:]
resp = llm.invoke(truncated + [HumanMessage(test_q)])
Enter fullscreen mode Exit fullscreen mode

Tested with 8-topic conversation history (16 messages), truncated to last 4, then asked "What is a Python list?":

Earliest visible message after truncation: "Explain Python decorators" (topic 5 — "lists" was topic 1)
Answer: "A Python list is a built-in data structure..." (answered from LLM's own knowledge)
⚠ Lost the fact that we "covered" lists — LLM answers from general knowledge only
Enter fullscreen mode Exit fullscreen mode

Best for: Scenarios where historical continuity doesn't matter, or pure Q&A Agents where past context is irrelevant.

Strategy 2: Summarization

Use the LLM to compress long history into a summary paragraph, then use the summary instead of raw history for subsequent turns:

summary_resp = llm.invoke([
    SystemMessage("Compress conversation history into a summary (≤80 words), keep all topic names"),
    HumanMessage("\n".join([f"{m.type}: {m.content}" for m in history])),
])
# → "Python lists are mutable and ordered, tuples are immutable and memory-efficient,
#    dicts map keys to values, sets store unique elements,
#    functions encapsulate logic, classes enable OOP,
#    decorators wrap functions, generators enable lazy evaluation."
# 16 messages → 66-word summary
Enter fullscreen mode Exit fullscreen mode

With the same "What is a Python list?" question, the summarization approach answers with awareness that "we've discussed this" — not just generic knowledge.

Strategy comparison:

Strategy Token Cost Information Retained Complexity Best For
Truncation Lowest Recent turns only Trivial Q&A, stateless tasks
Summarization Low Global thread Medium Teaching, consulting, long sessions
Retrieval Lowest Precisely relevant High (needs vector DB) Knowledge bases, multi-domain Agents

Strategy 3: Retrieval

Pull only the history snippets semantically relevant to the current question — the most effective and most complex approach:

# Simplified demo: keyword filter (production uses vector similarity)
relevant = [m for m in history if "list" in m.content.lower()]
# 16 messages → 3 relevant ones
resp = llm.invoke(relevant + [HumanMessage(test_q)])
Enter fullscreen mode Exit fullscreen mode

Best for: Knowledge-base Agents, personal assistants with extensive user histories.


LangGraph checkpointer: Within-Session State Persistence

LangGraph's MemorySaver (checkpointer) uses thread_id to distinguish sessions and automatically accumulates conversation history within a session:

from langgraph.checkpoint.memory import MemorySaver
from langgraph.prebuilt import create_react_agent

checkpointer = MemorySaver()
agent = create_react_agent(model=llm, tools=[get_weather], checkpointer=checkpointer)

# Same thread_id = same session, state persists across turns
config_a = {"configurable": {"thread_id": "weather_001"}}
Enter fullscreen mode Exit fullscreen mode

Cross-Turn Reference in Practice

Three consecutive weather queries in the same session:

[Turn 1] User:  What's the weather in Beijing today?
         Agent: Beijing today: sunny, 25°C, NE wind level 3, good air quality.
         (Messages in state: 4)

[Turn 2] User:  What about Shanghai?   ← "What about" has no explicit referent
         Agent: Shanghai today: overcast, 22°C, SE wind level 2, light smog.
         (Messages in state: 8)

[Turn 3] User:  Which city is better for going out today?  ← needs both prior results
         Agent: Given Shanghai's smog, Beijing is the better choice today.
         (Messages in state: 10)
Enter fullscreen mode Exit fullscreen mode

Turns 2 and 3 both depend on prior history — checkpointer handles cross-turn context automatically.

Session Isolation

Different thread_id values are completely independent:

[New session — thread_id: weather_002]
User:  What city did I just ask about?
Agent: You asked "what city," but no specific city name was provided.
       If you'd like to check the weather somewhere, please tell me the city.
→ New thread_id has no history — has no idea what was asked before
Enter fullscreen mode Exit fullscreen mode

Session Continuation

Return to the same thread_id later — history is still there:

[Session A continued — same thread_id]
User:  Compare those two cities with Shenzhen
Agent: Shenzhen today: showers, 27°C, SW wind level 2, thunderstorm warning.
       (Provided three-city comparison: Beijing / Shanghai / Shenzhen)
→ Remembered the previous Beijing and Shanghai queries
Enter fullscreen mode Exit fullscreen mode

MemorySaver is in-memory — data is lost on process restart. For production, use SqliteSaver (local file) or PostgresSaver (database).


LangGraph InMemoryStore: Cross-Session Long-Term Memory

The checkpointer handles memory within a single session. Cross-session long-term memory requires the store:

checkpointer  →  bound to thread_id, valid for the session's lifetime
store         →  bound to user_id, persists across all sessions
Enter fullscreen mode Exit fullscreen mode

Core API

from langgraph.store.memory import InMemoryStore

store = InMemoryStore()

# Write: (namespace, key, value)
store.put(("user_facts", user_id), key, {"fact": "Li Lei, backend engineer"})

# Read: search all entries under a namespace
facts = store.search(("user_facts", user_id))
for item in facts:
    print(item.value["fact"])

# Precise read
item = store.get(("user_facts", user_id), specific_key)
Enter fullscreen mode Exit fullscreen mode

Cross-Session Memory in Practice

In Session A, the Agent automatically extracts and stores user information from the conversation:

[Session A] User said three things → auto-extracted and stored:
  • Li Lei, backend engineer
  • Python, Go, LangGraph, Agent development
  • Prefers hands-on practice over reading docs
Enter fullscreen mode Exit fullscreen mode

In a completely new Session B (new thread_id, same user_id), asking "Do you know me?":

[Session B — brand new thread_id]
User:  Hi, do you know me?
Agent: Hello! Based on your profile, yes I do. You're Li Lei, a backend engineer
       skilled in Python, Go, LangGraph, and Agent development.
       You prefer hands-on practice over reading documentation.
       How can I help you today?
→ Brand new thread_id, but store data persists across sessions
Enter fullscreen mode Exit fullscreen mode

Data for different user_id values is fully isolated and cannot bleed across users.

checkpointer vs store

# Short-term memory: checkpointer — bound to thread_id, valid within session
app = graph.compile(checkpointer=MemorySaver())
result = app.invoke(input, config={"configurable": {"thread_id": "abc"}})

# Long-term memory: store — bound to user_id, valid across sessions
store = InMemoryStore()
app = graph.compile(store=store, checkpointer=MemorySaver())
# Operate on store inside a node
store.put(("user_facts", user_id), key, {"fact": "..."})
stored = store.search(("user_facts", user_id))
Enter fullscreen mode Exit fullscreen mode

In production, swap InMemoryStore for PostgresStore or RedisStore to get real persistence — the architecture stays identical.


Auto-Summarization: RemoveMessage + Summary Rotation

When message count exceeds a threshold, trigger automatic compression — this is the core mechanism that keeps an Agent coherent over infinitely long conversations.

Graph Structure

[chat node]
    │
    ├─ message count ≤ threshold → END
    └─ message count > threshold → [compress node] → END
Enter fullscreen mode Exit fullscreen mode

The compress Node: Deleting Old Messages with RemoveMessage

def compress_node(state: SummaryState) -> dict:
    messages = state["messages"]
    to_compress = messages[:-2]   # keep the 2 most recent, compress the rest
    keep = messages[-2:]

    # old messages → new summary
    new_summary = llm.invoke([
        SystemMessage("Compress the following into a summary under 120 words"),
        HumanMessage(existing_summary + old_messages_text),
    ]).content

    # RemoveMessage: tells the add_messages reducer to delete these messages
    remove_ops = [RemoveMessage(id=m.id) for m in to_compress]
    return {"messages": remove_ops, "summary": new_summary}
Enter fullscreen mode Exit fullscreen mode

RemoveMessage is LangGraph's dedicated message-deletion operator. When the add_messages reducer encounters it, it removes the matching message ID from state — rather than appending.

Measured Results

11 turns of conversation, compression threshold at 8 messages:

[Turns 1–4]  Message count:  2/4/6/8  | Summary: ○ none

  [Compression triggered] 10 messages → compress 8, keep 2
  [New summary]  Python list common methods include search, sort, add, remove.
                 dict.get() avoids KeyError, returns default value.
                 *args accepts variable positional args, **kwargs accepts keyword args...

[Turn 5]    Message count:  2  | Summary: ✓ compressed  ← restarts from 2 after compression

  [Compression triggered again] 10 messages → compress 8, keep 2

[Turn 11]  Final summary:
           "Based on our discussion, here's a summary of Python concepts covered:
            1. Python list comprehensions...
            2. Set comprehensions...   ← inherited from Turn 1 via summary chain
            3. Lambda functions..."
Enter fullscreen mode Exit fullscreen mode

Key result: All 11 turns kept only 2–8 active messages — but through the summary chain, every piece of knowledge from turn 1 was preserved at turn 11.

State Design

class SummaryState(TypedDict):
    messages: Annotated[list, add_messages]  # add_messages handles RemoveMessage
    summary: Optional[str]                   # accumulated history summary, injected into system prompt

def chat_node(state: SummaryState) -> dict:
    summary = state.get("summary") or ""
    system_parts = ["You are a helpful assistant."]
    if summary:
        system_parts.append(f"\n\n[Conversation History Summary]\n{summary}")
    resp = llm.invoke([SystemMessage("".join(system_parts))] + state["messages"])
    return {"messages": [resp]}
Enter fullscreen mode Exit fullscreen mode

Memory Management Design Checklist

Everything to consider when building a complete Agent memory system:

Short-term memory (checkpointer)

  • [ ] Choose the right checkpointer backend (MemorySaver for dev, SqliteSaver/PostgresSaver for prod)
  • [ ] Assign each user/session a unique thread_id
  • [ ] Set a history truncation threshold to prevent unbounded token growth

Long-term memory (store)

  • [ ] Organize user data by namespace: (type, user_id) e.g., ("user_facts", uid)
  • [ ] Apply confidence filtering during extraction — avoid storing meaningless noise
  • [ ] Replace InMemoryStore with PostgresStore / RedisStore in production

Context compression

  • [ ] Determine compression threshold (8–20 messages is a reasonable starting range)
  • [ ] Make the summary prompt explicit about what to preserve (topic names, key decisions, user preferences)
  • [ ] Test the summary chain: does the Turn-N summary still carry information from Turn 1?
  • [ ] Use RemoveMessage rather than replacing the entire messages list (the latter breaks with checkpointer)

Memory read/write timing

  • [ ] Read memory: at the start of chat_node, inject into system prompt
  • [ ] Write memory: at the end of chat_node, extract new user information
  • [ ] Avoid writing on every turn (set confidence or content-length filters)

Summary

Five core takeaways:

  1. Four memory types, each with a distinct role: Sensory memory is the LLM call itself; working memory is conversation history; episodic memory is compressed summaries; semantic memory is a cross-session KV store
  2. checkpointer manages sessions, store manages users: thread_id is the session dimension, user_id is the user dimension — keep them separate
  3. Summary compression is the key to long conversations: RemoveMessage + summary injection keeps token cost bounded while preserving all prior knowledge
  4. Session isolation is non-negotiable: Different thread_id histories never bleed; different user_id long-term memories never bleed
  5. InMemoryStore → PostgresStore is a one-line swap: Architecture stays constant, backend is pluggable

Up next: Knowledge Base Integration — the real difference between RAG-as-a-tool and a RAG pipeline, multi-knowledge-base routing, and how an Agent decides when to retrieve, what to retrieve, and how many times.


References


Find more useful knowledge and interesting products on my Homepage

Top comments (0)