DEV Community

Cover image for Agent Series (15): Advanced Agent Memory — Short-term, Long-term, Compression
WonderLab
WonderLab

Posted on

Agent Series (15): Advanced Agent Memory — Short-term, Long-term, Compression

Memory Isn't Just "Store the Chat Log"

Dumping conversation history into the prompt is the crudest form of memory. Real systems have more complex needs:

  • The user mentioned their city in turn 3; the Agent should know where to look when they ask about weather in turn 10
  • The user told the system last week which product plan they use — a new session shouldn't ask again
  • After 20 turns, the context window is almost full — how do you compress without losing critical information?

These are three different problems requiring three different mechanisms.


Three-Layer Memory Architecture

Short-term    Within session    MemorySaver checkpointer    Multi-turn Q&A
Long-term     Cross-session     Persistent KV / vector DB   Personalization
Compression   Within session    Summary replacement         Long-conversation token guard
Enter fullscreen mode Exit fullscreen mode

Demo 1: Short-term Memory — MemorySaver

LangGraph's MemorySaver is the lightest short-term memory implementation. It binds conversation history to a thread_id and automatically injects it on subsequent calls.

from langgraph.checkpoint.memory import MemorySaver
from langchain_core.runnables import RunnableConfig

checkpointer = MemorySaver()
stateful_agent = create_react_agent(
    model=llm,
    tools=[get_weather, calculator, get_product_info],
    checkpointer=checkpointer,
)

THREAD_A: RunnableConfig = {"configurable": {"thread_id": "thread-alice"}}

# Turn 1: introduce name and city
r1 = stateful_agent.invoke(
    {"messages": [HumanMessage("Hi, I'm Alice. I live in Beijing.")]},
    config=THREAD_A,
)

# Turn 2: same thread_id — full history is automatically attached
r2 = stateful_agent.invoke(
    {"messages": [HumanMessage("What's the weather like where I live today?")]},
    config=THREAD_A,
)
Enter fullscreen mode Exit fullscreen mode

Real benchmark results:

Thread A Turn 1: Hello Alice! How can I assist you today?

Thread A Turn 2: Sure, I can help you with that. I will need to know the 
                 city you are in. Could you please provide me with the 
                 name of your city?
                 Tools used: []

Thread B (no context): I can help with that. Could you please provide 
                       your city name?
                       Tools used: []
Enter fullscreen mode Exit fullscreen mode

Thread A and Thread B gave identical responses.

This is an important finding: MemorySaver's infrastructure worked correctly — Thread A's second call carried the full two-message history while Thread B only had one. But GLM-4-Flash didn't connect "I live in Beijing" (turn 1) to "where I live" (turn 2). This is a model capability issue, not a MemorySaver issue.

With the same prompt, GPT-4 or Claude would likely query Beijing weather directly. A weaker model may need more explicit input ("What's the weather in Beijing?") to trigger the tool call.

Short-term memory has two layers:

  1. Infrastructure layer: MemorySaver ensures history is passed along ✓
  2. Model layer: Whether the LLM can extract and use context from history ← depends on model capability

Demo 2: Long-term Memory — Cross-session Fact Store

The core idea for cross-session memory: use an LLM to extract key facts from conversation, store them persistently, and inject them into the system prompt for the next session.

Session 1 — extract and save:

# Simulated persistent store (production: replace with database or vector store)
LONG_TERM_STORE: dict[str, dict[str, str]] = {}

def extract_facts(conversation: str) -> dict[str, str]:
    resp = llm.invoke([
        SystemMessage(
            "Extract key facts about the user. "
            'Return ONLY JSON: {"city": "...", "plan": "..."}'
        ),
        HumanMessage(f"Conversation:\n{conversation}"),
    ])
    # parse JSON response...
Enter fullscreen mode Exit fullscreen mode

Session 1 conversation:

User: I'm Alice. I'm based in Shanghai and my team uses WonderBot Pro.
User: We mainly use the API for data processing — about 50,000 calls a month.
Enter fullscreen mode Exit fullscreen mode

Extracted and saved:

{'name': 'alice', 'city': 'shanghai', 'team': 'wonderbot pro', 'api_calls': '50000'}
Enter fullscreen mode Exit fullscreen mode

Session 2 — inject and use:

stored = load_user_facts("user-alice")
facts_text = "; ".join(f"{k}={v}" for k, v in stored.items())

personalized_prompt = (
    "You are a helpful assistant. "
    f"Known facts about this user: {facts_text}. "
    "Use these facts to personalize your responses without asking the user to repeat themselves."
)

personalized_agent = create_react_agent(model=llm, tools=TOOLS, prompt=personalized_prompt)
Enter fullscreen mode Exit fullscreen mode

Session 2 result:

User: What's the weather like in my city today?
Agent: The current weather in Shanghai is 22 degrees Celsius with cloudy conditions.
Tools used: ['get_weather']
Enter fullscreen mode Exit fullscreen mode

The Agent queried Shanghai directly, no clarifying question asked. That's because city=shanghai was already in the system prompt — the model read an explicit fact, not an inference from history.

This is why long-term memory is more reliable than short-term for cross-session use: facts in explicit KV format don't require the model to reason backward through conversation history.


Demo 3: History Compression

As conversations grow, token consumption and response latency scale linearly. The compression strategy: set a token threshold, then replace history with a summary when exceeded.

COMPRESSION_THRESHOLD = 250   # tokens

def summarize_messages(messages: list) -> str:
    history_text = "\n".join(
        f"{'User' if isinstance(m, HumanMessage) else 'Agent'}: {str(m.content)[:150]}"
        for m in messages
        if isinstance(m, (HumanMessage, AIMessage)) and not getattr(m, "tool_calls", None)
    )
    resp = llm.invoke([
        SystemMessage(
            "Summarize this conversation in 2-3 sentences. "
            "Preserve all key facts: names, cities, numbers, product names."
        ),
        HumanMessage(f"Conversation:\n{history_text}"),
    ])
    return str(resp.content)

# After each turn, check token count
if total_tokens > COMPRESSION_THRESHOLD:
    summary = summarize_messages(messages)
    messages = [SystemMessage(f"Conversation summary so far: {summary}")]
Enter fullscreen mode Exit fullscreen mode

Demo 3 real results: 5 turns (Bob in Shenzhen, evaluating WonderBot Pro, 8 developers, annual cost 299×12=3588) stayed at 198 tokens total — below the 250 threshold, so compression was never triggered.

Final verification:

User: Quickly summarize: who am I, what city, and what's the annual API cost?
Agent: You are Bob, from Shenzhen, and the annual API cost for WonderBot Pro 
       for 8 developers is $3,588.
Enter fullscreen mode Exit fullscreen mode

All 10 messages were retained. The Agent recalled every key fact correctly. Compression is a safety valve, not a per-turn operation — when conversation stays concise, the raw history is more precise than a summary. Recommended threshold: 2000–4000 tokens.


Comparison of Three Patterns

Pattern Demo 1 Result Demo 2 Result Key Difference
Short-term Infrastructure correct; GLM-4-Flash failed to use implicit context Depends on model reasoning ability
Long-term Directly called get_weather(Shanghai), zero clarifying questions Explicit KV injection, no inference needed
Compression Safety valve, triggered on demand

Core conclusion:

  • Need reliable cross-turn context → use long-term memory explicit injection, not just MemorySaver
  • MemorySaver's value is session isolation (separate thread_ids don't bleed into each other) and automatic history passing — it does not solve "can the model reason about implicit context"
  • Compression is a production necessity, but don't set the threshold too low

Design Checklist

Short-term Memory (MemorySaver)

  • [ ] Assign a distinct thread_id per user / per conversation
  • [ ] Use user ID for thread_id, not a random string — keeps multi-turn conversations continuous
  • [ ] Don't rely on MemorySaver to solve "model fails to infer implicit information"
  • [ ] Production: use a persistent checkpointer (SqliteSaver, PostgresSaver), not MemorySaver

Long-term Memory

  • [ ] Use an LLM to extract facts — don't hand-write parsing rules
  • [ ] Inject facts in explicit KV format into the system prompt; explicit beats implicit
  • [ ] Define an update policy: overwrite stale facts instead of appending indefinitely
  • [ ] Production: structured DB for factual memory, vector store for semantic memory

History Compression

  • [ ] Threshold: 2000–4000 tokens (too low = frequent compression = precision loss)
  • [ ] Summarization prompt must say "preserve specific numbers and names" — LLMs over-abstract otherwise
  • [ ] After compression, verify that key facts still appear in the summary
  • [ ] Never trigger compression mid-tool-call — it breaks the execution context

Summary

Five core takeaways:

  1. MemorySaver is infrastructure, not a silver bullet: it ensures history is passed along, but whether the model can use implicit context depends on model capability
  2. Explicit injection is more reliable than implicit inference: putting city=shanghai in the system prompt is more stable than hoping the model will deduce it from 10 messages of history
  3. Use all three layers in combination: short-term (session isolation) + long-term (cross-session personalization) + compression (token guard)
  4. Fact extraction is the key step for long-term memory: LLM extraction + JSON parsing converts unstructured conversation into structured, injectable facts
  5. Compression is a safety valve, not a per-turn operation: raw history is more precise when conversation is short; only compress when the threshold is exceeded

Up next: Agent Tool Design — how to design high-quality tools: tool granularity, error handling, retry strategies, and when to use one tool vs. splitting into multiple.


References


Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Top comments (0)