Memory Isn't Just "Store the Chat Log"
Dumping conversation history into the prompt is the crudest form of memory. Real systems have more complex needs:
- The user mentioned their city in turn 3; the Agent should know where to look when they ask about weather in turn 10
- The user told the system last week which product plan they use — a new session shouldn't ask again
- After 20 turns, the context window is almost full — how do you compress without losing critical information?
These are three different problems requiring three different mechanisms.
Three-Layer Memory Architecture
Short-term Within session MemorySaver checkpointer Multi-turn Q&A
Long-term Cross-session Persistent KV / vector DB Personalization
Compression Within session Summary replacement Long-conversation token guard
Demo 1: Short-term Memory — MemorySaver
LangGraph's MemorySaver is the lightest short-term memory implementation. It binds conversation history to a thread_id and automatically injects it on subsequent calls.
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.runnables import RunnableConfig
checkpointer = MemorySaver()
stateful_agent = create_react_agent(
model=llm,
tools=[get_weather, calculator, get_product_info],
checkpointer=checkpointer,
)
THREAD_A: RunnableConfig = {"configurable": {"thread_id": "thread-alice"}}
# Turn 1: introduce name and city
r1 = stateful_agent.invoke(
{"messages": [HumanMessage("Hi, I'm Alice. I live in Beijing.")]},
config=THREAD_A,
)
# Turn 2: same thread_id — full history is automatically attached
r2 = stateful_agent.invoke(
{"messages": [HumanMessage("What's the weather like where I live today?")]},
config=THREAD_A,
)
Real benchmark results:
Thread A Turn 1: Hello Alice! How can I assist you today?
Thread A Turn 2: Sure, I can help you with that. I will need to know the
city you are in. Could you please provide me with the
name of your city?
Tools used: []
Thread B (no context): I can help with that. Could you please provide
your city name?
Tools used: []
Thread A and Thread B gave identical responses.
This is an important finding: MemorySaver's infrastructure worked correctly — Thread A's second call carried the full two-message history while Thread B only had one. But GLM-4-Flash didn't connect "I live in Beijing" (turn 1) to "where I live" (turn 2). This is a model capability issue, not a MemorySaver issue.
With the same prompt, GPT-4 or Claude would likely query Beijing weather directly. A weaker model may need more explicit input ("What's the weather in Beijing?") to trigger the tool call.
Short-term memory has two layers:
- Infrastructure layer: MemorySaver ensures history is passed along ✓
- Model layer: Whether the LLM can extract and use context from history ← depends on model capability
Demo 2: Long-term Memory — Cross-session Fact Store
The core idea for cross-session memory: use an LLM to extract key facts from conversation, store them persistently, and inject them into the system prompt for the next session.
Session 1 — extract and save:
# Simulated persistent store (production: replace with database or vector store)
LONG_TERM_STORE: dict[str, dict[str, str]] = {}
def extract_facts(conversation: str) -> dict[str, str]:
resp = llm.invoke([
SystemMessage(
"Extract key facts about the user. "
'Return ONLY JSON: {"city": "...", "plan": "..."}'
),
HumanMessage(f"Conversation:\n{conversation}"),
])
# parse JSON response...
Session 1 conversation:
User: I'm Alice. I'm based in Shanghai and my team uses WonderBot Pro.
User: We mainly use the API for data processing — about 50,000 calls a month.
Extracted and saved:
{'name': 'alice', 'city': 'shanghai', 'team': 'wonderbot pro', 'api_calls': '50000'}
Session 2 — inject and use:
stored = load_user_facts("user-alice")
facts_text = "; ".join(f"{k}={v}" for k, v in stored.items())
personalized_prompt = (
"You are a helpful assistant. "
f"Known facts about this user: {facts_text}. "
"Use these facts to personalize your responses without asking the user to repeat themselves."
)
personalized_agent = create_react_agent(model=llm, tools=TOOLS, prompt=personalized_prompt)
Session 2 result:
User: What's the weather like in my city today?
Agent: The current weather in Shanghai is 22 degrees Celsius with cloudy conditions.
Tools used: ['get_weather']
The Agent queried Shanghai directly, no clarifying question asked. That's because city=shanghai was already in the system prompt — the model read an explicit fact, not an inference from history.
This is why long-term memory is more reliable than short-term for cross-session use: facts in explicit KV format don't require the model to reason backward through conversation history.
Demo 3: History Compression
As conversations grow, token consumption and response latency scale linearly. The compression strategy: set a token threshold, then replace history with a summary when exceeded.
COMPRESSION_THRESHOLD = 250 # tokens
def summarize_messages(messages: list) -> str:
history_text = "\n".join(
f"{'User' if isinstance(m, HumanMessage) else 'Agent'}: {str(m.content)[:150]}"
for m in messages
if isinstance(m, (HumanMessage, AIMessage)) and not getattr(m, "tool_calls", None)
)
resp = llm.invoke([
SystemMessage(
"Summarize this conversation in 2-3 sentences. "
"Preserve all key facts: names, cities, numbers, product names."
),
HumanMessage(f"Conversation:\n{history_text}"),
])
return str(resp.content)
# After each turn, check token count
if total_tokens > COMPRESSION_THRESHOLD:
summary = summarize_messages(messages)
messages = [SystemMessage(f"Conversation summary so far: {summary}")]
Demo 3 real results: 5 turns (Bob in Shenzhen, evaluating WonderBot Pro, 8 developers, annual cost 299×12=3588) stayed at 198 tokens total — below the 250 threshold, so compression was never triggered.
Final verification:
User: Quickly summarize: who am I, what city, and what's the annual API cost?
Agent: You are Bob, from Shenzhen, and the annual API cost for WonderBot Pro
for 8 developers is $3,588.
All 10 messages were retained. The Agent recalled every key fact correctly. Compression is a safety valve, not a per-turn operation — when conversation stays concise, the raw history is more precise than a summary. Recommended threshold: 2000–4000 tokens.
Comparison of Three Patterns
| Pattern | Demo 1 Result | Demo 2 Result | Key Difference |
|---|---|---|---|
| Short-term | Infrastructure correct; GLM-4-Flash failed to use implicit context | — | Depends on model reasoning ability |
| Long-term | — | Directly called get_weather(Shanghai), zero clarifying questions | Explicit KV injection, no inference needed |
| Compression | — | — | Safety valve, triggered on demand |
Core conclusion:
- Need reliable cross-turn context → use long-term memory explicit injection, not just MemorySaver
- MemorySaver's value is session isolation (separate thread_ids don't bleed into each other) and automatic history passing — it does not solve "can the model reason about implicit context"
- Compression is a production necessity, but don't set the threshold too low
Design Checklist
Short-term Memory (MemorySaver)
- [ ] Assign a distinct
thread_idper user / per conversation - [ ] Use user ID for
thread_id, not a random string — keeps multi-turn conversations continuous - [ ] Don't rely on MemorySaver to solve "model fails to infer implicit information"
- [ ] Production: use a persistent checkpointer (
SqliteSaver,PostgresSaver), notMemorySaver
Long-term Memory
- [ ] Use an LLM to extract facts — don't hand-write parsing rules
- [ ] Inject facts in explicit KV format into the system prompt; explicit beats implicit
- [ ] Define an update policy: overwrite stale facts instead of appending indefinitely
- [ ] Production: structured DB for factual memory, vector store for semantic memory
History Compression
- [ ] Threshold: 2000–4000 tokens (too low = frequent compression = precision loss)
- [ ] Summarization prompt must say "preserve specific numbers and names" — LLMs over-abstract otherwise
- [ ] After compression, verify that key facts still appear in the summary
- [ ] Never trigger compression mid-tool-call — it breaks the execution context
Summary
Five core takeaways:
- MemorySaver is infrastructure, not a silver bullet: it ensures history is passed along, but whether the model can use implicit context depends on model capability
-
Explicit injection is more reliable than implicit inference: putting
city=shanghaiin the system prompt is more stable than hoping the model will deduce it from 10 messages of history - Use all three layers in combination: short-term (session isolation) + long-term (cross-session personalization) + compression (token guard)
- Fact extraction is the key step for long-term memory: LLM extraction + JSON parsing converts unstructured conversation into structured, injectable facts
- Compression is a safety valve, not a per-turn operation: raw history is more precise when conversation is short; only compress when the threshold is exceeded
Up next: Agent Tool Design — how to design high-quality tools: tool granularity, error handling, retry strategies, and when to use one tool vs. splitting into multiple.
References
- LangGraph Persistence documentation
- LangGraph MemorySaver reference
- Full demo code for this series: agent-14-memory
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)