1.1 The Stateless Nature of LLMs
Let's start with a truth that seems counterintuitive when you're chatting with Claude or ChatGPT:
Large Language Models have no memory.
Every single time you send a message, the model starts completely fresh. It has no idea who you are, what you discussed before, or what preferences you have. It's like talking to someone with perfect amnesia, as every conversation begins at zero.
But, the conversation feels continuous, right!
The trick: Your entire conversation history is sent with every message.
When you send "Can you explain that differently?", what actually reaches the model is:
[System prompt: You are Claude, made by Anthropic...]
[Message 1: User asked about Python decorators]
[Message 2: Claude explained decorators with examples]
[Message 3: User says "Can you explain that differently?"]
The model reads everything, generates a response, and then forgets everything. The next time you send a message, the whole history is sent again.
This is what we mean by stateless , the model itself stores nothing between calls. All "memory" is an illusion created by passing context back and forth.
1.2 Context Windows: The Illusion of Memory
The "context window" is the maximum amount of text a model can process in a single call. Think of it as the model's working memory, everything it can "see" at once.
Context window sizes (as of late 2025):
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Claude Opus 4 | 200K tokens |
| Gemini 1.5 Pro | 2M tokens |
A token is roughly ¾ of a word. So 200K tokens ≈ 150,000 words ≈ a 500-page book.
This sounds huge. So what's the problem?
Three issues:
Issue 1: Cost
Every token you send costs money. If you're building an application with 1,000 users, each sending 10 messages per day, and you're stuffing 50K tokens of history into each call...
1,000 users × 10 messages × 50K tokens = 500M input tokens/day
At Claude's pricing ($3/1M input tokens for Sonnet), that's $1,500/day just on input tokens. And that's before the model generates any output.
Issue 2: Latency
More tokens = slower responses. The model has to process everything you send before generating the first word of its response. With 100K tokens of context, you might wait 5-10 seconds before seeing any output.
Issue 3: The "Lost in the Middle" Problem
Research has shown that LLMs pay more attention to the beginning and end of their context window, and less attention to the middle. If you stuff a 200K context window full of conversation history, the model might miss important details from 3 hours ago that are buried in the middle.
[Beginning - High Attention]
...
[Middle - Lower Attention] ← Important detail about user's project here
...
[End - High Attention]
This is why "just use a bigger context window" isn't a complete solution.
1.3 The Forgetting Problem: What Happens After 128K Tokens?
Let's make this concrete.
Imagine you're building a personal assistant that helps a user over weeks or months. They discuss:
- Their job (software engineer at a fintech startup)
- Their preferences (likes concise answers, hates bullet points)
- Their projects (building a recommendation engine)
- Their schedule (busy Mondays, prefers async communication)
- Hundreds of small details mentioned in passing
After a few weeks of daily use, you have millions of tokens of conversation history.
What do you do?
Option A: Truncate (Delete Old Messages)
Just keep the most recent N messages. Simple, but brutal.
Day 1: User mentions they're allergic to shellfish
Day 2-30: Various conversations
Day 31: User asks for dinner recommendations
Assistant: "How about this great lobster restaurant?" 💀
The model forgot because you deleted the context where the allergy was mentioned.
Option B: Summarize
Periodically compress old conversations into summaries.
Original (5000 tokens):
- Long conversation about user's job search
- Details about companies they applied to
- Specific concerns about salary negotiation
Summary (200 tokens):
"User is job searching in tech, has applied to several companies,
concerned about salary negotiation."
Better, but you lose nuance. Which companies? What were the specific concerns? Summaries are lossy compression.
Option C: Extract and Store
Pull out key facts and store them separately:
Facts extracted:
- User works at: TechCorp (software engineer)
- User preference: concise answers
- User allergy: shellfish
- User project: recommendation engine
This is the foundation of what memory systems like mem0 do. But now you need a system to:
- Decide what's worth extracting
- Store it somewhere
- Retrieve relevant facts for each new conversation
- Handle conflicts (user changed jobs, old fact is now wrong)
This is the memory problem. And it's why a whole category of tools exists to solve it.
1.4 Human Memory vs Machine Memory: A Conceptual Framework
To build good AI memory systems, it helps to understand how human memory actually works. Not because we should copy it exactly, but because it reveals what kinds of memory matter.
Human Memory Types
Sensory Memory (milliseconds)
Raw input from senses. Mostly irrelevant for AI, this is like the streaming tokens before they're processed.
Short-Term / Working Memory (seconds to minutes)
What you're actively thinking about right now. Limited capacity — humans can hold about 7±2 items.
For LLMs: This is the context window. What the model can "see" in a single call.
Long-Term Memory — This is where it gets interesting:
| Type | What It Stores | Human Example | AI Equivalent |
|---|---|---|---|
| Episodic | Specific events | "Last Tuesday's meeting" | Conversation logs |
| Semantic | Facts & knowledge | "Paris is in France" | Extracted facts, knowledge bases |
| Procedural | How to do things | Riding a bike | Fine-tuned behaviors, tool usage patterns |
The Key Insight
Humans don't remember everything. We:
- Consolidate — Important things move from short-term to long-term
- Forget — Unimportant things decay
- Reconstruct — We don't replay memories perfectly; we rebuild them from fragments
- Associate — Memories connect to each other (one memory triggers another)
Good AI memory systems need similar properties:
- Not everything should be stored (selective extraction)
- Old irrelevant memories should fade (decay/relevance scoring)
- Retrieval should be associative, not just keyword-based (semantic search)
- Memory should be reconstructible from fragments (summarization + facts)
The Gap
Here's what current LLM products (Claude's memory, ChatGPT's memory) give you:
User Preferences ✓ (semantic memory)
Key Facts ✓ (semantic memory)
Conversation Recall ✓ (episodic memory, limited)
Here's what they don't handle well:
Multi-agent shared memory ✗
Memory scoping (who knows what) ✗
Memory validation (is this fact still true?) ✗
Procedural memory for agents ✗
Memory across applications ✗
This gap is exactly where developer-facing memory tools (mem0, Supermemory, Aegis Memory) come in.
Module 1 Summary
| Concept | Key Takeaway |
|---|---|
| Stateless LLMs | Models remember nothing; context is re-sent every call |
| Context Windows | Limited size, costly, slow, attention problems |
| The Forgetting Problem | Can't keep everything; need selective storage & retrieval |
| Memory Types | Episodic (events), Semantic (facts), Procedural (skills) |
| The Gap | Product memory ≠ Agent/Developer memory |
What's Next
In Part 2, we'll answer a question that trips up most developers:
What's the difference between episodic and semantic memory, and why does it matter for your agent?
We'll build a complete taxonomy mapping human memory research to AI implementation.
You'll learn why LLMs can fake most memory types but struggle with one critical category —
the same one that multi-agent systems need most.
Top comments (0)