Arulnidhi Karunanidhi

Posted on Feb 9

Part 1: Foundations - Why Memory Matters in AI

#llm #ai #machinelearning #beginners

1.1 The Stateless Nature of LLMs

Let's start with a truth that seems counterintuitive when you're chatting with Claude or ChatGPT:

Large Language Models have no memory.

Every single time you send a message, the model starts completely fresh. It has no idea who you are, what you discussed before, or what preferences you have. It's like talking to someone with perfect amnesia, as every conversation begins at zero.

But, the conversation feels continuous, right!

The trick: Your entire conversation history is sent with every message.

When you send "Can you explain that differently?", what actually reaches the model is:

[System prompt: You are Claude, made by Anthropic...]
[Message 1: User asked about Python decorators]
[Message 2: Claude explained decorators with examples]
[Message 3: User says "Can you explain that differently?"]

The model reads everything, generates a response, and then forgets everything. The next time you send a message, the whole history is sent again.

This is what we mean by stateless , the model itself stores nothing between calls. All "memory" is an illusion created by passing context back and forth.

1.2 Context Windows: The Illusion of Memory

The "context window" is the maximum amount of text a model can process in a single call. Think of it as the model's working memory, everything it can "see" at once.

Context window sizes (as of late 2025):

Model	Context Window
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Claude Opus 4	200K tokens
Gemini 1.5 Pro	2M tokens

A token is roughly ¾ of a word. So 200K tokens ≈ 150,000 words ≈ a 500-page book.

This sounds huge. So what's the problem?

Three issues:

Issue 1: Cost

Every token you send costs money. If you're building an application with 1,000 users, each sending 10 messages per day, and you're stuffing 50K tokens of history into each call...

1,000 users × 10 messages × 50K tokens = 500M input tokens/day

At Claude's pricing ($3/1M input tokens for Sonnet), that's $1,500/day just on input tokens. And that's before the model generates any output.

Issue 2: Latency

More tokens = slower responses. The model has to process everything you send before generating the first word of its response. With 100K tokens of context, you might wait 5-10 seconds before seeing any output.

Issue 3: The "Lost in the Middle" Problem

Research has shown that LLMs pay more attention to the beginning and end of their context window, and less attention to the middle. If you stuff a 200K context window full of conversation history, the model might miss important details from 3 hours ago that are buried in the middle.

[Beginning - High Attention]
...
[Middle - Lower Attention] ← Important detail about user's project here
...
[End - High Attention]

This is why "just use a bigger context window" isn't a complete solution.

1.3 The Forgetting Problem: What Happens After 128K Tokens?

Let's make this concrete.

Imagine you're building a personal assistant that helps a user over weeks or months. They discuss:

Their job (software engineer at a fintech startup)
Their preferences (likes concise answers, hates bullet points)
Their projects (building a recommendation engine)
Their schedule (busy Mondays, prefers async communication)
Hundreds of small details mentioned in passing

After a few weeks of daily use, you have millions of tokens of conversation history.

What do you do?

Option A: Truncate (Delete Old Messages)

Just keep the most recent N messages. Simple, but brutal.

Day 1: User mentions they're allergic to shellfish
Day 2-30: Various conversations
Day 31: User asks for dinner recommendations
Assistant: "How about this great lobster restaurant?" 💀

The model forgot because you deleted the context where the allergy was mentioned.

Option B: Summarize

Periodically compress old conversations into summaries.

Original (5000 tokens):
- Long conversation about user's job search
- Details about companies they applied to
- Specific concerns about salary negotiation

Summary (200 tokens):
"User is job searching in tech, has applied to several companies, 
concerned about salary negotiation."

Better, but you lose nuance. Which companies? What were the specific concerns? Summaries are lossy compression.

Option C: Extract and Store

Pull out key facts and store them separately:

Facts extracted:
- User works at: TechCorp (software engineer)
- User preference: concise answers
- User allergy: shellfish
- User project: recommendation engine

This is the foundation of what memory systems like mem0 do. But now you need a system to:

Decide what's worth extracting
Store it somewhere
Retrieve relevant facts for each new conversation
Handle conflicts (user changed jobs, old fact is now wrong)

This is the memory problem. And it's why a whole category of tools exists to solve it.

1.4 Human Memory vs Machine Memory: A Conceptual Framework

To build good AI memory systems, it helps to understand how human memory actually works. Not because we should copy it exactly, but because it reveals what kinds of memory matter.

Human Memory Types

Sensory Memory (milliseconds)
Raw input from senses. Mostly irrelevant for AI, this is like the streaming tokens before they're processed.

Short-Term / Working Memory (seconds to minutes)
What you're actively thinking about right now. Limited capacity — humans can hold about 7±2 items.

For LLMs: This is the context window. What the model can "see" in a single call.

Long-Term Memory — This is where it gets interesting:

Type	What It Stores	Human Example	AI Equivalent
Episodic	Specific events	"Last Tuesday's meeting"	Conversation logs
Semantic	Facts & knowledge	"Paris is in France"	Extracted facts, knowledge bases
Procedural	How to do things	Riding a bike	Fine-tuned behaviors, tool usage patterns

The Key Insight

Humans don't remember everything. We:

Consolidate — Important things move from short-term to long-term
Forget — Unimportant things decay
Reconstruct — We don't replay memories perfectly; we rebuild them from fragments
Associate — Memories connect to each other (one memory triggers another)

Good AI memory systems need similar properties:

Not everything should be stored (selective extraction)
Old irrelevant memories should fade (decay/relevance scoring)
Retrieval should be associative, not just keyword-based (semantic search)
Memory should be reconstructible from fragments (summarization + facts)

The Gap

Here's what current LLM products (Claude's memory, ChatGPT's memory) give you:

User Preferences ✓ (semantic memory)
Key Facts ✓ (semantic memory)
Conversation Recall ✓ (episodic memory, limited)

Here's what they don't handle well:

Multi-agent shared memory ✗
Memory scoping (who knows what) ✗
Memory validation (is this fact still true?) ✗
Procedural memory for agents ✗
Memory across applications ✗

This gap is exactly where developer-facing memory tools (mem0, Supermemory, Aegis Memory) come in.

Module 1 Summary

Concept	Key Takeaway
Stateless LLMs	Models remember nothing; context is re-sent every call
Context Windows	Limited size, costly, slow, attention problems
The Forgetting Problem	Can't keep everything; need selective storage & retrieval
Memory Types	Episodic (events), Semantic (facts), Procedural (skills)
The Gap	Product memory ≠ Agent/Developer memory

What's Next

In Part 2, we'll answer a question that trips up most developers:

What's the difference between episodic and semantic memory, and why does it matter for your agent?

We'll build a complete taxonomy mapping human memory research to AI implementation.
You'll learn why LLMs can fake most memory types but struggle with one critical category —
the same one that multi-agent systems need most.

DEV Community