Your agent doesn't have a memory problem. It has seven of them, and most teams have built two.
Start from the thing everyone skips past: the model itself remembers nothing. An LLM is a pure function. Same input, same output, no state carried between calls. Whatever feels like memory when you talk to ChatGPT is a layer wrapped around the model, re-sending the relevant history on every single request. The model is not remembering your last message. Something else is handing it back to the model, every turn, and paying for the tokens each time.
That layer is where almost all the engineering lives, and almost all of it collapses into two patterns: a conversation history that keeps growing until you truncate it, and a vector database you call RAG. Those are two of seven distinct things an agent can remember. The catch is that they're the two that don't make the agent any smarter over time. The type that does, the one that turns yesterday's mistakes into tomorrow's rules, is the least-built component in the entire stack.
This is part one of three. Here I'll lay out the seven types and argue about which ones actually earn the name. The taxonomy isn't mine: it comes from cognitive science by way of the CoALA paper (Sumers et al., Princeton 2023) and Tulving's 1972 split between episodic and semantic memory. What follows is the engineer's version, with opinions about what to build and what to ignore.
The only axis that matters: how long it lives
Forget the seven labels for a second. There's one organizing question, and it's temporal. Does this memory live inside the context window for a single turn, or does it persist outside the model across sessions?
Short-term is the context window. It's fast, it's right there, and it evaporates when the session ends. Long-term is everything you write to an external store and read back later. Two of the seven straddle the middle: their store is long-lived, but what they hand the model is used for exactly one turn and then thrown away. One type doesn't sit in any store at all; it's frozen into the model's weights.
Here's the full set, then I'll argue about it.
| # | Type | Lives | In one line |
|---|---|---|---|
| 1 | Working | Short | Everything in the context window right now |
| 2 | Semantic | Long | Facts, preferences, domain knowledge: the know-what |
| 3 | Episodic | Long | Logged past events: what worked, what failed |
| 4 | Procedural | Long | Skills, workflows, tool patterns: the know-how |
| 5 | Retrieval | Both | Knowledge pulled in by similarity search (RAG) |
| 6 | Parametric | Long | Knowledge baked into the model's weights |
| 7 | Prospective | Both | Future intentions, scheduled to fire later |
Read past the table and you'll notice two of these don't behave like memory at all. That's the interesting part.
1. Working memory: where you pay the bill
Working memory is the context window: system prompt, conversation history, tool outputs, retrieved chunks, and the model's own running reasoning. It's the only memory the model can directly see. Every other type on this list exists to load itself into here at the right moment.
Three facts about it decide your whole architecture. It's bounded, so you evict or summarize at the limit. It vanishes at session end. And you re-pay for it every turn, because the entire window is re-sent on each call.
That last one is where the money goes, and it's just math. A 50-turn conversation that keeps full history re-sends turn 1 fifty times. The early context isn't expensive once, it's expensive once per remaining turn. This is why "just use a bigger context window" is a credit card, not a solution. It works until the bill arrives, and the bill scales with the square of the conversation length.
There's no magic in the code, just a list that goes back in full every turn:
# The whole window is re-sent on every turn. That is the bill.
history.append(user_msg)
reply = llm.invoke([system_prompt, *history])
history.append(reply)
Frameworks paper over this with persisted state. A LangGraph checkpointer, for example, saves the thread so the history survives a restart. Persistence is not the same as paying less, though: the window still ships in full on every call. What you actually control is what goes into that list, which is why eviction and summarization are the real levers.
Working memory is the convergence point for everything else here. Get it wrong and no clever retrieval scheme downstream will save you.
2, 3, 4: the three long-term stores worth separating
These three are the heart of the system, and the useful move is to keep them distinct, because they answer different questions.
Semantic memory is the know-what: stable facts and preferences, decoupled from when you learned them. A user is on the Enterprise plan and prefers email over phone. A client is an NRI with a moderate risk profile who won't touch structured products. You apply that in every conversation without re-deriving it. You build it from a structured store for the clean fields plus a vector store for the fuzzy recall, and you update profiles incrementally.
The failure mode here isn't retrieval. It's truth. At one user, stale facts are an annoyance. At enterprise scale, with multiple agents writing to the same profile, the hard problem is source-of-truth conflict: two agents hold contradictory versions of the same fact and nothing arbitrates. That's a data-governance problem wearing a memory costume, and it's the thing that actually bites in production.
Episodic memory is the log of specific past events: full runs, the decision the agent made, and whether it worked. A fraud agent records each case (the pattern it saw, what it recommended, and whether it turned out to be real fraud or a false positive), then pulls the close matches when a similar signature shows up. This is case-based reasoning. It's how an agent stops repeating the same mistake.
Procedural memory is the know-how: the workflows, tool-use patterns, and rules for how to do things. A claims agent's flow is procedural: validate policy, assess the damage photos, check fraud signals, compute the payout band, route for approval above a threshold. Some of this lives in a tuned system prompt acting as the meta-controller, and some lives in an external store of decision rules. For rules, a structured key lookup beats fuzzy search every time. You don't want "approval threshold" retrieved by cosine similarity; you want it looked up exactly.
# Rules are an exact lookup, not a similarity hit.
PAYOUT_BANDS = {"auto": 50_000, "home": 200_000}
threshold = PAYOUT_BANDS[claim.type] # correct every time
# not this, where "close" quietly becomes "wrong":
threshold = vectors.search(f"{claim.type} approval threshold")[0]
The clean split, and the reason to keep them apart: procedural says how, semantic says what the policy is, episodic says what happened. Collapse them into one "memory" blob and you lose the ability to update one without corrupting the others.
The one nobody builds
Here's the part worth circling. The link between episodic and semantic memory is where learning actually happens. An agent hits the same situation a dozen times, you abstract the pattern into a rule, and you promote that rule from the episode log into semantic memory. After that, the agent doesn't reason from twelve analogies; it applies one fact.
The loop itself is small to write, which is what makes skipping it so telling:
similar = episodes.find(case.signature, k=20)
if len(similar) >= 12 and they_agree(similar):
rule = abstract(similar) # "auto claims with signal Y are usually false positives"
semantic.upsert(rule) # promote the pattern into a durable fact
episodes.mark_consolidated(similar)
You don't have to hand-roll all of it. LangMem and Mem0 both extract and consolidate memories from past runs instead of leaving you a raw log, and Letta (formerly MemGPT) lets the agent rewrite its own memory between turns. The tooling is catching up to the idea. Most teams still haven't wired it in.
That loop, consolidation, is the most valuable stage in the whole system and the least implemented. Almost everyone logs episodes. Almost nobody closes the loop back into durable rules. It's the difference between an agent with a diary and an agent that learns, and it's the subject of part two.
5. Retrieval is not a memory type
This is where I'll take the unpopular position. RAG isn't a kind of memory. It's a delivery mechanism.
The mechanics are familiar: embed the query, run a similarity search over a vector store, inject the top-k chunks into the context, and let the model answer grounded in them. A compliance bot keeps RBI and SEBI rules in a vector store and pulls only the few passages a KYC question needs. Useful, yes. But notice what's stored there: semantic facts and episodic events. Retrieval is how those get from the store into working memory. It's the pipe, not the water.
This matters because conflating the two is exactly why so many agents are "just a vector DB" and stop there. Once you see retrieval as plumbing, the real questions surface: what's worth storing, how do you keep it consistent, and when is similarity the wrong access pattern entirely? It usually is for rules and exact lookups, where similarity will happily hand you the wrong threshold because it reads close to the right one.
And while we're here: similarity is not relevance. The chunk that scores highest against your query embedding is the one that's closest in vector space, which is not the same as the one that answers the question. The gap between those two is where most "the RAG is hallucinating" bug reports actually come from.
6. Parametric memory: the knowledge the model is
Parametric memory is everything baked into the weights at training time: grammar, arithmetic, broad world knowledge, the fact that Paris is the capital of France. The model doesn't consult this. It is this. No retrieval step, always available, instant.
Which is exactly why it's a trap for anything that changes. The weights are frozen between training runs. They're opaque, they can be confidently wrong, and they know nothing past the cutoff. So the design boundary is sharp: general reasoning and language go to parametric, and anything volatile, proprietary, or recent goes to an external store.
Concretely: knowing what "loan-to-value" means is parametric, and you should trust the model for it. The current repo rate, or this specific bank's product rules, must come from a store you control. Bake in the stuff that doesn't move. Retrieve the stuff that does. Confusing the two is how an agent ends up quoting last year's interest rate with total confidence.
7. Prospective memory: the one you can't fake with a vector DB
The last type is the one that's architecturally different from everything above. Prospective memory is remembering to act later: intentions the agent formed but hasn't run yet. "Send the portfolio review on the 1st of every month." "When this client's SIP fails, alert the relationship manager." "Review this FD maturity in September," decided back in March.
You cannot build this with similarity search, and that's the whole point. There's no query to embed. It's a task queue, a scheduler, and goal trackers that fire on a trigger, either a clock or an event. If your agent needs to do something next Monday, no amount of retrieval tuning gets you there. You need a thing that wakes up on Monday.
# A scheduler, not a vector store.
scheduler.add_job(send_portfolio_review, "cron", day=1, args=[client.id])
scheduler.add_job(review_fd_maturity, "date", run_date="2026-09-01", args=[fd.id])
events.on("sip_failed", lambda e: alert_rm(e.client_id))
APScheduler covers the in-process case. Reach for Celery or Temporal when the intention has to survive a crash. Either way the trigger is a clock or an event, never a query.
This is the type that separates a chatbot from an agent with a horizon. Skip it and your agent can only ever react to the message in front of it. Build it and the agent can schedule its own future, which is most of what "agentic" actually means.
Where each type lives in real tooling
If you want the shortcut from taxonomy to dependencies, this is roughly where each type lands today.
| Type | What it needs | Tools to reach for |
|---|---|---|
| Working | Persisted thread state, eviction and summary | LangGraph checkpointers, your framework's thread state |
| Semantic | Structured store plus vector recall | Postgres + pgvector/Qdrant/Pinecone; LangMem, Mem0, Zep |
| Episodic | Append-only event log plus a similarity index | Postgres or Mongo + a vector index; Zep, Mem0 |
| Procedural | Exact key lookup plus a tuned system prompt | Redis or Postgres; LangMem for rule and prompt tuning |
| Retrieval | Similarity search over the stores above | pgvector, Qdrant, Weaviate, Pinecone, Chroma; LlamaIndex |
| Parametric | The weights, plus an optional fine-tune | the base model; LoRA/PEFT if you must bake it in |
| Prospective | A scheduler or queue with triggers | APScheduler, Celery, Temporal, cron |
None of these are load-bearing on their own. The architecture is the seven boxes and how they hand knowledge between each other. The libraries just fill the boxes.
The bottom line: what to actually build
If you're starting an agent and want to know where to spend, here's the order I'd put it in.
Start with working memory and treat it as a budget, not a convenience. It's the one cost that compounds, so settle your eviction and summarization strategy before you build anything clever on top. Everything else loads in here, and you pay for it per turn.
Keep semantic, episodic, and procedural in separate stores: one for facts, one for events, one for rules and workflows. The day you can update what the agent knows without touching what it did, you've built something you can maintain.
Treat retrieval as plumbing. It serves the three stores above, nothing more. If your memory system is only a vector DB, you've built the pipe and forgotten the water.
Draw the parametric line hard. General reasoning comes from the weights, everything volatile comes from a store you own. Never let a frozen model be your source of truth for anything that has a date on it.
Add prospective memory the day your agent needs a future. A scheduler, not a vector store. It's the cheapest type to describe and the one most agents are missing.
And if you only do one of these well, do the consolidation loop: the promotion of repeated episodes into durable rules. Logging what happened is table stakes. Turning it into something the agent applies next time is the part almost nobody finishes, and it's the whole reason to build memory instead of just a bigger prompt.
That handoff, how the stores read, write, and promote knowledge between each other, is the difference between an agent that answers and one that gets better at its job. Most agents shipping today are a context window and a vector DB with a good README. The seven-store version is more work, and it's the version that learns. How those stores talk to each other is part two.
Top comments (1)
Two things I'd extend the table on, because the loop you flag at the end isn't just "underbuilt" — it's structurally harder than the rest:
The link between episodic and semantic needs lineage in both directions, not just upsert. If consolidation promotes A → rule(v1), and another batch later refines it to rule(v2), anything still pointing at A has to resolve forward to v2 — otherwise the loop creates drift: stale middles become quietly authoritative because nothing pulls them along. Same disease as "stale-as-absent" for episodes, applied to consolidated rules.
The other thing: they_agree(similar) is one path, not n paths. If the twelve similar episodes were all retrieved via the same embedding, the agreement is shared-lineage agreement — not independent. To know the consolidation loop is actually doing what its arity claims, you need to shove a known-bad episode in periodically and watch whether they_agree catches it or absorbs it. The loop needs its own fault-injection harness, not just episode logging.
Looking forward to part two on how the stores actually talk to each other.