Building Agent Memory: Episodic vs Semantic Stores

#ai #agents #python #postgres

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user tells your agent on Tuesday that they prefer metric units. Friday, different session, the agent quotes a distance in miles. The user files a ticket. You check the logs and the Tuesday conversation is right there, with a clean assistant response saying "got it, metric." But the Friday session opened with a fresh messages[] array and no idea that Tuesday ever happened.

That gap is what people mean when they say "agent memory," and it is almost always misdiagnosed. The team patches it by ramming the entire user history into the system prompt. Cost goes up, latency goes up, and the agent still misses the preference half the time. The relevant turn is buried on page eight of a thirty-page transcript. The fix is to stop treating memory as one thing.

Production agent memory has two layers, and conflating them is the bug. Episodic memory is what happened in this conversation, ten minutes ago, and it lives in a buffer that gets summarised when it grows. Semantic memory is what you know about this user across every conversation they have ever had, and it lives in a store you query when their words ask for it. They have different lifetimes and live on different read and write paths. What follows is the layering, in about 100 lines of Python on Postgres with pgvector.

The two layers, defined narrowly

Episodic memory is per-conversation. It is the windowed transcript of the current session, plus a running summary of older turns once the window is full. The agent reads it on every turn so it knows what was said two messages ago. It is written to constantly, summarised periodically, and discarded when the conversation ends — or kept under a session id if the product needs resumption.

Semantic memory is per-user. It is the set of stable facts about the account: their name, their preferences, their prior orders, the things they have told you that should still be true tomorrow. The agent queries it on each turn (or at session start, depending on the read-budget tradeoff). It is written to rarely, and only when something the user said is worth keeping past the current session. Conflicts are normal. The user changes their mind, contradicts a prior statement, and you need a write path that can update an existing fact instead of stacking duplicates.

Two libraries are worth naming as references, both still maintained at the time of writing: mem0 takes the bolt-on-to-your-existing-agent path, and Letta (from the team behind MemGPT) takes the agent-runtime path where memory is part of the platform. Read their docs first. The 100 lines below borrow the layering, not the dependency.

The schema

Two tables. Episodic is a conversation buffer; semantic is a fact store with embeddings.

create extension if not exists vector;

create table episodic_turn (
    id          bigserial primary key,
    session_id  text not null,
    role        text not null,
    content     text not null,
    created_at  timestamptz default now()
);

create index on episodic_turn (session_id, id);

create table semantic_fact (
    id          bigserial primary key,
    user_id     text not null,
    fact        text not null,
    embedding   vector(1536),
    updated_at  timestamptz default now()
);

create index on semantic_fact
    using ivfflat (embedding vector_cosine_ops);
create index on semantic_fact (user_id);

Two things to notice. The episodic table is keyed by session_id, not user_id. A user can have many sessions; a session has exactly one user. The semantic table is keyed by user_id and carries an embedding so you can pull the top-k relevant facts for a query instead of dumping every fact into the prompt.

Episodic: a buffer with a summariser

The contract: the agent calls load_episodic(session_id) at the start of each turn and gets back the last N raw turns plus a running summary of everything older. After each turn, it calls append_turn(...). When the buffer crosses a threshold, the oldest half gets compressed into the summary slot.

import psycopg
from anthropic import Anthropic

WINDOW = 12
client = Anthropic()
MODEL = "claude-sonnet-4-5"

def append_turn(conn, session_id, role, content):
    conn.execute(
        "insert into episodic_turn "
        "(session_id, role, content) "
        "values (%s, %s, %s)",
        (session_id, role, content),
    )
    conn.commit()

def load_episodic(conn, session_id):
    rows = conn.execute(
        "select role, content from episodic_turn "
        "where session_id = %s "
        "order by id desc limit %s",
        (session_id, WINDOW),
    ).fetchall()
    rows.reverse()
    summary = _maybe_summarise(conn, session_id)
    return summary, rows

The summariser runs only when the table for this session has more than WINDOW * 2 rows. It pulls the oldest half, asks the model for a tight third-person summary, writes it to a session_summary row, and deletes the compressed turns. Production code would keep the deleted rows in cold storage for audit; the version here drops them to keep the example readable.

def _maybe_summarise(conn, session_id):
    count = conn.execute(
        "select count(*) from episodic_turn "
        "where session_id = %s",
        (session_id,),
    ).fetchone()[0]
    if count <= WINDOW * 2:
        return ""
    old = conn.execute(
        "select role, content from episodic_turn "
        "where session_id = %s order by id "
        "limit %s",
        (session_id, count - WINDOW),
    ).fetchall()
    transcript = "\n".join(
        f"{r}: {c}" for r, c in old
    )
    resp = client.messages.create(
        model=MODEL, max_tokens=400,
        messages=[{
            "role": "user",
            "content": (
                "Summarise this conversation in "
                "3 sentences, third person:\n\n"
                + transcript
            ),
        }],
    )
    text = resp.content[0].text
    conn.execute(
        "delete from episodic_turn "
        "where session_id = %s and id <= "
        "(select max(id) from "
        "(select id from episodic_turn "
        "where session_id = %s order by id "
        "limit %s) sub)",
        (session_id, session_id, count - WINDOW),
    )
    conn.commit()
    return text

That is the entire episodic layer. A buffer, a window, a summariser that fires when the buffer doubles. No vector search, no clever ranking. Recent turns are usually the relevant ones for episodic, so chronological is the right read pattern in most session shapes.

Semantic: vector store with a write path that updates

Semantic is harder because writes are the interesting part. You do not append every user utterance to the fact table; that gives you a noisy store full of "ok," "thanks," and "what about the other one." You write only when a turn carries a stable fact about the user. The cleanest way to do that is a small classifier prompt that runs after each user turn and returns either a fact or null.

def extract_fact(user_text):
    resp = client.messages.create(
        model=MODEL, max_tokens=120,
        messages=[{
            "role": "user",
            "content": (
                "Extract one stable user "
                "preference or fact from the "
                "message, or reply NONE. "
                "Examples: 'prefers metric units', "
                "'is vegetarian', "
                "'works at Acme Corp'. "
                "Message: " + user_text
            ),
        }],
    )
    out = resp.content[0].text.strip()
    return None if out == "NONE" else out

This adds a model call per user turn. In production, batch the extractor or run it async out of the response path so it does not sit on the user's latency budget.

Then the write path. For every extracted fact, embed it, search for a near-duplicate in the user's existing facts, and either update the matching row or insert a new one. This is the conflict-resolution pattern: the user said "metric" on Tuesday and "imperial" on Friday. You do not want both rows. You want the Friday row, with the Tuesday id.

def embed(text):
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return resp.data[0].embedding

DUP_THRESHOLD = 0.30

def upsert_fact(conn, user_id, fact):
    vec = embed(fact)
    row = conn.execute(
        "select id, embedding <=> %s::vector "
        "as dist from semantic_fact "
        "where user_id = %s "
        "order by dist asc limit 1",
        (vec, user_id),
    ).fetchone()
    if row and row[1] < DUP_THRESHOLD:
        conn.execute(
            "update semantic_fact set fact = %s, "
            "embedding = %s, updated_at = now() "
            "where id = %s",
            (fact, vec, row[0]),
        )
    else:
        conn.execute(
            "insert into semantic_fact "
            "(user_id, fact, embedding) "
            "values (%s, %s, %s)",
            (user_id, fact, vec),
        )
    conn.commit()

The 0.30 threshold is cosine distance, not similarity. Under pgvector's <=> operator, a smaller number means closer. You will want to tune this against real data; too low and contradictions stack as separate facts, too high and unrelated facts get clobbered. Log the matched pairs for a week and pick the cutoff from the histogram.

The read path is the standard top-k:

TOP_K = 5

def load_semantic(conn, user_id, query):
    vec = embed(query)
    rows = conn.execute(
        "select fact from semantic_fact "
        "where user_id = %s "
        "order by embedding <=> %s::vector "
        "limit %s",
        (user_id, vec, TOP_K),
    ).fetchall()
    return [r[0] for r in rows]

Pass the user's most recent message as query. The top five facts come back ranked by relevance to whatever the user is currently asking about. Five is a starting point; the right number is whatever fits in your prompt budget without crowding the system instructions.

Wiring it into the agent loop

The agent turn looks like this:

def agent_turn(conn, user_id, session_id, user_text):
    summary, recent = load_episodic(conn, session_id)
    facts = load_semantic(conn, user_id, user_text)
    system = (
        "You are an assistant. "
        "Known about user: " + "; ".join(facts) +
        ". Earlier in this chat: " + summary
    )
    messages = [
        {"role": r, "content": c} for r, c in recent
    ]
    messages.append(
        {"role": "user", "content": user_text}
    )
    resp = client.messages.create(
        model=MODEL, max_tokens=1024,
        system=system, messages=messages,
    )
    answer = resp.content[0].text
    append_turn(conn, session_id, "user", user_text)
    append_turn(conn, session_id, "assistant", answer)
    fact = extract_fact(user_text)
    if fact:
        upsert_fact(conn, user_id, fact)
    return answer

A fixed order: load both layers, build the prompt, call the model, append the turn, run the extractor and write to semantic if anything came back. The whole thing is around 100 lines of Python including the schema. Production code adds error handling, an embedding cache (so you do not pay to embed the same query twice), and a backoff for the summariser when it is already running for this session. The layering above is the part that matters.

Where teams get this wrong

Three patterns that surface in agent codebases.

The first is writing every turn to the semantic store. The result is a fact table full of conversational fluff that crowds out real preferences when you read top-k. The fact extractor is the gate; if it returns NONE, you write nothing.

The second is using only the semantic store and skipping episodic. The agent forgets what was said two messages ago because the embedding-search retrieval does not pull up the most recent turn. It pulls up whatever is closest in vector space, which is often something a user said three weeks ago and not what you wanted.

The third is treating contradictions as new facts. The user said metric on Tuesday and imperial on Friday; the table now has two rows; the read path returns both at the top of the list, and the model picks one at random. The upsert_fact pattern with a duplicate threshold is the fix, and the threshold is the parameter you tune.

Once you have the layering, the rest is refactor work. Swap pgvector for a managed vector DB. Wire in mem0 or Letta for the parts that need a shared platform. None of that is a rewrite.

If this was useful

The AI Agents Pocket Guide covers the patterns this post leans on: the episodic/semantic split, the fact-extraction gate, conflict resolution on writes. It also goes into the read-budget math when you start juggling tools, memory, and a long system prompt at the same time. Worth a look if you are past "make it answer once" and onto "make it remember across sessions."