Decay aware agent memory in one exact Postgres query

#postgres #ai #machinelearning #opensource

Most "agent memory" is just a vector search. You embed what the agent said, store it, and at recall time you do a nearest-neighbor lookup. It works, until you notice that a note from three weeks ago ranks exactly the same as one from three minutes ago. My assistant would confidently resurface a preference I had changed months earlier.

That is not memory. It is a filing cabinet with good search.

I wanted recall to rank by similarity x importance x recency: a fresh, important memory should beat a slightly-more-similar but stale one, and trivial old memories should fade. This post is about the one idea that made that cheap and exact, and it ended up as a small Postgres extension called pgmemai.

The obvious approach, and why it falls short

The naive version is "over-fetch by similarity, then re-rank":

SELECT *,
       (1 - (embedding <=> :q)) * importance * exp(-:lambda * age_days) AS score
FROM memories
ORDER BY embedding <=> :q      -- nearest by cosine
LIMIT 500                       -- grab a big candidate pool
-- ... then re-sort by score in app code, take top 10

The problem: the memory that should win on importance and recency is often not in the similarity-top-K at all. So you have to fetch a large candidate pool to even have a chance of seeing it, and you still miss high-importance or recent-but-moderately-similar memories that fell outside the pool. You are fighting your own index.

The trick: fold the objective into the vector

The score I want is:

score = cos(query, embedding) * importance * exp(-lambda * (now - created_at))

Watch what happens if I bake importance and recency into the stored vector at insert time:

embedding_wd = unit(embedding) * importance * exp(lambda * created_at)

Now take the inner product of a normalized query with that folded vector:

unit(query) . embedding_wd
  = cos(query, embedding) * importance * exp(lambda * created_at)

Compare that to the score I actually want. They differ only by a factor of exp(-lambda * now). And exp(-lambda * now) is the same constant for every row in a given query, so it does not change the top-K ordering. It just scales everything.

Two facts make this hold:

exp(-lambda * now) is a per-query constant, so it drops out of the ranking.
created_at is immutable, so exp(lambda * created_at) is computed once at insert and never needs updating.

So a single plain inner-product nearest-neighbor search over embedding_wd ranks rows by the full similarity x importance x recency objective, exactly. No re-ranking pass. No background job re-scoring rows as time passes. No special time-aware index.

What it looks like in Postgres

It is built on pgvector. A BEFORE INSERT trigger computes the folded vector:

-- inside a BEFORE INSERT trigger:
w := NEW.importance * exp(lambda * epoch_day(NEW.created_at));
NEW.embedding_wd := l2_normalize(NEW.embedding) * w;   -- scale the unit vector by w

The folded column gets an HNSW index with inner-product ops:

CREATE INDEX ON memories USING hnsw (embedding_wd vector_ip_ops);

And recall is one indexed top-K (<#> is pgvector's inner-product operator):

SELECT id, content
FROM memories
WHERE agent_id = :agent AND superseded_at IS NULL
ORDER BY embedding_wd <#> l2_normalize(:query)
LIMIT :k;

That is the whole hot path. One index scan.

The one gotcha: overflow

exp(lambda * created_at) grows over time, so left alone it would eventually overflow a float. The fix is a periodic re_center() that multiplies every folded vector by a single constant to pull the exponent back down. Because it is a global scale, it does not change inner-product ordering, so recall is unchanged. It is a no-op until lambda * (now - t_ref) > 40, which is years away for typical lambda, and it runs during maintenance.

Does it actually return the right memories?

I measured recall@10 against an exact brute-force computation of the same objective (so 1.000 means HNSW returned the same top-10 as the exact answer, it is a statement about index approximation, not "perfect memory"):

memories	ef_search=40	ef_search=100	ef_search=200
100k	1.000	1.000	1.000
1M	0.945	0.995	1.000

ef_search is the standard HNSW recall/latency knob. Same 1.000 on real all-MiniLM-L6-v2 embeddings, not just synthetic clusters. Latency is about 13 ms per call at 100k on a debug build. The benchmark scripts are in the repo if you want to run your own data through them.

The rest of the system

Recall is the interesting part, but a memory store needs more to be usable:

Lifecycle: memories are range-partitioned by created_at (immutable membership, so no row movement), with roll-up of old partitions and an opt-in expire(retention_days).
Supersession: give a changing fact a stable mem_key. A new value retires the old one for recall but keeps it for a time-travel audit(agent, as_of) query ("what did the agent know on date X?").
Forgetting: memories whose activation importance * exp(-lambda * age) drops below a floor are evicted.
SDKs: Python and TypeScript, plus drop-in LangChain, CrewAI, and AutoGen adapters.

Honest limitations

It is pre-1.0, so minor versions may change the schema.
lambda (the decay rate) is fixed per store because it is baked into the index. That is the whole trick, but it means you choose a decay rate up front.
recall() writes a little on every call (it bumps an access counter for reinforcement), so it is not a pure read. I think it should be optional, and that is on the list.

Try it

It is Apache-2.0 and runs in the Postgres you already have:

cd extension && make install
psql -d mydb -c "CREATE EXTENSION pgmemai CASCADE;"
psql -d mydb -c "SELECT pgmemai.create_store(1536, 0.05);"

Repo: github.com/pg-amjad/pgmemai

I would genuinely love feedback on the approach and the math, and especially to hear where the decay-fold breaks on a case I have not hit. How are you handling agent memory today?