Most "agent memory" is just a vector search. You embed what the agent said, store it, and at recall time you do a nearest-neighbor lookup. It works, until you notice that a note from three weeks ago ranks exactly the same as one from three minutes ago. My assistant would confidently resurface a preference I had changed months earlier.
That is not memory. It is a filing cabinet with good search.
I wanted recall to rank by similarity x importance x recency: a fresh, important memory should beat a slightly-more-similar but stale one, and trivial old memories should fade. This post is about the one idea that made that cheap and exact, and it ended up as a small Postgres extension called pgmemai.
The obvious approach, and why it falls short
The naive version is "over-fetch by similarity, then re-rank":
SELECT *,
(1 - (embedding <=> :q)) * importance * exp(-:lambda * age_days) AS score
FROM memories
ORDER BY embedding <=> :q -- nearest by cosine
LIMIT 500 -- grab a big candidate pool
-- ... then re-sort by score in app code, take top 10
The problem: the memory that should win on importance and recency is often not in the similarity-top-K at all. So you have to fetch a large candidate pool to even have a chance of seeing it, and you still miss high-importance or recent-but-moderately-similar memories that fell outside the pool. You are fighting your own index.
The trick: fold the objective into the vector
The score I want is:
score = cos(query, embedding) * importance * exp(-lambda * (now - created_at))
Watch what happens if I bake importance and recency into the stored vector at insert time:
embedding_wd = unit(embedding) * importance * exp(lambda * created_at)
Now take the inner product of a normalized query with that folded vector:
unit(query) . embedding_wd
= cos(query, embedding) * importance * exp(lambda * created_at)
Compare that to the score I actually want. They differ only by a factor of exp(-lambda * now). And exp(-lambda * now) is the same constant for every row in a given query, so it does not change the top-K ordering. It just scales everything.
Two facts make this hold:
-
exp(-lambda * now)is a per-query constant, so it drops out of the ranking. -
created_atis immutable, soexp(lambda * created_at)is computed once at insert and never needs updating.
So a single plain inner-product nearest-neighbor search over embedding_wd ranks rows by the full similarity x importance x recency objective, exactly. No re-ranking pass. No background job re-scoring rows as time passes. No special time-aware index.
What it looks like in Postgres
It is built on pgvector. A BEFORE INSERT trigger computes the folded vector:
-- inside a BEFORE INSERT trigger:
w := NEW.importance * exp(lambda * epoch_day(NEW.created_at));
NEW.embedding_wd := l2_normalize(NEW.embedding) * w; -- scale the unit vector by w
The folded column gets an HNSW index with inner-product ops:
CREATE INDEX ON memories USING hnsw (embedding_wd vector_ip_ops);
And recall is one indexed top-K (<#> is pgvector's inner-product operator):
SELECT id, content
FROM memories
WHERE agent_id = :agent AND superseded_at IS NULL
ORDER BY embedding_wd <#> l2_normalize(:query)
LIMIT :k;
That is the whole hot path. One index scan.
The one gotcha: overflow
exp(lambda * created_at) grows over time, so left alone it would eventually overflow a float. The fix is a periodic re_center() that multiplies every folded vector by a single constant to pull the exponent back down. Because it is a global scale, it does not change inner-product ordering, so recall is unchanged. It is a no-op until lambda * (now - t_ref) > 40, which is years away for typical lambda, and it runs during maintenance.
Does it actually return the right memories?
I measured recall@10 against an exact brute-force computation of the same objective (so 1.000 means HNSW returned the same top-10 as the exact answer, it is a statement about index approximation, not "perfect memory"):
| memories | ef_search=40 | ef_search=100 | ef_search=200 |
|---|---|---|---|
| 100k | 1.000 | 1.000 | 1.000 |
| 1M | 0.945 | 0.995 | 1.000 |
ef_search is the standard HNSW recall/latency knob. Same 1.000 on real all-MiniLM-L6-v2 embeddings, not just synthetic clusters. Latency is about 13 ms per call at 100k on a debug build. The benchmark scripts are in the repo if you want to run your own data through them.
The rest of the system
Recall is the interesting part, but a memory store needs more to be usable:
-
Lifecycle: memories are range-partitioned by
created_at(immutable membership, so no row movement), with roll-up of old partitions and an opt-inexpire(retention_days). -
Supersession: give a changing fact a stable
mem_key. A new value retires the old one for recall but keeps it for a time-travelaudit(agent, as_of)query ("what did the agent know on date X?"). -
Forgetting: memories whose activation
importance * exp(-lambda * age)drops below a floor are evicted. - SDKs: Python and TypeScript, plus drop-in LangChain, CrewAI, and AutoGen adapters.
Honest limitations
- It is pre-1.0, so minor versions may change the schema.
-
lambda(the decay rate) is fixed per store because it is baked into the index. That is the whole trick, but it means you choose a decay rate up front. -
recall()writes a little on every call (it bumps an access counter for reinforcement), so it is not a pure read. I think it should be optional, and that is on the list.
Try it
It is Apache-2.0 and runs in the Postgres you already have:
cd extension && make install
psql -d mydb -c "CREATE EXTENSION pgmemai CASCADE;"
psql -d mydb -c "SELECT pgmemai.create_store(1536, 0.05);"
Repo: github.com/pg-amjad/pgmemai
I would genuinely love feedback on the approach and the math, and especially to hear where the decay-fold breaks on a case I have not hit. How are you handling agent memory today?
Top comments (0)