DEV Community

Cover image for Your AI Agent's Memory Has No Expiry Date: I Scored Freshness on a Real Corpus
Alex Spinov
Alex Spinov

Posted on • Originally published at blog.spinov.online

Your AI Agent's Memory Has No Expiry Date: I Scored Freshness on a Real Corpus

My agent confidently quoted a price from 40 days ago. The retrieval was perfect. The fact was dead.

The chunk it pulled said "Pro plan is $29/mo." High similarity to the question, top of the ranking, grammatical, on-topic. Everything a retriever is built to reward. The only problem: the plan had moved to $39 weeks earlier, and the $29 chunk had been sitting in memory the whole time, looking exactly as trustworthy as the day it was written. Worse, the $39 chunk was right there in the corpus too, and the retriever scored the two within 0.002 of each other. A margin that thin isn't a real preference; re-embed the same corpus on a newer model version or a different batch and it can swap sign. Which fact you serve ends up riding on noise nobody designed.

That is the failure I want to fix today. Not bad retrieval. Stale retrieval that looks like good retrieval, decided by a tie-break nobody designed.

Quick answer: A memory or RAG chunk that was correct when stored can quietly go stale, and similarity search is blind to age. When the stale chunk and the fresh chunk are near-duplicates, they score almost identically, so which one lands top-1 turns on a margin too thin to mean anything: a sliver of embedder noise that moves with the model version or the batch. Whichever edges ahead today, naive top-k serves it as the truth. A freshness gate tags every chunk with its age and a TTL based on how volatile that kind of fact is, then down-ranks, blocks, or refuses, before the model reads anything. That turns a fragile similarity ordering into a deterministic rule. Below is a small, zero-network gate with its real, deterministic output across two queries.

This is for anyone running an agent on top of their own corpus: a RAG pipeline, an MCP memory tool, a while loop that stuffs retrieved chunks into context. If you have ever watched your model be confidently wrong and couldn't tell why, this is one of the whys. The fact was right once. Nobody checked whether it still was.

The artifact first

Here is what the gate prints. Same corpus, same ranker, two queries, each run twice (naive vs gated). The cand column is "did this chunk clear the relevance floor for this query":

=== query: 'what does the pro plan cost'   now=day 1000 ===

[naive top-k]    rank by similarity only
id   age  ttl   sim   fresh  cand  verdict
c1    40     3  0.903  0.00   y     STALE_BLOCK
c2     1     3  0.901  0.67   y     FRESH
c5     5     7  0.840  0.29   y     STALE_WARN
c3     5  3650  0.710  1.00   y     FRESH
c6    43     7  0.220  0.00   -     STALE_BLOCK
-> injects: c1 "Pro plan is $29/mo" (sim 0.903)

[freshness-gated] blocked if stale
id   age  ttl   sim   fresh  cand  verdict
c1    40     3  0.903  0.00   y     STALE_BLOCK
c2     1     3  0.901  0.67   y     FRESH
c5     5     7  0.840  0.29   y     STALE_WARN
c3     5  3650  0.710  1.00   y     FRESH
c6    43     7  0.220  0.00   -     STALE_BLOCK
-> injects: c2 "Pro plan is $39/mo" (sim 0.901)

=== query: 'is the $29 summer promo still active'   now=day 1000 ===

[naive top-k]    rank by similarity only
id   age  ttl   sim   fresh  cand  verdict
c1    40     3  0.880  0.00   y     STALE_BLOCK
c6    43     7  0.830  0.00   y     STALE_BLOCK
c2     1     3  0.410  0.67   -     FRESH
c3     5  3650  0.320  1.00   -     FRESH
c5     5     7  0.200  0.29   -     STALE_WARN
-> injects: c1 "Pro plan is $29/mo" (sim 0.880)

[freshness-gated] blocked if stale
id   age  ttl   sim   fresh  cand  verdict
c1    40     3  0.880  0.00   y     STALE_BLOCK
c6    43     7  0.830  0.00   y     STALE_BLOCK
c2     1     3  0.410  0.67   -     FRESH
c3     5  3650  0.320  1.00   -     FRESH
c5     5     7  0.200  0.29   -     STALE_WARN
-> REFUSE: every on-topic chunk is stale, no fresh answer to give

Enter fullscreen mode Exit fullscreen mode

Look at the four injects/REFUSE lines. That is the whole article.

Query 1, naive top-k, injects c1, the $29 fossil. Not because $29 is "more relevant" than $39, but because c1 scored 0.903 and c2 scored 0.901. Two near-duplicate price strings, 0.002 apart. With these exact scores the sort is deterministic, c1 does land on top every run, I'm not hiding that. The dishonest part is treating that 0.002 as a decision. It isn't. Re-embed this corpus on a newer model build or a different batch and the order can flip, because nothing about the two vectors actually encodes which price is current. The retriever has no opinion about freshness; it hands back whichever near-duplicate edged ahead and naive top-k takes the top. That fragile edge is the lottery, not a fact about $29.

Query 1, gated, injects c2, the live $39 price, every time. Same ranker, same similarity floor. The only thing the gated pass adds is age: it STALE_BLOCKs c1 (a 40-day-old price against a 3-day TTL) and falls through to the freshest candidate underneath. The coin flip is gone. The verdict is a rule, not a tie-break.

Query 2 is the case naive never shows you. Someone asks whether the old $29 promo is still running. The only on-topic chunks (c1, c6) are both stale; the fresh chunks (c2, c3) are off-topic and sit below the relevance floor, so they never qualify. Naive serves the stale c1 anyway, "yes, $29." The gated pass has nothing fresh AND relevant to offer, so it REFUSEs instead of confidently lying. A missing answer the agent admits to beats a fossil it's sure about.

Why this isn't the usual "data was wrong" story

I want to draw a hard line before going further, because this is easy to confuse with a different bug.

This is not data that was wrong when you collected it. The $29 chunk was correct. On the day it was stored, the Pro plan really was $29/mo. There was no lie at the source, no poisoned page, no parsing error. It was a true fact.

It just rotted.

So the line is: some bugs are about data that was wrong when you got it. This one is about data that was right, and went stale. Validity checks, schema canaries, source-trust scoring all look at the moment of collection. None of them look at the gap between stored_at and now. That gap is the entire problem here, and similarity search is blind to it.

A chunk's embedding does not age. "Pro plan is $29/mo" sits at the same point in vector space forever, and "Pro plan is $39/mo" sits about 0.002 away. The price moved in the real world; neither vector did. So the retriever cannot tell the fossil from the current fact. It hands back whichever one edged ahead by that hair and calls it the best match. With no age signal, "best match" between two near-duplicates rests on a margin smaller than the noise between model versions, and over enough re-embeds and queries that margin eventually points at the fossil.

Stale memory is worse than no memory

Here's the uncomfortable part. An agent with no memory of the price asks. It calls a tool, hits the source, gets $39. Slow, but correct.

An agent with a stale memory of the price does not ask. Why would it? It has a high-confidence chunk sitting right there at similarity 0.903, a hair above the live one. It skips the lookup, injects $29, and reasons forward: quotes the customer, drafts the invoice, picks the wrong tier in a comparison. Every downstream step inherits the rot, and each one looks just as confident as if the number were right.

Empty memory makes an agent slow. Stale memory makes it confidently wrong, which is the expensive kind of wrong. The whole reason you added memory was to skip the lookup. That shortcut is exactly what turns one dead fact into a chain of dead reasoning. A freshness gate gives the shortcut a tripwire: trust the cached fact when it's fresh, fall back to the live lookup when it isn't.

The mechanism: age against a per-class TTL

The gate does one cheap thing at retrieval time. For each candidate chunk it computes a freshness score:

def freshness_score(chunk, now):
    age = now - chunk["stored_on"]
    ttl = TTL[chunk["cls"]]
    score = 1.0 - age / ttl
    return age, ttl, max(0.0, min(1.0, score))
Enter fullscreen mode Exit fullscreen mode

runnable, stdlib only. Age is now - stored_at. The score is how much of the chunk's time-to-live is left, clamped to [0, 1]. New chunk, score near 1.0. Past its TTL, score 0.0. Then a verdict:

  • FRESH (score ≥ 0.5): inject normally.
  • STALE_WARN (0 < score < 0.5): keep it, but multiply its rank key by the score so a fresher chunk can overtake it.
  • STALE_BLOCK (score 0): never inject. Fall through or refuse.

All three verdicts actually fire in the output above, which is the point of running it instead of describing it. c5 ("Pro plan includes 5 seats", an availability fact 5 days old against a 7-day TTL) lands at score 0.29, so it's a STALE_WARN: in the gated pass its rank key becomes 0.840 * 0.29 = 0.24, which drops it from second-by-similarity to behind both fresh chunks. It isn't dropped, just demoted. That is the STALE_WARN branch executing on a real chunk, not a claim about one.

The interesting part is TTL[chunk["cls"]]. Freshness is not raw age. A price and a historical fact age at completely different rates, so they get different TTLs:

TTL = {"price": 3, "availability": 7, "schedule": 30, "reference": 3650}
Enter fullscreen mode Exit fullscreen mode

Watch what that does in the output. Chunk c3, "Pro plan billing is monthly", is 5 days old. Older than c2. But it's a reference fact with a 3650-day TTL, so its freshness is 1.00 and it stays FRESH. Meanwhile c1, 40 days old, is a price with a 3-day TTL, so it's flatly STALE_BLOCK. Same age, opposite verdict, because the half-life of the fact is what's being measured, not the calendar.

One more knob: SIM_FLOOR. A chunk has to clear it to be a candidate at all (the cand column). That floor is what makes REFUSE possible. In query 2, the only on-topic chunks are stale, and the fresh chunks fall below the floor, so the gated pass has nothing both fresh and relevant to serve. It declines rather than reaching past the floor for an off-topic-but-fresh chunk, or under the block for a stale-but-relevant one.

That is the answer to the first obvious objection: what about evergreen facts? They are fine. Evergreen means a long TTL, which means the gate leaves them alone. The same age that kills a price barely scratches a definition. Freshness is age measured against the half-life of that kind of fact, not a clock.

Where the TTLs come from (and where they don't)

This is the part I refuse to fake, because a freshness score is only as honest as the numbers behind it.

I did not measure "facts decay at X% per day." Nobody can; it depends entirely on what the fact is about. What I do have is real volatility data. Across 2,190 production runs on our own scrapers, the same sources change at wildly different rates. Price and stock fields churn between runs constantly. Reference and historical fields barely move; in one batch of 12 records I re-checked, 5 had changed since the previous run and 7 were byte-for-byte identical.

So the TTLs in this gate are modeled on observed source churn, not a measured decay curve. They are config. The honest framing is: "I have watched which classes of facts go stale fast and which don't, and I encoded that as TTLs you should recalibrate for your domain." If you scrape a stock exchange, your price TTL is minutes, not days. If you index legal statutes, your reference TTL is years.

That is also the honest limit of this whole approach. The class-to-TTL mapping is a judgment call. The gate does not discover it for you. It gives you a place to put the judgment, and then it applies that judgment uniformly and visibly, which is more than similarity search does.

Where this is wrong

A few objections deserve real answers, not a hand-wave.

"Just re-index the corpus more often." Sure, if you can. But re-embedding is periodic and expensive, and it answers a different question. Re-indexing keeps the corpus current. The gate answers "can I trust this specific chunk at this specific retrieval, right now?" Those are orthogonal. You can run a nightly re-index and still serve a 40-day-old price at 9am because the source moved at 8:55. The gate is the cheap guard at read time; re-indexing is the slow refresh. Use both.

"TTL-by-type is arbitrary." Partly true. The class boundaries and the numbers are a design decision, and a wrong TTL gives a wrong verdict. I'd rather have a wrong-but-visible TTL I can tune than an invisible assumption that every retrieved fact is eternally current, which is what plain top-k quietly assumes.

"The similarity scores are basically tied; maybe the retriever just ranks c2 first anyway." Sometimes it will, and that is exactly the problem. c1 and c2 are near-duplicate price strings, so they score within 0.002 of each other. With one fixed set of scores that ordering is stable, on this corpus c1 wins every run, I'm not pretending otherwise. But 0.002 is well inside the noise floor between embedder versions and batches: re-embed and the order can invert, because the gap encodes nothing about which price is live. On this corpus the dead chunk edged it out; re-run on a newer model and the live one might, until the day it doesn't. Relying on which near-duplicate edges ahead is not a freshness strategy, it's luck. The gate replaces the coin flip with a rule: blocked if past TTL, full stop. It is the only component in the path that even knows c1 is older, and that is the single fact that makes the outcome deterministic instead of lucky.

What changes Monday

Three moves, in order of effort.

Stamp every chunk with stored_at and a source lineage when you write it to memory. If you're not already doing this, it's the best hour you'll spend all week, because you cannot reason about freshness you never recorded.

Tag each chunk with a volatility class. Start with four buckets like the ones above. You don't need a taxonomy; you need "does this kind of fact change in hours, days, or years."

Run the gate between retrieval and the model. Block STALE_BLOCK, down-rank STALE_WARN, and decide explicitly what happens when everything is stale. Refusing ("I don't have a current figure") beats injecting a confident fossil. A wrong answer your agent is sure about costs more than a missing one it admits to.

One thing the gate buys you for free: the verdict column is an audit trail. When a customer says "your bot quoted the old price," you don't guess. You look at the retrieval log, see c1 ... STALE_BLOCK or STALE_WARN, and know exactly which fossil got served and how old it was. The lineage field (source) tells you where it came from so you can go re-pull it. Plain top-k gives you none of that; it just hands over the top vector and forgets it ever ranked the others. Debuggability is the quiet second win here, and on a real agent it might matter more than the block itself.

Related, if you also fetch live pages inside the agent: a 200 OK body is not automatically usable content either. I wrote a separate gate for that, but freshness is the storage-side twin of the same idea. That gate guards what you just fetched; this one guards what you stored months ago. Trust the timestamp, not the vibe.

Here's the full script. Stdlib only, zero network, deterministic. NOW and the similarity scores are hardcoded so two runs print byte-identical stdout; I ran it twice and the output md5 matched (58aa51a486481c8bc20ffb6d4ef80ccd). Drop in your own corpus and TTLs:

"""freshness_gate.py: a retrieval freshness gate for agent memory.

Stdlib only. Zero network. Deterministic: NOW and similarity are hardcoded,
so two runs print byte-identical stdout (stable under md5).

Idea: a memory/RAG chunk was TRUE when stored, then quietly went stale.
Its embedding never ages, so it keeps the same similarity to the query as
the day it was written. When a stale chunk and a fresh chunk are near-
duplicates (two prices for the same plan), their cosine similarity is
near-equal, and which one lands top-1 is a tie-break lottery: insertion
order, sort stability, a hair of embedder noise. Sooner or later naive
top-k serves the fossil. The gate makes age a first-class signal and
removes the lottery deterministically: it scores each chunk against the
TTL of its volatility class, and down-ranks, BLOCKs, or REFUSEs before
the chunk ever reaches the model.

TTLs are modeled on real volatility we observed across 2,190 production runs
(price/stock fields churned run-to-run; reference facts barely moved). They are
config, not measured decay rates. Calibrate per domain.
"""

# Fixed "today" so age is deterministic. Days since each chunk was stored.
NOW_DAY = 1000

# TTL per volatility class, in days. Modeled on observed source churn, not a
# decay rate. "price" moves fast; "reference" is near-evergreen.
TTL = {"price": 3, "availability": 7, "schedule": 30, "reference": 3650}

# A chunk has to clear this similarity to be a candidate at all. Below it,
# the chunk is off-topic for the query and never gets injected. This is what
# lets the gate REFUSE: if every on-topic chunk is stale, there is nothing
# fresh AND relevant left, so we decline instead of serving an off-topic
# fresh chunk or a stale on-topic one.
SIM_FLOOR = 0.50

# Corpus. Each chunk: id, text, stored_on (absolute day), volatility class,
# and the retriever's cosine similarity per query (hardcoded: stands in for
# the embedding model). Note c1 and c2 are near-duplicate price strings, so
# their similarity to the price query is near-equal (0.903 vs 0.901): the
# embedder cannot tell the fresh one from the fossil.
QUERIES = {
    "q1": "what does the pro plan cost",
    "q2": "is the $29 summer promo still active",
}
CORPUS = [
    {"id": "c1", "text": "Pro plan is $29/mo",            "stored_on": 960, "cls": "price",        "sim": {"q1": 0.903, "q2": 0.88}},
    {"id": "c2", "text": "Pro plan is $39/mo",            "stored_on": 999, "cls": "price",        "sim": {"q1": 0.901, "q2": 0.41}},
    {"id": "c3", "text": "Pro plan billing is monthly",   "stored_on": 995, "cls": "reference",    "sim": {"q1": 0.710, "q2": 0.32}},
    {"id": "c5", "text": "Pro plan includes 5 seats",     "stored_on": 995, "cls": "availability", "sim": {"q1": 0.840, "q2": 0.20}},
    {"id": "c6", "text": "Summer promo: 20% off Pro",     "stored_on": 957, "cls": "availability", "sim": {"q1": 0.22,  "q2": 0.83}},
]


def freshness_score(chunk, now):
    """Age vs the TTL of the chunk's class. 1.0 = brand new, 0.0 = >= TTL old."""
    age = now - chunk["stored_on"]
    ttl = TTL[chunk["cls"]]
    score = 1.0 - age / ttl
    return age, ttl, max(0.0, min(1.0, score))


def verdict(score):
    if score >= 0.5:
        return "FRESH"
    if score > 0.0:
        return "STALE_WARN"
    return "STALE_BLOCK"


def rank(chunks, query_key, now, gated):
    """One ranker for both passes. Same similarity signal, same SIM_FLOOR.
    gated=True adds exactly one thing: age. FRESH passes through, STALE_WARN
    is down-ranked by its freshness score, STALE_BLOCK is dropped."""
    rows = []
    for c in chunks:
        sim = c["sim"][query_key]
        age, ttl, score = freshness_score(c, now)
        v = verdict(score)
        candidate = sim >= SIM_FLOOR          # off-topic chunks never inject
        keep = candidate
        rank_key = sim
        if gated and candidate:
            if v == "STALE_BLOCK":
                keep = False                  # never inject a blocked fact
            elif v == "STALE_WARN":
                rank_key = sim * score        # down-rank, don't drop
        rows.append({**c, "sim_q": sim, "age": age, "ttl": ttl, "score": score,
                     "v": v, "cand": candidate, "keep": keep, "key": rank_key})
    kept = [r for r in rows if r["keep"]]
    kept.sort(key=lambda r: r["key"], reverse=True)
    return rows, (kept[0] if kept else None)


def show(query_key, gated):
    rows, top = rank(CORPUS, query_key, NOW_DAY, gated)
    tag = "[freshness-gated]" if gated else "[naive top-k]   "
    print(f"{tag} {'blocked if stale' if gated else 'rank by similarity only'}")
    print("id   age  ttl   sim   fresh  cand  verdict")
    for r in sorted(rows, key=lambda r: r["sim_q"], reverse=True):
        print(f"{r['id']:<4} {r['age']:>3}  {r['ttl']:>4}  {r['sim_q']:.3f}  "
              f"{r['score']:.2f}   {'y' if r['cand'] else '-'}     {r['v']}")
    if top:
        print(f"-> injects: {top['id']} \"{top['text']}\" (sim {top['sim_q']:.3f})")
    else:
        print("-> REFUSE: every on-topic chunk is stale, no fresh answer to give")
    print()


if __name__ == "__main__":
    for qk, qtext in QUERIES.items():
        print(f"=== query: {qtext!r}   now=day {NOW_DAY} ===\n")
        show(qk, gated=False)
        show(qk, gated=True)
Enter fullscreen mode Exit fullscreen mode

The gate is deliberately dumb. No model call, no embedding, no clever decay math. Just age against a TTL you set, applied where it matters: before the fact reaches the model, not after the model has already believed it.

What's the shortest-lived fact your agent has ever quoted back to you with full confidence? I'm collecting volatility classes and would love a TTL you've had to set absurdly low. Drop it in the comments. 👇


Follow for more numbers from production agent runs. AI disclosure: drafted with AI assistance, but every line of code here was actually run, and the stdout above is its real, unedited output.

Top comments (0)