Charles Wu for OceanBase User Group

Posted on Jun 28

How Agent Memory Forgets: An Engineering Walkthrough

#agents #ai #opensource #softwareengineering

Time coats memory in dust. Access is the only cloth that wipes it clean. When the dust grows thick and no one asks, that is forgetting.

Photo by Markus Winkler on Unsplash

In the previous post, we looked at forgetting from a cognitive-science angle — synaptic plasticity, the Ebbinghaus forgetting curve [1], spaced repetition, and desirable difficulty. Fascinating theory.

This piece takes a different angle. We follow a single message through an agent memory system — using PowerMem 2 as our running example — from write to eviction, and watch how forgetting is engineered at every step. In short: decay starts the moment a message is written, but what actually decides a memory’s fate is whether it gets accessed again.

1. Importance Scoring: Is This Worth Remembering?

Picture this message landing in the system:

Next Friday at 3:00 PM — Q2 requirements review with the product team in Conference Room B.

Once the message enters memory, the first question is not how to store it, but whether it deserves storage. If every piece of information is written and retrieved at equal weight, two problems compound as volume grows: retrieval signal-to-noise keeps dropping, and storage cost becomes unbounded.

Shannon’s information theory frames the same idea: high-probability events carry almost no information and are poor candidates for long-term retention; low-probability but critical events carry enormous information and should be persisted.

Importance scoring is the filter. Each item gets a score that drives decay speed, review cadence, and eviction priority downstream.

So how is importance assessed?

1.1 The six-dimension model

In PowerMem, scoring a single message is more involved than it looks. The system uses a six-dimension model — six axes, weighted and summed:

For our meeting message: highly relevant to “Q2 work” (relevance ≈ 0.8), concrete time and place (factual ≈ 0.8), attendance is non-optional (actionable ≈ 0.9), emotionally neutral (emotional_impact ≈ 0.2). Weighted:

0.3×0.8 + 0.2×0.5 + 0.15×0.2 + 0.15×0.9 + 0.1×0.8 + 0.1×0.6 ≈ 0.72

Importance score: 0.72.

The six-dimension model is the theoretical backbone of importance scoring.

1.2 Dual-path scoring: LLM vs. rule engine

PowerMem runs two execution paths so the system always returns a score. Path one uses the six-dimension model; path two is a rule engine that kicks in when path one is unavailable — graceful degradation.

Path one: LLM deep scoring (preferred)

When an LLM is available, the system asks it to analyze all six dimensions and return structured JSON, for example:

{
    "importance_score": 0.72,
    "reasoning": "Meeting schedule imposes a hard time constraint and clear action requirement",
    "criteria_scores": {
        "relevance": 0.8,
        "novelty": 0.5,
        "emotional_impact": 0.2,
        "actionable": 0.9,
        "factual": 0.8,
        "personal": 0.6
    }
}

PowerMem reads importance_score from the response.

Because LLM output is not perfectly predictable, parsing uses a three-level fallback: try JSON first; if that fails, regex-extract a numeric score; if that fails too, default to 0.5.

Engineering note: The per-dimension scores in the JSON are not fed back into the weighted formula. They exist to structure the model’s reasoning (chain-of-thought style) so the final score is more stable.

Path two: Rule engine (fallback)

When the LLM is down, the rule engine takes over. Precision drops, but the system keeps running.

Rules accumulate score from quantifiable signals:

Content length > 100 characters: +0.1; > 50 characters: +0.05
Keyword hit: +0.1 each
Contains ? or !: +0.05 each
Metadata priority high / medium: +0.2 / +0.1
Score capped at 1.0

This is graceful degradation in production: one external dependency failing should not stall the entire memory layer.

2. Classification and Parameter Initialization

2.1 Three-layer memory model

Once importance is set, the next step is classification: which memory layer does this message belong to?

If you read the previous post, you’ll recall how biological memory layers work: information enters the hippocampus (short-term buffer) with limited capacity, consolidates into neocortex (long-term storage), and only items that are repeatedly activated, richly linked to existing knowledge, or emotionally salient earn priority transfer.

PowerMem maps that to three layers:

Classification thresholds:

score ≥ 0.8 → long_term
score ≥ 0.6 → short_term
otherwise → working

if score >= self._algo.long_term_threshold:   # 0.8
    return "long_term"
if score >= self._algo.short_term_threshold:  # 0.6
    return "short_term"
return "working"

Our meeting message scores 0.72 — above 0.6, below 0.8 — so it lands in short_term.

Higher layers decay more slowly; memories live longer.

2.2 Forgetting parameter initialization

Classification answers where a memory lives. The sharper questions are: how fast should it fade? When should it be reviewed for consolidation?

After classification, PowerMem builds a full lifecycle metadata profile — a card tracking strength, decay parameters, review schedule, and management flags.

The profile splits into two blocks:

Below we walk through each parameter for the meeting example (importance = 0.72, short_term).

2.2.1 Initial retention

Initial retention captures how “solid” a memory is at birth. More important content should start with higher retention; low-importance noise should be fragile from day one and yield bandwidth to stronger competitors.

In PowerMem:

initial_retention = self.initial_retention * importance_score
# Meeting example: 1.0 × 0.72 = 0.72

Two fields are written:

initial_retention — snapshot at creation (“how firmly it was encoded”)
current_retention — live effective retention

They match at creation; only current_retention changes as decay and review proceed.

2.2.2 Decay rate by layer

working / short_term / long_term mirror working memory, hippocampus, and neocortex: closer to long-term storage, slower per-unit-time forgetting. Each layer gets its own decay coefficient:

{
    "working":    0.5,  # smallest S — fades fastest
    "short_term": 1.5,
    "long_term":  2.0,  # largest S — fades slowest
}

For our short_term meeting message:

0.1 (global base decay) × 1.5 (short_term coefficient) = 0.15

By comparison:

working: 0.1 × 0.5 = 0.05
long_term: 0.1 × 2.0 = 0.20

Larger coefficient → slower forgetting (within this parameterization).

2.2.3 Review scheduling

Spaced repetition’s core rule: review often at first, then stretch intervals. Hit the window where the memory is almost gone but still recoverable — not so early that review is wasted, not so late that it’s already lost.

PowerMem schedules review timestamps at creation, not on demand.

Step 1: Baseline intervals

Five global baseline intervals (hours): [1, 6, 24, 72, 168] — roughly 1 h, 6 h, 1 day, 3 days, 7 days. Shared by all memories.

Step 2: Compress by importance

Baseline intervals treat all memories equally, but a credential reminder should be nudged more often than small talk. Each interval is compressed:

adjusted_interval = interval × (1 - importance_score × adjustment_factor)

interval — baseline (e.g. 1 h, 6 h, 24 h)
importance_score — 0.72 for our meeting
adjustment_factor — default 0.3 (compression strength)
Floor: 0.5 hours

For the meeting (importance = 0.72):

1 - 0.72 × 0.3 = 0.784

Each baseline interval becomes 78.4% of its original length:

For a low-importance message (importance = 0.3):

1 - 0.3 × 0.3 = 0.91

Intervals shrink only slightly — reviews drift later:

Higher importance → earlier review windows → more chances to re-consolidate.

After computing five timestamps, PowerMem stores the full schedule and initializes companion fields:

Together:

next_review — when to review
review_count + last_reviewed — how much review has happened
reinforcement_factor — how much each review restores

When next_review arrives and a review completes: review_count increments, last_reviewed updates, current_retention rises by reinforcement_factor, next_review advances.

That closes the engineering loop of retrieval → reconsolidation.

2.2.4 Lifecycle state machine

Beyond continuous numbers, memories need discrete lifecycle flags:

Should this memory be promoted to a higher layer?
Should it be evicted?
Should it leave the active retrieval pool?

At creation, PowerMem initializes:

{
    "should_promote": false,
    "should_forget": false,
    "should_archive": false,
    "is_active": true
}

should_promote — e.g. working → short_term when a “scratch” memory is accessed repeatedly; slower decay upstairs.
should_forget — decay factor drops below threshold (0.3), or zero accesses in 7 days (silent forgetting).
should_archive — move out of active search. Archive ≠ delete; data remains, but skips routine retrieval.
is_active — participates in normal read/search paths.

An access_count counter is also initialized; promotion logic can require ≥ 3 accesses, among other rules.

2.2.5 Persisting the metadata profile

When all parameters are computed, the system packs them into a structured dict, merges into metadata, and writes alongside the message body.

The clock starts ticking here.

3. Decay Calculation

Time passes. Retention drifts downward.

Recall the Ebbinghaus form: R(t) = e^(-λt) — exponential decay.

3.1 PowerMem’s decay formula

In code:

rate = self.decay_rate if decay_rate is None else decay_rate
decay_factor = math.exp(-hours_elapsed / (24 * rate))

The denominator 24 × rate (hours) is the memory’s characteristic decay timescale. Call it S (Strength): S = 24 × rate. Larger S → slower decay → longer life.

Cleaner form:

decay_factor = e^(-t / S)

Elegant property: after elapsed time t = S, retention falls to e^(-1) ≈ 37% of its prior value — regardless of S. Know S in hours, and you know the forgetting rhythm.

For our short_term meeting memory, rate = 0.15:

S = 24 × 0.15 = 3.6 hours

Roughly every 3.6 hours, retention drops to ~37% of what it was.

One last engineering detail: PowerMem includes a fallback path. The caller first computes and passes a memory-type-specific decay rate. If none is provided, the system falls back to the global default decay rate. This ensures decay calculation still works even when legacy metadata lacks type information.

3.2 A telling engineering trade-off

PowerMem’s formula is equivalent to the classic Ebbinghaus write-up — different notation. Classic: R(t) = e^(-λt) where λ is the decay constant and λ = 1/S.

Back-solving classic Ebbinghaus lab data gives λ ≈ 0.821. PowerMem’s default config implies λ ≈ 0.417 — about half:

This is intentional, not a bug. Ebbinghaus used meaningless syllables — the fastest decay humans show. PowerMem stores semantically linked agent memory; gentler decay matches that reality.

Tune aggressiveness via INTELLIGENT_MEMORY_DECAY_RATE in .env:

Lower decay_rate → more aggressive forgetting.

3.3 Decay timeline for the meeting memory

A short_term memory crosses the 0.3 threshold in ~4.3 hours — but that does not mean immediate deletion. Eviction runs when the memory is accessed.

4. Access-Triggered Lifecycle

4.1 Defer forget decisions until access

Decay runs continuously in the background; forget / promote / archive decisions execute on access.

Cognitive background: retrieving a consolidated memory temporarily reopens it to plasticity; reconsolidation strengthens the trace. Engineering translation: memory fate should not be time-only — re-evaluate on every touch.

Access is the strongest feedback signal. Frequent access → retain or promote. Long silence → forget, even if it once looked important. Lazy evaluation avoids batch scans; no cron job walking every row.

4.2 Four checkpoints

On Memory.get() or Memory.search(), four stages run in order.

Stage 1: Forget

rate = self._resolve_decay_rate(memory)
decay_factor = self.calculate_decay(created_at, decay_rate=rate)
if decay_factor < self.working_threshold:  # 0.3
    return True

if access_count == 0 and time_elapsed > timedelta(days=7):
    return True

Either condition triggers forget. The second is blunt: never accessed in seven days is itself a strong forget signal. The caller performs deletion.

Stage 2: Promote

Any one condition promotes:

if access_count >= 3:
    return True

if time_elapsed > timedelta(hours=24):
    return True

if importance >= self.short_term_threshold:  # 0.6
    return True

Effect: working → short_term or short_term → long_term — lower decay multiplier, longer life.

Our meeting memory (importance = 0.72 ≥ 0.6) qualifies for promotion on first access. If still short_term, it upgrades to long_term — sticky note → durable knowledge. That mirrors biological consolidation.

Stage 3: Archive

if time_elapsed > timedelta(days=30):
    return True

if importance < self.working_threshold:  # 0.3
    return True

Archived memories are not deleted; they leave the active pool but remain reachable via archive APIs.

Stage 4: Periodic reprocessing

Whenever the access count hits a multiple of 5 (the 5th, 10th, 15th access, and so on), or when the memory type changes, the system recomputes all Ebbinghaus metadata. Parameters track evolving access patterns — spaced repetition stabilized in code.

5. Search Weighting

5.1 Interference theory in retrieval

The hard part of memory is not storage — it’s retrieval. As volume grows, cross-interference explodes.

Search for “Q2 review.” Pure semantic ranking might surface a three-month-old meeting note at the top and bury yesterday’s schedule change. Best semantic match ≠ what the user needs right now.

PowerMem injects time into ranking.

5.2 Ranking formula

final_score = relevance_score × decay_factor

relevance_score — keyword / semantic match
decay_factor — time decay from Section 3

This is a cross-rank: stale-but-perfect matches lose to fresher moderate matches:

Recency has veto power. No matter how well it matches, a nearly decayed memory sinks.

Search is also access: each hit triggers Memory.get() — batch lifecycle management for free.

6. Global Optimization

6.1 Why global passes matter

Everything above is per-memory, online. At scale you still need periodic housekeeping — duplicates, redundancy, fragmentation. Analog: sleep-dependent consolidation — replay, transfer, dedupe, merge, strengthen important links.

PowerMem offers three complementary strategies.

6.2 Three optimization strategies

(1) Exact deduplication

Content-hash exact match. Maintain hash → [memories]; keep earliest per group, delete rest. Batch cap: 10,000 records. Identical duplicates → one row.

(2) Semantic deduplication

Embedding cosine similarity finds near-duplicates with different wording. O(N×M) pairwise compare; default threshold 0.95. Delete newer duplicate; keep earliest.

Example:

“Q2 review moved to next Wednesday”
“Q2 review pushed to Wednesday next week”

Same meaning, different surface form — semantic dedup catches it.

(3) Memory compression

For similar-but-not-identical clusters, an LLM summarizes into one synthetic memory:

Greedy clustering — group pairs above threshold (default 0.85)
LLM summary — templated prompt → one replacement memory per cluster

Together with per-item decay, this spans micro (row-level fade) → macro (batch compress) — a full memory quality system.

Summary

Quite a bit of ground. Our meeting message’s path through PowerMem:

"Q2 requirements review, Friday 3 PM, Conference Room B" enters the system
  ↓
Importance scoring → "How important?" → 0.72
  ↓
Classification → "Which layer?" → short_term
  ↓
Parameter init → "Set the decay clock" → decay rate + review schedule locked in
  ↓
Time passes → retention falls
  ↓
Accessed → "Still alive? Promote?" → passes; promoted to long_term
  ↓
Searched → "Where in results?" → recency × relevance
  ↓
Global optimization → "Duplicates? Compress?" → dedup and merge

The through-line: use finite storage for the highest-value signal, and keep retrieval SNR high.

Six mechanisms, six jobs:

Importance scoring — what to remember
Classification — how long to remember
Decay — when to evict
Access triggers — dynamic adjustment
Search weighting — how to find it
Global optimization — how to stay lean

What we find most interesting about PowerMem: forgetting is not a binary “delete or not” afterthought. Decay begins at write time — but decay is a continuous weight, not deletion. What actually decides fate is whether the memory is touched again. That’s uncomfortably close to how human memory behaves.

Forgetting is not a patch on top of memory systems. It is a core design dimension across the entire lifecycle.

Based on PowerMem v1.1.1 source analysis; code references reflect the actual project.

DEV Community