DEV Community

Sarthak Rawat
Sarthak Rawat

Posted on

Why AI Assistants Have an Amnesia Problem. And How We're Fixing It

A deep dive into HGVM: the memory architecture that makes AI agents remember like humans, forget like humans, and learn like humans.


There is a moment every frequent AI user has experienced. You spent twenty minutes in a conversation explaining your project, your constraints, your preferences, your stack. The assistant was helpful, precise, contextually aware. Then you closed the tab.

The next day, you open a new conversation. The assistant greets you like a stranger.

You explain everything again.

This is not a minor inconvenience. It is a fundamental architectural failure, and it is not the only one. On the other side of the spectrum are systems that remember everything forever: every offhand comment, every temporary preference, every piece of stale context from six months ago. These systems don't forget, but they accumulate noise until their memory becomes a liability. They surface outdated information as confidently as current facts. They become sycophantic, anchored to what you said before rather than what is true now.

Neither model is how memory should work. Neither model is how human memory works.

We built HGVM, Hierarchical Graph-Vector Memory, to fix this. This blog explains what it is, how it works, and why we think it represents a meaningfully different approach to AI agent memory.


The Human Memory Analogy

Human memory is not a database. It does not store everything with equal fidelity and retrieve it on demand. It is a dynamic, adaptive system that continuously organizes, compresses, reinforces, and discards information based on relevance, recency, and repeated exposure.

When you learn something new, your brain does not simply file it. It connects it to existing structures. It consolidates related fragments into coherent abstractions during sleep. It strengthens memories that are repeatedly accessed and allows others to fade. And crucially, it reflects, it notices patterns across experiences and crystallizes them into durable knowledge that shapes future reasoning.

HGVM is built on this intuition. Memory should be a managed resource, not a growing pile. It should be structured hierarchically, retrieved selectively, compressed into abstractions over time, and actively pruned when it is no longer relevant. Four behaviors, working together.


The Four Pillars of HGVM

1. Hierarchical Structure

Most existing memory systems store memories as a flat collection of text chunks with embeddings. When you query, you run a vector similarity search across all of them and return the top results. This works adequately at small scale, but it fails in two important ways as the memory grows.

First, it causes context leakage, semantically similar but contextually irrelevant memories surface because the retrieval system has no way to distinguish between "similar in meaning" and "relevant to this situation." A question about your Python project might surface memories about a Python script you wrote for a completely different purpose two months ago.

Second, it is expensive. Vector search across all memories grows linearly. The more you remember, the slower and noisier retrieval becomes.

HGVM organizes memory into a strict four-layer hierarchy:

Domain → Category → Topic → Episode

A Domain is the broadest grouping: Work, Health, Personal, Finance. A Category is a sub-domain: ProjectX, Morning Runs, Family. A Topic is a coherent theme within a category: Backend API Design, Sprint Planning, Dad's Hospital Visit. An Episode is a single atomic memory item attached to a topic.

Core Architecture Diagram

When a query arrives, the system routes at the Domain and Category level first, a cheap semantic match against summaries, and then runs vector search only within the relevant sub-graph. The candidate set is small. Retrieval is precise. Context leakage is dramatically reduced because irrelevant branches are never even considered.

This is the structural foundation everything else builds on.


2. Active Forgetting

Forgetting is not a failure mode. It is a feature.

The reason most AI memory systems remember everything forever is that forgetting feels dangerous, what if you delete something important? The result is a system that accumulates noise indefinitely, where the signal-to-noise ratio of the memory store degrades over time and retrieval quality degrades with it.

HGVM takes a different position: forgetting should be tiered by memory type, not uniform. Not all memories deserve the same treatment.

We define five memory classes, each with its own decay behavior:

Memory Class What It Stores Decay Rate
permanent Explicit identity facts ("I am vegetarian", "never forget this") Never decays
semantic Consolidated summaries and learned patterns Very slow
preference Stable user preferences Very slow
observation Raw observations and task outcomes Medium
task_state Active work state and in-progress details Fast

Memory Class Spectrum

A permanent memory :- something the user explicitly told the system to remember, never decays. A preference :- "I prefer TypeScript", fades very slowly, remaining useful for months. A task_state memory :- "currently blocked on the auth middleware", fades quickly, becoming stale within days because task context changes fast.

The decay formula itself is borrowed from cognitive science. We use an exponential decay function where the strength of each memory degrades over time since last access:

new_strength = current_strength × exp(−λ × days_since_access)
Enter fullscreen mode Exit fullscreen mode

Each memory class has its own λ value. When strength falls below a minimum threshold, the memory is soft-deleted :- marked as invalid but preserved in the graph for auditability. Nothing is ever hard-deleted during forgetting.

Retrieval is reinforced too: every time a memory is returned in a query, its strength is boosted. Memories that are repeatedly useful stay strong. Memories that are never relevant fade gracefully.

Forgetting Curve Comparison

This is the forgetting curve most existing systems ignore. HGVM embraces it as a design principle.


3. Consolidation

A topic with fifty raw episode nodes is not useful. Fifty individually embedded fragments retrieved piecemeal create noisy, redundant context. At some point, accumulated raw memories need to be compressed into something denser and more useful.

This is consolidation.

HGVM runs a scheduled consolidation pipeline that works as follows: when a topic accumulates enough raw episodes, the system clusters them by semantic similarity, extracts atomic facts from each cluster, generates a concise summary from those facts, and then runs a batched verification pass to confirm the summary is faithful to the source material.

The verification step is non-negotiable. We check for three failure conditions: missing facts (something in the source that did not make it into the summary), altered facts (something that changed in meaning), and contradicted facts (something that directly conflicts with the source). If any of these conditions are non-empty, the consolidation is rejected and logged. No summary episode is created until verification passes completely.

When verification passes, the system creates a new semantic episode with subtype consolidation_summary, linked to the source episodes via SUMMARIZES relationships. Critically, the source episodes are never deleted. Consolidation adds compressed knowledge — it does not destroy provenance.

Retrieval Pipeline Flow

The retrieval system then naturally prefers the summary for most queries, it is denser, more representative, and semantically richer, while the raw episodes remain available for queries that need granular detail.

Consolidation Before/After

This mirrors how human memory consolidates during sleep: the raw experiences remain somewhere in the system, but what you access day-to-day is the compressed, organized version.


4. Reflection

Consolidation compresses. Reflection learns.

Reflection is the most novel component of HGVM. While consolidation creates summaries within a single topic, reflection operates across topics and sessions, looking for higher-order patterns in behavior, preference, and reasoning.

Imagine a user who across ten sessions has: rejected Java twice, chosen Python for three consecutive projects, consistently used FastAPI over Django, and complained about verbose enterprise frameworks. No single episode says "this user values developer productivity." But the pattern across episodes does.

Reflection detects this. It fetches recent valid episodes from the non-reflective memory classes :- permanent, preference, task_state, and observation, groups them into coherent evidence bundles, and generates candidate higher-order abstractions. These are persisted as semantic episodes with subtype reflection_pattern, linked to their supporting evidence.

The result: the next time the user asks for a project scaffold recommendation, the system does not just recall that they used FastAPI before. It retrieves the reflection, "this user prioritizes developer productivity over ecosystem conservatism", and reasons from that durable insight.

Reflection Explained

There is one hard constraint we enforce at the architecture level: reflection memories can never be used as evidence for future reflections. This prevents a feedback loop where the system reflects on its own abstractions, compounding distortions over time. Reflection operates only on primary evidence. This constraint is enforced at the database query level, not just the prompt level.


How HGVM Compares to What Exists

The agent memory landscape has become more sophisticated in the last two years. It is worth being honest about where existing systems land and where HGVM goes further.

MemGPT / Letta introduced the idea of paging between active context and external memory using function calls, and Letta added sleep-time consolidation agents. HGVM builds directly on these ideas but adds hierarchical routing, tiered forgetting, and the reflection pipeline.

Mem0 offers a clean multi-signal retrieval system (semantic + keyword + entity) with good API design. It does not offer hierarchical memory organization, tiered decay, or reflection.

Zep / Graphiti provide excellent graph-based memory with temporal reasoning and the bitemporal model (facts carry both a valid time and a transaction time). HGVM adopts the bitemporal approach for contradiction handling and uses Graphiti's valid_at / invalid_at pattern for soft deletion.

ChatGPT Memory is a static injection system, remembered facts are prepended to context. This causes cross-domain leakage (facts from unrelated contexts surface inappropriately) and sycophancy (the system defers to stored preferences even when they are stale or wrong). PersistBench has documented both failure modes.

HGVM is not a critique of any of these systems. Each solved a real problem. HGVM synthesizes the best ideas :- graph structure, temporal modeling, consolidation, multi-signal retrieval and adds forgetting as a first-class feature and reflection as a novel capability.


The Multi-Agent Layer

HGVM is not just a memory system. It is a memory system designed for an agent society.

Four specialized LLM agents share the same memory graph and collaborate on tasks:

  • Planner :- memory-first orchestration. Queries memory before planning anything. Owns task-state memory. Decides what gets persisted after each turn.
  • Executor :- task execution. Writes observation memories. Lower trust than Critic for factual claims.
  • Critic :- factual review and contradiction correction. Highest trust for permanent and preference memories. Can invalidate incorrect stored facts.
  • Reflection :- scheduled background agent. Generates semantic patterns from non-reflective evidence. Does not participate in the synchronous chat loop.

Agent Society

A critical design decision: no agent writes directly to the memory graph. All persistent memory writes go through a single MemoryManager service. This enforces a trust hierarchy; when Planner and Critic disagree about a stored fact, the Critic's version wins for permanent and preference memories. When write conflicts occur simultaneously, both versions are preserved with provenance tags until a resolution signal arrives.

This means the memory is not just accurate, it is accountable. Every episode knows which agent created it, when it was created, and whether it was ever superseded.


What the System Looks Like

The HGVM dashboard is built to make memory formation visible in real time. The Memory Graph page shows the full hierarchy as an interactive graph on a dark background: Domain hexagons in teal, Category rectangles, Topic circles with summary previews, and Episode nodes color-coded by strength, bright cyan for strong memories, yellow for medium, orange for fading.

New episodes appear with a fade-in animation. Strength color transitions animate as decay runs. When consolidation fires on a topic, a brief pulse effect signals the compression happening. Reflection outputs appear as double-ring nodes, visually distinct from raw and summary episodes.

A live Memory Activity Feed alongside the chat interface shows every memory operation in real time, green for ADD, blue for QUERY, red for INVALIDATE, purple for CONSOLIDATE, teal for REFLECT, grey for FORGET. You can watch the system's memory evolve turn by turn.

An Analytics page tracks memory growth by class over time, token savings from compression, consolidation verification health, reflection utilization rates, retrieval configuration, and topic reuse effectiveness.

The goal is to make memory legible, not just to researchers, but to anyone using the system.


The Hard Problems We Have Not Solved

We want to be clear about what remains genuinely difficult, because intellectual honesty matters more than clean marketing.

Decay parameter tuning. The λ values for each memory class are educated starting points informed by cognitive science and ablation experiments. The right values for a specific deployment domain — medical, coding, creative writing, likely differ. Universal optimal values do not exist yet.

Machine unlearning. GDPR's right to be forgotten is straightforward for the graph: a cascade deletion removes all a user's data atomically. It is not straightforward for the LLM itself. If an LLM has been exposed to a user's memories in-context during inference, traces of that exposure may persist in ways that are not cleanly deletable. Machine unlearning for in-context exposure is still an open research problem.

Semantic conflict resolution at scale. When two agents write contradictory facts at the same timestamp, the trust hierarchy handles it. But there are edge cases :- domain-dependent conflicts, partial contradictions, nuanced updates that partially overlap with stored facts — where the right resolution depends on context that no static rule captures. We use a conservative approach (preserve both, resolve later) but this is a heuristic, not a solution.

Reflection quality at low evidence density. When a user is new and has generated few episodes, reflection has little to work with. Thin-evidence reflections risk being generic or incorrect. We suppress reflections with insufficient supporting evidence, but the threshold is calibrated manually rather than learned.

These are real limitations. We name them because they are the honest frontier of this problem space, and because solving them is where the interesting research goes next.


What We Are Building Toward

HGVM v2.0 is being built for the Global AI Hackathon Series with Qwen Cloud with a tight deadline, but the ideas behind it are not hackathton ideas. The question of how AI agents should manage long-term memory is one of the most practically important open questions in applied AI right now.

Every production AI assistant deployment faces this problem. Every enterprise deploying agents across thousands of users faces this problem. Every personal assistant that fails to remember what you told it last week fails because of this problem.

We think the right direction involves all four pillars working together: hierarchical structure for precision, tiered forgetting for noise control, consolidation for compression, and reflection for higher-order learning. Not as research curiosities but as production-grade engineering.

The full architecture, schema, pipeline specifications, and implementation plan are documented internally and will be shared progressively as the system matures. The dashboard will be publicly demoed when it is stable.

If you are working on agent memory, long-term personalization, or multi-agent coordination, we would genuinely like to hear what you think. The hard problems above are not going to be solved alone.


HGVM is being built as part of a focused 30-day engineering sprint. Architecture is frozen. Implementation is active.

If this resonates with work you are doing, reach out.


Top comments (0)