김이더

Posted on Apr 13

To Teach AI How to Remember, First Teach It How to Forget

#ai #llm #memorybank #sideprojects

Code on GitHub. Paper on arXiv.
More posts at radarlog.kr.

I once asked ChatGPT about a conversation we had three days earlier. It had no idea. Tried Claude too. Same thing. Once the conversation ends, the memory vanishes.

But humans remember conversations from three days ago. Well, the important ones. You forget what you ate for lunch yesterday, but you remember your friend saying they're switching jobs. Memories that get recalled often stick around. Memories that never get pulled out fade naturally.

MemoryBank transplants this exact principle into LLMs.

What Is MemoryBank

It's a long-term memory mechanism for LLMs, built by Wanjun Zhong and colleagues at Sun Yat-sen University. The paper was accepted at AAAI 2024, and the full code is open-sourced on GitHub. 419 stars. MIT license.

The core idea is simple. In 1885, German psychologist Hermann Ebbinghaus discovered the forgetting curve — a mathematical model of how memory decays over time. MemoryBank applies this to AI memory systems.

R = e^(−t/S)

R is memory retention, t is elapsed time, S is memory strength. When you first learn something, S starts at 1. Over time, R drops sharply. Ebbinghaus found that 42% of new information is forgotten within 20 minutes, and 67% after a day. He tested this on himself with nonsense syllables.

Here's the key insight.

When a memory gets recalled even once, S increases by 1 and t resets to 0. That memory survives longer. Frequently recalled memories become progressively harder to forget, while untouched memories decay quickly.

If you've worked on game servers, this feels like session timeout logic. If a user doesn't connect, the session expires. Every connection resets the timer. Except MemoryBank doesn't just reset the timer — it extends the timeout window itself. Each reconnection makes the session last even longer.

Three Pillars

MemoryBank's architecture splits into three components: Memory Storage, Memory Retrieval, and Memory Updating. If you're a game developer, think ECS pattern — data component, read system, update system. Clean separation of concerns.

Memory Storage saves raw conversations with timestamps. But it doesn't just pile up chat logs. It uses LLM calls to generate daily event summaries, global event summaries, and user personality profiles — all maintained hierarchically. "The user talked about career doubts last Monday" sits alongside "The user is introverted and growth-oriented."

In Unreal Engine terms, it's like the SaveGame system. You keep the raw data intact while serializing key states separately. Later, you can reconstruct context from summaries alone without loading everything.

Memory Retrieval uses FAISS-based vector search. Every conversation turn and event summary gets encoded into vectors using an encoder model (MiniLM for English, Text2vec for Chinese). When new conversation comes in, the current context gets vectorized and matched against the FAISS index for the most relevant memories. The whole pipeline is built on LangChain.

# When a new message arrives
query_vector = encoder.encode(current_context)

# Search FAISS for relevant memories
relevant_memories = faiss_index.search(query_vector, top_k=5)

# Inject retrieved memories + user profile + event summary into prompt
prompt = build_prompt(relevant_memories, user_portrait, event_summary)
response = llm.generate(prompt)

The beauty here is that you don't have to cram the entire conversation history into the context window. Even Claude's 200K token window fills up fast in long conversations. MemoryBank cherry-picks only the relevant memories for the prompt, so token efficiency is much better.

Memory Updating is the heart of this whole thing. It applies the forgetting curve formula to every memory piece, calculating retention R. When R drops below a threshold, that memory gets removed or weakened. Memories recalled during conversations get their S bumped up and t reset, so they survive.

import math

def calculate_retention(t, S):
    """Calculate memory retention"""
    return math.exp(-t / S)

def update_memory(memory_item, recalled=False):
    """Update a memory piece"""
    if recalled:
        memory_item['S'] += 1   # Increase memory strength
        memory_item['t'] = 0    # Reset elapsed time

    R = calculate_retention(memory_item['t'], memory_item['S'])

    if R < THRESHOLD:  # If retention falls below threshold
        memory_item['status'] = 'forgotten'

    return memory_item

It's simple. That simplicity is the point. The authors explicitly state this is "an exploratory and highly simplified memory updating model." Real human memory is far more complex, but for LLM memory purposes, this level of simplification is effective enough.

SiliconFriend — Memory Is the Prerequisite for Empathy

SiliconFriend is the chatbot built on top of MemoryBank. It's not just memory bolted on — they also fine-tuned it with 38k psychological counseling dialogues using LoRA. Rank 16, 3 epochs on a single A100.

Why psychological data? Because memory and empathy can't be separated. To ask "You mentioned thinking about switching jobs last time — how did that go?", two things are needed: remembering the job talk, and empathizing naturally when bringing it up. MemoryBank handles the former. The psychological fine-tuning handles the latter.

The experiments make this clear. Base ChatGLM gives textbook comfort when you say "I'm having a rough time." SiliconFriend adjusts its response based on personality profiles built from past conversations. Cautious approach for introverted users, more active engagement for extroverted ones.

The evaluation setup is solid too. ChatGPT role-played 15 virtual users with different personalities, generating 10 days of conversation history. From that history, 194 memory probing questions were created to measure recall accuracy. ChatGPT-based SiliconFriend scored highest, while open-source models (ChatGLM, BELLE) were still competitive on retrieval accuracy — just weaker on response naturalness, reflecting their base model capabilities.

How Is This Different from Current AI Memory

Claude, ChatGPT, Claude Code. All three have their own memory systems. But the approaches are fundamentally different.

ChatGPT pre-computes conversation summaries and injects them into every chat. It's automatic. No user effort needed. But summaries lose nuance through compression.

Claude takes the opposite approach. Memory tools are on-demand. You can search past conversations, but Claude has to decide "I should search now" for it to work. If it doesn't think to look, context stays buried. The trade-off: when it does search, it pulls from raw conversations, so depth is there.

Claude Code uses CLAUDE.md files. Write project context in markdown, and it auto-loads at session start. Transparent and editable, but performance degrades as files grow. There's a 200-line index limit too.

None of them have a forgetting mechanism. That's the critical difference from MemoryBank.

ChatGPT accumulates summaries indefinitely. Claude's conversation history keeps growing. Claude Code's CLAUDE.md gets bloated unless you manually prune it. If nothing ever gets forgotten, memory ironically becomes useless. When everything has equal weight, the truly important memories get harder to find quickly.

MemoryBank introduces "forgetting" into this picture. Old, never-recalled memories naturally disappear. Only frequently recalled memories get reinforced. The result: your memory store contains only what actually matters. This is also a performance optimization — smaller FAISS index, better retrieval accuracy.

How to Plug It Into Your Project

You can use MemoryBank directly, or just borrow the core ideas. Two paths.

Path 1: Use the repo as-is. Clone from GitHub, run pip install -r requirement.txt, set up your OpenAI API key. The ChatGPT-based SiliconFriend is the easiest to get running. Put your API key in SiliconFriend-ChatGPT/launch.sh and run it. --language=en for English, --language=cn for Chinese.

If you want open-source models, you need to set up ChatGLM or BELLE as the base, then download LoRA checkpoints. Requires an A100 80GB environment. A bit heavy for personal projects.

Path 2: Transplant the core mechanism into your own code. This is the practical route. You need three things.

First, a storage layer that saves conversations with timestamps. JSON works fine. At the end of each conversation, call an LLM API to generate daily summaries and personality summaries.

memory_storage = {
    "conversations": [
        {
            "timestamp": "2026-04-13T14:30:00",
            "role": "user",
            "content": "I'm stuck on this UE5 Slate widget layout",
            "S": 1,  # Memory strength (initial)
            "t": 0   # Elapsed time
        }
    ],
    "daily_summaries": {},
    "user_portrait": "Game dev, UE5 C++ specialist, introverted, problem-solving oriented"
}

Second, embedding + vector search. Use sentence-transformers for embedding, FAISS for indexing. With LangChain, this pipeline takes a few lines.

Third, forgetting curve-based updates. Once a day, or before each conversation starts, sweep through all memories and calculate R. Remove anything below a threshold (say, 0.1). During conversations, bump S and reset t for any retrieved memory.

Combining these three gives you long-term memory for any chatbot or AI agent. Especially effective for services with repeated user interactions — AI tutors, AI coaches, customer support bots, game NPCs.

Imagine plugging this into game NPCs. A player tells an NPC about their adventures multiple times — the NPC remembers longer each time. A passing conversation the player had once? The NPC forgets it. Pretty natural behavior.

Limitations and Caveats

MemoryBank is validated research accepted at AAAI 2024, but it has clear limitations.

Incrementing S as a simple integer doesn't reflect reality. Human memory strength is influenced by emotional significance, sleep, stress, and other variables. MemoryBank ignores all of these and decides S based solely on recall count. The authors explicitly acknowledge this.

Another thing. Memory summarization and personality profiling require LLM API calls. Calling the summarization API after every conversation means costs accumulate as conversations increase. For production deployment, you'd need to adjust summarization frequency or offload the summary layer to a local lightweight model.

Finally, the base repo hasn't seen major updates since its May 2023 release. The ChatGLM 6B and BELLE 7B base models are dated at this point. But the architecture itself is model-agnostic. You can plug it into Claude, GPT-4o, Gemma, Llama — anything. The point isn't the model. It's the memory mechanism.

In the next post, I'll break down the R = e^(−t/S) formula mathematically. What happens to the curve when S goes from 1 to 5, where to set the threshold, and a simulation to see it all in action.

"Perfect memory isn't memory at all. Only memory that knows how to forget is real."

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.