Kim Namhyun

Posted on Feb 24

Designing a 3-Tier Memory System for a Local AI Agent — STM / MTM / LTM

#ai #agents #llm #systemdesign

How we modeled human memory consolidation to build a robust memory pipeline for a 20B-parameter local LLM agent — and achieved 100% on a 53-question test suite.

1. Problem Definition

We built a local AI agent (Androi) that needs to remember user facts across conversations.

When a user says "My name is Namhyun" or "My hobby is hiking," the agent must recall and use these facts in future sessions.

The initial implementation was simple:

User message → LLM extracts key|value → Stored directly in LTM (permanent)

Out of 53 end-to-end tests, 8 failed, and memory system issues were the root cause for most.

Key symptoms:

BMI calculation couldn't find height data
Salary information wasn't used for price comparisons
After correcting a hobby, the agent recalled the old value
Scheduler tool name mismatches (unrelated bug, also fixed)

2. Root Cause Analysis

2-1. LTM Pollution

Every extracted fact went directly to LTM, including transient junk:

weather|Seoul       ← one-time search request
time|8AM            ← scheduling parameter  
weather_alert|cancelled  ← task status

When LTM exceeded 30 entries, semantic filtering kicked in — but these junk entries diluted the signal, causing truly important memories (height, weight, salary) to be filtered out.

2-2. Single-Character Key Filter Bug

extract_and_remember had a len(key) < 2 filter that dropped valid single-character Korean keys.

"키" (height in Korean) is 1 character → extracted but never saved. Only "몸무게" (weight, 3 chars) survived.

2-3. auto_retrieve Injection Threshold

When LTM exceeded 15 entries, the system switched to semantic matching. With 20 entries stored, a query like "Calculate my BMI" couldn't match "weight: 72kg" above the 0.3 cosine similarity threshold.

2-4. Architectural Gap

The intended design was STM→MTM→LTM natural promotion, but extract_and_remember bypassed MTM entirely and wrote directly to LTM. The promotion logic (_try_promote_to_ltm) was effectively dead code.

3. Solution Design

3-1. Immediate Bug Fixes

Issue	Fix
Scheduler tool name mismatch	`schedule_task` → `create_task` (3 tools)
`web_search` chosen over `find_contact`	Added "Do NOT use web_search for contacts" to tool description
`len(key) < 2` filter	Removed minimum key length requirement
auto_retrieve threshold	Increased 15 → 30

3-2. Architecture Refactoring — Priority-Based MTM→LTM Pipeline

Inspired by human memory consolidation:

Hippocampus = MTM — short-term memories consolidate to cortex through repetition
Ebbinghaus Forgetting Curve — unused memories naturally decay
Emotional significance — blood type, allergies are permanently stored after a single mention

Implementation:

Priority Classification

Priority	Criteria	Examples	Storage
HIGH	Immutable core identity	Name, birthday, blood type, allergies	Direct to LTM
MID	Mutable preferences/state	Hobby, job, residence, salary	MTM → promotion queue
LOW	Transient/one-off info	Weather, news, timestamps	Not extracted

Promotion & Expiry Flow

4. Testing & Validation

4-1. Pass Rate Progression

Round	Pass Rate	Key Change
Baseline	45/53 (85%)	Original code
Round 1	48/53 (91%)	Scheduler tool name fix
Round 2	49/53 (92%)	find_contact description
Round 3	51/53 (96%)	LTM threshold + system prompt
Round 4	53/53 (100%)	Key filter fix + multi-chain guidance
Final (post-refactor)	53/53 (100%)	MTM→LTM architecture, no regressions

4-2. Test Coverage (53 tests)

Category	Count	Coverage
A. Memory System	10	CRUD, cross-session, update, delete, implicit save
B. Web Search	6	Weather, exchange rate, news, restaurants, price, population
C. Calculation + Memory	5	Take-home pay, BMI, compound interest, unit conversion
D. Calendar	6	CRUD, memory-linked scheduling, event modification
E. File + Code	5	File CRUD, Python execution, result persistence
F. Email	3	Inbox, send, search
G. Contacts	3	Add, search, delete
H. Scheduler	3	Register, list, cancel
I. Multi-Chain	7	3+ tool chains, corrections, weather→calendar
J. Edge Cases	5	Hallucination prevention, tool re-call prevention, response speed

5. Results

Final Scorecard

A. Memory System    ████████████████████ 10/10 (100%)
B. Web Search       ████████████████████ 6/6   (100%)
C. Calc + Memory    ████████████████████ 5/5   (100%)
D. Calendar         ████████████████████ 6/6   (100%)
E. File + Code      ████████████████████ 5/5   (100%)
F. Email            ████████████████████ 3/3   (100%)
G. Contacts         ████████████████████ 3/3   (100%)
H. Scheduler        ████████████████████ 3/3   (100%)
I. Multi-Chain      ████████████████████ 7/7   (100%)
J. Edge Cases       ████████████████████ 5/5   (100%)
────────────────────────────────────────────────────
Total               53/53 (100%)

Key Takeaways

LLM-based extraction needs guardrails. Without priority classification, LLMs will extract "weather|Seoul" alongside "name|John" and pollute permanent memory.
Human memory models work for AI agents. The hippocampus→cortex consolidation pattern (MTM→LTM promotion through repeated access) naturally filters for truly important information.
Single-character CJK key pitfall. A len(key) < 2 filter designed for English silently drops valid Korean keys like "키" (height) and "나이" (age). Multilingual systems need careful attention.
Semantic matching has blind spots. "Calculate my BMI" doesn't semantically match "weight: 72kg" above a 0.3 cosine threshold. For small memory sets, full injection is more reliable than semantic filtering.

Top comments (2)

signalstack • Feb 24

The LTM pollution problem you described is one of those things that seems obvious in hindsight but hits everyone building agent memory systems. That transient→permanent bypass is a classic failure mode.

What struck me about your fix is the priority classification — treating name/birthday/blood type as direct-to-LTM vs hobby/job going through MTM first. That's the right call. There's essentially a semantic permanence spectrum, and one-size-fits-all storage always ends up corrupted by ephemeral context.

The semantic matching blind spot is underrated too. "Calculate my BMI" not matching "weight: 72kg" above 0.3 cosine is a great concrete example of why hybrid retrieval often beats pure vector search for agent memory. Sometimes exact key lookup beats cosine similarity for structured fact recall.

One thing I'd think about for the next iteration: what happens when MID-priority facts conflict? User says hobby is hiking, then later says hiking was "just a phase, I do pottery now." Does the MTM→LTM promotion pick up the update, or does the old LTM entry create a conflict? Correction handling at promotion time seems like where the next round of edge cases lives.

Kim Namhyun • Feb 24 • Edited

"Wow, thank you for your interest and insightful thoughts. I am not sure if this fully answers your question, but I will share how my system handles it.

In my system, MTM have count tags for promoting from MTM to LTM, so a memory will be dropped after K hours (or days) if it is not recalled in limited time(LTM also but has much more long deadline). Also, when I retrieve memories from MTM and LTM, I order them chronologically and instruct the LLM via the system prompt to understand it (recent memories presents at lasts)'
I briefly checked if this works when there are multiple conflicting memories in LTM, and it does.

So, I believe that even if a memory is promoted from MTM to LTM and multiple versions exist, the system can handle it. (Though it hasn't been fully tested yet! 😊)
I will test it more thoroughly soon and post an update. Anyway, thank you for sharing your thoughts!"

And also, I need to think about sementic seach you mentioned above. (I just increase number of retrieving words from L/MTM)