DEV Community

Kim Namhyun
Kim Namhyun

Posted on

Designing a 3-Tier Memory System for a Local AI Agent — STM / MTM / LTM

How we modeled human memory consolidation to build a robust memory pipeline for a 20B-parameter local LLM agent — and achieved 100% on a 53-question test suite.


1. Problem Definition

We built a local AI agent (Androi) that needs to remember user facts across conversations.

When a user says "My name is Namhyun" or "My hobby is hiking," the agent must recall and use these facts in future sessions.

The initial implementation was simple:

User message → LLM extracts key|value → Stored directly in LTM (permanent)
Enter fullscreen mode Exit fullscreen mode

Out of 53 end-to-end tests, 8 failed, and memory system issues were the root cause for most.

Key symptoms:

  • BMI calculation couldn't find height data
  • Salary information wasn't used for price comparisons
  • After correcting a hobby, the agent recalled the old value
  • Scheduler tool name mismatches (unrelated bug, also fixed)

2. Root Cause Analysis

2-1. LTM Pollution

Every extracted fact went directly to LTM, including transient junk:

weather|Seoul       ← one-time search request
time|8AM            ← scheduling parameter  
weather_alert|cancelled  ← task status
Enter fullscreen mode Exit fullscreen mode

When LTM exceeded 30 entries, semantic filtering kicked in — but these junk entries diluted the signal, causing truly important memories (height, weight, salary) to be filtered out.

2-2. Single-Character Key Filter Bug

extract_and_remember had a len(key) < 2 filter that dropped valid single-character Korean keys.

"키" (height in Korean) is 1 character → extracted but never saved. Only "몸무게" (weight, 3 chars) survived.

2-3. auto_retrieve Injection Threshold

When LTM exceeded 15 entries, the system switched to semantic matching. With 20 entries stored, a query like "Calculate my BMI" couldn't match "weight: 72kg" above the 0.3 cosine similarity threshold.

2-4. Architectural Gap

The intended design was STM→MTM→LTM natural promotion, but extract_and_remember bypassed MTM entirely and wrote directly to LTM. The promotion logic (_try_promote_to_ltm) was effectively dead code.


3. Solution Design

3-1. Immediate Bug Fixes

Issue Fix
Scheduler tool name mismatch schedule_taskcreate_task (3 tools)
web_search chosen over find_contact Added "Do NOT use web_search for contacts" to tool description
len(key) < 2 filter Removed minimum key length requirement
auto_retrieve threshold Increased 15 → 30

3-2. Architecture Refactoring — Priority-Based MTM→LTM Pipeline

Inspired by human memory consolidation:

  • Hippocampus = MTM — short-term memories consolidate to cortex through repetition
  • Ebbinghaus Forgetting Curve — unused memories naturally decay
  • Emotional significance — blood type, allergies are permanently stored after a single mention

Implementation:

Priority Classification

Priority Criteria Examples Storage
HIGH Immutable core identity Name, birthday, blood type, allergies Direct to LTM
MID Mutable preferences/state Hobby, job, residence, salary MTM → promotion queue
LOW Transient/one-off info Weather, news, timestamps Not extracted

Promotion & Expiry Flow


4. Testing & Validation

4-1. Pass Rate Progression

Round Pass Rate Key Change
Baseline 45/53 (85%) Original code
Round 1 48/53 (91%) Scheduler tool name fix
Round 2 49/53 (92%) find_contact description
Round 3 51/53 (96%) LTM threshold + system prompt
Round 4 53/53 (100%) Key filter fix + multi-chain guidance
Final (post-refactor) 53/53 (100%) MTM→LTM architecture, no regressions

4-2. Test Coverage (53 tests)

Category Count Coverage
A. Memory System 10 CRUD, cross-session, update, delete, implicit save
B. Web Search 6 Weather, exchange rate, news, restaurants, price, population
C. Calculation + Memory 5 Take-home pay, BMI, compound interest, unit conversion
D. Calendar 6 CRUD, memory-linked scheduling, event modification
E. File + Code 5 File CRUD, Python execution, result persistence
F. Email 3 Inbox, send, search
G. Contacts 3 Add, search, delete
H. Scheduler 3 Register, list, cancel
I. Multi-Chain 7 3+ tool chains, corrections, weather→calendar
J. Edge Cases 5 Hallucination prevention, tool re-call prevention, response speed

5. Results

Final Scorecard

A. Memory System    ████████████████████ 10/10 (100%)
B. Web Search       ████████████████████ 6/6   (100%)
C. Calc + Memory    ████████████████████ 5/5   (100%)
D. Calendar         ████████████████████ 6/6   (100%)
E. File + Code      ████████████████████ 5/5   (100%)
F. Email            ████████████████████ 3/3   (100%)
G. Contacts         ████████████████████ 3/3   (100%)
H. Scheduler        ████████████████████ 3/3   (100%)
I. Multi-Chain      ████████████████████ 7/7   (100%)
J. Edge Cases       ████████████████████ 5/5   (100%)
────────────────────────────────────────────────────
Total               53/53 (100%)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. LLM-based extraction needs guardrails. Without priority classification, LLMs will extract "weather|Seoul" alongside "name|John" and pollute permanent memory.

  2. Human memory models work for AI agents. The hippocampus→cortex consolidation pattern (MTM→LTM promotion through repeated access) naturally filters for truly important information.

  3. Single-character CJK key pitfall. A len(key) < 2 filter designed for English silently drops valid Korean keys like "키" (height) and "나이" (age). Multilingual systems need careful attention.

  4. Semantic matching has blind spots. "Calculate my BMI" doesn't semantically match "weight: 72kg" above a 0.3 cosine threshold. For small memory sets, full injection is more reliable than semantic filtering.

Top comments (2)

Collapse
 
signalstack profile image
signalstack

The LTM pollution problem you described is one of those things that seems obvious in hindsight but hits everyone building agent memory systems. That transient→permanent bypass is a classic failure mode.

What struck me about your fix is the priority classification — treating name/birthday/blood type as direct-to-LTM vs hobby/job going through MTM first. That's the right call. There's essentially a semantic permanence spectrum, and one-size-fits-all storage always ends up corrupted by ephemeral context.

The semantic matching blind spot is underrated too. "Calculate my BMI" not matching "weight: 72kg" above 0.3 cosine is a great concrete example of why hybrid retrieval often beats pure vector search for agent memory. Sometimes exact key lookup beats cosine similarity for structured fact recall.

One thing I'd think about for the next iteration: what happens when MID-priority facts conflict? User says hobby is hiking, then later says hiking was "just a phase, I do pottery now." Does the MTM→LTM promotion pick up the update, or does the old LTM entry create a conflict? Correction handling at promotion time seems like where the next round of edge cases lives.

Collapse
 
kim_namhyun_e7535f3dc4c69 profile image
Kim Namhyun • Edited

​"Wow, thank you for your interest and insightful thoughts. I am not sure if this fully answers your question, but I will share how my system handles it.

​In my system, MTM have count tags for promoting from MTM to LTM, so a memory will be dropped after K hours (or days) if it is not recalled in limited time(LTM also but has much more long deadline). Also, when I retrieve memories from MTM and LTM, I order them chronologically and instruct the LLM via the system prompt to understand it (recent memories presents at lasts)'
​I briefly checked if this works when there are multiple conflicting memories in LTM, and it does.

So, I believe that even if a memory is promoted from MTM to LTM and multiple versions exist, the system can handle it. (Though it hasn't been fully tested yet! 😊)
​I will test it more thoroughly soon and post an update. Anyway, thank you for sharing your thoughts!"

And also, I need to think about sementic seach you mentioned above. (I just increase number of retrieving words from L/MTM)