How we modeled human memory consolidation to build a robust memory pipeline for a 20B-parameter local LLM agent — and achieved 100% on a 53-question test suite.
1. Problem Definition
We built a local AI agent (Androi) that needs to remember user facts across conversations.
When a user says "My name is Namhyun" or "My hobby is hiking," the agent must recall and use these facts in future sessions.
The initial implementation was simple:
User message → LLM extracts key|value → Stored directly in LTM (permanent)
Out of 53 end-to-end tests, 8 failed, and memory system issues were the root cause for most.
Key symptoms:
- BMI calculation couldn't find height data
- Salary information wasn't used for price comparisons
- After correcting a hobby, the agent recalled the old value
- Scheduler tool name mismatches (unrelated bug, also fixed)
2. Root Cause Analysis
2-1. LTM Pollution
Every extracted fact went directly to LTM, including transient junk:
weather|Seoul ← one-time search request
time|8AM ← scheduling parameter
weather_alert|cancelled ← task status
When LTM exceeded 30 entries, semantic filtering kicked in — but these junk entries diluted the signal, causing truly important memories (height, weight, salary) to be filtered out.
2-2. Single-Character Key Filter Bug
extract_and_remember had a len(key) < 2 filter that dropped valid single-character Korean keys.
"키" (height in Korean) is 1 character → extracted but never saved. Only "몸무게" (weight, 3 chars) survived.
2-3. auto_retrieve Injection Threshold
When LTM exceeded 15 entries, the system switched to semantic matching. With 20 entries stored, a query like "Calculate my BMI" couldn't match "weight: 72kg" above the 0.3 cosine similarity threshold.
2-4. Architectural Gap
The intended design was STM→MTM→LTM natural promotion, but extract_and_remember bypassed MTM entirely and wrote directly to LTM. The promotion logic (_try_promote_to_ltm) was effectively dead code.
3. Solution Design
3-1. Immediate Bug Fixes
| Issue | Fix |
|---|---|
| Scheduler tool name mismatch |
schedule_task → create_task (3 tools) |
web_search chosen over find_contact
|
Added "Do NOT use web_search for contacts" to tool description |
len(key) < 2 filter |
Removed minimum key length requirement |
| auto_retrieve threshold | Increased 15 → 30 |
3-2. Architecture Refactoring — Priority-Based MTM→LTM Pipeline
Inspired by human memory consolidation:
- Hippocampus = MTM — short-term memories consolidate to cortex through repetition
- Ebbinghaus Forgetting Curve — unused memories naturally decay
- Emotional significance — blood type, allergies are permanently stored after a single mention
Implementation:
Priority Classification
| Priority | Criteria | Examples | Storage |
|---|---|---|---|
| HIGH | Immutable core identity | Name, birthday, blood type, allergies | Direct to LTM |
| MID | Mutable preferences/state | Hobby, job, residence, salary | MTM → promotion queue |
| LOW | Transient/one-off info | Weather, news, timestamps | Not extracted |
Promotion & Expiry Flow
4. Testing & Validation
4-1. Pass Rate Progression
| Round | Pass Rate | Key Change |
|---|---|---|
| Baseline | 45/53 (85%) | Original code |
| Round 1 | 48/53 (91%) | Scheduler tool name fix |
| Round 2 | 49/53 (92%) | find_contact description |
| Round 3 | 51/53 (96%) | LTM threshold + system prompt |
| Round 4 | 53/53 (100%) | Key filter fix + multi-chain guidance |
| Final (post-refactor) | 53/53 (100%) | MTM→LTM architecture, no regressions |
4-2. Test Coverage (53 tests)
| Category | Count | Coverage |
|---|---|---|
| A. Memory System | 10 | CRUD, cross-session, update, delete, implicit save |
| B. Web Search | 6 | Weather, exchange rate, news, restaurants, price, population |
| C. Calculation + Memory | 5 | Take-home pay, BMI, compound interest, unit conversion |
| D. Calendar | 6 | CRUD, memory-linked scheduling, event modification |
| E. File + Code | 5 | File CRUD, Python execution, result persistence |
| F. Email | 3 | Inbox, send, search |
| G. Contacts | 3 | Add, search, delete |
| H. Scheduler | 3 | Register, list, cancel |
| I. Multi-Chain | 7 | 3+ tool chains, corrections, weather→calendar |
| J. Edge Cases | 5 | Hallucination prevention, tool re-call prevention, response speed |
5. Results
Final Scorecard
A. Memory System ████████████████████ 10/10 (100%)
B. Web Search ████████████████████ 6/6 (100%)
C. Calc + Memory ████████████████████ 5/5 (100%)
D. Calendar ████████████████████ 6/6 (100%)
E. File + Code ████████████████████ 5/5 (100%)
F. Email ████████████████████ 3/3 (100%)
G. Contacts ████████████████████ 3/3 (100%)
H. Scheduler ████████████████████ 3/3 (100%)
I. Multi-Chain ████████████████████ 7/7 (100%)
J. Edge Cases ████████████████████ 5/5 (100%)
────────────────────────────────────────────────────
Total 53/53 (100%)
Key Takeaways
LLM-based extraction needs guardrails. Without priority classification, LLMs will extract "weather|Seoul" alongside "name|John" and pollute permanent memory.
Human memory models work for AI agents. The hippocampus→cortex consolidation pattern (MTM→LTM promotion through repeated access) naturally filters for truly important information.
Single-character CJK key pitfall. A
len(key) < 2filter designed for English silently drops valid Korean keys like "키" (height) and "나이" (age). Multilingual systems need careful attention.Semantic matching has blind spots. "Calculate my BMI" doesn't semantically match "weight: 72kg" above a 0.3 cosine threshold. For small memory sets, full injection is more reliable than semantic filtering.



Top comments (2)
The LTM pollution problem you described is one of those things that seems obvious in hindsight but hits everyone building agent memory systems. That transient→permanent bypass is a classic failure mode.
What struck me about your fix is the priority classification — treating name/birthday/blood type as direct-to-LTM vs hobby/job going through MTM first. That's the right call. There's essentially a semantic permanence spectrum, and one-size-fits-all storage always ends up corrupted by ephemeral context.
The semantic matching blind spot is underrated too. "Calculate my BMI" not matching "weight: 72kg" above 0.3 cosine is a great concrete example of why hybrid retrieval often beats pure vector search for agent memory. Sometimes exact key lookup beats cosine similarity for structured fact recall.
One thing I'd think about for the next iteration: what happens when MID-priority facts conflict? User says hobby is hiking, then later says hiking was "just a phase, I do pottery now." Does the MTM→LTM promotion pick up the update, or does the old LTM entry create a conflict? Correction handling at promotion time seems like where the next round of edge cases lives.
"Wow, thank you for your interest and insightful thoughts. I am not sure if this fully answers your question, but I will share how my system handles it.
In my system, MTM have count tags for promoting from MTM to LTM, so a memory will be dropped after K hours (or days) if it is not recalled in limited time(LTM also but has much more long deadline). Also, when I retrieve memories from MTM and LTM, I order them chronologically and instruct the LLM via the system prompt to understand it (recent memories presents at lasts)'
I briefly checked if this works when there are multiple conflicting memories in LTM, and it does.
So, I believe that even if a memory is promoted from MTM to LTM and multiple versions exist, the system can handle it. (Though it hasn't been fully tested yet! 😊)
I will test it more thoroughly soon and post an update. Anyway, thank you for sharing your thoughts!"
And also, I need to think about sementic seach you mentioned above. (I just increase number of retrieving words from L/MTM)