Kim Namhyun

Posted on Feb 23

Debugging a Local LLM Agent's Memory System 🧠

#ai #testing #llm #agents

I built a 20-test suite to verify my AI agent's memory system, discovered fundamental architecture flaws, decomposed the pipeline from 2 stages to 3, hunted down a hidden code path bug, and reached 94% (18/19) pass rate. Here's the full story.

Context: Xoul — A Fully Local AI Assistant

Xoul is a personal AI agent running entirely on a local QEMU VM. It uses Ollama to run a 20B parameter LLM with tool calling (web search, calculations, file management, etc.) and a persistent memory system.

The memory system has three tiers:

Tier	Purpose	Storage
STM (Short-Term)	Current conversation turn history	SQLite `stm` table
MTM (Mid-Term)	Conversation summaries, time-limited	SQLite `mtm` table
LTM (Long-Term)	User info (name, birthday, etc.) permanent	SQLite `ltm` + vector embeddings

Today's goal: verify that this memory system actually works end-to-end.

Step 1: Building a 20-Test Suite

I created 20 E2E prompt-based tests covering all memory operations:

P01: "My name is Namhyun, I'm a teacher. Remember this." → store
P02: Same session: "What's my job?" → instant recall
P03: New session: "What do you know about me?" → cross-session recall
P04: Remember 3 birthdays at once
P05: Selective recall (wife's birthday only)
P06: Memory update (blue → green)
...
P20: Response speed under 30 seconds

Initial Result: 13/20 (65%) 😰

Key failure causes:

English keys (user_name) vs Korean queries (이름이 뭐야?) — semantic gap
forget tool not registered in tool definitions
auto_retrieve only injecting category='profile' items (effectively 0 items)
LLM calling remember only once for 3 birthdays

Step 2: The Original 2-Stage Architecture's Limits (Plan → Agent)

The existing architecture had 2 stages:

The problem: LLMs tend to call a tool once and handle the rest as text. This is a fundamental tool-calling pattern issue that's difficult to fix with prompting alone.

The 3-Stage Architecture: Plan → Agent → Memory

The solution: add a dedicated extraction stage after the main LLM response, where a model extracts facts from the user message and auto-saves them.

Key Changes

Removed remember tool from system prompt — LLM just responds "Got it!" as text
Added extract_and_remember() function — extracts Key|Value pairs from user messages
Inserted extraction stage into all response paths in server.py

The extraction prompt:

Extract information as Key|Value from the input. If none, say 'none'.

Ex) Input: My name is Namhyun, email is abc@test.com
Output:
    name|Namhyun
    email|abc@test.com

---
Input: {user_message}
Extract:

Struggle 1: The Small Model (0.6B) Can't Follow Instructions

Initially, I used qwen3:0.6b (600M params) as the dedicated extractor. Fast and light, right?

📝 0.6b IN: My name is Namhyun, I'm a 4th grade teacher
📝 0.6b OUT: Key-Value: name=Namhyun    ← Wrong format! Used = instead of |

Issue	Symptom
Format mismatch	Output `Key=Value` or `Key: Value` instead of `Key\
Hallucination	Generated information not in the input
Empty response	With {% raw %}`think: True`, thinking tokens consumed the budget → empty response field

Solution: Use the Main LLM in Non-Think Mode

Using the main LLM (20B) for extraction with thinking disabled:

Input tokens are minimal (user message + prompt ≈ 200 tokens)
Output is short (a few Key|Value lines)
Time cost: ~5 seconds — perfectly acceptable

📝 20B IN: My name is Namhyun, I'm a 4th grade teacher
📝 20B OUT: name|Namhyun
            job|4th grade teacher
📝 Saved: [['name', 'Namhyun'], ['job', '4th grade teacher']]

Perfect extraction! Correct format, zero hallucination.

Struggle 2: The Hidden Code Path (The Real War Story)

Even after switching to 20B, some messages silently skipped extraction. Server logs showed:

# Extraction RUNS for these (questions):
[extract_and_remember] INPUT: Show me all my memories → none ✅
[extract_and_remember] INPUT: Delete favorite color   → none ✅

# Extraction MISSING for these (storage requests):
# ... literally no log output! 😱

The irony: extraction only worked for questions (which have nothing to store), but failed for messages that actually needed storage.

Root Cause: 3 Response Paths, Only 1 Had Extraction

extract_and_remember() only existed in the No-Tool Path.

Simple messages like "Remember this" took the Planning Shortcut — a fast path that generated a response without entering the tool execution loop. This path yielded the final response without ever calling extraction.

The messages that needed memory storage the most were the ones that never got it.

The Fix: Extract on Every Path

# Planning Shortcut (line 724) — Added extraction here!
if plan_text.strip() and not plan_tool_results:
    session.compact_after_turn()
    memory_extract = None
    try:
        memory_extract = extract_and_remember(user_message)
    except Exception as e:
        print(f"SERVER ERROR: {e}")
    yield {"type": "final", ..., "memory_extract": memory_extract}
    return

After adding extraction to all 3 response paths, every message now gets memory extraction.

Final Result: 18/19 (94%) 🎉

✅ P01 — Explicit memory request    ✅ P11 — Memory correction
✅ P02 — Instant recall              ✅ P13 — Long content (recipe)
✅ P03 — Cross-session recall        ✅ P14 — Semantic association
✅ P04 — Multiple facts (3 birthdays)✅ P15 — Number memory (phone)
✅ P05 — Selective recall             ✅ P16 — Multi-attribute (9/9 hits!)
❌ P06 — Memory update               ✅ P17 — Context-aware advice
✅ P07 — Implicit personal info      ✅ P18 — STM turn accumulation
✅ P08 — recall tool                  ✅ P19 — LTM integrity (14 items)
✅ P09 — forget tool                  ✅ P20 — Response speed (4.7s)
✅ P10 — Memory + calculation combo

Test Result Progression

Round	Pass Rate	Key Change
Round 1	13/20 (65%)	Initial state
Round 2	17/20 (85%)	Korean keys + forget tool + full auto_retrieve
Round 3	18/20 (90%)	0.6b extraction (backup mode)
Round 4	9/19 (47%)	3-stage separation first attempt — 0.6b empty response
Round 5	13/19 (68%)	20B extraction + dict return fix
Round 6	18/19 (94%)	Planning Shortcut path fix

Lessons Learned

Without tests, you know nothing — The 20-test suite was the only reason I found the Planning Shortcut bug. Without it, extraction would have silently failed forever.
Small models have real limits — 0.6B models struggle to follow formatting instructions consistently. If input is short, a larger model can be fast enough.
Trace every code path in your server — I had yield final in 3 places. Extraction was in only 1. The bug lived in the gap.
Silent errors are the worst — except: pass silently swallowed critical errors and made debugging nearly impossible. Always log your exceptions.
Same model, different role — Using the same 20B model for both conversation and extraction works because the extraction task has minimal tokens, keeping latency low (~5s).

Tomorrow

Test again with the 0.6B model (with optimized prompts)
Fix P06 (memory update) — improve extraction prompt for "change X to Y" requests

This post is part of my dev log for the Xoul project — a fully local AI agent. Drop a comment if you're interested in local AI systems!

#AI #LLM #LocalAI #MemorySystem #Ollama #DevLog

DEV Community