DEV Community

Kim Namhyun
Kim Namhyun

Posted on

Debugging a Local LLM Agent's Memory System 🧠

I built a 20-test suite to verify my AI agent's memory system, discovered fundamental architecture flaws, decomposed the pipeline from 2 stages to 3, hunted down a hidden code path bug, and reached 94% (18/19) pass rate. Here's the full story.


Context: Xoul — A Fully Local AI Assistant

Xoul is a personal AI agent running entirely on a local QEMU VM. It uses Ollama to run a 20B parameter LLM with tool calling (web search, calculations, file management, etc.) and a persistent memory system.

The memory system has three tiers:

Tier Purpose Storage
STM (Short-Term) Current conversation turn history SQLite stm table
MTM (Mid-Term) Conversation summaries, time-limited SQLite mtm table
LTM (Long-Term) User info (name, birthday, etc.) permanent SQLite ltm + vector embeddings

Today's goal: verify that this memory system actually works end-to-end.


Step 1: Building a 20-Test Suite

I created 20 E2E prompt-based tests covering all memory operations:

P01: "My name is Namhyun, I'm a teacher. Remember this." → store
P02: Same session: "What's my job?" → instant recall
P03: New session: "What do you know about me?" → cross-session recall
P04: Remember 3 birthdays at once
P05: Selective recall (wife's birthday only)
P06: Memory update (blue → green)
...
P20: Response speed under 30 seconds
Enter fullscreen mode Exit fullscreen mode

Initial Result: 13/20 (65%) 😰

Key failure causes:

  • English keys (user_name) vs Korean queries (이름이 뭐야?) — semantic gap
  • forget tool not registered in tool definitions
  • auto_retrieve only injecting category='profile' items (effectively 0 items)
  • LLM calling remember only once for 3 birthdays

Step 2: The Original 2-Stage Architecture's Limits (Plan → Agent)

The existing architecture had 2 stages:

The problem: LLMs tend to call a tool once and handle the rest as text. This is a fundamental tool-calling pattern issue that's difficult to fix with prompting alone.


The 3-Stage Architecture: Plan → Agent → Memory

The solution: add a dedicated extraction stage after the main LLM response, where a model extracts facts from the user message and auto-saves them.

Key Changes

  1. Removed remember tool from system prompt — LLM just responds "Got it!" as text
  2. Added extract_and_remember() function — extracts Key|Value pairs from user messages
  3. Inserted extraction stage into all response paths in server.py

The extraction prompt:

Extract information as Key|Value from the input. If none, say 'none'.

Ex) Input: My name is Namhyun, email is abc@test.com
Output:
    name|Namhyun
    email|abc@test.com

---
Input: {user_message}
Extract:
Enter fullscreen mode Exit fullscreen mode

Struggle 1: The Small Model (0.6B) Can't Follow Instructions

Initially, I used qwen3:0.6b (600M params) as the dedicated extractor. Fast and light, right?

📝 0.6b IN: My name is Namhyun, I'm a 4th grade teacher
📝 0.6b OUT: Key-Value: name=Namhyun    ← Wrong format! Used = instead of |
Enter fullscreen mode Exit fullscreen mode
Issue Symptom
Format mismatch Output Key=Value or Key: Value instead of `Key\
Hallucination Generated information not in the input
Empty response With {% raw %}think: True, thinking tokens consumed the budget → empty response field

Solution: Use the Main LLM in Non-Think Mode

Using the main LLM (20B) for extraction with thinking disabled:

  • Input tokens are minimal (user message + prompt ≈ 200 tokens)
  • Output is short (a few Key|Value lines)
  • Time cost: ~5 seconds — perfectly acceptable
📝 20B IN: My name is Namhyun, I'm a 4th grade teacher
📝 20B OUT: name|Namhyun
            job|4th grade teacher
📝 Saved: [['name', 'Namhyun'], ['job', '4th grade teacher']]
Enter fullscreen mode Exit fullscreen mode

Perfect extraction! Correct format, zero hallucination.


Struggle 2: The Hidden Code Path (The Real War Story)

Even after switching to 20B, some messages silently skipped extraction. Server logs showed:

# Extraction RUNS for these (questions):
[extract_and_remember] INPUT: Show me all my memories → none ✅
[extract_and_remember] INPUT: Delete favorite color   → none ✅

# Extraction MISSING for these (storage requests):
# ... literally no log output! 😱
Enter fullscreen mode Exit fullscreen mode

The irony: extraction only worked for questions (which have nothing to store), but failed for messages that actually needed storage.

Root Cause: 3 Response Paths, Only 1 Had Extraction

extract_and_remember() only existed in the No-Tool Path.

Simple messages like "Remember this" took the Planning Shortcut — a fast path that generated a response without entering the tool execution loop. This path yielded the final response without ever calling extraction.

The messages that needed memory storage the most were the ones that never got it.

The Fix: Extract on Every Path

# Planning Shortcut (line 724) — Added extraction here!
if plan_text.strip() and not plan_tool_results:
    session.compact_after_turn()
    memory_extract = None
    try:
        memory_extract = extract_and_remember(user_message)
    except Exception as e:
        print(f"SERVER ERROR: {e}")
    yield {"type": "final", ..., "memory_extract": memory_extract}
    return
Enter fullscreen mode Exit fullscreen mode

After adding extraction to all 3 response paths, every message now gets memory extraction.


Final Result: 18/19 (94%) 🎉

✅ P01 — Explicit memory request    ✅ P11 — Memory correction
✅ P02 — Instant recall              ✅ P13 — Long content (recipe)
✅ P03 — Cross-session recall        ✅ P14 — Semantic association
✅ P04 — Multiple facts (3 birthdays)✅ P15 — Number memory (phone)
✅ P05 — Selective recall             ✅ P16 — Multi-attribute (9/9 hits!)
❌ P06 — Memory update               ✅ P17 — Context-aware advice
✅ P07 — Implicit personal info      ✅ P18 — STM turn accumulation
✅ P08 — recall tool                  ✅ P19 — LTM integrity (14 items)
✅ P09 — forget tool                  ✅ P20 — Response speed (4.7s)
✅ P10 — Memory + calculation combo
Enter fullscreen mode Exit fullscreen mode

Test Result Progression

Round Pass Rate Key Change
Round 1 13/20 (65%) Initial state
Round 2 17/20 (85%) Korean keys + forget tool + full auto_retrieve
Round 3 18/20 (90%) 0.6b extraction (backup mode)
Round 4 9/19 (47%) 3-stage separation first attempt — 0.6b empty response
Round 5 13/19 (68%) 20B extraction + dict return fix
Round 6 18/19 (94%) Planning Shortcut path fix

Lessons Learned

  1. Without tests, you know nothing — The 20-test suite was the only reason I found the Planning Shortcut bug. Without it, extraction would have silently failed forever.

  2. Small models have real limits — 0.6B models struggle to follow formatting instructions consistently. If input is short, a larger model can be fast enough.

  3. Trace every code path in your server — I had yield final in 3 places. Extraction was in only 1. The bug lived in the gap.

  4. Silent errors are the worstexcept: pass silently swallowed critical errors and made debugging nearly impossible. Always log your exceptions.

  5. Same model, different role — Using the same 20B model for both conversation and extraction works because the extraction task has minimal tokens, keeping latency low (~5s).

Tomorrow

  • Test again with the 0.6B model (with optimized prompts)
  • Fix P06 (memory update) — improve extraction prompt for "change X to Y" requests

This post is part of my dev log for the Xoul project — a fully local AI agent. Drop a comment if you're interested in local AI systems!

#AI #LLM #LocalAI #MemorySystem #Ollama #DevLog

Top comments (0)