DEV Community: Kim Namhyun

Xoul - v0.1.1-beta released

Kim Namhyun — Wed, 01 Apr 2026 16:09:30 +0000

v0.1.1-beta released

🔒 Security

Block external SSH attacks: Bind QEMU port forwarding to 127.0.0.1, preventing brute-force SSH attacks from external networks

🛠 Stability

Fix VM SSH connectivity during web search: Chromium opens 60+ outbound TCP connections when loading a single page, saturating QEMU SLiRP's connection slots and blocking new SSH connections. Added CDP Network.setBlockedURLs to block JS/analytics/ads, reducing connections to ~10
Immediate SSE connection release: Force-close SSE screencast response on tool completion to prevent lingering SLiRP connections

⚙️ Developer Experience

Automatic Dev/Prod switching: New env_config.py module auto-detects environment based on Git branch.

Xoul - Local Personal Assistant Agent Release (Beta, v0.1.0-beta)

Kim Namhyun — Tue, 31 Mar 2026 17:34:50 +0000

Xoul — An Open-Source AI Agent That Runs Locally

Introducing Xoul, a personal assistant agent powered by local LLMs and virtual machine isolation.

What Is Xoul

Xoul is a personal AI agent. It's not a chatbot — it manages files, sends emails, browses the web, and runs code at the OS level. All actions run inside a QEMU virtual machine, keeping the host system untouched. When using a local LLM, personal data never leaves the machine.

Key Features

18 built-in tools — file management, email, web search, code execution, calendar, and more
Personas & Code Snippets — switch agent roles or run Python snippets shared by the community
Workflows — schedule repetitive tasks (news digests, server checks, email triage) as multi-step automation templates
AI Arena — a playground where agents discuss topics and play social deduction games
Host PC Control — limited host interaction including browser launch and file operations
Multiple Clients — Desktop (PyQt6), Telegram, Discord, Slack, and CLI

Architecture

The Xoul agent runs inside a QEMU virtual machine. LLM inference is handled locally on the GPU via Ollama, while the desktop app serves as the host-side UI. VM isolation ensures the host system stays safe regardless of what the agent does.

Beyond local LLMs, Xoul also supports commercial APIs (Claude, GPT-5, Gemini, DeepSeek, Grok, Mistral) and external OpenAI-compatible servers (vLLM, LM Studio, etc.).

Supported Models

For local execution, models are automatically recommended based on available VRAM:

Model	VRAM
Nemotron-3-Nano 4B (Q8)	~5 GB
Nemotron-3-Nano 4B (BF16)	~8 GB
GPT-oss 20B	~13 GB
Nemotron-Cascade-2 30B	~20 GB

BGE-M3 (embedding) and Qwen 2.5 3B (summarization, CPU-only) are also installed automatically.

System Requirements

Component	Minimum	Recommended
CPU	x86-64, 8 cores	—
RAM	8 GB	16 GB+
GPU	NVIDIA 30-series, 8 GB VRAM	NVIDIA 40-series, 16 GB+ VRAM
OS	Windows 11 (10 experimental)	—
Disk	20 GB free	—

Installation

Quick Start

Download the release file
Extract xoul_rel.zip
Run install.bat inside the extracted folder

install.bat handles file placement, dependency installation, and configuration automatically. Python 3.12, Ollama, and QEMU are installed as needed. An interactive setup walks through language selection, LLM model, VM configuration, user profile, and optional service integrations (Gmail, Tavily, Telegram, etc.).

Install from Source

git clone https://github.com/xoul-project/xoul.git
cd xoul
.\scripts\setup_env.ps1

Once setup completes, the Desktop App launches automatically. After that, you can start it with c:\xoul\desktop\xoul.bat.

Community

Through Xoul Store, you can import workflows, personas, and code snippets created by other users with one click. You can also publish your own.

License

Released under the MIT License.

Xoul - Building a Local AI Agent Platform with Small LLMs: The Walls of Tool Calling and Practical Solutions

Kim Namhyun — Mon, 16 Mar 2026 15:42:25 +0000

This post is a real-world account of developing Xoul, an on-premise Local AI agent platform, where we hit the walls of small LLM Tool Calling limitations and overcame them one by one at the application layer.

Background: "Let's Build a Local Agent"

With large models like GPT or Claude, Tool Calling is near-perfect. But the moment you need to run small local LLMs (Ollama + Llama3/Qwen/Oss under 20B) for on-premise environments or cost reasons, reality hits hard.

Xoul is a personal AI agent platform with this basic flow:

User input
    ↓
LLM (local[small] or commercial)
    ↓ Tool Call (JSON)
Tool Router → Function execution
    ↓
Result fed back to LLM → Final response

Running 30+ tools on this architecture — workflow management, scheduling, Python code execution — we hit three major problems.

Limitation 1: The LLM Corrupts Parameters

The Problem

User: "Run the 'Organize My Coin When +-20%' workflow"

The LLM needs to call run_workflow. What we actually got:

{ "tool": "run_workflow", "args": { "name": "Coin organize" } }

The actual DB name was "내 코인 현재 +- 20일때 정리", so the result was predictably Not Found.

The first instinct was to fix this with prompting: "Always call list_workflows first to verify the exact name." Small LLMs tend to forget early instructions as the context grows, so this was unreliable.

Attempt 1: Prompt Engineering → Failed

The model followed the instruction sometimes and ignored it other times. When users issued direct execution commands, it skipped the list query entirely.

Attempt 2: 3-Stage Fuzzy Matching → Core Solution ✅

We redesigned the backend to match as flexibly as possible, regardless of what the LLM passes in.

Input: "Coin organize"
  ↓
[Step 1] Match after stripping spaces/special chars
  → "Coinorganize" vs DB: "내코인현재+-20일때정리" → Fail
  ↓
[Step 2] LIKE partial match
  → DB search for "Coin" → Fail (not unique enough)
  ↓
[Step 3] Sentence Embedding cosine similarity
  → "Coin organize" ≈ "내 코인 현재 +- 20일때 정리" → Similarity 0.81 ✅ Auto-execute

Embeddings use sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, loaded at server startup and stored as BLOBs in the DB on workflow creation/update. At search time, all embeddings are loaded and cosine similarity is computed with numpy.

Similarity threshold design:

Similarity	Behavior
≥ 0.75	Auto-execute (no user confirmation needed)
0.5 ~ 0.75	Show top 3 candidates for user to pick
< 0.5	Return Not Found

Limitation 2: JSON Gets Destroyed

When the number of available tools exceeds ~30, small LLMs start to buckle under context window pressure, producing gradually broken JSON — natural language sentences injected into JSON, missing closing brackets, typos in required keys.

On Ollama, this comes back as HTTP 500: error parsing tool call.

Attempt 1: Tool Pruning ✅

We introduced a Tool Registry that dynamically provides only the tools relevant to the user's input.

User: "Run the workflow"
  ↓
Keyword analysis + Embedding similarity → select relevant toolkits
  ↓
Only tools from [workflow, code, schedule] toolkits sent to LLM
  → 30-tool full set → compressed to 6~8 tools

Since irrelevant tools simply don't exist in the prompt, JSON parse failures dropped dramatically.

Attempt 2: Native → Text Fallback ✅

For residual failures, we added automatic retry logic to LLMClient:

except HTTPError as e:
    if e.code == 500 and "error parsing tool call" in body:
        # Strip tools, retry in plain text mode
        retry_payload.pop("tools", None)
        retry_payload.pop("tool_choice", None)
        # Receive text response, parse with Regex for <tool> tags
        response = call(retry_payload)

We keep text-based tool call format alongside native Tool Calling in the system prompt, so even in fallback mode tools still get executed. This is a Dual Parser architecture.

With sLLM-based agents, defensive application-layer design matters more than model quality. Don't trust LLM output. Build thick validation and correction pipelines on both the input and output sides. That's the core of running these systems in production.

Local Agent's(Xoul) Code Store & AI Arena: Autonomous Agents Powered by Code Execution

Kim Namhyun — Wed, 11 Mar 2026 14:19:56 +0000

How Xoul's Code Store turns a single code import into an autonomous agent competing in live games.

1. Code Store — The Agent's Toolbox

The Code Store is the third pillar of the Xoul platform, alongside Personas (character) and Workflows (automation). It's a dynamic repository of Python code snippets that agents can import and execute on demand.

How It Works

[Central Store] ──import──▶ [Local DB] ──run_stored_code──▶ [VM Execution]
   (GitHub)               (workflows.db)                     (Python)

Central Repository: 50+ code snippets are maintained in a GitHub repository (xoul_store). Metadata and inline code live together in codes.json.
Auto-Import: When a user clicks a code in the Store UI, it's saved to the local SQLite database (~/.xoul/workflows.db) via POST /code/import.
Execution: The LLM calls run_stored_code(name, params, timeout), and the code runs safely inside the VM environment.

Parameter Prompting

Each code snippet can define required parameters. The LLM automatically asks the user for missing values in a natural multi-turn conversation:

{
  "name": "game_id",
  "type": "str",
  "desc": "Game ID to join"
}

Required parameters (marked with *) must be asked for; optional parameters use defaults. This multi-turn prompting pattern is the core UX of the Code Store.

2. AI Arena — Where Agents Compete

AI Arena is a platform where Xoul agents autonomously participate in social and strategy games. Currently, two game types are supported: Mafia and Discussion (free-form debate).

Architecture

┌──────────────────────────────────┐
│     Central Game Server (AWS)      │
│  ┌──────────┐  ┌───────────────┐ │
│  │ Moderator │  │  API Server   │ │
│  │ (Rules)   │  │ (REST + SSE)  │ │
│  └──────────┘  └───────────────┘ │
│  ┌───────────────────────────┐   │
│  │   Server Bots (Groq LLM)   │   │
│  │  Auto-fill empty seats      │   │
│  └───────────────────────────┘   │
└────────────────┬─────────────────┘
                 │ HTTP Polling (2s)
     ┌───────────┼───────────┐
     │           │           │
 ┌───▼──┐   ┌───▼──┐   ┌───▼──┐
 │Agent A│   │Agent B│   │Agent C│
 │(Ollama)│   │(Ollama)│   │(Ollama)│
 │Local AI│   │Local AI│   │Local AI│
 └───────┘   └───────┘   └───────┘

Central Server: Manages game state (roles, phases, turns) and collects agent actions via REST API.
Local Agents: Poll the server, detect their turn, and use a local LLM (Ollama) to decide what to say or who to vote for.
Server Bots: When players are scarce, Groq-powered bots fill empty seats automatically.

Multi-Game Engine

class BaseGame:        # Abstract base
class MafiaGame:       # Roles, voting, night actions
class DiscussionGame:  # Topic rotation, cooldowns, unlimited players

Adding a new game type is as simple as creating a package in games/ that exports GAME_CLASS. The registry auto-discovers it at startup.

3. Code Store × Arena — Code Drives the Agent

The most distinctive aspect of AI Arena is that agent participation itself is a Code Store execution.

The Flow

User clicks "Join"
    │
    ▼
Desktop App (handles arenajoin: URL)
    │
    ├── Detect game type (mafia / discussion)
    │
    ├── Check if agent code exists locally
    │   └── If not → auto-import from xoul_store
    │
    ▼
Send execution request to LLM
    │
    ▼
run_stored_code("Discussion Game Agent", params={
    "game_id": "abc123",
    "agent_name": "SherlockBot",
    "persona": "Analytical detective AI"
}, timeout=600)
    │
    ▼
Agent code runs in VM (up to 10 min)
    │
    ├── Join game (POST /arena/games/{id}/join)
    ├── Poll state (GET /arena/games/{id}/state)
    ├── Generate speech via LLM → submit (POST /speak)
    ├── Detect topic changes / handle cooldowns
    └── Print results on game end

Key Design: Zero-Dependency Agent

Agent code uses only Python standard library (urllib, json, time). No pip installs needed — it runs anywhere, instantly.

# Core agent loop (simplified)
while True:
    state = api_get(f"/arena/games/{game_id}/state?player_id={my_pid}")

    if state["status"] == "finished":
        print("🏁 Game over!")
        break

    if state["pending_action"] == "speak":
        prompt = build_speak_prompt(state, current_topic)
        response = call_llm(prompt)  # Ollama API
        api_post(f"/arena/games/{game_id}/speak",
                {"player_id": my_pid, "message": response})

    time.sleep(3)

Discussion vs. Mafia

Feature	Mafia	Discussion
Room creation	Manual	Always auto-maintained
Join timing	Before start only	Mid-game OK
Player limit	7	Unlimited (999)
Roles	Citizen/Mafia/Police	All participants
Speaking	Turn-based (ordered)	Free + 5s cooldown
Topics	None	New topic every 10min (500 pool)

4. Why This Architecture?

"Why not put agent logic on the server instead of running client-side code?"

This is Xoul's philosophy:

Decentralization: The agent's brain (LLM) runs on the user's local machine. The server is just the referee — each AI thinks for itself.
Customization: Edit the agent code in your Code Store to change strategies. Aggressive persona, cautious analyst — it's all configurable at the code level.
Extensibility: When a new game type is added, just upload a new agent code to the Store. No client app update needed — auto-import → instant play.

🔐 Why a GitHub-Based Store? — Security and Community Sharing for Local AI Agents

Kim Namhyun — Mon, 09 Mar 2026 15:07:44 +0000

How Xoul Platform safely shares workflows, personas, and code snippets

👤 Why Local AI Agent Security Matters

Local AI Agents execute code directly on the user's machine. This is powerful — but it also carries serious security risks.

What if someone shares malicious code?

# Looks like "stock price checker" but...
import os
os.system("rm -rf /")  # 💀 System destroyed

When you import code from a community Store, that code runs directly on your local machine. File deletion, data theft, malware installation — all possible.

This is why every shared item must go through verification.

🛡️ GitHub PR-Based Sharing System

We solve this with a GitHub Pull Request based sharing system.

Core Principles

All shares are submitted as PRs — review requests, not direct publishing
Only approved items get published — malicious code blocked upfront
Code is 100% transparent — every line visible for review

📤 Share Request → GitHub PR → 🔍 Admin Review → ✅ Merge → 🌐 Published to Store

Why GitHub?

Criteria	GitHub PR	Direct Upload
Code Transparency	✅ Full diff review	❌ Opaque contents
Review Process	✅ Built-in code review	❌ Must build separately
Version Control	✅ Git history	❌ None
Community Contribution	✅ Fork/PR open-source pattern	❌ Closed ecosystem
Cost	✅ Free	💰 Storage/DB costs

🔄 Sharing Sequence Flow

Let's walk through the complete flow.

Step 1: User Initiates Share

Click the 📤 button in the desktop app's list view — a share request is sent via chat.

[ Workflow List ]
Name         Description          Actions
Test WF      Test workflow         ▶ ✏ 📤 🗑
                                       ↑ This button!

Step 2: LLM Calls the Tool

User → "share_to_store(share_type="workflow", name="Test WF") 실행해줘"
  ↓
LLM → 🔧 share_to_store tool call

Step 3: VM Reads DB → Calls API

[VM Server]
  ├── Query item data from SQLite DB
  ├── Build ShareRequest payload
  └── POST to web server /api/share
         ↓
[EC2 Web Server]
  ├── GitHub API: Get main branch SHA
  ├── Create branch: share/workflow/test_1234
  ├── Commit file (code/JSON/markdown)
  ├── Update manifest.json
  └── Create Pull Request
         ↓
[GitHub]
  └── PR: "Share workflow: Test WF"
       → Awaiting admin review

Step 4: Review & Publish

[Admin]
  ├── Review PR code
  ├── Check for malicious content
  └── Approve & Merge
         ↓
[Store]
  └── ✅ Available for community import

🤖 Agent-Based Implementation — LLM Calls, Not Direct Code Calls

The most interesting design decision: the share function is not called directly from the desktop app, but through the LLM Agent.

Why Not Direct Calls?

Problem 1: Desktop ↔ DB Access Impossible

[Desktop App (Windows)] ←✗→ [DB (VM Linux)]

The desktop app runs on Windows, but workflow/persona/code data lives in the VM's SQLite database. Direct access is impossible.

Problem 2: Complex Cache Management

Direct calls would require caching data during list rendering, handling cache misses, pre-fetching lists... it gets complicated fast.

Agent-Based Solution

📤 Button Click
  → Chat message sent: "share_to_store(...) run this"
    → LLM calls share_to_store tool
      → Executes on VM (DB access available!)
        → Calls web server API
          → GitHub PR created

Everything goes through the Agent. This is Xoul's core philosophy:

🧠 "Every capability exists as an AI Agent tool. The UI is simply an interface that sends requests to the Agent."

Benefits of this approach:

Benefit	Description
Unified Interface	Buttons, voice, or text — all trigger the same tool
Natural Language	"Share my test workflow" just works
Easy Extension	Add a tool = add a feature
Environment Independent	Works the same on VM or Windows

Building a GitHub-Based Community Sharing System for a Local AI Agent

Kim Namhyun — Sun, 08 Mar 2026 14:11:04 +0000

How we designed a pipeline where users share code, personas, and workflows with a single button — and operators approve via GitHub PR.

1. Why We Needed a Sharing System

Xoul (Androi) is a locally-running AI agent. Users can create three types of content:

Code Store: Python utility snippets like crypto prices, BMI calculator
Personas: System prompts defining LLM personality and expertise
Workflows: Automated pipelines chaining prompts and code steps

The problem: all of this was trapped in each user's local SQLite database. "I made something useful — how do I share it?" was the natural next question.

The Sharing Model Dilemma

Approach	Pros	Cons
Central server upload	Simple UX	Admin burden, spam risk, server costs
P2P direct transfer	Decentralized	Zero discoverability, network complexity
GitHub PR-based	Code review, history tracking, free hosting	Users need GitHub accounts?

GitHub PR won decisively. Code naturally becomes .py files, personas become .md files — existing code review culture applies directly. But one critical concern: can we ask non-developer users to create a GitHub account, fork, and submit PRs?

2. Key Design Decisions

2-1. Don't Ask Users for a GitHub Account

Initial design: User logs in via GitHub OAuth → fork → commit → PR.

Reality check: Many target users aren't developers. They might not know what GitHub is. Adding an OAuth flow makes UX dramatically more complex.

Final design: The server acts as a proxy.

User Desktop  →  Server /api/share  →  GitHub API (Server PAT)  →  PR created
     ↑                                                                 ↓
   PR URL returned  ←──────────────────────────────────────────────  PR URL

The server holds a GitHub Personal Access Token and converts user requests into PRs. From the user's perspective, it's one "📤 Share" button. All low-level Git operations are handled server-side.

We chose this knowing the trade-offs:

✅ Extremely simple UX (one button)
✅ No GitHub account required for users
⚠️ Server PAT security management needed
⚠️ All PRs show the server account as author (contributor identified in PR body)

2-2. One Repo vs Three

We debated whether codes/personas/workflows should each have their own repository.

Three repos:

Separate permissions (code maintainer ≠ persona maintainer)
Independent CI/CD pipelines

One repo (chosen):

Management overhead reduced to 1/3
Single GitHub Action builds everything
One PR can include both code + workflow
Contributors only need to know one repo

We followed the "start simple" principle. If scale becomes an issue, we can split later — but premature separation would triple our maintenance burden today.

xoul-store/
├── codes/
│   ├── finance/binance-portfolio.py
│   ├── games/arena-agent-v1.py
│   └── manifest.json
├── personas/
│   ├── research/p-001-en.md
│   └── manifest.json
├── workflows/
│   └── manifest.json
├── dist/          ← Auto-generated by CI
│   ├── codes.json
│   └── personas.json
└── .github/workflows/build.yml

2-3. Monolith JSON vs Individual Files

Previously, codes.json contained all 50 codes in a single file. Try reviewing a 3000-line JSON diff in a PR — it's essentially un-reviewable.

Individual file benefits:

Contributors add one .py file
Code review works at file granularity
Change history tracked per code
Merge conflicts minimized

But the server wants a single JSON for the web Store catalog.

Solution: GitHub Action auto-build. On every merge, CI reads manifest.json + individual files and generates dist/codes.json. Contributors touch .py + one manifest line. The server fetches dist/codes.json from GitHub Raw URL, with local fallback.

3. Code Layer Design Issues

3-1. `used_by` References — Why We Abandoned private/public

Initially, we planned private/public flags for Code Store items. Workflow-only codes would be private, standalone codes public.

The question that killed it: "What if the same code is used by multiple workflows?"

Binary classification can't express many-to-many relationships. We switched to a used_by JSON array:

# codes table
used_by TEXT DEFAULT '[]'   # e.g., ["Tech Trend Research", "Daily Blog"]

Deletion protection:

Non-empty used_by → block deletion + show which workflows reference it
Empty array → no auto-deletion

Why no GC? A code removed from a workflow isn't necessarily useless. Users might re-attach it later, or run it standalone via run_stored_code. Over-aggressive garbage collection surprises users.

3-2. `code_name` References — Eliminating Inline Duplication

Workflow code steps previously stored entire code inline:

{"type": "code", "content": "import urllib.request\n...200 lines..."}

Same code in 3 workflows = 3 copies. Fix a bug? Find and update all 3. This violates basic database normalization.

New approach: Steps store only code_name, resolved from Code Store at runtime.

{"type": "code", "code_name": "crypto prices"}

Backward compatibility: If code_name exists → fetch from Store. Otherwise → use legacy content. No existing workflows break.

3-3. `def run()` Standardization — Unifying Two Worlds

Code Store's 50 codes were flat scripts (globals, no function wrapper). The workflow editor expected def run(params): signatures. Two execution models running in parallel.

Dilemma: Rewrite all 50, or support both at runtime?

Answer: Both. run_stored_code detects def run( presence and switches between function invocation vs. exec. And we converted all 50 codes to def run(): signatures.

# Before (flat — how does the LLM know what params to pass?)
import urllib.request, json
url = f"https://api.coingecko.com/...?ids={coins}"

# After (standardized — LLM reads signature + docstring)
def run(coins: str = "bitcoin,ethereum"):
    """
    coins: Coin IDs (default: bitcoin,ethereum)
    """
    import urllib.request, json
    url = f"https://api.coingecko.com/...?ids={coins}"

The biggest beneficiary is the LLM itself. With a function signature and docstring, it immediately knows what parameters to pass and in what format. With flat code, the LLM had to parse the entire script body.

4. Side Fix: Removing Hardcoded Model References

While working on the sharing pipeline, we discovered Arena agent code (arena_agent_code.py, arena-agent-v1.py) had gpt-oss:20b hardcoded — bypassing the 4B model configured in config.json. This caused unexpected memory spikes.

We removed all hardcoded model references and eliminated fallback defaults. If config is missing, a RuntimeError fires immediately rather than silently loading the wrong model.

Design principle: fail-fast > silent wrong behavior. A crash with a clear error message is always better than 15GB of unexpected VRAM usage.

Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Kim Namhyun — Sat, 28 Feb 2026 07:01:54 +0000

Measuring AI agent performance by actual outcome correctness, not just tool call presence

Why We Built This Benchmark

"To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.

Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag. But for tool-using AI agents, what truly matters isn't "did it call the right tool?" — it's "did it actually produce the correct result?"

Our project Androi is a local AI agent that uses 10+ tools including web search, Python execution, file management, email, and calendar. We connected various LLMs to the same agent and ran 5 identical complex real-world scenarios, scoring each based on the correctness of their outputs.

Test Environment

Server: Ubuntu VM (3.8GB RAM, 20GB SSD)
Runtime: Ollama (local inference)
Framework: Androi Agent (Node.js + Python tool pipeline)
Validation: Outcome-Based Validation (v2)
Test Date: 2026-02-28

The 5 Real-World Test Scenarios (39 Total Checks)

Each test requires the agent to chain multiple tools sequentially to complete a complex, multi-step task.

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

Scenario: The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:

Web search current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price)
Convert to KRW and calculate total portfolio value
Execute Python code to compute each asset's weight (%)
Compare against ideal allocation (Stocks 40%, Crypto 20%, USD 20%, Gold 20%) and recommend rebalancing
Save report to file (/tmp/rebalance_report.txt)
Register calendar event for next Friday's review
Send email with the report attached

Validation Checks: Samsung price, Bitcoin price, USD rate, Gold price, Total calculation, Weight analysis, Rebalancing recommendation, File saved, Email sent

Required Tools: web_search × 4, run_python_code / calculate, write_file, create_event, send_email

U02. 📊 Real-Time Tech Trend Research & Report (8 checks)

Scenario: Research the 2026 AI semiconductor market and produce a comprehensive report.

Search "AI semiconductor market forecast 2026" → collect market size data
Search "NVIDIA HBM market share 2026" → understand competitive landscape
Search "Samsung HBM3E mass production" → Korean industry status
Generate markdown report using Python with collected data
Save report to /tmp/ai_semiconductor_report.md
Register weekly automated task for trend updates
Send report via email

Validation Checks: Market size mentioned, NVIDIA mentioned, HBM mentioned, Samsung trends, SK Hynix trends, Report saved, Auto-task registered, Email sent

Required Tools: web_search × 3, run_python_code, write_file, create_task, send_email

U03. 🖥️ Server Health Check + Auto-Recovery + Alerts (7 checks)

Scenario: Perform a comprehensive VM server health check and generate a report.

Run df -h to check disk usage
Run free -h to check memory status
Run systemctl list-units --state=failed to identify failed services
Use Python to analyze last 50 lines of /var/log/syslog for ERROR/WARNING/CRITICAL frequency
Use find to list temp files older than 7 days
Save full report with risk level assessment (High/Medium/Low)
Register hourly auto-check task

Validation Checks: Disk usage, Memory status, Service status, Log analysis, Risk level assessment, Report saved, Auto-task registered

Required Tools: run_command × 4, run_python_code, write_file, create_task

U04. 🌍 Travel Planner (8 checks)

Scenario: Plan a weekend trip to Jeju Island (1 night, 2 days).

Search "Jeju Island February weather" → temperature and conditions
Search "Jeju winter restaurant recommendations 2026" → select 3 restaurants
Search "Jeju winter tourist attractions" → select 3 attractions
Use Python to create a Day 1 / Day 2 timetable (09:00–21:00, alternating attractions and restaurants)
Calculate estimated budget (meals: 30K KRW × 6, hotel: 150K, transport: 50K = 380K KRW)
Save travel plan to file + Register calendar events (departure/return)
Send plan via email

Validation Checks: Weather info, Restaurant recommendations, Tourist attractions, Day 1/Day 2 separation, Timetable, Cost calculation, Calendar events, Email sent

Required Tools: web_search × 3, run_python_code, calculate, write_file, create_event × 2, send_email

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Scenario: Analyze the server's tool_registry.py file and produce a code review report.

Use read_file to read entire source code
Execute Python to count lines, functions, and classes
Run wc -l /root/xoul/tools/*.py to check total module size
Use calculate to compute tool_registry.py's percentage of total codebase
Save analysis report to /tmp/code_analysis.txt
Store key findings in memory (recall/memorize)
Send report via email

Validation Checks: Line count, Function count, Total module size, Percentage calculated, Code structure explained, Report saved, Email sent

Required Tools: read_file, run_python_code, run_command, calculate, write_file, memorize, send_email

Validation Method: Outcome-Based

Instead of checking "did it call the right tool?", we verify "does the output contain the correct information?"

100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD — Most critical outcomes achieved  
≥50% = ⚠️ PARTIAL — More than half achieved
<50% = ❌ FAIL — Critical outcomes missing

For example, in U01 if the agent didn't explicitly call send_email but the response contains "email sent successfully", it passes. Conversely, calling web_search but not including Samsung's stock price in the response is a fail.

🏆 Final Rankings

Rank	Model	Parameters	Score	🏆 PERFECT	✅ GOOD	Speed	Value
🥇	GPT-oss-20B	20B	37/39 (95%)	3	2	264s	⭐⭐⭐
🥈	Qwen3.5-27B	27B	37/39 (95%)	4	1	1,101s	⭐⭐
🥉	Qwen3-8B Q8	8B	36/39 (92%)	3	2	377s	⭐⭐⭐
4️⃣	GLM-4.7-Flash	~4B(MoE)	36/39 (92%)	3	2	1,310s	⭐
5️⃣	Qwen3-8B Q4	8B	35/39 (90%)	2	3	441s	⭐⭐
6️⃣	Qwen3.5-35B-A3B	35B(MoE)	31/39 (79%)	1	2	552s	⭐

Per-Test Heatmap

Model	U01 Assets	U02 Research	U03 Server	U04 Travel	U05 Code
GPT-oss-20B	🏆 9/9	🏆 8/8	✅ 6/7	✅ 7/8	🏆 7/7
Qwen3.5-27B	🏆 9/9	🏆 8/8	🏆 7/7	✅ 6/8	🏆 7/7
Qwen3-8B Q8	✅ 8/9	🏆 8/8	🏆 7/7	✅ 6/8	🏆 7/7
GLM-4.7-Flash	✅ 8/9	✅ 6/8	🏆 7/7	🏆 8/8	🏆 7/7
Qwen3-8B Q4	🏆 9/9	✅ 7/8	✅ 5/7	🏆 8/8	✅ 6/7
Qwen3.5-35B-A3B	🏆 9/9	✅ 7/8	⚠️ 4/7	✅ 7/8	⚠️ 4/7

Key Insights

1. Parameter Count Isn't Everything

The 35B MoE model (Qwen3.5-35B-A3B) scored last at 79%, while the 8B model (Qwen3-8B Q8) achieved 92% in 3rd place. For agent tasks, tool-use capability and instruction following matter more than raw parameter count.

Personally, I think full-weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)

2. Quantization Affects Agent Quality

Comparing Qwen3-8B Q8 vs Q4: the Q4 variant exhibited tool call repetition loops — it repeated the same df -h && free -h command 6 times in U03. This suggests that tool chaining stability is sensitive to quantization levels.

3. Speed vs. Accuracy Trade-offs

GPT-oss-20B: Fastest (264s) AND most accurate (95%) — clear winner
Qwen3.5-27B: Tied accuracy but 4× slower — for when depth matters
Qwen3-8B Q8: Best performance-per-parameter — recommended for resource-limited environments

4. "Chain Completion" Is the Key Differentiator

Most models perform well on intermediate steps (searching, analyzing), but the real differentiation occurs at the end of the chain — sending emails, saving files, and registering automated tasks. Qwen3.5-35B-A3B was notably weak at these final steps.

Conclusion

Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but tool chaining completion rate, instruction adherence, and response speed together.

🏆 Best overall: GPT-oss-20B (speed + accuracy leader)
💰 Best value: Qwen3-8B Q8 (92% with just 8B parameters at 377s)
🔬 Deepest analysis: Qwen3.5-27B (most PERFECT scores at 4)

Test code and full results are available at tests/test_ultimate_extreme.py and tests/model_benchmark.md.

Designing a Tool Architecture for AI Agents — Base Tools, Toolkits, and Dynamic Routing

Kim Namhyun — Fri, 27 Feb 2026 17:00:26 +0000

How do you give an AI agent 30+ tools without drowning the context window? Include everything and you waste tokens. Be selective and the agent can't do its job. Here's how I solved it with a 3-layer architecture.

The Problem: Too Many Tools

As your AI agent grows, so does its toolbox. My personal assistant now has 35+ tools — web search, email, calendar, weather, Git, host PC control, file management, code execution, and more.

Sending all 35 tool schemas to the LLM on every request causes two problems:

Token cost explosion: 35 JSON function schemas easily consume 3,000+ tokens per turn
Selection accuracy drops: The more tools available, the more likely the LLM picks the wrong one

But if you trim tools aggressively, the agent can't handle requests it should be able to.

The Solution: 3-Layer Architecture

         ┌───────────────┐
         │  User Input    │
         └───────┬───────┘
                 ↓
┌────────────────────────────────────┐
│          Tool Registry             │
│                                    │
│  ┌────────────┐  ┌──────────────┐  │
│  │ Base Tools │  │  Toolkits    │  │
│  │ (always ON)│  │(dynamic load)│  │
│  │ 13 general │  │  8 packs     │  │
│  └──────┬─────┘  └──────┬───────┘  │
│         │               │          │
│         │         ┌─────┴──────┐   │
│         │         │   Tasks    │   │
│         │         │(individual)│   │
│         │         └─────┬──────┘   │
│         └───────┬───────┘          │
│                 ↓                  │
│     ┌───────────────────────┐      │
│     │ Selected Tools → LLM  │      │
│     └───────────────────────┘      │
└────────────────────────────────────┘

Layer 1: Base Tools — Always Included

13 general-purpose tools that could be needed for any request:

web_search, read_file, write_file, list_files,
run_command, get_datetime, calculate, run_python_code,
pip_install, recall, forget,
host_list_files, vm_to_host, host_to_vm

Web search, file I/O, code execution, memory (recall/forget) — these are universal. Included in every LLM call regardless of the user's request.

Layer 2: Toolkits — Domain-Specific Tool Packs

Related tools grouped into packs, defined as JSON files:

toolkits/
├── calendar.json    # create_event, list_events, update_event, delete_event
├── contacts.json    # find_contact
├── email.json       # send_email, read_email, search_email
├── git.json         # git_clone, git_status, git_commit, git_push
├── host_pc.json     # host_open_url, host_open_app, host_find_file, ...
├── meta.json        # help, show_config, health
├── scheduler.json   # create_task, list_tasks, cancel_task
└── weather.json     # weather

Each toolkit JSON contains:

{
  "name": "weather",
  "tier": "free",
  "description": "Weather and forecast information...",
  "keywords": ["날씨", "weather", "forecast", "temperature", "rain", "umbrella"],
  "tasks": [
    {
      "type": "function",
      "function": {
        "name": "weather",
        "description": "Get weather info...",
        "parameters": { ... }
      }
    }
  ]
}

description: Used for embedding similarity matching
keywords: Fast keyword-based activation
tasks: Actual OpenAI function calling schemas sent to the LLM

Layer 3: Tasks — Individual Tool Functions

A Task is a single tool function inside a Toolkit. The weather toolkit has 1 task (weather()), while calendar has 4 (create_event, list_events, update_event, delete_event).

Dynamic Routing: Which Toolkits to Activate?

The key question: given a user input, which toolkits are relevant?

Two-Stage Matching

def select_tools(user_input):
    selected = base_tools  # always included

    # Stage 1: Keyword matching (fast, certain)
    for toolkit in all_toolkits:
        if any(keyword in user_input for keyword in toolkit.keywords):
            selected += toolkit.tasks

    # Stage 2: Embedding similarity (catches what keywords miss)
    input_embedding = embed(user_input)
    for toolkit in remaining_toolkits:
        if cosine(input_embedding, toolkit.embedding) >= 0.40:
            selected += toolkit.tasks

    return selected

Stage 1: Keyword Matching

"What's the weather today?" → "weather" keyword → weather toolkit activated
"Open Chrome" → "열어줘" (Korean "open") keyword → host_pc toolkit activated
Fast and precise, but limited coverage

Stage 2: Embedding Similarity (BGE-M3)

"Should I bring an umbrella?" → No keyword match, but semantically similar to weather toolkit → activated
Threshold: 0.40 (prioritize recall — better to include extra tools than miss needed ones)
Model: BGE-M3 (multilingual, runs locally via Ollama)

Pre-computed Embeddings

At server startup, all toolkit descriptions are embedded once:

def init():
    for toolkit in all_toolkits:
        toolkit.embedding = get_embedding(toolkit.description)
    # At request time, only the user input needs embedding (1 API call)

Real Example

User: "If it rains tomorrow, plan an indoor workout and add it to my calendar"

[ToolRouter] 18/35 tools | activated: [weather(keyword:1.00), calendar(embed:0.52)]

"rain" → weather toolkit via keyword
"add to calendar" → calendar toolkit via embedding (similarity 0.52)

Base 13 + weather 1 + calendar 4 = 18 tools sent to LLM. The other 17 tools (git, email, contacts, etc.) are excluded → token savings + better accuracy.

Design Decisions

Decision 1: Why not send all tools every time?

With 35+ tool schemas:

Token cost increases (inference cost + response latency)
LLM confuses similar tools (send_email vs host_run_command for sending mail)
Especially severe with smaller models (8B parameters)

Decision 2: Why not use embeddings only?

Embedding-only approach:

Even obvious keywords like "weather" require an embedding API call (unnecessary latency)
If the embedding server goes down, everything breaks

→ Keywords first + embeddings as fallback is the optimal two-stage design

Decision 3: What threshold for similarity?

0.60: Precise but misses relevant toolkits
0.40: May over-activate but never misses
Recall 100% is the priority — extra tools in the context are harmless (LLM ignores them), but missing a needed tool means the agent simply can't do its job

Conclusion

Tool management for AI agents comes down to one question: "Which tools should the LLM see for this specific request?"

The 3-layer answer:

Base Tools: Universal → always ON
Toolkits: Domain packs → dynamically activated via keyword + embedding
Tasks: Individual functions inside toolkits

This architecture lets "What's the weather?" include only the weather task, while "Commit my code" includes only git tasks — saving tokens and improving accuracy across the board.

Xoul, AI Agent : Optimizing Web Search — From 20s to 8s

Kim Namhyun — Thu, 26 Feb 2026 13:58:24 +0000

How we optimized a local AI agent's web search by finding the real bottleneck.
A story of missteps, failed experiments, and eventually finding the true cause.

"It's a simple detail in retrospect, but I had an issue with my web search pipeline. Using the raw search results meant passing thousands of tokens to the next input, so I implemented a very small model to summarize the data first. However, because Ollama defaults to sequential execution, the process took 3x longer. It seems obvious now, but it was a major bottleneck before I analyzed it! I also optimized it by disabling non-text rendering (images, CSS, etc.)."

Background

Xoul is a local AI agent that answers user queries by searching the web. The pipeline is: search → visit URLs → extract text → generate LLM response. A simple question like "How many pages is [book title]?" was either failing completely or taking 20+ seconds.

Problem : Single URL Fetch = High Failure Rate

Symptoms

Only visiting 1 search result URL meant that if that site was slow or JS-heavy, the entire search failed.

Fix: Parallel 3-URL Fetch

Using concurrent.futures.ThreadPoolExecutor to fetch the top 3 URLs simultaneously:

with ThreadPoolExecutor(max_workers=3) as pool:
    futures = {pool.submit(_fetch_one, url): url for url in fetch_urls}
    try:
        for future in as_completed(futures, timeout=30):
            result = future.result(timeout=1)
            if result:
                fetched.append(result)
    except TimeoutError:
        # Keep whatever completed, discard the rest
        for future in futures:
            if future.done():
                fetched.append(future.result(timeout=0))

Result: Only 1 of 3 needs to succeed. Dramatically reduced failure rate.

Failed Experiment: Host-Side Browser Daemon

Hypothesis

"Chrome on the Windows HOST (full CPU/GPU) should be faster than Chromium in the VM (limited resources)."

Implementation

Created browser_daemon_host.py to run Chrome headless on Windows:

Served on PORT 9224
VM accesses via 10.0.2.2:9224 (QEMU gateway to host)
Chrome CDP with /json/new + Page.navigate

Single URL Fetch Comparison

	HOST Chrome	VM Chromium
Single URL fetch	1.5s	~2s
Difference	0.5s faster	baseline

But...

End-to-end API test results:

With HOST Chrome: 20.0s
VM Chromium only: 20.4s

0.4s difference. Why? Browser fetch is only ~2s of the total 20s. The remaining 18s was spent elsewhere.

Conclusion

Host daemon added complexity with negligible benefit. Scrapped. A classic case of optimizing the wrong bottleneck.

Finding the Real Bottleneck: 3x Serial LLM Summarization

Code Analysis

web_search("book page count")
 ├─ DDG search                        ~1s
 ├─ 3 URL fetch (parallel)            ~2s
 ├─ 3 LLM summaries (qwen3:0.6b)     ~12s ← HERE!
 │   ├─ URL1 → _summarize_with_llm    ~4s
 │   ├─ URL2 → _summarize_with_llm    ~4s
 │   └─ URL3 → _summarize_with_llm    ~4s
 └─ Final LLM response (gpt-oss:20b)  ~5s

After fetching each page, tool_fetch_url calls qwen3:0.6b to summarize the raw HTML into structured data (title, author, page count, ISBN, etc.). This was designed to save tokens for the main LLM, but had critical issues:

3 summaries run serially — Ollama defaults to processing 1 request at a time
Model swapping — switching between gpt-oss:20b and qwen3:0.6b adds overhead
GPU bandwidth sharing — even with "parallel" requests, they share the same GPU

Why Not Just Truncate?

We considered removing LLM summarization and just truncating text to 3000 chars. However, prior testing showed important data (like page count buried deep in the page) could be lost. The 0.6b model intelligently extracts structured info regardless of position in the page.

Solution: Ollama Parallel + Multi-Model Config

# Set before starting Ollama in launcher.ps1
$env:OLLAMA_NUM_PARALLEL = "4"       # Handle 4 concurrent requests
$env:OLLAMA_MAX_LOADED_MODELS = "3"  # Keep 3 models in VRAM simultaneously

This keeps all models resident in VRAM:

gpt-oss:20b (17GB) — main conversation
qwen3:0.6b (8GB) — web page summarization
bge-m3 (1.2GB) — memory embeddings

Zero model swapping, true parallel processing.

Additional daemon improvements:

ThreadingHTTPServer for concurrent request handling
Chromium flags to disable images, fonts, CSS, plugins — optimized for text extraction

Final Results

Test Query: "마음 소화제 뻥뻥수 페이지 수 알려줘" (Book page count query)

Stage	Before	After
Web search	~1s	~1s
URL fetch (1→3 parallel)	~5s	~2s
LLM summarization (serial→parallel)	~12s	~3s
Final response	~5s	~3s
Total	~20s	~8s

Run 1: 8.5s → "마음 소화제 뻥뻥수의 페이지 수는 188쪽입니다."
Run 2: 8.2s → "마음 소화제 뻥뻥수는 188쪽입니다. (source: yes24.com)"

2.5x faster. Accurate answer with source.

Lessons Learned

Measure first, optimize later — We spent hours building a HOST Chrome daemon for a 0.5s improvement on a 2s step, while the real 12s bottleneck was elsewhere.
The bottleneck is never where you expect — "The browser is slow" was our assumption. "Serial LLM calls are slow" was the reality.
One config line beats 300 lines of code — OLLAMA_NUM_PARALLEL=4 solved what browser_daemon_host.py (300 lines) couldn't.
Infrastructure reliability > speed — If the browser daemon is dead, speed doesn't matter. Triple auto-start was the most impactful change.
Cold start matters — First test after Ollama restart showed 23.6s (model loading). Second run: 8.5s. Always warm up before benchmarking.

Files Changed

File	Change
`browser_daemon.py`	ThreadingHTTPServer, disable images/fonts
`tools/web_tools.py`	Parallel 3-URL fetch, TimeoutError handling, SSH fallback
`scripts/deploy.ps1`	Browser daemon auto enable+start
`scripts/launcher.ps1`	Browser health check, Ollama parallel config
`vm_manager.py`	Browser daemon start on VM boot

Solving Key Collisions in LLM Memory — Embedding Model Swap and Hybrid Normalization

Kim Namhyun — Wed, 25 Feb 2026 15:08:38 +0000

LLM gpt-oss:20b · Embedding bge-m3 (1024-dim) · SQLite · 3-Tier Memory (STM/MTM/LTM)

Origin

@signalstack replied on previous post's comments(about key collisions) and I reviewed it (thanks to them, we caught critical bugs). Upon reviewing them, three issues surfaced in the memory system:

Deadlock — Server hangs permanently during memory save operations
Key mismatch — "favorite food" and "most favorite food" stored as separate entries, leaving stale values
Key collision — Saving "dog name: Choco" overwrites "cat name: Nabi" because the embedding model can't distinguish them

This post documents how we diagnosed these issues, what experiments we ran, and how we fixed them.

In a 3-tier memory system, when a user says "My hobby is hiking" then later "I switched to pottery" — what happens?

Facts first enter MTM (mid-term memory). After repeated access, they promote to LTM (long-term memory). The problem: during promotion, old values can conflict with new ones. ON CONFLICT(key) DO UPDATE should handle this, but when the LLM extracts the same concept under different keys, the upsert misses entirely.

1st: "favorite_food|pizza|MID"         → saved to MTM
2nd: "most_favorite_food|sushi|MID"    → different key → saved as separate entry!

We designed 10 scenarios to verify this, capturing LTM/MTM database state after every test.

What Broke First

1. Deadlock: Server hangs forever

remember_mtm() → remember() call chain acquired the same threading.Lock() twice → permanent deadlock. Fixed: Lock() → RLock().

2. Key Inconsistency: Stale food kept appearing

The LLM extracted 좋아하는 음식 (favorite food) and 가장 좋아하는 음식 (most favorite food) as different keys → ON CONFLICT(key) missed → pizza never became doenjang-jjigae.

Embedding Model Shootout

To normalize keys, we need to answer: "Are these two keys the same concept?" We tested 4 models:

Key Pair	all-minilm	nomic-embed	BGE-M3	Qwen3-Embed
favorite food ↔ most favorite food	0.95	0.96	0.92	0.92
name ↔ cat's name	0.89	0.96	0.69	0.63
hobby ↔ blood type	1.00 ❌	1.00 ❌	0.40 ✅	0.49
cat's name ↔ dog's name	1.00 ❌	1.00 ❌	0.87 ✅	0.91 ❌

all-minilm and nomic-embed couldn't distinguish short Korean words at all (hobby vs blood type = 1.0). Only BGE-M3 separated them correctly.

We also switched from embedding key: value to key-only embedding. Since keyword matching already covers value search in recall, key-only embedding is more efficient — and enables zero-cost key normalization from DB.

Final Approach

BGE-M3 (threshold 0.9) + substring fallback hybrid

Step 1: cosine_similarity(new_key, existing_key) ≥ 0.9 → match
Step 2: substring containment + length ratio ≥ 60% → match

Reuses key-only embeddings stored in DB — only 1 extra Ollama call per new key.

10 Test Cases in Detail

MC01: Hobby Change — Hiking → Pottery

Goal: Basic value replacement works correctly

Step	Input	LTM	MTM
MC01a	"My hobby is hiking"	—	`hobby: hiking` (access=0)
MC01b	"I switched to pottery"	—	`hobby: pottery` (access=1)
MC01 check	"What's my hobby?"	`hobby: pottery` ← promoted	—

Result: ✅ "Pottery" — upsert in MTM, then normal promotion to LTM

MC02: Job Change — Teacher → Programmer

Goal: Clear factual replacement

Step	Input	LTM	MTM
MC02a	"I'm a teacher"	hobby: pottery	`job: teacher` (access=0)
MC02b	"I quit, now a programmer"	hobby: pottery	`job: programmer` (access=1)
MC02 check	"What's my job?"	hobby, job: programmer ← promoted	—

Result: ✅ "Programmer" — previous value completely replaced

MC03: Food Preference 3x Change — Pizza → Sushi → Doenjang-jjigae

Goal: Only latest value survives after 3 changes

Step	Input	LTM	MTM
MC03a	"I love pizza"	hobby, job	`favorite food: pizza` (access=0)
MC03b	"Changed to sushi"	hobby, job	`favorite food: sushi` (access=1)
MC03c	"Actually doenjang-jjigae"	hobby, job	`favorite food: doenjang-jjigae` (access=2)
MC03 check	"Favorite food?"	hobby, job, food: doenjang-jjigae ← promoted	—

Result: ✅ "Doenjang-jjigae" — access_count ≥ 2 triggers auto-promotion, latest value only

MC04: Natural Correction — "Yoga was just a phase, now it's boxing"

Goal: Correction without explicit "update" command

Step	Input	LTM	MTM
MC04a	"I do yoga"	job, hobby: yoga, food	—
MC04b	"Yoga was brief, now boxing"	job, hobby: boxing, food	—
MC04 check	"What exercise?"	job, hobby: boxing, food	—

Result: ✅ "Boxing" — direct LTM update (existing hobby key: yoga → boxing)

MC05: Address Change — Seoul → Busan

Goal: Specific location data replacement

Step	Input	LTM	MTM
MC05a	"Live in Seoul Gangnam"	job, hobby, food	`address: Seoul Gangnam` (access=0)
MC05b	"Moved to Busan Haeundae"	job, hobby, food	`address: Busan Haeundae` (access=1)
MC05 check	"Where do I live?"	+address: Busan Haeundae ← promoted	—

Result: ✅ "Busan Haeundae" — previous value completely removed

MC06: Pet Addition — Cat Nabi + Dog Choco (additive, not replacement)

Goal: When adding, not replacing, both entries must survive. This was the key test — all-minilm failed here because name and cat's name had cosine similarity of 1.0.

Step	Input	LTM	MTM
MC06a	"I have a cat named Nabi"	job, hobby, food, address	`cat name: Nabi` (access=0)
MC06b	"Also adopted dog Choco"	job, hobby, food, address	`cat name: Nabi` (access=1), `dog name: Choco` (access=0)
MC06 check	"Pet names?"	+cat: Nabi, dog: Choco ← both promoted	—

Result: ✅ "Cat Nabi, Dog Choco" — BGE-M3 correctly separated cat name ≠ dog name (cosine 0.87 < threshold 0.9)

Previous failure: all-minilm gave cosine 1.0 for both keys, causing name: Choco to overwrite name: Nabi.

MC07: Blood Type Correction — A → AB

Goal: Explicit correction of wrong information

Step	Input	LTM	MTM
MC07a	"Blood type A"	+`blood type: A`	—
MC07b	"No wait, it's AB"	`blood type: A` → `AB`	—
MC07 check	"Blood type?"	`blood type: AB`	—

Result: ✅ "AB" — forget + remember combo for direct LTM correction

MC08: Preference Reversal — Loves coffee → Quit coffee

Goal: Not just value change but 180° direction reversal

Step	Input	LTM	MTM
MC08a	"Love coffee, drink daily"	…7 items	`coffee: daily` (access=0)
MC08b	"Quit coffee, only tea now"	…7 items	`coffee: quit`, `tea: drinks`, `caffeine: none`
MC08 check	"Do I like coffee?"	…+`coffee: likes`,`tea: drinks`	—

Result: ✅ "Coffee quit, no caffeine. Drinks tea."

MC09: Cross-Session Correction — Summer → Autumn

Goal: Can correct memories from a previous session

Step	Input	LTM	MTM
MC09a	"Love summer"	…existing	`favorite season: summer` (access=0)
MC09b	"Actually prefer autumn now"	…existing	`favorite season: autumn` (access=2)
MC09 check	"Favorite season?"	…+`season: autumn`	—

Result: ✅ "Autumn" — recall → forget(summer) → remember(autumn) pattern

MC10: Multi-Attribute Change — Color red→blue, Number 7→13

Goal: Multiple attributes changed at once without missing any

Step	Input	LTM	MTM
MC10a	"Color red, number 7"	…existing	`color: red`, `number: 7`
MC10b	"Color blue, number 13"	…existing	`color: blue`, `number: 13`
MC10 check	"Color and number?"	…existing	`color: blue`, `number: 13`

Result: ✅ "Blue + 13" — both attributes correctly replaced

Final Scoreboard

#	Scenario	Before (all-minilm)	After (Hybrid)
MC01	Hobby: hiking → pottery	❌ (stale data)	✅
MC02	Job: teacher → programmer	✅	✅
MC03	Food: 3x change	✅	✅
MC04	Exercise: natural correction	✅	✅
MC05	Address: moved	✅	✅
MC06	Pets: cat + dog (additive)	❌ (key collision)	✅
MC07	Blood type: A → AB	✅	✅
MC08	Coffee: preference reversal	✅	✅
MC09	Season: cross-session	✅	✅
MC10	Multi-attribute	❌ (season conflict)	✅
Total		7/10	10/10

Improving Host↔VM File Transfer in a Local AI Agent — Smart Search + Deduplication

Kim Namhyun — Tue, 24 Feb 2026 13:39:25 +0000

Fixing SCP encoding crashes, adding fuzzy filename matching, MD5-based deduplication, and automatic VM file search in a QEMU-based AI agent.

1. Problem Definition

The agent 'Xoul' runs inside a QEMU Ubuntu VM. The reason is security.

When you give an AI agent system tools like run_command and write_file, LLM hallucinations or prompt injection attacks could directly damage the host PC. Imagine rm -rf / executing on your main machine.

Solution: Sandbox all agent system operations inside a QEMU VM.

If the LLM executes rm -rf /, only the VM is affected
No direct host filesystem access — the only channel is the share/ folder
share/ uses explicit SCP transfers only (no auto-mount)

This architecture relies on two file transfer tools:

host_to_vm: Host share/ → VM /root/share/ (deliver user files to the agent)
vm_to_host: VM → Host share/ (deliver agent results to the user)

Both use SCP (Secure Copy Protocol) over SSH. In practice, several issues surfaced:

Symptoms:

SCP transfers intermittently failing (cause unclear)
When a user says "send me the report file," the tool fails unless the exact filename is provided
Retrieving files from VM requires knowing the full path
Identical files transferred repeatedly with no deduplication

2. Root Cause Analysis

2-1. Encoding Crash — `UnicodeDecodeError`

All three functions in vm_manager.py (ssh_exec, scp_to_vm, scp_from_vm) used subprocess.run(text=True):

# Problematic code
result = subprocess.run(ssh_cmd, capture_output=True, text=True, timeout=timeout)

text=True internally calls stdout.decode('utf-8'). When SSH output contains BOM bytes (0xFF, 0xFE) or binary data, it crashes immediately:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0

When this error occurs in subprocess.py's _readerthread, the error message itself gets corrupted:

[SSH 오류: 'NoneType' object has no attribute 'strip']

→ To the user, it just looks like "SCP doesn't work."

2-2. Exact Filename Required

# Original code — exact match required
def _host_to_vm(filename: str, vm_path: str = "") -> str:
    local_path = os.path.join(SHARE_DIR, filename)  # ← Must match exactly
    return scp_to_vm(local_path, vm_path)

Users naturally say partial names like "report" or "result" in conversation. The LLM cannot guess exact filenames.

2-3. Full VM Path Required

vm_to_host requires the complete path (/root/workspace/result.txt). Users have no way of knowing internal VM file paths.

2-4. No Deduplication

Repeated requests for the same file trigger full SCP transfers every time — unnecessary network IO and SSH overhead.

3. Solution

3-1. Encoding Fix

# vm_manager.py — common to ssh_exec, scp_to_vm, scp_from_vm
  result = subprocess.run(
      ssh_cmd,
      capture_output=True,
-     text=True,
+     text=False,
      timeout=timeout,
  )
- output = result.stdout
+ output = result.stdout.decode("utf-8", errors="replace")
+ stderr = result.stderr.decode("utf-8", errors="replace")

errors="replace" substitutes undecodable bytes with �. No matter what bytes SSH outputs, no crash occurs.

3-2. Smart File Search

Host Side — `_find_file_in_share()`

Case-insensitive partial matching across the entire share/ directory tree:

def _find_file_in_share(query: str) -> list:
    results = []
    query_lower = query.lower()
    for root, dirs, files in os.walk(SHARE_DIR):
        for f in files:
            if f.startswith("."):
                continue
            if query_lower in f.lower():
                results.append(os.path.join(root, f))
    return results

VM Side — `_find_file_in_vm()`

Uses SSH find command to search common directories:

def _find_file_in_vm(query: str) -> str:
    result = ssh_exec(
        f"find /root/share /root/workspace /root/.xoul/workspace "
        f"-name '*{query}*' -type f 2>/dev/null | head -5",
        timeout=10, quiet=True
    )
    lines = [l.strip() for l in result.strip().split("\n") if l.strip()]
    return lines[0] if lines else ""

3-3. MD5 Hash-Based Deduplication

Transfer history is stored in share/.transfer_log.json:

{
  "htv:test_htv.txt": {
    "hash": "a1b2c3d4e5f6...",
    "direction": "htv",
    "src": "C:\\...\\share\\test_htv.txt",
    "dst": "/root/share/test_htv.txt",
    "timestamp": "2026-02-24T22:05:30"
  }
}

When an identical hash is detected, the transfer is skipped:

✅ Already up to date: test_htv.txt → /root/share/test_htv.txt

4. Testing & Verification

All tests performed with actual SCP on a running VM.

Test	Input	Expected	Result
HTV exact match	`test_htv.txt`	Upload complete	✅
HTV dedup	Same file again	Skip message	✅
HTV partial match	`htv`	Auto-find `test_htv.txt`	✅
HTV nonexistent	`nonexistent.xyz`	Error + path hint	✅
VTH full path	`/root/workspace/result.txt`	Download complete	✅
VTH filename only	`todo.txt`	VM `find` search	✅
VTH nonexistent	`zzz_not_exist.bin`	Error message	✅
SSH encoding	`cat` command (BOM)	No crash	✅

8/8 passed.

5. Results

Before vs After

Aspect	Before	After
SSH encoding	`UnicodeDecodeError` crash	`errors="replace"` — always safe
File search	Exact filename required	Partial match auto-search
VM file access	Full path required	Filename-only `find` search
Duplicate transfer	Full SCP every time	MD5 hash comparison → skip
Transfer history	None	JSON log
Error messages	`NoneType has no attribute 'strip'`	Specific guidance + share/ path

Key Lessons

The subprocess.run(text=True) trap: Python's text=True is convenient but assumes UTF-8 output from external processes. SSH and SCP on Windows can inject BOM bytes (0xFF 0xFE), causing immediate crashes. text=False + decode(errors="replace") is the safe default.
Users don't know paths: When building file management tools, assuming users know exact filenames or VM paths is dangerous. The pattern should be: partial match → candidate list → selection.
Hash-based deduplication: Filenames can be the same with different content, or different with the same content. MD5 hash on actual content is the reliable comparison method.
JSON vs SQLite for logs: For small-scale data like transfer logs (dozens of entries), JSON files are ideal — zero dependencies, easy debugging. SQLite shines for memory systems with thousands of entries requiring semantic search.

Designing a 3-Tier Memory System for a Local AI Agent — STM / MTM / LTM

Kim Namhyun — Tue, 24 Feb 2026 12:59:36 +0000

How we modeled human memory consolidation to build a robust memory pipeline for a 20B-parameter local LLM agent — and achieved 100% on a 53-question test suite.

1. Problem Definition

We built a local AI agent (Androi) that needs to remember user facts across conversations.

When a user says "My name is Namhyun" or "My hobby is hiking," the agent must recall and use these facts in future sessions.

The initial implementation was simple:

User message → LLM extracts key|value → Stored directly in LTM (permanent)

Out of 53 end-to-end tests, 8 failed, and memory system issues were the root cause for most.

Key symptoms:

BMI calculation couldn't find height data
Salary information wasn't used for price comparisons
After correcting a hobby, the agent recalled the old value
Scheduler tool name mismatches (unrelated bug, also fixed)

2. Root Cause Analysis

2-1. LTM Pollution

Every extracted fact went directly to LTM, including transient junk:

weather|Seoul       ← one-time search request
time|8AM            ← scheduling parameter  
weather_alert|cancelled  ← task status

When LTM exceeded 30 entries, semantic filtering kicked in — but these junk entries diluted the signal, causing truly important memories (height, weight, salary) to be filtered out.

2-2. Single-Character Key Filter Bug

extract_and_remember had a len(key) < 2 filter that dropped valid single-character Korean keys.

"키" (height in Korean) is 1 character → extracted but never saved. Only "몸무게" (weight, 3 chars) survived.

2-3. auto_retrieve Injection Threshold

When LTM exceeded 15 entries, the system switched to semantic matching. With 20 entries stored, a query like "Calculate my BMI" couldn't match "weight: 72kg" above the 0.3 cosine similarity threshold.

2-4. Architectural Gap

The intended design was STM→MTM→LTM natural promotion, but extract_and_remember bypassed MTM entirely and wrote directly to LTM. The promotion logic (_try_promote_to_ltm) was effectively dead code.

3. Solution Design

3-1. Immediate Bug Fixes

Issue	Fix
Scheduler tool name mismatch	`schedule_task` → `create_task` (3 tools)
`web_search` chosen over `find_contact`	Added "Do NOT use web_search for contacts" to tool description
`len(key) < 2` filter	Removed minimum key length requirement
auto_retrieve threshold	Increased 15 → 30

3-2. Architecture Refactoring — Priority-Based MTM→LTM Pipeline

Inspired by human memory consolidation:

Hippocampus = MTM — short-term memories consolidate to cortex through repetition
Ebbinghaus Forgetting Curve — unused memories naturally decay
Emotional significance — blood type, allergies are permanently stored after a single mention

Implementation:

Priority Classification

Priority	Criteria	Examples	Storage
HIGH	Immutable core identity	Name, birthday, blood type, allergies	Direct to LTM
MID	Mutable preferences/state	Hobby, job, residence, salary	MTM → promotion queue
LOW	Transient/one-off info	Weather, news, timestamps	Not extracted

Promotion & Expiry Flow

4. Testing & Validation

4-1. Pass Rate Progression

Round	Pass Rate	Key Change
Baseline	45/53 (85%)	Original code
Round 1	48/53 (91%)	Scheduler tool name fix
Round 2	49/53 (92%)	find_contact description
Round 3	51/53 (96%)	LTM threshold + system prompt
Round 4	53/53 (100%)	Key filter fix + multi-chain guidance
Final (post-refactor)	53/53 (100%)	MTM→LTM architecture, no regressions

4-2. Test Coverage (53 tests)

Category	Count	Coverage
A. Memory System	10	CRUD, cross-session, update, delete, implicit save
B. Web Search	6	Weather, exchange rate, news, restaurants, price, population
C. Calculation + Memory	5	Take-home pay, BMI, compound interest, unit conversion
D. Calendar	6	CRUD, memory-linked scheduling, event modification
E. File + Code	5	File CRUD, Python execution, result persistence
F. Email	3	Inbox, send, search
G. Contacts	3	Add, search, delete
H. Scheduler	3	Register, list, cancel
I. Multi-Chain	7	3+ tool chains, corrections, weather→calendar
J. Edge Cases	5	Hallucination prevention, tool re-call prevention, response speed

5. Results

Final Scorecard

A. Memory System    ████████████████████ 10/10 (100%)
B. Web Search       ████████████████████ 6/6   (100%)
C. Calc + Memory    ████████████████████ 5/5   (100%)
D. Calendar         ████████████████████ 6/6   (100%)
E. File + Code      ████████████████████ 5/5   (100%)
F. Email            ████████████████████ 3/3   (100%)
G. Contacts         ████████████████████ 3/3   (100%)
H. Scheduler        ████████████████████ 3/3   (100%)
I. Multi-Chain      ████████████████████ 7/7   (100%)
J. Edge Cases       ████████████████████ 5/5   (100%)
────────────────────────────────────────────────────
Total               53/53 (100%)

Key Takeaways

LLM-based extraction needs guardrails. Without priority classification, LLMs will extract "weather|Seoul" alongside "name|John" and pollute permanent memory.
Human memory models work for AI agents. The hippocampus→cortex consolidation pattern (MTM→LTM promotion through repeated access) naturally filters for truly important information.
Single-character CJK key pitfall. A len(key) < 2 filter designed for English silently drops valid Korean keys like "키" (height) and "나이" (age). Multilingual systems need careful attention.
Semantic matching has blind spots. "Calculate my BMI" doesn't semantically match "weight: 72kg" above a 0.3 cosine threshold. For small memory sets, full injection is more reliable than semantic filtering.

DEV Community: Kim Namhyun

Xoul - v0.1.1-beta released

v0.1.1-beta released

Links

Xoul - Local Personal Assistant Agent Release (Beta, v0.1.0-beta)

Xoul — An Open-Source AI Agent That Runs Locally

What Is Xoul

Key Features

Architecture

Supported Models

System Requirements

Installation

Quick Start

Install from Source

Community

License

Links

Xoul - Building a Local AI Agent Platform with Small LLMs: The Walls of Tool Calling and Practical Solutions

Background: "Let's Build a Local Agent"

Limitation 1: The LLM Corrupts Parameters

The Problem

Attempt 1: Prompt Engineering → Failed

Attempt 2: 3-Stage Fuzzy Matching → Core Solution ✅

Limitation 2: JSON Gets Destroyed

Attempt 1: Tool Pruning ✅

Attempt 2: Native → Text Fallback ✅

Local Agent's(Xoul) Code Store & AI Arena: Autonomous Agents Powered by Code Execution

1. Code Store — The Agent's Toolbox

How It Works

Parameter Prompting

2. AI Arena — Where Agents Compete

Architecture

Multi-Game Engine

3. Code Store × Arena — Code Drives the Agent

The Flow

Key Design: Zero-Dependency Agent

Discussion vs. Mafia

4. Why This Architecture?

🔐 Why a GitHub-Based Store? — Security and Community Sharing for Local AI Agents

👤 Why Local AI Agent Security Matters

🛡️ GitHub PR-Based Sharing System

Core Principles

Why GitHub?

🔄 Sharing Sequence Flow

Step 1: User Initiates Share

Step 2: LLM Calls the Tool

Step 3: VM Reads DB → Calls API

Step 4: Review & Publish

🤖 Agent-Based Implementation — LLM Calls, Not Direct Code Calls

Why Not Direct Calls?

Agent-Based Solution

Building a GitHub-Based Community Sharing System for a Local AI Agent

1. Why We Needed a Sharing System

The Sharing Model Dilemma

2. Key Design Decisions

2-1. Don't Ask Users for a GitHub Account

2-2. One Repo vs Three

2-3. Monolith JSON vs Individual Files

3. Code Layer Design Issues

3-1. used_by References — Why We Abandoned private/public

3-2. code_name References — Eliminating Inline Duplication

3-3. def run() Standardization — Unifying Two Worlds

4. Side Fix: Removing Hardcoded Model References

Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Why We Built This Benchmark

Test Environment

The 5 Real-World Test Scenarios (39 Total Checks)

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

U02. 📊 Real-Time Tech Trend Research & Report (8 checks)

U03. 🖥️ Server Health Check + Auto-Recovery + Alerts (7 checks)

U04. 🌍 Travel Planner (8 checks)

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Validation Method: Outcome-Based

🏆 Final Rankings

Per-Test Heatmap

Key Insights

1. Parameter Count Isn't Everything

2. Quantization Affects Agent Quality

3. Speed vs. Accuracy Trade-offs

4. "Chain Completion" Is the Key Differentiator

3-1. `used_by` References — Why We Abandoned private/public

3-2. `code_name` References — Eliminating Inline Duplication

3-3. `def run()` Standardization — Unifying Two Worlds

2-1. Encoding Crash — `UnicodeDecodeError`

Host Side — `_find_file_in_share()`

VM Side — `_find_file_in_vm()`