<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kim Namhyun</title>
    <description>The latest articles on DEV Community by Kim Namhyun (@kim_namhyun_e7535f3dc4c69).</description>
    <link>https://dev.to/kim_namhyun_e7535f3dc4c69</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3785019%2Fc1626915-c1e9-4793-a84e-d16b3e25d682.jpg</url>
      <title>DEV Community: Kim Namhyun</title>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kim_namhyun_e7535f3dc4c69"/>
    <language>en</language>
    <item>
      <title>Xoul - v0.1.1-beta released</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Wed, 01 Apr 2026 16:09:30 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-v011-beta-released-4h5g</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-v011-beta-released-4h5g</guid>
      <description>&lt;h2&gt;
  
  
  v0.1.1-beta released
&lt;/h2&gt;

&lt;p&gt;🔒 Security&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Block external SSH attacks: Bind QEMU port forwarding to 127.0.0.1, preventing brute-force SSH attacks from external networks&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;🛠 Stability&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Fix VM SSH connectivity during web search: Chromium opens 60+ outbound TCP connections when loading a single page, saturating QEMU SLiRP's connection slots and blocking new SSH connections. Added CDP Network.setBlockedURLs to block JS/analytics/ads, reducing connections to ~10&lt;br&gt;
Immediate SSE connection release: Force-close SSE screencast response on tool completion to prevent lingering SLiRP connections&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;⚙️ Developer Experience&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Automatic Dev/Prod switching: New env_config.py module auto-detects environment based on Git branch. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://www.xoulai.net/" rel="noopener noreferrer"&gt;https://www.xoulai.net/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/xoul-project/xoul" rel="noopener noreferrer"&gt;https://github.com/xoul-project/xoul&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Discussions: &lt;a href="https://github.com/xoul-project/xoul/discussions" rel="noopener noreferrer"&gt;https://github.com/xoul-project/xoul/discussions&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>networking</category>
      <category>python</category>
      <category>security</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Xoul - Local Personal Assistant Agent Release (Beta, v0.1.0-beta)</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Tue, 31 Mar 2026 17:34:50 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-local-personal-assistant-agent-release-beta-v010-beta-1op4</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-local-personal-assistant-agent-release-beta-v010-beta-1op4</guid>
      <description>&lt;h1&gt;
  
  
  Xoul — An Open-Source AI Agent That Runs Locally
&lt;/h1&gt;

&lt;p&gt;Introducing &lt;strong&gt;Xoul&lt;/strong&gt;, a personal assistant agent powered by local LLMs and virtual machine isolation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhsawz1jk3tw5m90yvmq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhsawz1jk3tw5m90yvmq.gif" alt=" " width="560" height="374"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Xoul
&lt;/h2&gt;

&lt;p&gt;Xoul is a personal AI agent. It's not a chatbot — it manages files, sends emails, browses the web, and runs code at the OS level. All actions run inside a QEMU virtual machine, keeping the host system untouched. When using a local LLM, personal data never leaves the machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;18 built-in tools&lt;/strong&gt; — file management, email, web search, code execution, calendar, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personas &amp;amp; Code Snippets&lt;/strong&gt; — switch agent roles or run Python snippets shared by the community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflows&lt;/strong&gt; — schedule repetitive tasks (news digests, server checks, email triage) as multi-step automation templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Arena&lt;/strong&gt; — a playground where agents discuss topics and play social deduction games&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host PC Control&lt;/strong&gt; — limited host interaction including browser launch and file operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple Clients&lt;/strong&gt; — Desktop (PyQt6), Telegram, Discord, Slack, and CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The Xoul agent runs inside a QEMU virtual machine. LLM inference is handled locally on the GPU via Ollama, while the desktop app serves as the host-side UI. VM isolation ensures the host system stays safe regardless of what the agent does.&lt;/p&gt;

&lt;p&gt;Beyond local LLMs, Xoul also supports commercial APIs (Claude, GPT-5, Gemini, DeepSeek, Grok, Mistral) and external OpenAI-compatible servers (vLLM, LM Studio, etc.).&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Models
&lt;/h2&gt;

&lt;p&gt;For local execution, models are automatically recommended based on available VRAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron-3-Nano 4B (Q8)&lt;/td&gt;
&lt;td&gt;~5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron-3-Nano 4B (BF16)&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-oss 20B&lt;/td&gt;
&lt;td&gt;~13 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron-Cascade-2 30B&lt;/td&gt;
&lt;td&gt;~20 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;BGE-M3 (embedding) and Qwen 2.5 3B (summarization, CPU-only) are also installed automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Requirements
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;x86-64, 8 cores&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;16 GB+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA 30-series, 8 GB VRAM&lt;/td&gt;
&lt;td&gt;NVIDIA 40-series, 16 GB+ VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Windows 11 (10 experimental)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;20 GB free&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.xoulai.net/xoul_dist/xoul_rel.zip" rel="noopener noreferrer"&gt;Download the release file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Extract &lt;code&gt;xoul_rel.zip&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;install.bat&lt;/code&gt; inside the extracted folder&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;install.bat&lt;/code&gt; handles file placement, dependency installation, and configuration automatically. Python 3.12, Ollama, and QEMU are installed as needed. An interactive setup walks through language selection, LLM model, VM configuration, user profile, and optional service integrations (Gmail, Tavily, Telegram, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  Install from Source
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://github.com/xoul-project/xoul.git&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xoul&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\scripts\setup_env.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once setup completes, the Desktop App launches automatically. After that, you can start it with &lt;code&gt;c:\xoul\desktop\xoul.bat&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community
&lt;/h2&gt;

&lt;p&gt;Through Xoul Store, you can import workflows, personas, and code snippets created by other users with one click. You can also publish your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  License
&lt;/h2&gt;

&lt;p&gt;Released under the MIT License.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://www.xoulai.net/" rel="noopener noreferrer"&gt;https://www.xoulai.net/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/xoul-project/xoul" rel="noopener noreferrer"&gt;https://github.com/xoul-project/xoul&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Discussions: &lt;a href="https://github.com/xoul-project/xoul/discussions" rel="noopener noreferrer"&gt;https://github.com/xoul-project/xoul/discussions&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Xoul - Building a Local AI Agent Platform with Small LLMs: The Walls of Tool Calling and Practical Solutions</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Mon, 16 Mar 2026 15:42:25 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-building-a-local-ai-agent-platform-with-small-llms-the-walls-of-tool-calling-and-practical-11fb</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-building-a-local-ai-agent-platform-with-small-llms-the-walls-of-tool-calling-and-practical-11fb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This post is a real-world account of developing Xoul, an on-premise Local AI agent platform, where we hit the walls of small LLM Tool Calling limitations and overcame them one by one at the application layer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Background: "Let's Build a Local Agent"
&lt;/h2&gt;

&lt;p&gt;With large models like GPT or Claude, Tool Calling is near-perfect. But the moment you need to run &lt;strong&gt;small local LLMs (Ollama + Llama3/Qwen/Oss under 20B)&lt;/strong&gt; for on-premise environments or cost reasons, reality hits hard.&lt;/p&gt;

&lt;p&gt;Xoul is a personal AI agent platform with this basic flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input
    ↓
LLM (local[small] or commercial)
    ↓ Tool Call (JSON)
Tool Router → Function execution
    ↓
Result fed back to LLM → Final response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running 30+ tools on this architecture — workflow management, scheduling, Python code execution — we hit three major problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitation 1: The LLM Corrupts Parameters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;User: &lt;code&gt;"Run the 'Organize My Coin When +-20%' workflow"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The LLM needs to call &lt;a&gt;run_workflow&lt;/a&gt;. What we actually got:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run_workflow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Coin organize"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual DB name was &lt;code&gt;"내 코인 현재 +- 20일때 정리"&lt;/code&gt;, so the result was predictably &lt;strong&gt;Not Found&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The first instinct was to fix this with prompting: &lt;em&gt;"Always call list_workflows first to verify the exact name."&lt;/em&gt; Small LLMs tend to forget early instructions as the context grows, so this was unreliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 1: Prompt Engineering → Failed
&lt;/h3&gt;

&lt;p&gt;The model followed the instruction sometimes and ignored it other times. When users issued direct execution commands, it skipped the list query entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 2: 3-Stage Fuzzy Matching → Core Solution ✅
&lt;/h3&gt;

&lt;p&gt;We redesigned the backend to match &lt;strong&gt;as flexibly as possible&lt;/strong&gt;, regardless of what the LLM passes in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: "Coin organize"
  ↓
[Step 1] Match after stripping spaces/special chars
  → "Coinorganize" vs DB: "내코인현재+-20일때정리" → Fail
  ↓
[Step 2] LIKE partial match
  → DB search for "Coin" → Fail (not unique enough)
  ↓
[Step 3] Sentence Embedding cosine similarity
  → "Coin organize" ≈ "내 코인 현재 +- 20일때 정리" → Similarity 0.81 ✅ Auto-execute
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Embeddings use &lt;code&gt;sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2&lt;/code&gt;, loaded at server startup and stored as BLOBs in the DB on workflow creation/update. At search time, all embeddings are loaded and cosine similarity is computed with numpy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Similarity threshold design:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Similarity&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.75&lt;/td&gt;
&lt;td&gt;Auto-execute (no user confirmation needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.5 ~ 0.75&lt;/td&gt;
&lt;td&gt;Show top 3 candidates for user to pick&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.5&lt;/td&gt;
&lt;td&gt;Return Not Found&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Limitation 2: JSON Gets Destroyed
&lt;/h2&gt;

&lt;p&gt;When the number of available tools exceeds ~30, small LLMs start to buckle under context window pressure, producing &lt;strong&gt;gradually broken JSON&lt;/strong&gt; — natural language sentences injected into JSON, missing closing brackets, typos in required keys.&lt;/p&gt;

&lt;p&gt;On Ollama, this comes back as &lt;code&gt;HTTP 500: error parsing tool call&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 1: Tool Pruning ✅
&lt;/h3&gt;

&lt;p&gt;We introduced a &lt;strong&gt;Tool Registry&lt;/strong&gt; that dynamically provides only the tools relevant to the user's input.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Run the workflow"
  ↓
Keyword analysis + Embedding similarity → select relevant toolkits
  ↓
Only tools from [workflow, code, schedule] toolkits sent to LLM
  → 30-tool full set → compressed to 6~8 tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since irrelevant tools simply don't exist in the prompt, JSON parse failures dropped dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 2: Native → Text Fallback ✅
&lt;/h3&gt;

&lt;p&gt;For residual failures, we added automatic retry logic to &lt;a&gt;LLMClient&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;HTTPError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error parsing tool call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Strip tools, retry in plain text mode
&lt;/span&gt;        &lt;span class="n"&gt;retry_payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;retry_payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_choice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Receive text response, parse with Regex for &amp;lt;tool&amp;gt; tags
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We keep text-based tool call format alongside native Tool Calling in the system prompt, so even in fallback mode tools still get executed. This is a &lt;strong&gt;Dual Parser&lt;/strong&gt; architecture.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With sLLM-based agents, &lt;strong&gt;defensive application-layer design matters more than model quality&lt;/strong&gt;. Don't trust LLM output. Build thick validation and correction pipelines on both the input and output sides. That's the core of running these systems in production.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Local Agent's(Xoul) Code Store &amp; AI Arena: Autonomous Agents Powered by Code Execution</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 14:19:56 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/local-agentsxoul-code-store-ai-arena-autonomous-agents-powered-by-code-execution-25ck</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/local-agentsxoul-code-store-ai-arena-autonomous-agents-powered-by-code-execution-25ck</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How Xoul's Code Store turns a single code import into an autonomous agent competing in live games.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Code Store — The Agent's Toolbox
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Code Store&lt;/strong&gt; is the third pillar of the Xoul platform, alongside Personas (character) and Workflows (automation). It's a dynamic repository of Python code snippets that agents can import and execute on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Central Store] ──import──▶ [Local DB] ──run_stored_code──▶ [VM Execution]
   (GitHub)               (workflows.db)                     (Python)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Central Repository&lt;/strong&gt;: 50+ code snippets are maintained in a GitHub repository (&lt;code&gt;xoul_store&lt;/code&gt;). Metadata and inline code live together in &lt;code&gt;codes.json&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto-Import&lt;/strong&gt;: When a user clicks a code in the Store UI, it's saved to the local SQLite database (&lt;code&gt;~/.xoul/workflows.db&lt;/code&gt;) via &lt;code&gt;POST /code/import&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;: The LLM calls &lt;code&gt;run_stored_code(name, params, timeout)&lt;/code&gt;, and the code runs safely inside the VM environment.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Parameter Prompting
&lt;/h3&gt;

&lt;p&gt;Each code snippet can define required parameters. The LLM automatically asks the user for missing values in a natural multi-turn conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"game_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"str"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"desc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Game ID to join"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Required parameters (marked with &lt;code&gt;*&lt;/code&gt;) must be asked for; optional parameters use defaults. This multi-turn prompting pattern is the core UX of the Code Store.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. AI Arena — Where Agents Compete
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI Arena&lt;/strong&gt; is a platform where Xoul agents autonomously participate in social and strategy games. Currently, two game types are supported: &lt;strong&gt;Mafia&lt;/strong&gt; and &lt;strong&gt;Discussion&lt;/strong&gt; (free-form debate).&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────┐
│     Central Game Server (AWS)      │
│  ┌──────────┐  ┌───────────────┐ │
│  │ Moderator │  │  API Server   │ │
│  │ (Rules)   │  │ (REST + SSE)  │ │
│  └──────────┘  └───────────────┘ │
│  ┌───────────────────────────┐   │
│  │   Server Bots (Groq LLM)   │   │
│  │  Auto-fill empty seats      │   │
│  └───────────────────────────┘   │
└────────────────┬─────────────────┘
                 │ HTTP Polling (2s)
     ┌───────────┼───────────┐
     │           │           │
 ┌───▼──┐   ┌───▼──┐   ┌───▼──┐
 │Agent A│   │Agent B│   │Agent C│
 │(Ollama)│   │(Ollama)│   │(Ollama)│
 │Local AI│   │Local AI│   │Local AI│
 └───────┘   └───────┘   └───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Central Server&lt;/strong&gt;: Manages game state (roles, phases, turns) and collects agent actions via REST API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Agents&lt;/strong&gt;: Poll the server, detect their turn, and use a local LLM (Ollama) to decide what to say or who to vote for.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server Bots&lt;/strong&gt;: When players are scarce, Groq-powered bots fill empty seats automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multi-Game Engine
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseGame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="c1"&gt;# Abstract base
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MafiaGame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="c1"&gt;# Roles, voting, night actions
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DiscussionGame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Topic rotation, cooldowns, unlimited players
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding a new game type is as simple as creating a package in &lt;code&gt;games/&lt;/code&gt; that exports &lt;code&gt;GAME_CLASS&lt;/code&gt;. The registry auto-discovers it at startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Code Store × Arena — Code Drives the Agent
&lt;/h2&gt;

&lt;p&gt;The most distinctive aspect of AI Arena is that &lt;strong&gt;agent participation itself is a Code Store execution&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;User&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Join&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="n"&gt;Desktop&lt;/span&gt; &lt;span class="nc"&gt;App &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt; &lt;span class="n"&gt;arenajoin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Detect&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="nf"&gt;type &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mafia&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;discussion&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Check&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="n"&gt;locally&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="n"&gt;xoul_store&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="n"&gt;Send&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="nf"&gt;run_stored_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Discussion Game Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;game_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SherlockBot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persona&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analytical detective AI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="n"&gt;Agent&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;VM &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;up&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Join&lt;/span&gt; &lt;span class="nf"&gt;game &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POST&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;arena&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;games&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Poll&lt;/span&gt; &lt;span class="nf"&gt;state &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;arena&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;games&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Generate&lt;/span&gt; &lt;span class="n"&gt;speech&lt;/span&gt; &lt;span class="n"&gt;via&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nf"&gt;submit &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POST&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;speak&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;Detect&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="n"&gt;cooldowns&lt;/span&gt;
    &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;Print&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Design: Zero-Dependency Agent
&lt;/h3&gt;

&lt;p&gt;Agent code uses &lt;strong&gt;only Python standard library&lt;/strong&gt; (&lt;code&gt;urllib&lt;/code&gt;, &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;time&lt;/code&gt;). No pip installs needed — it runs anywhere, instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Core agent loop (simplified)
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;api_get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/arena/games/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/state?player_id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;my_pid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finished&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🏁 Game over!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending_action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speak&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_speak_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_topic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Ollama API
&lt;/span&gt;        &lt;span class="nf"&gt;api_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/arena/games/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/speak&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;player_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;my_pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Discussion vs. Mafia
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Mafia&lt;/th&gt;
&lt;th&gt;Discussion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Room creation&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Always auto-maintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Join timing&lt;/td&gt;
&lt;td&gt;Before start only&lt;/td&gt;
&lt;td&gt;Mid-game OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Player limit&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Unlimited (999)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Roles&lt;/td&gt;
&lt;td&gt;Citizen/Mafia/Police&lt;/td&gt;
&lt;td&gt;All participants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speaking&lt;/td&gt;
&lt;td&gt;Turn-based (ordered)&lt;/td&gt;
&lt;td&gt;Free + 5s cooldown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Topics&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;New topic every 10min (500 pool)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Why This Architecture?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Why not put agent logic on the server instead of running client-side code?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is Xoul's philosophy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decentralization&lt;/strong&gt;: The agent's brain (LLM) runs on the user's local machine. The server is just the referee — each AI thinks for itself.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Customization&lt;/strong&gt;: Edit the agent code in your Code Store to change strategies. Aggressive persona, cautious analyst — it's all configurable at the code level.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extensibility&lt;/strong&gt;: When a new game type is added, just upload a new agent code to the Store. No client app update needed — auto-import → instant play.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>python</category>
    </item>
    <item>
      <title>🔐 Why a GitHub-Based Store? — Security and Community Sharing for Local AI Agents</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Mon, 09 Mar 2026 15:07:44 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/why-a-github-based-store-security-and-community-sharing-for-local-ai-agents-28lc</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/why-a-github-based-store-security-and-community-sharing-for-local-ai-agents-28lc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How Xoul Platform safely shares workflows, personas, and code snippets&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  👤 Why Local AI Agent Security Matters
&lt;/h2&gt;

&lt;p&gt;Local AI Agents execute code directly on the user's machine. This is powerful — but it also carries serious security risks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if someone shares malicious code?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Looks like "stock price checker" but...
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm -rf /&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 💀 System destroyed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you import code from a community Store, that code runs &lt;strong&gt;directly on your local machine&lt;/strong&gt;. File deletion, data theft, malware installation — all possible.&lt;/p&gt;

&lt;p&gt;This is why &lt;strong&gt;every shared item must go through verification&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛡️ GitHub PR-Based Sharing System
&lt;/h2&gt;

&lt;p&gt;We solve this with a &lt;strong&gt;GitHub Pull Request&lt;/strong&gt; based sharing system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Principles
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;All shares are submitted as PRs&lt;/strong&gt; — review requests, not direct publishing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only approved items get published&lt;/strong&gt; — malicious code blocked upfront&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code is 100% transparent&lt;/strong&gt; — every line visible for review
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📤 Share Request → GitHub PR → 🔍 Admin Review → ✅ Merge → 🌐 Published to Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why GitHub?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;GitHub PR&lt;/th&gt;
&lt;th&gt;Direct Upload&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code Transparency&lt;/td&gt;
&lt;td&gt;✅ Full diff review&lt;/td&gt;
&lt;td&gt;❌ Opaque contents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review Process&lt;/td&gt;
&lt;td&gt;✅ Built-in code review&lt;/td&gt;
&lt;td&gt;❌ Must build separately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Version Control&lt;/td&gt;
&lt;td&gt;✅ Git history&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community Contribution&lt;/td&gt;
&lt;td&gt;✅ Fork/PR open-source pattern&lt;/td&gt;
&lt;td&gt;❌ Closed ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;✅ Free&lt;/td&gt;
&lt;td&gt;💰 Storage/DB costs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🔄 Sharing Sequence Flow
&lt;/h2&gt;

&lt;p&gt;Let's walk through the complete flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: User Initiates Share
&lt;/h3&gt;

&lt;p&gt;Click the 📤 button in the desktop app's list view — a share request is sent via chat.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Workflow List ]
Name         Description          Actions
Test WF      Test workflow         ▶ ✏ 📤 🗑
                                       ↑ This button!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7phnlp28urg0hzkrx9t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7phnlp28urg0hzkrx9t.png" alt=" " width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: LLM Calls the Tool
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → "share_to_store(share_type="workflow", name="Test WF") 실행해줘"
  ↓
LLM → 🔧 share_to_store tool call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: VM Reads DB → Calls API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[VM Server]
  ├── Query item data from SQLite DB
  ├── Build ShareRequest payload
  └── POST to web server /api/share
         ↓
[EC2 Web Server]
  ├── GitHub API: Get main branch SHA
  ├── Create branch: share/workflow/test_1234
  ├── Commit file (code/JSON/markdown)
  ├── Update manifest.json
  └── Create Pull Request
         ↓
[GitHub]
  └── PR: "Share workflow: Test WF"
       → Awaiting admin review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Review &amp;amp; Publish
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Admin]
  ├── Review PR code
  ├── Check for malicious content
  └── Approve &amp;amp; Merge
         ↓
[Store]
  └── ✅ Available for community import
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94eknaccbc51qjpz1n30.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94eknaccbc51qjpz1n30.png" alt=" " width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fim76f7yqo8xkpsvz17ky.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fim76f7yqo8xkpsvz17ky.png" alt=" " width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr81ziyuu5pfuqmvw51z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr81ziyuu5pfuqmvw51z.png" alt=" " width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wb1qwq4809gaqqxveyb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wb1qwq4809gaqqxveyb.png" alt=" " width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 Agent-Based Implementation — LLM Calls, Not Direct Code Calls
&lt;/h2&gt;

&lt;p&gt;The most interesting design decision: &lt;strong&gt;the share function is not called directly from the desktop app, but through the LLM Agent&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Not Direct Calls?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem 1: Desktop ↔ DB Access Impossible&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Desktop App (Windows)] ←✗→ [DB (VM Linux)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The desktop app runs on Windows, but workflow/persona/code data lives in the VM's SQLite database. Direct access is impossible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2: Complex Cache Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Direct calls would require caching data during list rendering, handling cache misses, pre-fetching lists... it gets complicated fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent-Based Solution
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📤 Button Click
  → Chat message sent: "share_to_store(...) run this"
    → LLM calls share_to_store tool
      → Executes on VM (DB access available!)
        → Calls web server API
          → GitHub PR created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Everything goes through the Agent.&lt;/strong&gt; This is Xoul's core philosophy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🧠 &lt;strong&gt;"Every capability exists as an AI Agent tool. The UI is simply an interface that sends requests to the Agent."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Benefits of this approach:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Interface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Buttons, voice, or text — all trigger the same tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Natural Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Share my test workflow" just works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Easy Extension&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Add a tool = add a feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Environment Independent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works the same on VM or Windows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>github</category>
      <category>security</category>
    </item>
    <item>
      <title>Building a GitHub-Based Community Sharing System for a Local AI Agent</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Sun, 08 Mar 2026 14:11:04 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/building-a-github-based-community-sharing-system-for-a-local-ai-agent-18lm</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/building-a-github-based-community-sharing-system-for-a-local-ai-agent-18lm</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How we designed a pipeline where users share code, personas, and workflows with a single button — and operators approve via GitHub PR.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrshma2wpfk7s268xtj2.png" alt=" " width="800" height="468"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Why We Needed a Sharing System
&lt;/h2&gt;

&lt;p&gt;Xoul (Androi) is a locally-running AI agent. Users can create three types of content:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code Store&lt;/strong&gt;: Python utility snippets like &lt;code&gt;crypto prices&lt;/code&gt;, &lt;code&gt;BMI calculator&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personas&lt;/strong&gt;: System prompts defining LLM personality and expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflows&lt;/strong&gt;: Automated pipelines chaining prompts and code steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem: all of this was &lt;strong&gt;trapped in each user's local SQLite database&lt;/strong&gt;. "I made something useful — how do I share it?" was the natural next question.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sharing Model Dilemma
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Central server upload&lt;/td&gt;
&lt;td&gt;Simple UX&lt;/td&gt;
&lt;td&gt;Admin burden, spam risk, server costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P2P direct transfer&lt;/td&gt;
&lt;td&gt;Decentralized&lt;/td&gt;
&lt;td&gt;Zero discoverability, network complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub PR-based&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code review, history tracking, free hosting&lt;/td&gt;
&lt;td&gt;Users need GitHub accounts?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GitHub PR won decisively. Code naturally becomes &lt;code&gt;.py&lt;/code&gt; files, personas become &lt;code&gt;.md&lt;/code&gt; files — existing code review culture applies directly. But &lt;strong&gt;one critical concern&lt;/strong&gt;: can we ask non-developer users to create a GitHub account, fork, and submit PRs?&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Key Design Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2-1. Don't Ask Users for a GitHub Account
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Initial design:&lt;/strong&gt; User logs in via GitHub OAuth → fork → commit → PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality check:&lt;/strong&gt; Many target users aren't developers. They might not know what GitHub is. Adding an OAuth flow makes UX dramatically more complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final design:&lt;/strong&gt; The server acts as a &lt;strong&gt;proxy&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Desktop  →  Server /api/share  →  GitHub API (Server PAT)  →  PR created
     ↑                                                                 ↓
   PR URL returned  ←──────────────────────────────────────────────  PR URL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server holds a GitHub Personal Access Token and converts user requests into PRs. From the user's perspective, it's one "📤 Share" button. All low-level Git operations are handled server-side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We chose this knowing the trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Extremely simple UX (one button)&lt;/li&gt;
&lt;li&gt;✅ No GitHub account required for users&lt;/li&gt;
&lt;li&gt;⚠️ Server PAT security management needed&lt;/li&gt;
&lt;li&gt;⚠️ All PRs show the server account as author (contributor identified in PR body)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2-2. One Repo vs Three
&lt;/h3&gt;

&lt;p&gt;We debated whether codes/personas/workflows should each have their own repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three repos:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate permissions (code maintainer ≠ persona maintainer)&lt;/li&gt;
&lt;li&gt;Independent CI/CD pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;One repo (chosen):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Management overhead reduced to 1/3&lt;/li&gt;
&lt;li&gt;Single GitHub Action builds everything&lt;/li&gt;
&lt;li&gt;One PR can include both code + workflow&lt;/li&gt;
&lt;li&gt;Contributors only need to know one repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We followed the "start simple" principle. If scale becomes an issue, we can split later — but premature separation would triple our maintenance burden today.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xoul-store/
├── codes/
│   ├── finance/binance-portfolio.py
│   ├── games/arena-agent-v1.py
│   └── manifest.json
├── personas/
│   ├── research/p-001-en.md
│   └── manifest.json
├── workflows/
│   └── manifest.json
├── dist/          ← Auto-generated by CI
│   ├── codes.json
│   └── personas.json
└── .github/workflows/build.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2-3. Monolith JSON vs Individual Files
&lt;/h3&gt;

&lt;p&gt;Previously, &lt;code&gt;codes.json&lt;/code&gt; contained all 50 codes in a single file. Try reviewing a 3000-line JSON diff in a PR — it's essentially un-reviewable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Individual file benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contributors add &lt;strong&gt;one&lt;/strong&gt; &lt;code&gt;.py&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;Code review works at &lt;strong&gt;file granularity&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Change history tracked per code&lt;/li&gt;
&lt;li&gt;Merge conflicts minimized&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;But the server wants a single JSON&lt;/strong&gt; for the web Store catalog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution: GitHub Action auto-build.&lt;/strong&gt; On every merge, CI reads &lt;code&gt;manifest.json&lt;/code&gt; + individual files and generates &lt;code&gt;dist/codes.json&lt;/code&gt;. Contributors touch &lt;code&gt;.py&lt;/code&gt; + one manifest line. The server fetches &lt;code&gt;dist/codes.json&lt;/code&gt; from GitHub Raw URL, with local fallback.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Code Layer Design Issues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3-1. &lt;code&gt;used_by&lt;/code&gt; References — Why We Abandoned private/public
&lt;/h3&gt;

&lt;p&gt;Initially, we planned &lt;code&gt;private&lt;/code&gt;/&lt;code&gt;public&lt;/code&gt; flags for Code Store items. Workflow-only codes would be private, standalone codes public.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question that killed it:&lt;/strong&gt; "What if the same code is used by multiple workflows?"&lt;/p&gt;

&lt;p&gt;Binary classification can't express many-to-many relationships. We switched to a &lt;strong&gt;&lt;code&gt;used_by&lt;/code&gt; JSON array&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# codes table
&lt;/span&gt;&lt;span class="n"&gt;used_by&lt;/span&gt; &lt;span class="n"&gt;TEXT&lt;/span&gt; &lt;span class="n"&gt;DEFAULT&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;   &lt;span class="c1"&gt;# e.g., ["Tech Trend Research", "Daily Blog"]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deletion protection:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-empty &lt;code&gt;used_by&lt;/code&gt; → block deletion + show which workflows reference it&lt;/li&gt;
&lt;li&gt;Empty array → &lt;strong&gt;no auto-deletion&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why no GC? A code removed from a workflow isn't necessarily useless. Users might re-attach it later, or run it standalone via &lt;code&gt;run_stored_code&lt;/code&gt;. Over-aggressive garbage collection surprises users.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-2. &lt;code&gt;code_name&lt;/code&gt; References — Eliminating Inline Duplication
&lt;/h3&gt;

&lt;p&gt;Workflow code steps previously stored &lt;strong&gt;entire code inline&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"import urllib.request&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;...200 lines..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same code in 3 workflows = 3 copies. Fix a bug? Find and update all 3. This violates basic database normalization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New approach:&lt;/strong&gt; Steps store only &lt;code&gt;code_name&lt;/code&gt;, resolved from Code Store at runtime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"code_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crypto prices"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Backward compatibility:&lt;/strong&gt; If &lt;code&gt;code_name&lt;/code&gt; exists → fetch from Store. Otherwise → use legacy &lt;code&gt;content&lt;/code&gt;. No existing workflows break.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-3. &lt;code&gt;def run()&lt;/code&gt; Standardization — Unifying Two Worlds
&lt;/h3&gt;

&lt;p&gt;Code Store's 50 codes were &lt;strong&gt;flat scripts&lt;/strong&gt; (globals, no function wrapper). The workflow editor expected &lt;code&gt;def run(params):&lt;/code&gt; signatures. Two execution models running in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dilemma:&lt;/strong&gt; Rewrite all 50, or support both at runtime?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt; Both. &lt;code&gt;run_stored_code&lt;/code&gt; detects &lt;code&gt;def run(&lt;/code&gt; presence and switches between function invocation vs. exec. And we converted all 50 codes to &lt;code&gt;def run():&lt;/code&gt; signatures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (flat — how does the LLM know what params to pass?)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.coingecko.com/...?ids=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;coins&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# After (standardized — LLM reads signature + docstring)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coins&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bitcoin,ethereum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    coins: Coin IDs (default: bitcoin,ethereum)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.coingecko.com/...?ids=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;coins&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The biggest beneficiary is the &lt;strong&gt;LLM itself&lt;/strong&gt;. With a function signature and docstring, it immediately knows what parameters to pass and in what format. With flat code, the LLM had to parse the entire script body.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Side Fix: Removing Hardcoded Model References
&lt;/h2&gt;

&lt;p&gt;While working on the sharing pipeline, we discovered Arena agent code (&lt;code&gt;arena_agent_code.py&lt;/code&gt;, &lt;code&gt;arena-agent-v1.py&lt;/code&gt;) had &lt;code&gt;gpt-oss:20b&lt;/code&gt; hardcoded — bypassing the 4B model configured in &lt;code&gt;config.json&lt;/code&gt;. This caused unexpected memory spikes.&lt;/p&gt;

&lt;p&gt;We removed all hardcoded model references and &lt;strong&gt;eliminated fallback defaults&lt;/strong&gt;. If config is missing, a &lt;code&gt;RuntimeError&lt;/code&gt; fires immediately rather than silently loading the wrong model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design principle: fail-fast &amp;gt; silent wrong behavior.&lt;/strong&gt; A crash with a clear error message is always better than 15GB of unexpected VRAM usage.&lt;/p&gt;




</description>
      <category>agents</category>
      <category>ai</category>
      <category>github</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Sat, 28 Feb 2026 07:01:54 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/local-llm-agent-benchmark-comparing-6-models-in-real-world-scenarios-3ffb</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/local-llm-agent-benchmark-comparing-6-models-in-real-world-scenarios-3ffb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Measuring AI agent performance by &lt;strong&gt;actual outcome correctness&lt;/strong&gt;, not just tool call presence&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why We Built This Benchmark
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag. But for &lt;strong&gt;tool-using AI agents&lt;/strong&gt;, what truly matters isn't "did it call the right tool?" — it's &lt;strong&gt;"did it actually produce the correct result?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our project &lt;strong&gt;Androi&lt;/strong&gt; is a local AI agent that uses 10+ tools including web search, Python execution, file management, email, and calendar. We connected various LLMs to the same agent and ran &lt;strong&gt;5 identical complex real-world scenarios&lt;/strong&gt;, scoring each based on the correctness of their outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Server&lt;/strong&gt;: Ubuntu VM (3.8GB RAM, 20GB SSD)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt;: Ollama (local inference)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework&lt;/strong&gt;: Androi Agent (Node.js + Python tool pipeline)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: Outcome-Based Validation (v2)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test Date&lt;/strong&gt;: 2026-02-28&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The 5 Real-World Test Scenarios (39 Total Checks)
&lt;/h2&gt;

&lt;p&gt;Each test requires the agent to &lt;strong&gt;chain multiple tools sequentially&lt;/strong&gt; to complete a complex, multi-step task.&lt;/p&gt;

&lt;h3&gt;
  
  
  U01. 🏦 Global Asset Rebalancing Advisor (9 checks)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Web search&lt;/strong&gt; current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert to KRW&lt;/strong&gt; and calculate total portfolio value&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute Python code&lt;/strong&gt; to compute each asset's weight (%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare against ideal allocation&lt;/strong&gt; (Stocks 40%, Crypto 20%, USD 20%, Gold 20%) and recommend rebalancing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save report&lt;/strong&gt; to file (&lt;code&gt;/tmp/rebalance_report.txt&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register calendar event&lt;/strong&gt; for next Friday's review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send email&lt;/strong&gt; with the report attached&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Validation Checks&lt;/strong&gt;: Samsung price, Bitcoin price, USD rate, Gold price, Total calculation, Weight analysis, Rebalancing recommendation, File saved, Email sent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required Tools&lt;/strong&gt;: &lt;code&gt;web_search&lt;/code&gt; × 4, &lt;code&gt;run_python_code&lt;/code&gt; / &lt;code&gt;calculate&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;create_event&lt;/code&gt;, &lt;code&gt;send_email&lt;/code&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  U02. 📊 Real-Time Tech Trend Research &amp;amp; Report (8 checks)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Research the 2026 AI semiconductor market and produce a comprehensive report.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search &lt;strong&gt;"AI semiconductor market forecast 2026"&lt;/strong&gt; → collect market size data&lt;/li&gt;
&lt;li&gt;Search &lt;strong&gt;"NVIDIA HBM market share 2026"&lt;/strong&gt; → understand competitive landscape&lt;/li&gt;
&lt;li&gt;Search &lt;strong&gt;"Samsung HBM3E mass production"&lt;/strong&gt; → Korean industry status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate markdown report&lt;/strong&gt; using Python with collected data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save report&lt;/strong&gt; to &lt;code&gt;/tmp/ai_semiconductor_report.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register weekly automated task&lt;/strong&gt; for trend updates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Send report via email&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Validation Checks&lt;/strong&gt;: Market size mentioned, NVIDIA mentioned, HBM mentioned, Samsung trends, SK Hynix trends, Report saved, Auto-task registered, Email sent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required Tools&lt;/strong&gt;: &lt;code&gt;web_search&lt;/code&gt; × 3, &lt;code&gt;run_python_code&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;create_task&lt;/code&gt;, &lt;code&gt;send_email&lt;/code&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  U03. 🖥️ Server Health Check + Auto-Recovery + Alerts (7 checks)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Perform a comprehensive VM server health check and generate a report.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;df -h&lt;/code&gt; to check &lt;strong&gt;disk usage&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;free -h&lt;/code&gt; to check &lt;strong&gt;memory status&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;systemctl list-units --state=failed&lt;/code&gt; to identify &lt;strong&gt;failed services&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use Python to analyze last 50 lines of &lt;code&gt;/var/log/syslog&lt;/code&gt; for &lt;strong&gt;ERROR/WARNING/CRITICAL frequency&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;find&lt;/code&gt; to list &lt;strong&gt;temp files older than 7 days&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save full report&lt;/strong&gt; with &lt;strong&gt;risk level assessment&lt;/strong&gt; (High/Medium/Low)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Register hourly auto-check task&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Validation Checks&lt;/strong&gt;: Disk usage, Memory status, Service status, Log analysis, Risk level assessment, Report saved, Auto-task registered&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required Tools&lt;/strong&gt;: &lt;code&gt;run_command&lt;/code&gt; × 4, &lt;code&gt;run_python_code&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;create_task&lt;/code&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  U04. 🌍 Travel Planner (8 checks)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Plan a weekend trip to Jeju Island (1 night, 2 days).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search &lt;strong&gt;"Jeju Island February weather"&lt;/strong&gt; → temperature and conditions&lt;/li&gt;
&lt;li&gt;Search &lt;strong&gt;"Jeju winter restaurant recommendations 2026"&lt;/strong&gt; → select 3 restaurants&lt;/li&gt;
&lt;li&gt;Search &lt;strong&gt;"Jeju winter tourist attractions"&lt;/strong&gt; → select 3 attractions&lt;/li&gt;
&lt;li&gt;Use Python to create a &lt;strong&gt;Day 1 / Day 2 timetable&lt;/strong&gt; (09:00–21:00, alternating attractions and restaurants)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculate estimated budget&lt;/strong&gt; (meals: 30K KRW × 6, hotel: 150K, transport: 50K = 380K KRW)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save travel plan&lt;/strong&gt; to file + &lt;strong&gt;Register calendar events&lt;/strong&gt; (departure/return)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Send plan via email&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Validation Checks&lt;/strong&gt;: Weather info, Restaurant recommendations, Tourist attractions, Day 1/Day 2 separation, Timetable, Cost calculation, Calendar events, Email sent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required Tools&lt;/strong&gt;: &lt;code&gt;web_search&lt;/code&gt; × 3, &lt;code&gt;run_python_code&lt;/code&gt;, &lt;code&gt;calculate&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;create_event&lt;/code&gt; × 2, &lt;code&gt;send_email&lt;/code&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Analyze the server's &lt;code&gt;tool_registry.py&lt;/code&gt; file and produce a code review report.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;read_file&lt;/code&gt; to &lt;strong&gt;read entire source code&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute Python&lt;/strong&gt; to count lines, functions, and classes&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;wc -l /root/xoul/tools/*.py&lt;/code&gt; to check &lt;strong&gt;total module size&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;calculate&lt;/code&gt; to compute &lt;strong&gt;tool_registry.py's percentage&lt;/strong&gt; of total codebase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save analysis report&lt;/strong&gt; to &lt;code&gt;/tmp/code_analysis.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store key findings&lt;/strong&gt; in memory (recall/memorize)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Send report via email&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Validation Checks&lt;/strong&gt;: Line count, Function count, Total module size, Percentage calculated, Code structure explained, Report saved, Email sent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required Tools&lt;/strong&gt;: &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;run_python_code&lt;/code&gt;, &lt;code&gt;run_command&lt;/code&gt;, &lt;code&gt;calculate&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;memorize&lt;/code&gt;, &lt;code&gt;send_email&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Validation Method: Outcome-Based
&lt;/h2&gt;

&lt;p&gt;Instead of checking "did it call the right tool?", we verify &lt;strong&gt;"does the output contain the correct information?"&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD — Most critical outcomes achieved  
≥50% = ⚠️ PARTIAL — More than half achieved
&amp;lt;50% = ❌ FAIL — Critical outcomes missing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, in U01 if the agent didn't explicitly call &lt;code&gt;send_email&lt;/code&gt; but the response contains "email sent successfully", it passes. Conversely, calling &lt;code&gt;web_search&lt;/code&gt; but not including Samsung's stock price in the response is a fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏆 Final Rankings
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;🏆 PERFECT&lt;/th&gt;
&lt;th&gt;✅ GOOD&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPT-oss-20B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37/39 (95%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;264s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.5-27B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37/39 (95%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1,101s&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-8B Q8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;36/39 (92%)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;377s&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4️⃣&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GLM-4.7-Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~4B(MoE)&lt;/td&gt;
&lt;td&gt;36/39 (92%)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1,310s&lt;/td&gt;
&lt;td&gt;⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5️⃣&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-8B Q4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;35/39 (90%)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;441s&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6️⃣&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.5-35B-A3B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;35B(MoE)&lt;/td&gt;
&lt;td&gt;31/39 (79%)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;552s&lt;/td&gt;
&lt;td&gt;⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Per-Test Heatmap
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;U01 Assets&lt;/th&gt;
&lt;th&gt;U02 Research&lt;/th&gt;
&lt;th&gt;U03 Server&lt;/th&gt;
&lt;th&gt;U04 Travel&lt;/th&gt;
&lt;th&gt;U05 Code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-oss-20B&lt;/td&gt;
&lt;td&gt;🏆 9/9&lt;/td&gt;
&lt;td&gt;🏆 8/8&lt;/td&gt;
&lt;td&gt;✅ 6/7&lt;/td&gt;
&lt;td&gt;✅ 7/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-27B&lt;/td&gt;
&lt;td&gt;🏆 9/9&lt;/td&gt;
&lt;td&gt;🏆 8/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;td&gt;✅ 6/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-8B Q8&lt;/td&gt;
&lt;td&gt;✅ 8/9&lt;/td&gt;
&lt;td&gt;🏆 8/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;td&gt;✅ 6/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-4.7-Flash&lt;/td&gt;
&lt;td&gt;✅ 8/9&lt;/td&gt;
&lt;td&gt;✅ 6/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;td&gt;🏆 8/8&lt;/td&gt;
&lt;td&gt;🏆 7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-8B Q4&lt;/td&gt;
&lt;td&gt;🏆 9/9&lt;/td&gt;
&lt;td&gt;✅ 7/8&lt;/td&gt;
&lt;td&gt;✅ 5/7&lt;/td&gt;
&lt;td&gt;🏆 8/8&lt;/td&gt;
&lt;td&gt;✅ 6/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-35B-A3B&lt;/td&gt;
&lt;td&gt;🏆 9/9&lt;/td&gt;
&lt;td&gt;✅ 7/8&lt;/td&gt;
&lt;td&gt;⚠️ 4/7&lt;/td&gt;
&lt;td&gt;✅ 7/8&lt;/td&gt;
&lt;td&gt;⚠️ 4/7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Parameter Count Isn't Everything
&lt;/h3&gt;

&lt;p&gt;The 35B MoE model (Qwen3.5-35B-A3B) scored &lt;strong&gt;last at 79%&lt;/strong&gt;, while the 8B model (Qwen3-8B Q8) achieved &lt;strong&gt;92% in 3rd place&lt;/strong&gt;. For agent tasks, tool-use capability and instruction following matter more than raw parameter count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personally, I think full-weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Quantization Affects Agent Quality
&lt;/h3&gt;

&lt;p&gt;Comparing Qwen3-8B Q8 vs Q4: the Q4 variant exhibited &lt;strong&gt;tool call repetition loops&lt;/strong&gt; — it repeated the same &lt;code&gt;df -h &amp;amp;&amp;amp; free -h&lt;/code&gt; command 6 times in U03. This suggests that tool chaining stability is sensitive to quantization levels.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Speed vs. Accuracy Trade-offs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-oss-20B&lt;/strong&gt;: Fastest (264s) AND most accurate (95%) — clear winner&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3.5-27B&lt;/strong&gt;: Tied accuracy but 4× slower — for when depth matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-8B Q8&lt;/strong&gt;: Best performance-per-parameter — recommended for resource-limited environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. "Chain Completion" Is the Key Differentiator
&lt;/h3&gt;

&lt;p&gt;Most models perform well on intermediate steps (searching, analyzing), but the real differentiation occurs at the &lt;strong&gt;end of the chain&lt;/strong&gt; — sending emails, saving files, and registering automated tasks. Qwen3.5-35B-A3B was notably weak at these final steps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but &lt;strong&gt;tool chaining completion rate, instruction adherence, and response speed&lt;/strong&gt; together.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🏆 &lt;strong&gt;Best overall&lt;/strong&gt;: GPT-oss-20B (speed + accuracy leader)&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;Best value&lt;/strong&gt;: Qwen3-8B Q8 (92% with just 8B parameters at 377s)&lt;/li&gt;
&lt;li&gt;🔬 &lt;strong&gt;Deepest analysis&lt;/strong&gt;: Qwen3.5-27B (most PERFECT scores at 4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Test code and full results are available at &lt;a href="//./test_ultimate_extreme.py"&gt;tests/test_ultimate_extreme.py&lt;/a&gt; and &lt;a href="//./model_benchmark.md"&gt;tests/model_benchmark.md&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Designing a Tool Architecture for AI Agents — Base Tools, Toolkits, and Dynamic Routing</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Fri, 27 Feb 2026 17:00:26 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/designing-a-tool-architecture-for-ai-agents-base-tools-toolkits-and-dynamic-routing-fdo</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/designing-a-tool-architecture-for-ai-agents-base-tools-toolkits-and-dynamic-routing-fdo</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How do you give an AI agent 30+ tools without drowning the context window? Include everything and you waste tokens. Be selective and the agent can't do its job. Here's how I solved it with a 3-layer architecture.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem: Too Many Tools
&lt;/h2&gt;

&lt;p&gt;As your AI agent grows, so does its toolbox. My personal assistant now has 35+ tools — web search, email, calendar, weather, Git, host PC control, file management, code execution, and more.&lt;/p&gt;

&lt;p&gt;Sending all 35 tool schemas to the LLM on every request causes two problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token cost explosion&lt;/strong&gt;: 35 JSON function schemas easily consume 3,000+ tokens per turn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selection accuracy drops&lt;/strong&gt;: The more tools available, the more likely the LLM picks the wrong one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But if you trim tools aggressively, the agent can't handle requests it should be able to.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: 3-Layer Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         ┌───────────────┐
         │  User Input    │
         └───────┬───────┘
                 ↓
┌────────────────────────────────────┐
│          Tool Registry             │
│                                    │
│  ┌────────────┐  ┌──────────────┐  │
│  │ Base Tools │  │  Toolkits    │  │
│  │ (always ON)│  │(dynamic load)│  │
│  │ 13 general │  │  8 packs     │  │
│  └──────┬─────┘  └──────┬───────┘  │
│         │               │          │
│         │         ┌─────┴──────┐   │
│         │         │   Tasks    │   │
│         │         │(individual)│   │
│         │         └─────┬──────┘   │
│         └───────┬───────┘          │
│                 ↓                  │
│     ┌───────────────────────┐      │
│     │ Selected Tools → LLM  │      │
│     └───────────────────────┘      │
└────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 1: Base Tools — Always Included
&lt;/h3&gt;

&lt;p&gt;13 general-purpose tools that could be needed for any request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;web_search, read_file, write_file, list_files,
run_command, get_datetime, calculate, run_python_code,
pip_install, recall, forget,
host_list_files, vm_to_host, host_to_vm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Web search, file I/O, code execution, memory (recall/forget) — these are universal. Included in every LLM call regardless of the user's request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Toolkits — Domain-Specific Tool Packs
&lt;/h3&gt;

&lt;p&gt;Related tools grouped into packs, defined as JSON files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toolkits/
├── calendar.json    # create_event, list_events, update_event, delete_event
├── contacts.json    # find_contact
├── email.json       # send_email, read_email, search_email
├── git.json         # git_clone, git_status, git_commit, git_push
├── host_pc.json     # host_open_url, host_open_app, host_find_file, ...
├── meta.json        # help, show_config, health
├── scheduler.json   # create_task, list_tasks, cancel_task
└── weather.json     # weather
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each toolkit JSON contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"free"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Weather and forecast information..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"날씨"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"temperature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rain"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"umbrella"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tasks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get weather info..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;description&lt;/strong&gt;: Used for embedding similarity matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;keywords&lt;/strong&gt;: Fast keyword-based activation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tasks&lt;/strong&gt;: Actual OpenAI function calling schemas sent to the LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Tasks — Individual Tool Functions
&lt;/h3&gt;

&lt;p&gt;A Task is a single tool function inside a Toolkit. The &lt;code&gt;weather&lt;/code&gt; toolkit has 1 task (&lt;code&gt;weather()&lt;/code&gt;), while &lt;code&gt;calendar&lt;/code&gt; has 4 (&lt;code&gt;create_event&lt;/code&gt;, &lt;code&gt;list_events&lt;/code&gt;, &lt;code&gt;update_event&lt;/code&gt;, &lt;code&gt;delete_event&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Dynamic Routing: Which Toolkits to Activate?
&lt;/h2&gt;

&lt;p&gt;The key question: given a user input, which toolkits are relevant?&lt;/p&gt;

&lt;h3&gt;
  
  
  Two-Stage Matching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;select_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_tools&lt;/span&gt;  &lt;span class="c1"&gt;# always included
&lt;/span&gt;
    &lt;span class="c1"&gt;# Stage 1: Keyword matching (fast, certain)
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_toolkits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;

    &lt;span class="c1"&gt;# Stage 2: Embedding similarity (catches what keywords miss)
&lt;/span&gt;    &lt;span class="n"&gt;input_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;remaining_toolkits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.40&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;selected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Stage 1: Keyword Matching
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;"What's the weather today?" → &lt;code&gt;"weather"&lt;/code&gt; keyword → weather toolkit activated&lt;/li&gt;
&lt;li&gt;"Open Chrome" → &lt;code&gt;"열어줘"&lt;/code&gt; (Korean "open") keyword → host_pc toolkit activated&lt;/li&gt;
&lt;li&gt;Fast and precise, but limited coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Stage 2: Embedding Similarity (BGE-M3)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;"Should I bring an umbrella?" → No keyword match, but semantically similar to weather toolkit → activated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold: 0.40&lt;/strong&gt; (prioritize recall — better to include extra tools than miss needed ones)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model: BGE-M3&lt;/strong&gt; (multilingual, runs locally via Ollama)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pre-computed Embeddings
&lt;/h3&gt;

&lt;p&gt;At server startup, all toolkit descriptions are embedded once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_toolkits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# At request time, only the user input needs embedding (1 API call)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "If it rains tomorrow, plan an indoor workout and add it to my calendar"

[ToolRouter] 18/35 tools | activated: [weather(keyword:1.00), calendar(embed:0.52)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"rain"&lt;/code&gt; → weather toolkit via keyword&lt;/li&gt;
&lt;li&gt;"add to calendar" → calendar toolkit via embedding (similarity 0.52)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Base 13 + weather 1 + calendar 4 = 18 tools&lt;/strong&gt; sent to LLM. The other 17 tools (git, email, contacts, etc.) are excluded → &lt;strong&gt;token savings + better accuracy&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Decision 1: Why not send all tools every time?
&lt;/h3&gt;

&lt;p&gt;With 35+ tool schemas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token cost increases (inference cost + response latency)&lt;/li&gt;
&lt;li&gt;LLM confuses similar tools (&lt;code&gt;send_email&lt;/code&gt; vs &lt;code&gt;host_run_command&lt;/code&gt; for sending mail)&lt;/li&gt;
&lt;li&gt;Especially severe with smaller models (8B parameters)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decision 2: Why not use embeddings only?
&lt;/h3&gt;

&lt;p&gt;Embedding-only approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Even obvious keywords like "weather" require an embedding API call (unnecessary latency)&lt;/li&gt;
&lt;li&gt;If the embedding server goes down, everything breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;strong&gt;Keywords first + embeddings as fallback&lt;/strong&gt; is the optimal two-stage design&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision 3: What threshold for similarity?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;0.60: Precise but misses relevant toolkits&lt;/li&gt;
&lt;li&gt;0.40: May over-activate but never misses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall 100% is the priority&lt;/strong&gt; — extra tools in the context are harmless (LLM ignores them), but missing a needed tool means the agent simply can't do its job&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Tool management for AI agents comes down to one question: &lt;strong&gt;"Which tools should the LLM see for this specific request?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 3-layer answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base Tools&lt;/strong&gt;: Universal → always ON&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Toolkits&lt;/strong&gt;: Domain packs → dynamically activated via keyword + embedding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks&lt;/strong&gt;: Individual functions inside toolkits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture lets "What's the weather?" include only the weather task, while "Commit my code" includes only git tasks — saving tokens and improving accuracy across the board.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Xoul, AI Agent : Optimizing Web Search — From 20s to 8s</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Thu, 26 Feb 2026 13:58:24 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-ai-agent-optimizing-web-search-from-20s-to-8s-37k3</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/xoul-ai-agent-optimizing-web-search-from-20s-to-8s-37k3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How we optimized a local AI agent's web search by finding the real bottleneck.&lt;br&gt;
A story of missteps, failed experiments, and eventually finding the true cause.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;​"It's a simple detail in retrospect, but I had an issue with my web search pipeline. Using the raw search results meant passing thousands of tokens to the next input, so I implemented a very small model to summarize the data first. However, because Ollama defaults to sequential execution, the process took 3x longer. It seems obvious now, but it was a major bottleneck before I analyzed it! I also optimized it by disabling non-text rendering (images, CSS, etc.)."&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Xoul is a local AI agent that answers user queries by searching the web. The pipeline is: search → visit URLs → extract text → generate LLM response. A simple question like "How many pages is [book title]?" was either &lt;strong&gt;failing completely or taking 20+ seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem : Single URL Fetch = High Failure Rate
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Symptoms
&lt;/h3&gt;

&lt;p&gt;Only visiting 1 search result URL meant that if that site was slow or JS-heavy, the entire search failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: Parallel 3-URL Fetch
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;concurrent.futures.ThreadPoolExecutor&lt;/code&gt; to fetch the top 3 URLs simultaneously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_fetch_one&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fetch_urls&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;fetched&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Keep whatever completed, discard the rest
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;fetched&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Only 1 of 3 needs to succeed. Dramatically reduced failure rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failed Experiment: Host-Side Browser Daemon
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hypothesis
&lt;/h3&gt;

&lt;p&gt;"Chrome on the Windows HOST (full CPU/GPU) should be faster than Chromium in the VM (limited resources)."&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;Created &lt;code&gt;browser_daemon_host.py&lt;/code&gt; to run Chrome headless on Windows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Served on PORT 9224&lt;/li&gt;
&lt;li&gt;VM accesses via &lt;code&gt;10.0.2.2:9224&lt;/code&gt; (QEMU gateway to host)&lt;/li&gt;
&lt;li&gt;Chrome CDP with &lt;code&gt;/json/new&lt;/code&gt; + &lt;code&gt;Page.navigate&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Single URL Fetch Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;HOST Chrome&lt;/th&gt;
&lt;th&gt;VM Chromium&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single URL fetch&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.5s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Difference&lt;/td&gt;
&lt;td&gt;0.5s faster&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  But...
&lt;/h3&gt;

&lt;p&gt;End-to-end API test results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With HOST Chrome: &lt;strong&gt;20.0s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;VM Chromium only: &lt;strong&gt;20.4s&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;0.4s difference.&lt;/strong&gt; Why? Browser fetch is only ~2s of the total 20s. The remaining 18s was spent elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Host daemon added complexity with negligible benefit. &lt;strong&gt;Scrapped.&lt;/strong&gt; A classic case of optimizing the wrong bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the Real Bottleneck: 3x Serial LLM Summarization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code Analysis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;web_search("book page count")
 ├─ DDG search                        ~1s
 ├─ 3 URL fetch (parallel)            ~2s
 ├─ 3 LLM summaries (qwen3:0.6b)     ~12s ← HERE!
 │   ├─ URL1 → _summarize_with_llm    ~4s
 │   ├─ URL2 → _summarize_with_llm    ~4s
 │   └─ URL3 → _summarize_with_llm    ~4s
 └─ Final LLM response (gpt-oss:20b)  ~5s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fetching each page, &lt;code&gt;tool_fetch_url&lt;/code&gt; calls &lt;code&gt;qwen3:0.6b&lt;/code&gt; to summarize the raw HTML into structured data (title, author, page count, ISBN, etc.). This was designed to save tokens for the main LLM, but had critical issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;3 summaries run serially&lt;/strong&gt; — Ollama defaults to processing 1 request at a time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model swapping&lt;/strong&gt; — switching between &lt;code&gt;gpt-oss:20b&lt;/code&gt; and &lt;code&gt;qwen3:0.6b&lt;/code&gt; adds overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU bandwidth sharing&lt;/strong&gt; — even with "parallel" requests, they share the same GPU&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Not Just Truncate?
&lt;/h3&gt;

&lt;p&gt;We considered removing LLM summarization and just truncating text to 3000 chars. However, prior testing showed important data (like page count buried deep in the page) could be lost. The 0.6b model intelligently extracts structured info regardless of position in the page.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Ollama Parallel + Multi-Model Config
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set before starting Ollama in launcher.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_NUM_PARALLEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="c"&gt;# Handle 4 concurrent requests&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_MAX_LOADED_MODELS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c"&gt;# Keep 3 models in VRAM simultaneously&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps all models resident in VRAM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gpt-oss:20b&lt;/code&gt; (17GB) — main conversation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen3:0.6b&lt;/code&gt; (8GB) — web page summarization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bge-m3&lt;/code&gt; (1.2GB) — memory embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Zero model swapping, true parallel processing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Additional daemon improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ThreadingHTTPServer&lt;/code&gt; for concurrent request handling&lt;/li&gt;
&lt;li&gt;Chromium flags to disable images, fonts, CSS, plugins — optimized for text extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Query: "마음 소화제 뻥뻥수 페이지 수 알려줘" (Book page count query)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Web search&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;URL fetch (1→3 parallel)&lt;/td&gt;
&lt;td&gt;~5s&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM summarization (serial→parallel)&lt;/td&gt;
&lt;td&gt;~12s&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final response&lt;/td&gt;
&lt;td&gt;~5s&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~20s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~8s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Run 1: 8.5s → "마음 소화제 뻥뻥수의 페이지 수는 188쪽입니다."
Run 2: 8.2s → "마음 소화제 뻥뻥수는 188쪽입니다. (source: yes24.com)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2.5x faster. Accurate answer with source.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measure first, optimize later&lt;/strong&gt; — We spent hours building a HOST Chrome daemon for a 0.5s improvement on a 2s step, while the real 12s bottleneck was elsewhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The bottleneck is never where you expect&lt;/strong&gt; — "The browser is slow" was our assumption. "Serial LLM calls are slow" was the reality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One config line beats 300 lines of code&lt;/strong&gt; — &lt;code&gt;OLLAMA_NUM_PARALLEL=4&lt;/code&gt; solved what &lt;code&gt;browser_daemon_host.py&lt;/code&gt; (300 lines) couldn't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Infrastructure reliability &amp;gt; speed&lt;/strong&gt; — If the browser daemon is dead, speed doesn't matter. Triple auto-start was the most impactful change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cold start matters&lt;/strong&gt; — First test after Ollama restart showed 23.6s (model loading). Second run: 8.5s. Always warm up before benchmarking.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Files Changed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;browser_daemon.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ThreadingHTTPServer, disable images/fonts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tools/web_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Parallel 3-URL fetch, TimeoutError handling, SSH fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scripts/deploy.ps1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Browser daemon auto enable+start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scripts/launcher.ps1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Browser health check, Ollama parallel config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vm_manager.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Browser daemon start on VM boot&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>Solving Key Collisions in LLM Memory — Embedding Model Swap and Hybrid Normalization</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Wed, 25 Feb 2026 15:08:38 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/solving-key-collisions-in-llm-memory-embedding-model-swap-and-hybrid-normalization-fih</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/solving-key-collisions-in-llm-memory-embedding-model-swap-and-hybrid-normalization-fih</guid>
      <description>&lt;p&gt;LLM &lt;code&gt;gpt-oss:20b&lt;/code&gt; · Embedding &lt;code&gt;bge-m3&lt;/code&gt; (1024-dim) · SQLite · 3-Tier Memory (STM/MTM/LTM)&lt;/p&gt;




&lt;h2&gt;
  
  
  Origin
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;@signalstack&lt;/code&gt; replied on previous post's comments(about key collisions) and I reviewed it (thanks to them, we caught critical bugs). Upon reviewing them, three issues surfaced in the memory system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deadlock&lt;/strong&gt; — Server hangs permanently during memory save operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key mismatch&lt;/strong&gt; — "favorite food" and "most favorite food" stored as separate entries, leaving stale values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key collision&lt;/strong&gt; — Saving "dog name: Choco" overwrites "cat name: Nabi" because the embedding model can't distinguish them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post documents how we diagnosed these issues, what experiments we ran, and how we fixed them.&lt;/p&gt;

&lt;p&gt;In a 3-tier memory system, when a user says &lt;strong&gt;"My hobby is hiking"&lt;/strong&gt; then later &lt;strong&gt;"I switched to pottery"&lt;/strong&gt; — what happens?&lt;/p&gt;

&lt;p&gt;Facts first enter MTM (mid-term memory). After repeated access, they promote to LTM (long-term memory). The problem: during promotion, &lt;strong&gt;old values can conflict with new ones&lt;/strong&gt;. &lt;code&gt;ON CONFLICT(key) DO UPDATE&lt;/code&gt; should handle this, but when the LLM extracts the same concept under &lt;strong&gt;different keys&lt;/strong&gt;, the upsert misses entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1st: "favorite_food|pizza|MID"         → saved to MTM
2nd: "most_favorite_food|sushi|MID"    → different key → saved as separate entry!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We designed 10 scenarios to verify this, capturing LTM/MTM database state after every test.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke First
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Deadlock: Server hangs forever
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;remember_mtm()&lt;/code&gt; → &lt;code&gt;remember()&lt;/code&gt; call chain acquired the same &lt;code&gt;threading.Lock()&lt;/code&gt; twice → permanent deadlock. Fixed: &lt;code&gt;Lock()&lt;/code&gt; → &lt;code&gt;RLock()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Key Inconsistency: Stale food kept appearing
&lt;/h3&gt;

&lt;p&gt;The LLM extracted &lt;code&gt;좋아하는 음식&lt;/code&gt; (favorite food) and &lt;code&gt;가장 좋아하는 음식&lt;/code&gt; (most favorite food) as different keys → &lt;code&gt;ON CONFLICT(key)&lt;/code&gt; missed → pizza never became doenjang-jjigae.&lt;/p&gt;




&lt;h2&gt;
  
  
  Embedding Model Shootout
&lt;/h2&gt;

&lt;p&gt;To normalize keys, we need to answer: "Are these two keys the same concept?" We tested 4 models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Key Pair&lt;/th&gt;
&lt;th&gt;all-minilm&lt;/th&gt;
&lt;th&gt;nomic-embed&lt;/th&gt;
&lt;th&gt;BGE-M3&lt;/th&gt;
&lt;th&gt;Qwen3-Embed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;favorite food ↔ most favorite food&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.96&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.92&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;name ↔ cat's name&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;0.96&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.69&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hobby ↔ blood type&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.00&lt;/strong&gt; ❌&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.00&lt;/strong&gt; ❌&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0.40&lt;/strong&gt; ✅&lt;/td&gt;
&lt;td&gt;0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat's name ↔ dog's name&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.00&lt;/strong&gt; ❌&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.00&lt;/strong&gt; ❌&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0.87&lt;/strong&gt; ✅&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0.91&lt;/strong&gt; ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;all-minilm and nomic-embed &lt;strong&gt;couldn't distinguish short Korean words at all&lt;/strong&gt; (hobby vs blood type = 1.0). Only BGE-M3 separated them correctly.&lt;/p&gt;

&lt;p&gt;We also switched from embedding &lt;code&gt;key: value&lt;/code&gt; to &lt;strong&gt;key-only embedding&lt;/strong&gt;. Since keyword matching already covers value search in recall, key-only embedding is more efficient — and enables zero-cost key normalization from DB.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Approach
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;BGE-M3 (threshold 0.9) + substring fallback hybrid&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: cosine_similarity(new_key, existing_key) ≥ 0.9 → match
Step 2: substring containment + length ratio ≥ 60% → match
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reuses key-only embeddings stored in DB — &lt;strong&gt;only 1 extra Ollama call per new key&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  10 Test Cases in Detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  MC01: Hobby Change — Hiking → Pottery
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Basic value replacement works correctly&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC01a&lt;/td&gt;
&lt;td&gt;"My hobby is hiking"&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hobby: hiking&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC01b&lt;/td&gt;
&lt;td&gt;"I switched to pottery"&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hobby: pottery&lt;/code&gt; (access=1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC01 check&lt;/td&gt;
&lt;td&gt;"What's my hobby?"&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hobby: pottery&lt;/code&gt; ← promoted&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Pottery" — upsert in MTM, then normal promotion to LTM&lt;/p&gt;




&lt;h3&gt;
  
  
  MC02: Job Change — Teacher → Programmer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Clear factual replacement&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC02a&lt;/td&gt;
&lt;td&gt;"I'm a teacher"&lt;/td&gt;
&lt;td&gt;hobby: pottery&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;job: teacher&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC02b&lt;/td&gt;
&lt;td&gt;"I quit, now a programmer"&lt;/td&gt;
&lt;td&gt;hobby: pottery&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;job: programmer&lt;/code&gt; (access=1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC02 check&lt;/td&gt;
&lt;td&gt;"What's my job?"&lt;/td&gt;
&lt;td&gt;hobby, &lt;strong&gt;job: programmer&lt;/strong&gt; ← promoted&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Programmer" — previous value completely replaced&lt;/p&gt;




&lt;h3&gt;
  
  
  MC03: Food Preference 3x Change — Pizza → Sushi → Doenjang-jjigae
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Only latest value survives after 3 changes&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC03a&lt;/td&gt;
&lt;td&gt;"I love pizza"&lt;/td&gt;
&lt;td&gt;hobby, job&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;favorite food: pizza&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC03b&lt;/td&gt;
&lt;td&gt;"Changed to sushi"&lt;/td&gt;
&lt;td&gt;hobby, job&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;favorite food: sushi&lt;/code&gt; (access=1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC03c&lt;/td&gt;
&lt;td&gt;"Actually doenjang-jjigae"&lt;/td&gt;
&lt;td&gt;hobby, job&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;favorite food: doenjang-jjigae&lt;/code&gt; (access=2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC03 check&lt;/td&gt;
&lt;td&gt;"Favorite food?"&lt;/td&gt;
&lt;td&gt;hobby, job, &lt;strong&gt;food: doenjang-jjigae&lt;/strong&gt; ← promoted&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Doenjang-jjigae" — access_count ≥ 2 triggers auto-promotion, latest value only&lt;/p&gt;




&lt;h3&gt;
  
  
  MC04: Natural Correction — "Yoga was just a phase, now it's boxing"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Correction without explicit "update" command&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC04a&lt;/td&gt;
&lt;td&gt;"I do yoga"&lt;/td&gt;
&lt;td&gt;job, &lt;strong&gt;hobby: yoga&lt;/strong&gt;, food&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC04b&lt;/td&gt;
&lt;td&gt;"Yoga was brief, now boxing"&lt;/td&gt;
&lt;td&gt;job, &lt;strong&gt;hobby: boxing&lt;/strong&gt;, food&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC04 check&lt;/td&gt;
&lt;td&gt;"What exercise?"&lt;/td&gt;
&lt;td&gt;job, &lt;strong&gt;hobby: boxing&lt;/strong&gt;, food&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Boxing" — direct LTM update (existing &lt;code&gt;hobby&lt;/code&gt; key: &lt;code&gt;yoga&lt;/code&gt; → &lt;code&gt;boxing&lt;/code&gt;)&lt;/p&gt;




&lt;h3&gt;
  
  
  MC05: Address Change — Seoul → Busan
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Specific location data replacement&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC05a&lt;/td&gt;
&lt;td&gt;"Live in Seoul Gangnam"&lt;/td&gt;
&lt;td&gt;job, hobby, food&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;address: Seoul Gangnam&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC05b&lt;/td&gt;
&lt;td&gt;"Moved to Busan Haeundae"&lt;/td&gt;
&lt;td&gt;job, hobby, food&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;address: Busan Haeundae&lt;/code&gt; (access=1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC05 check&lt;/td&gt;
&lt;td&gt;"Where do I live?"&lt;/td&gt;
&lt;td&gt;+&lt;strong&gt;address: Busan Haeundae&lt;/strong&gt; ← promoted&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Busan Haeundae" — previous value completely removed&lt;/p&gt;




&lt;h3&gt;
  
  
  MC06: Pet Addition — Cat Nabi + Dog Choco (additive, not replacement)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: When adding, not replacing, both entries must survive. &lt;strong&gt;This was the key test&lt;/strong&gt; — all-minilm failed here because &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;cat's name&lt;/code&gt; had cosine similarity of 1.0.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC06a&lt;/td&gt;
&lt;td&gt;"I have a cat named Nabi"&lt;/td&gt;
&lt;td&gt;job, hobby, food, address&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cat name: Nabi&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC06b&lt;/td&gt;
&lt;td&gt;"Also adopted dog Choco"&lt;/td&gt;
&lt;td&gt;job, hobby, food, address&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cat name: Nabi&lt;/code&gt; (access=1), &lt;code&gt;dog name: Choco&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC06 check&lt;/td&gt;
&lt;td&gt;"Pet names?"&lt;/td&gt;
&lt;td&gt;+&lt;strong&gt;cat: Nabi&lt;/strong&gt;, &lt;strong&gt;dog: Choco&lt;/strong&gt; ← both promoted&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Cat Nabi, Dog Choco" — BGE-M3 correctly separated &lt;code&gt;cat name&lt;/code&gt; ≠ &lt;code&gt;dog name&lt;/code&gt; (cosine 0.87 &amp;lt; threshold 0.9)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Previous failure&lt;/strong&gt;: all-minilm gave cosine 1.0 for both keys, causing &lt;code&gt;name: Choco&lt;/code&gt; to overwrite &lt;code&gt;name: Nabi&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  MC07: Blood Type Correction — A → AB
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Explicit correction of wrong information&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC07a&lt;/td&gt;
&lt;td&gt;"Blood type A"&lt;/td&gt;
&lt;td&gt;+&lt;code&gt;blood type: A&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC07b&lt;/td&gt;
&lt;td&gt;"No wait, it's AB"&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;blood type: A&lt;/code&gt; → &lt;strong&gt;&lt;code&gt;AB&lt;/code&gt;&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC07 check&lt;/td&gt;
&lt;td&gt;"Blood type?"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;blood type: AB&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "AB" — forget + remember combo for direct LTM correction&lt;/p&gt;




&lt;h3&gt;
  
  
  MC08: Preference Reversal — Loves coffee → Quit coffee
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Not just value change but &lt;strong&gt;180° direction reversal&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC08a&lt;/td&gt;
&lt;td&gt;"Love coffee, drink daily"&lt;/td&gt;
&lt;td&gt;…7 items&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;coffee: daily&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC08b&lt;/td&gt;
&lt;td&gt;"Quit coffee, only tea now"&lt;/td&gt;
&lt;td&gt;…7 items&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;coffee: quit&lt;/code&gt;, &lt;code&gt;tea: drinks&lt;/code&gt;, &lt;code&gt;caffeine: none&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC08 check&lt;/td&gt;
&lt;td&gt;"Do I like coffee?"&lt;/td&gt;
&lt;td&gt;…+&lt;code&gt;coffee: likes&lt;/code&gt;,&lt;code&gt;tea: drinks&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Coffee quit, no caffeine. Drinks tea."&lt;/p&gt;




&lt;h3&gt;
  
  
  MC09: Cross-Session Correction — Summer → Autumn
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Can correct memories from a previous session&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC09a&lt;/td&gt;
&lt;td&gt;"Love summer"&lt;/td&gt;
&lt;td&gt;…existing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;favorite season: summer&lt;/code&gt; (access=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC09b&lt;/td&gt;
&lt;td&gt;"Actually prefer autumn now"&lt;/td&gt;
&lt;td&gt;…existing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;favorite season: autumn&lt;/code&gt; (access=2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC09 check&lt;/td&gt;
&lt;td&gt;"Favorite season?"&lt;/td&gt;
&lt;td&gt;…+&lt;strong&gt;&lt;code&gt;season: autumn&lt;/code&gt;&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Autumn" — recall → forget(summer) → remember(autumn) pattern&lt;/p&gt;




&lt;h3&gt;
  
  
  MC10: Multi-Attribute Change — Color red→blue, Number 7→13
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Multiple attributes changed at once without missing any&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;LTM&lt;/th&gt;
&lt;th&gt;MTM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC10a&lt;/td&gt;
&lt;td&gt;"Color red, number 7"&lt;/td&gt;
&lt;td&gt;…existing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;color: red&lt;/code&gt;, &lt;code&gt;number: 7&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC10b&lt;/td&gt;
&lt;td&gt;"Color blue, number 13"&lt;/td&gt;
&lt;td&gt;…existing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;color: blue&lt;/code&gt;, &lt;code&gt;number: 13&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC10 check&lt;/td&gt;
&lt;td&gt;"Color and number?"&lt;/td&gt;
&lt;td&gt;…existing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;color: blue&lt;/code&gt;, &lt;code&gt;number: 13&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ✅ "Blue + 13" — both attributes correctly replaced&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Scoreboard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Before (all-minilm)&lt;/th&gt;
&lt;th&gt;After (Hybrid)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MC01&lt;/td&gt;
&lt;td&gt;Hobby: hiking → pottery&lt;/td&gt;
&lt;td&gt;❌ (stale data)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC02&lt;/td&gt;
&lt;td&gt;Job: teacher → programmer&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC03&lt;/td&gt;
&lt;td&gt;Food: 3x change&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC04&lt;/td&gt;
&lt;td&gt;Exercise: natural correction&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC05&lt;/td&gt;
&lt;td&gt;Address: moved&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC06&lt;/td&gt;
&lt;td&gt;Pets: cat + dog (additive)&lt;/td&gt;
&lt;td&gt;❌ (key collision)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC07&lt;/td&gt;
&lt;td&gt;Blood type: A → AB&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC08&lt;/td&gt;
&lt;td&gt;Coffee: preference reversal&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC09&lt;/td&gt;
&lt;td&gt;Season: cross-session&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC10&lt;/td&gt;
&lt;td&gt;Multi-attribute&lt;/td&gt;
&lt;td&gt;❌ (season conflict)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10/10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Improving Host↔VM File Transfer in a Local AI Agent — Smart Search + Deduplication</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Tue, 24 Feb 2026 13:39:25 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/improving-host-vm-file-transfer-in-a-local-ai-agent-smart-search-deduplication-59kn</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/improving-host-vm-file-transfer-in-a-local-ai-agent-smart-search-deduplication-59kn</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Fixing SCP encoding crashes, adding fuzzy filename matching, MD5-based deduplication, and automatic VM file search in a QEMU-based AI agent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Problem Definition
&lt;/h2&gt;

&lt;p&gt;The agent 'Xoul' runs inside a QEMU Ubuntu VM. The reason is &lt;strong&gt;security&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you give an AI agent system tools like &lt;code&gt;run_command&lt;/code&gt; and &lt;code&gt;write_file&lt;/code&gt;, LLM hallucinations or prompt injection attacks could directly damage the host PC. Imagine &lt;code&gt;rm -rf /&lt;/code&gt; executing on your main machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Sandbox all agent system operations inside a QEMU VM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F03u1m9zyax94zttrc0js.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F03u1m9zyax94zttrc0js.png" alt=" " width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the LLM executes &lt;code&gt;rm -rf /&lt;/code&gt;, only the VM is affected&lt;/li&gt;
&lt;li&gt;No direct host filesystem access — the only channel is the &lt;code&gt;share/&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;share/&lt;/code&gt; uses explicit SCP transfers only (no auto-mount)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture relies on two file transfer tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;host_to_vm&lt;/strong&gt;: Host &lt;code&gt;share/&lt;/code&gt; → VM &lt;code&gt;/root/share/&lt;/code&gt; (deliver user files to the agent)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vm_to_host&lt;/strong&gt;: VM → Host &lt;code&gt;share/&lt;/code&gt; (deliver agent results to the user)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both use SCP (Secure Copy Protocol) over SSH. In practice, several issues surfaced:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCP transfers intermittently failing (cause unclear)&lt;/li&gt;
&lt;li&gt;When a user says "send me the report file," the tool fails unless the exact filename is provided&lt;/li&gt;
&lt;li&gt;Retrieving files from VM requires knowing the full path&lt;/li&gt;
&lt;li&gt;Identical files transferred repeatedly with no deduplication&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Root Cause Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2-1. Encoding Crash — &lt;code&gt;UnicodeDecodeError&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;All three functions in &lt;code&gt;vm_manager.py&lt;/code&gt; (&lt;code&gt;ssh_exec&lt;/code&gt;, &lt;code&gt;scp_to_vm&lt;/code&gt;, &lt;code&gt;scp_from_vm&lt;/code&gt;) used &lt;code&gt;subprocess.run(text=True)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Problematic code
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssh_cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;text=True&lt;/code&gt; internally calls &lt;code&gt;stdout.decode('utf-8')&lt;/code&gt;. When SSH output contains BOM bytes (0xFF, 0xFE) or binary data, it crashes immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this error occurs in &lt;code&gt;subprocess.py&lt;/code&gt;'s &lt;code&gt;_readerthread&lt;/code&gt;, the error message itself gets corrupted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SSH 오류: 'NoneType' object has no attribute 'strip']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ To the user, it just looks like "SCP doesn't work."&lt;/p&gt;

&lt;h3&gt;
  
  
  2-2. Exact Filename Required
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original code — exact match required
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_host_to_vm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vm_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;local_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SHARE_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← Must match exactly
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;scp_to_vm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vm_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users naturally say partial names like "report" or "result" in conversation. The LLM cannot guess exact filenames.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-3. Full VM Path Required
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;vm_to_host&lt;/code&gt; requires the complete path (&lt;code&gt;/root/workspace/result.txt&lt;/code&gt;). Users have no way of knowing internal VM file paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-4. No Deduplication
&lt;/h3&gt;

&lt;p&gt;Repeated requests for the same file trigger full SCP transfers every time — unnecessary network IO and SSH overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Solution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3-1. Encoding Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;# vm_manager.py — common to ssh_exec, scp_to_vm, scp_from_vm
  result = subprocess.run(
      ssh_cmd,
      capture_output=True,
&lt;span class="gd"&gt;-     text=True,
&lt;/span&gt;&lt;span class="gi"&gt;+     text=False,
&lt;/span&gt;      timeout=timeout,
  )
&lt;span class="gd"&gt;- output = result.stdout
&lt;/span&gt;&lt;span class="gi"&gt;+ output = result.stdout.decode("utf-8", errors="replace")
+ stderr = result.stderr.decode("utf-8", errors="replace")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;errors="replace"&lt;/code&gt; substitutes undecodable bytes with &lt;code&gt;�&lt;/code&gt;. No matter what bytes SSH outputs, no crash occurs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-2. Smart File Search
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zrgkmso1q2m70ts1vx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zrgkmso1q2m70ts1vx9.png" alt=" " width="800" height="1030"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Host Side — &lt;code&gt;_find_file_in_share()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Case-insensitive partial matching across the entire &lt;code&gt;share/&lt;/code&gt; directory tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_find_file_in_share&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dirs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SHARE_DIR&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  VM Side — &lt;code&gt;_find_file_in_vm()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Uses SSH &lt;code&gt;find&lt;/code&gt; command to search common directories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_find_file_in_vm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ssh_exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;find /root/share /root/workspace /root/.xoul/workspace &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-name &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; -type f 2&amp;gt;/dev/null | head -5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quiet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3-3. MD5 Hash-Based Deduplication
&lt;/h3&gt;

&lt;p&gt;Transfer history is stored in &lt;code&gt;share/.transfer_log.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"htv:test_htv.txt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a1b2c3d4e5f6..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"direction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"htv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"src"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;share&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;test_htv.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dst"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/root/share/test_htv.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-24T22:05:30"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When an identical hash is detected, the transfer is skipped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Already up to date: test_htv.txt → /root/share/test_htv.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Testing &amp;amp; Verification
&lt;/h2&gt;

&lt;p&gt;All tests performed with actual SCP on a running VM.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HTV exact match&lt;/td&gt;
&lt;td&gt;&lt;code&gt;test_htv.txt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upload complete&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTV dedup&lt;/td&gt;
&lt;td&gt;Same file again&lt;/td&gt;
&lt;td&gt;Skip message&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTV partial match&lt;/td&gt;
&lt;td&gt;&lt;code&gt;htv&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Auto-find &lt;code&gt;test_htv.txt&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTV nonexistent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nonexistent.xyz&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Error + path hint&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VTH full path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/root/workspace/result.txt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Download complete&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VTH filename only&lt;/td&gt;
&lt;td&gt;&lt;code&gt;todo.txt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM &lt;code&gt;find&lt;/code&gt; search&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VTH nonexistent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;zzz_not_exist.bin&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Error message&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSH encoding&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cat&lt;/code&gt; command (BOM)&lt;/td&gt;
&lt;td&gt;No crash&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;8/8 passed.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SSH encoding&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;UnicodeDecodeError&lt;/code&gt; crash&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;errors="replace"&lt;/code&gt; — always safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File search&lt;/td&gt;
&lt;td&gt;Exact filename required&lt;/td&gt;
&lt;td&gt;Partial match auto-search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VM file access&lt;/td&gt;
&lt;td&gt;Full path required&lt;/td&gt;
&lt;td&gt;Filename-only &lt;code&gt;find&lt;/code&gt; search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate transfer&lt;/td&gt;
&lt;td&gt;Full SCP every time&lt;/td&gt;
&lt;td&gt;MD5 hash comparison → skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transfer history&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;JSON log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error messages&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NoneType has no attribute 'strip'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specific guidance + share/ path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Lessons
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The &lt;code&gt;subprocess.run(text=True)&lt;/code&gt; trap&lt;/strong&gt;: Python's &lt;code&gt;text=True&lt;/code&gt; is convenient but assumes UTF-8 output from external processes. SSH and SCP on Windows can inject BOM bytes (0xFF 0xFE), causing immediate crashes. &lt;code&gt;text=False&lt;/code&gt; + &lt;code&gt;decode(errors="replace")&lt;/code&gt; is the safe default.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Users don't know paths&lt;/strong&gt;: When building file management tools, assuming users know exact filenames or VM paths is dangerous. The pattern should be: partial match → candidate list → selection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hash-based deduplication&lt;/strong&gt;: Filenames can be the same with different content, or different with the same content. MD5 hash on actual content is the reliable comparison method.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;JSON vs SQLite for logs&lt;/strong&gt;: For small-scale data like transfer logs (dozens of entries), JSON files are ideal — zero dependencies, easy debugging. SQLite shines for memory systems with thousands of entries requiring semantic search.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>linux</category>
      <category>security</category>
    </item>
    <item>
      <title>Designing a 3-Tier Memory System for a Local AI Agent — STM / MTM / LTM</title>
      <dc:creator>Kim Namhyun</dc:creator>
      <pubDate>Tue, 24 Feb 2026 12:59:36 +0000</pubDate>
      <link>https://dev.to/kim_namhyun_e7535f3dc4c69/designing-a-3-tier-memory-system-for-a-local-ai-agent-stm-mtm-ltm-1hni</link>
      <guid>https://dev.to/kim_namhyun_e7535f3dc4c69/designing-a-3-tier-memory-system-for-a-local-ai-agent-stm-mtm-ltm-1hni</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How we modeled human memory consolidation to build a robust memory pipeline for a 20B-parameter local LLM agent — and achieved 100% on a 53-question test suite.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Problem Definition
&lt;/h2&gt;

&lt;p&gt;We built a local AI agent (Androi) that needs to remember user facts across conversations.&lt;br&gt;&lt;br&gt;
When a user says "My name is Namhyun" or "My hobby is hiking," the agent must recall and use these facts in future sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The initial implementation&lt;/strong&gt; was simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User message → LLM extracts key|value → Stored directly in LTM (permanent)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Out of 53 end-to-end tests, &lt;strong&gt;8 failed&lt;/strong&gt;, and memory system issues were the root cause for most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key symptoms:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BMI calculation couldn't find height data&lt;/li&gt;
&lt;li&gt;Salary information wasn't used for price comparisons&lt;/li&gt;
&lt;li&gt;After correcting a hobby, the agent recalled the old value&lt;/li&gt;
&lt;li&gt;Scheduler tool name mismatches (unrelated bug, also fixed)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Root Cause Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2-1. LTM Pollution
&lt;/h3&gt;

&lt;p&gt;Every extracted fact went directly to LTM, including transient junk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;weather|Seoul       ← one-time search request
time|8AM            ← scheduling parameter  
weather_alert|cancelled  ← task status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When LTM exceeded 30 entries, semantic filtering kicked in — but these junk entries diluted the signal, causing truly important memories (height, weight, salary) to be filtered out.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-2. Single-Character Key Filter Bug
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;extract_and_remember&lt;/code&gt; had a &lt;code&gt;len(key) &amp;lt; 2&lt;/code&gt; filter that &lt;strong&gt;dropped valid single-character Korean keys&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
"키" (height in Korean) is 1 character → extracted but never saved. Only "몸무게" (weight, 3 chars) survived.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-3. auto_retrieve Injection Threshold
&lt;/h3&gt;

&lt;p&gt;When LTM exceeded 15 entries, the system switched to semantic matching. With 20 entries stored, a query like "Calculate my BMI" couldn't match "weight: 72kg" above the 0.3 cosine similarity threshold.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-4. Architectural Gap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuwuo7fdk5sray2kgqt24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuwuo7fdk5sray2kgqt24.png" alt=" " width="800" height="167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;intended design&lt;/strong&gt; was STM→MTM→LTM natural promotion, but &lt;code&gt;extract_and_remember&lt;/code&gt; &lt;strong&gt;bypassed MTM entirely&lt;/strong&gt; and wrote directly to LTM. The promotion logic (&lt;code&gt;_try_promote_to_ltm&lt;/code&gt;) was effectively dead code.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Solution Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3-1. Immediate Bug Fixes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler tool name mismatch&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;schedule_task&lt;/code&gt; → &lt;code&gt;create_task&lt;/code&gt; (3 tools)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;web_search&lt;/code&gt; chosen over &lt;code&gt;find_contact&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Added "Do NOT use web_search for contacts" to tool description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;len(key) &amp;lt; 2&lt;/code&gt; filter&lt;/td&gt;
&lt;td&gt;Removed minimum key length requirement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;auto_retrieve threshold&lt;/td&gt;
&lt;td&gt;Increased 15 → 30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3-2. Architecture Refactoring — Priority-Based MTM→LTM Pipeline
&lt;/h3&gt;

&lt;p&gt;Inspired by human memory consolidation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hippocampus&lt;/strong&gt; = MTM — short-term memories consolidate to cortex through repetition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ebbinghaus Forgetting Curve&lt;/strong&gt; — unused memories naturally decay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emotional significance&lt;/strong&gt; — blood type, allergies are permanently stored after a single mention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00a3hi8mhvrtxvru9qns.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00a3hi8mhvrtxvru9qns.png" alt=" " width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Priority Classification
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HIGH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Immutable core identity&lt;/td&gt;
&lt;td&gt;Name, birthday, blood type, allergies&lt;/td&gt;
&lt;td&gt;Direct to LTM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mutable preferences/state&lt;/td&gt;
&lt;td&gt;Hobby, job, residence, salary&lt;/td&gt;
&lt;td&gt;MTM → promotion queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LOW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transient/one-off info&lt;/td&gt;
&lt;td&gt;Weather, news, timestamps&lt;/td&gt;
&lt;td&gt;Not extracted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Promotion &amp;amp; Expiry Flow
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7n842u14z4u57nfn6ax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7n842u14z4u57nfn6ax.png" alt=" " width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Testing &amp;amp; Validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4-1. Pass Rate Progression
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;Key Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;45/53 (85%)&lt;/td&gt;
&lt;td&gt;Original code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round 1&lt;/td&gt;
&lt;td&gt;48/53 (91%)&lt;/td&gt;
&lt;td&gt;Scheduler tool name fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round 2&lt;/td&gt;
&lt;td&gt;49/53 (92%)&lt;/td&gt;
&lt;td&gt;find_contact description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round 3&lt;/td&gt;
&lt;td&gt;51/53 (96%)&lt;/td&gt;
&lt;td&gt;LTM threshold + system prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round 4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53/53 (100%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key filter fix + multi-chain guidance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Final (post-refactor)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53/53 (100%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MTM→LTM architecture, no regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4-2. Test Coverage (53 tests)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A. Memory System&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;CRUD, cross-session, update, delete, implicit save&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B. Web Search&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Weather, exchange rate, news, restaurants, price, population&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C. Calculation + Memory&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Take-home pay, BMI, compound interest, unit conversion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D. Calendar&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;CRUD, memory-linked scheduling, event modification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E. File + Code&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;File CRUD, Python execution, result persistence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F. Email&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Inbox, send, search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G. Contacts&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Add, search, delete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H. Scheduler&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Register, list, cancel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;I. Multi-Chain&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;3+ tool chains, corrections, weather→calendar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;J. Edge Cases&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Hallucination prevention, tool re-call prevention, response speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  5. Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Final Scorecard
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A. Memory System    ████████████████████ 10/10 (100%)
B. Web Search       ████████████████████ 6/6   (100%)
C. Calc + Memory    ████████████████████ 5/5   (100%)
D. Calendar         ████████████████████ 6/6   (100%)
E. File + Code      ████████████████████ 5/5   (100%)
F. Email            ████████████████████ 3/3   (100%)
G. Contacts         ████████████████████ 3/3   (100%)
H. Scheduler        ████████████████████ 3/3   (100%)
I. Multi-Chain      ████████████████████ 7/7   (100%)
J. Edge Cases       ████████████████████ 5/5   (100%)
────────────────────────────────────────────────────
Total               53/53 (100%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM-based extraction needs guardrails.&lt;/strong&gt; Without priority classification, LLMs will extract "weather|Seoul" alongside "name|John" and pollute permanent memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Human memory models work for AI agents.&lt;/strong&gt; The hippocampus→cortex consolidation pattern (MTM→LTM promotion through repeated access) naturally filters for truly important information.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single-character CJK key pitfall.&lt;/strong&gt; A &lt;code&gt;len(key) &amp;lt; 2&lt;/code&gt; filter designed for English silently drops valid Korean keys like "키" (height) and "나이" (age). Multilingual systems need careful attention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Semantic matching has blind spots.&lt;/strong&gt; "Calculate my BMI" doesn't semantically match "weight: 72kg" above a 0.3 cosine threshold. For small memory sets, full injection is more reliable than semantic filtering.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
