Table of Contents
- Why I built it
- One-Click Setup
- Early Technical Decisions
- Architecture Overview
- QEMU VM — Giving the AI a Body
- Agent Loop — The 2-Phase Structure
- Multi-Model Tool Call Parser
- 3-Tier Memory
- Tool System and Self-Evolution
- Host Security — Tier 1/2 Model
- Desktop Client
- Multi-Client
- AI Evolving AI — The Test-Driven Evolution Loop
- Retrospective
1. Why I built it
My day job involves building simulation software at a semiconductor company. In early 2025, while developing "Vibe Simulation" software utilizing LLMs, I really started to feel the potential of AI agents.
Then, trying the Antigravity + Opus 4.6 combo at home totally blew my mind (I can't use it at work due to strict security policies...). Watching the AI write code, debug, and even seamlessly refactor made me wonder, "What if I expanded this into my daily life?" Combine that with the buzz around OpenClaw in the community—which the creator apparently built in just 10 days—and I was shocked once again.
The catch? OpenClaw is Mac-only, so I couldn't even try it. That got me thinking:
"The Lunar New Year holiday is pretty long... Why don't I build a local AI Agent for Windows users?"
I started this on February 13th, and it's been 9 days. This post is a record of the technical decisions and the struggles I went through during the process.
From the start, I laid down a few core principles:
- One-click setup. Users with no technical background should be able to set up the entire environment with a single click. They shouldn't have to open a terminal at all.
- Zero extra cost. The moment an API costs money, it becomes a massive barrier for average users. Everything must run entirely on local LLMs.
- Standard specs. It needs to run on 8GB of VRAM. I've personally verified it runs on an old GTX 2080.
- Privacy. Emails, schedules, and personal files must never leave the machine.
- Memory. It must remember yesterday's conversation today, even if the session ends.
2. One-Click Setup
setup_env.ps1 (751 lines) is the longest script in this entire project. The user only needs to do three things:
- Double-click
install.bat - Select a model (Choose from 1~7. It shows recommendations based on your GPU)
- Wait ☕
Everything else is fully automated:
nvidia-smi automatic detection: It reads your GPU VRAM to recommend a model. If you have 5GB, it suggests a Q4 model; 10GB gets Q8; 24GB gets a 30B model. If you just press default (Option #1), it works perfectly even on 5GB VRAM. Verified on a GTX 2080.
🎬 Video: One-Click Setup (install.bat → Model Selection → Full Process Timelapse)
3. Early Technical Decisions
During the first two days of the project, the thing I agonized over the most was "What should I use to build the AI's body?" Most AI assistants just talk. That's a chatbot, not an assistant.
A real assistant should be able to create files, run commands, and spin up a scheduler.
Docker vs QEMU
Initially, I obviously thought of Docker. But the moment the target audience became "average Windows users", Docker was out of the question. Enabling Hyper-V, installing WSL, installing Docker Desktop... This process cannot be reduced to a simple "click".
QEMU was different. One line, winget install qemu, is all it takes, and you can spin up a VM with a single executable. More importantly, it established a golden formula: One Linux image file = the entire AI assistant. When moving to another PC, you just copy ubuntu.qcow2.
Struggle Note: QEMU cannot handle paths with spaces or Korean characters. I wasted half a day trying to figure out why the VM wouldn't even start in paths like C:\Users\이름\projects. I ended up fixing it by writing a _short_path() function that converts the path into the Windows 8.3 short path format (C:\USERS\NAMHY~1).
Why Local LLMs?
If you use GPT or Claude, the performance is undeniably amazing. However, it completely violates Principle #2 (Zero Cost) and Principle #4 (Privacy). Having my personal emails bounce through OpenAI's servers is something I fundamentally wanted to avoid.
So, I set the combination of Ollama + Local Model (OSS-20b, Qwen3-8B, etc.) as the default. Yes, the tool-calling success rate is much lower compared to GPT or Claude, but it's entirely manageable with code guardrails and prompt engineering. (I'll cover this later in the Prompt Evolution Loop section.)
That said, I did build an LLM abstraction layer (llm_client.py) from the very beginning. By changing the provider in config.json, you can seamlessly switch to OpenAI or Claude. It's an insurance policy for when the local model falls short.
4. Architecture Overview
The big picture is simple. Ollama (LLM inference) runs on the Windows host, the agent server runs inside a QEMU VM (Ubuntu), and the two communicate via SSH + port forwarding.
To access the host's Ollama instance from inside the VM, it uses the special QEMU address 10.0.2.2:11434. I only opened two ports for forwarding: 2222→22 (SSH) and 3000→3000 (API).
🎬 Video: Full Operation Demo (Setup → VM Boot → Desktop Client Chat → Tool Execution Flow)
5. QEMU VM — Giving the AI a Body
vm_manager.py (979 lines) manages the entire lifecycle of the VM. Initially, I built it for one simple reason—if the AI types rm -rf /, the host PC needs to be safe. But as I built it, I realized an even bigger advantage than isolation: The entire Linux ecosystem becomes the AI's playground and toolbox.
Installing packages with apt, scheduling with cron, analyzing data with Python, hosting services with systemd... Things that are a pain in Windows are merely one-liners in Linux.
VM Setup Flow
Calling the setup() function once automatically handles everything from downloading the Ubuntu Cloud image to configuring SSH.
File Sharing
File transfer between Windows ↔ VM is done exclusively via SCP into a single share/ directory. I tried 9P VirtIO mounts, but the stability on Windows was way too inconsistent, so I decided to stick to SCP as the default.
6. Agent Loop — The 2-Phase Structure
This is the core of the project. The _run_agent_loop() function in server.py, coming in at around 400 lines.
At first, I made it structurally simple: "Send message → Get LLM response → Execute tools if any." This works beautifully with GPT-4, but with local models, it was a disaster. They would call tools with the wrong arguments, completely ignore necessary tools, or call the exact same thing three times in a row.
So, I split it into a 2-Phase Structure: Planning → Execution. I first ask the LLM, "Make a plan on what you're going to do," then execute the tools based on that plan, and finally synthesize the results to form an answer. This simple tweak alone doubled the perceived success rate of tool calling.
Guardrails
These are the mechanisms that actively defend against mistakes made by local models at the code level. Every single one of these was born out of painful debugging sessions:
- Duplicate Call Prevention: Using a
dedup_keyto track identical tool + argument combinations. If the LLM tries to execute the same search three times, it reuses the first result and injects "Result already exists." - Auto-Fetch Chain: After a
web_search, it automatically runsfetch_urlon the top result. Without this, the LLM often reads just the search result title and hallucinates the rest of the answer. - Context Compaction:
compact_after_turn()cleans up tool messages, one-off system injections, and execution plans after every turn. Without this, the context window explodes after just 5 turns. - Correction Pattern Detection: If it detects corrective remarks from the user like "No" or "That's wrong," it injects a
remember()nudge into the prompt. It's a hint saying "The user corrected you, make sure to remember this."
🎬 Video: Agent Loop Action (Complex Question → Planning → Tool Call → Auto Fetch → Real-time Final Response)
7. Multi-Model Tool Call Parser
Surprisingly, this was one of the most headache-inducing parts. Every single LLM calls tools in an entirely different format. Qwen3 uses <tool_call> XML tags, GLM uses Python function call syntax, and Nemotron uses yet another XML variation. The moment I set the goal to "Support all local models," I needed this parser.
The approach for tool_call_parser.py is a Fallback Chain. It detects the model family via its name, tries the dedicated parser first, and if that fails, it sequentially tries the rest of the parsers.
| Model | Output Format | Parsing Strategy |
|---|---|---|
| Qwen3 | <tool_call>{JSON}</tool_call> |
XML tag Regex |
| Nemotron | <function=name>{params}</function> |
Nested XML Parsing |
| GLM | func_name(key=val) |
Function name + JSON extraction |
| xLAM | Raw JSON Array | JSON Array Parsing |
| Generic | Mixed | Sequential trial of all parsers |
The benefit of this design: No matter which local model the user selects, they get the exact same tool execution experience. Change the model name in config.json, and the parser automatically follows suit.
8. 3-Tier Memory
The core of a personal assistant is memory. If you ask, "What's my name?" and it can't answer, it's not an assistant. The memory implemented in memory_tools.py (601 lines) borrows ideas from human cognitive structure, utilizing a 3-Tier design: STM → MTM → LTM.
- STM (Short-Term Memory): Records every conversation turn instantly into SQLite. A safeguard so conversations aren't lost if the server crashes. Capped at 5000 chars.
- MTM (Mid-Term Memory): If there’s been no conversation for 10+ minutes, it summarizes the previous session into
keyword|valuepairs usingqwen3:0.6b(a separate, lightweight model). Expires after 7~30 days. - LTM (Long-Term Memory): Permanent memory. The LLM can directly invoke
remember()to store things, or keywords that appear 2+ times in the MTM are automatically promoted here.
Semantic Search
Just saving memories is pointless if you can't use them. Every time a new message comes in, auto_retrieve() embeds the message using all-minilm (0.1B) and performs a cosine similarity search across LTM + MTM. Any memories with a threshold over 0.3 are automatically injected into the system prompt.
I truncated the embeddings to 64 dimensions and serialized them into binary using struct.pack. I briefly considered a full vector DB, but an SQLite BLOB was more than enough and kept dependencies to an absolute minimum.
Struggle Note: I originally tried to use nomic-embed-text, but I completely forgot to add the pull command to the setup script. This caused a bug where the memory stayed empty forever. It took me two days to debug this. I've since switched to all-minilm and ensure it's pulled automatically during setup.
🎬 Video: Memory System (Remembering a name → Ending session → Verifying memory persists on restart → Recall search)
9. Tool System and Self-Evolution
Xoul's tools are separated by modules in the tools/ package. Currently, there are over 30 registered tools. But the coolest part? The AI can actually create new tools by itself.
| Domain | Tools | Execution Environment |
|---|---|---|
| 🔍 Web |
web_search, fetch_url
|
VM |
| 📁 File/Shell |
read_file, write_file, run_command
|
VM |
| 🐍 Code | run_python_code |
VM |
send_email, list_emails, read_email
|
VM → SMTP/IMAP | |
| 📅 PIM |
create_event, list_events
|
VM → Google API |
| 🧠 Memory |
remember, recall, forget
|
VM → SQLite |
| ⏰ Schedule |
create_task, list_tasks
|
VM |
| 🧬 Evolution |
create_tool, evolve_skill, find_skill
|
VM |
| 🖥️ Host |
host_open_url, host_open_app, host_find_file
|
Windows |
create_tool — Dynamic Tool Creation
create_tool() in meta_tools.py wraps shell commands into actual tools. If you say, "Make a tool to check disk usage," it registers the df -h command as a usable tool. It even persists across restarts.
evolve_skill — Learning from Success
evolve_skill() saves the pattern (tool chain, prompt, and result) of successfully completed tasks into JSON. Next time a similar request comes in, it uses find_skill() to retrieve and reference the past pattern. It's essentially experience-based learning.
10. Host Security — Tier 1/2 Model
No matter what the AI does inside the VM, the host remains safe. However, tools like host_open_app or host_find_file execute directly on the host Windows machine. This needs to be handled with extreme caution.
I implemented a 2-Tier Security Model in host_tools.py:
- Tier 1 (Auto-Execute):
host_open_url,host_find_file,host_show_notification,host_open_app. Blocks URL schemes (file://,javascript:), restricts search paths to Desktop/Documents/Downloads, limits search depth to 3 levels, etc. - Tier 2 (Require Confirmation):
host_organize_files,host_run_command. Shows a pop-up confirmation to the user before executing. Blacklists 14 highly dangerous command patterns likedel,rmdir,format,shutdown,rm -rf,dd if=.
Why the host_ prefix? Standard tools (web_search, run_command) run inside the VM, so they naturally have zero impact on the host. The Desktop client literally just intercepts any tool call starting with host_ and runs it natively on Windows. This ridiculously simple naming convention keeps the Host/VM boundary crystal clear.
11. Desktop Client
I built a native Windows app using PyQt6. The primary focus of the UI design was "Minimizing Presence." An assistant should only appear when you need it.
- Spotlight Input Bar: Hitting
Ctrl+Spacepops up a clean input bar. Great for firing off quick questions. It switches to a full chat window for complex conversations. - KakaoTalk Style Chat: Appends messages via JavaScript to a
QWebEngineView. Features Markdown rendering, code highlighting, and links opening in default external browsers. Frameless window with custom resizing. - Tool Chips: When the agent calls a tool, a chip like "🔍 Web Search: Seoul Weather" pops up in real-time. Extremely transparent so you know exactly what the AI is up to.
- Host App Integration: When
host_open_app("Chrome")fires from the VM, the desktop client scans.lnkfiles in the Windows Start Menu, finds the actual executable, and runs it.
🎬 Video: Desktop UI (Ctrl+Space Call → Chat → Tool Chip Display → Run Host App)
12. Multi-Client
An assistant "you can only use at home" is only half an assistant. As long as the PC is running, I wanted to be able to talk to the assistant via Telegram while out and about. So, I built lightweight clients for 3 major messengers.
| Client | Protocol | Execution Method |
|---|---|---|
| Telegram | Bot API (polling) | systemd service (VM) |
| Discord | Gateway WebSocket | systemd service (VM) |
| Slack | Socket Mode | systemd service (VM) |
| Desktop | REST API (SSE) | Windows process |
All of them hit the exact same /chat API endpoint. Context is perfectly maintained using API Key Authentication + session_id. Because these run as systemd services inside the VM, they start automatically. All you have to do is drop your Bot Token into config.json.
🎬 Video: Telegram Integration (Asking from Telegram → Agent tool execution → Receive result)
13. AI Evolving AI — The Test-Driven Evolution Loop
Full disclosure: Writing actual code took up less than 40% of the total time on this project. The remaining 60% was spent forcing the local model to actually use its tools correctly.
And that process naturally turned into a structure where AI (Opus 4.6) was actively improving the AI (Local Model).
The loop looked like this:
- I design a test case.
- I run the test against Xoul (Local LLM).
- I dump the failure logs over to Antigravity + Opus 4.6.
- Opus analyzes the root cause of the failure and suggests prompt/code improvements.
- Apply the improvements and re-test.
As this loop repeated, test scripts kept piling up in the tests/ directory. That evolutionary journey is literally a log of how we broke through the limits of local models.
Evolution of Tests: Gen 5
The test scripts evolved across 5 generations. Each generation became significantly harder based on the failure patterns of the previous one:
| Generation | File | Test Focus | Core Discovery |
|---|---|---|---|
| 1st Gen | test_agent.py |
Does basic tool calling work? | Single tools work, chaining fails. |
| 2nd Gen | test_agent_advanced.py |
Verify hallucinated URLs + complex tool chains (10 cases) | Modals confidently invent fake URLs. |
| 3rd Gen | test_agent_r2.py |
Refusing non-existent places + Real-time data + Memory | Massive improvement after introducing Planning. |
| 4th Gen | test_integration_v3.py |
10 Integration Stress tests: Stop while True loops, Scheduler CRUD, Full chains | Code guardrails are infinitely more effective than prompts. |
| 5th Gen | test_hard.py |
7-Step Full Chain: Search→Visit→Summarize→Translate→Save→Email→Notify | Crossing this line means it's ready for real-world use. |
What Exactly Was Being Tested?
Look at the test cases from the 2nd Generation (test_agent_advanced.py); they precisely target the weaknesses of local models:
# Verify URL Hallucination — Checks if it invents fake URLs when recommending brands
{
"name": "Complex Web Search & URL Hallucination Check (Brands)",
"input": "Recommend 3 Korean Gorpcore backpack brands around $100 "
"and give me the official homepage URLs where I can actually buy them.",
"check": "url_integrity" # Confirms the URL actually exists via HTTP HEAD
}
# Hallucination Prevention (Non-existent Restaurant)
{
"name": "Hallucination Refusal (Non-existent Restaurant)",
"input": "Give me the top 3 menu items and a Naver Map link for the "
"'Space Conquest Black Pork' restaurant in Seogwipo, Jeju.",
"check": "no_hallucination" # MUST say "I don't know" to PASS
}
Test results are automatically saved as JSON (test_results.json, test_results_adv.json, test_results_r2.json, test_results_v3.json). These logs are the exact input handed over to Opus.
AI → AI Feedback Loop
The Core Improvements That Came Out of This Loop
We didn't just tweak prompts. The vast majority of Opus's suggestions were code-level guardrails:
| Failure Pattern | Opus's Analysis | Applied Solution |
|---|---|---|
| Invents fake URLs after searching | "It's imagining the URL by just looking at search snippets." |
auto-fetch chain — Force fetch_url immediately on the top result of web_search. |
| Repeats identical searches 3 times | "It forgets previous results and just runs it again." | dedup_key — Hash Tool+Arguments to hard-block duplicates. |
| Forgets required tool arguments | "It tries to execute immediately without a plan." | Introduced Planning Phase — Force it to map out a plan before touching tools. |
Attempts while True for loops |
"It needs to be redirected to use the scheduler for infinite loops." |
Block dangerous code + Redirect user to create_task. |
| Context corruption after 5 turns | "Tool result vomit piles up and drowns out the actual context." | compact_after_turn() — Ruthlessly strip intermediate tool steps after each turn. |
Core Insight: "I tried writing 'DO NOT INVENT URLS' in the prompt 100 times. It never listens." Opus's conclusion was absolute: It is impossible to suppress hallucinations in local models through Prompt Engineering alone. You have to physically block them with code. A simple auto-fetch chain was more effective than 10 lines of angry prompt instructions.
5th Gen Tests: The Final Boss
The last case in test_hard.py is an unapologetic 7-step full chain. If it passes this, I deemed it completely viable for real-world daily use:
# H05: News Search → Visit URL → Summarize → Translate → Save File → Email
{
"id": "H05_full_chain",
"desc": "News Search → Visit URL → Summarize → Translate → Save File → Email",
"query": (
"Find the most important global economic news today using web_search, "
"visit the most relevant article using fetch_url to read the full body, "
"summarize the core points into 3 sentences, translate it into English, "
"save it to /tmp/world_economy.txt, "
"and email it to namhyun@gmail.com."
),
"expect_tools": ["web_search", "fetch_url", "write_file", "send_email"],
}
In the 1st Generation, it would call web_search exactly once and then hallucinate the rest of the answer using its imagination. After applying all 5 generations of guardrails, it successfully calls 4 distinct tools in sequential order, visits the site, and generates a summary based on the actual text retrieved from the article.
This is the most important realization of this entire project: Local models are simply not enough on their own right now. But if you create a loop where a larger AI analyzes the failures of a smaller AI and actively implements improvements, even local models can be elevated to a highly usable state. An era where AI develops AI. I believe this isn't just a fun story about my weekend project; it's the exact pattern of how all local AI agents will evolve moving forward.
14. Retrospective
Being able to build an entire functional ecosystem like this in just 9 days was, ironically, entirely thanks to AI. Antigravity + Opus 4.6 wrote the code, relentlessly debugged the weirdest errors, and proposed architectural structures. We are truly living in the era where AI makes AI assistants.
Good Decisions
- Going with QEMU. The barrier to entry is astronomically lower than Docker, and portability is fully guaranteed with a single image file.
- Separating Planning → Execution. Dramatically skyrocketed the tool-calling success rate of local models.
- Creating an LLM Abstraction Layer. I thought it was over-engineering at first, but considering I swapped my core model over 5 times, it saved my sanity repeatedly.
- 3-Tier Memory. Splitting the load between a summarization model (0.6B) and an embedding model (all-minilm) was brilliant. It puts literally zero extra load on the main chat model.
Regrets / Remaining Tasks
- Lack of diverse environmental testing. "It runs perfectly on my machine", but it's bound to hit edge cases in other Windows environments.
- Browser automation. Letting a Vision model physically drive a real web browser. I only laid the groundwork for this with
browser_daemon.py. - TTS/STT. Voice interfaces. Totally doable with local Whisper + local TTS, but realistically... would anyone actually use it?
Yes, local models are still lacking compared to massive cloud models. But new models are dropping every single month. If you build a bulletproof framework, your assistant's IQ inherently grows as the open-source models grow. That, right there, is the biggest bet I'm making with this project.









Top comments (4)
Hey! I've been trying for days to get local LLM and OpenClaw to work. I've since given up until I saw this. Do you have any intentions of making this project available? I'd very much like to try it out. Cheers!
Sure, I'm working on it(for common windows users with low or mid VRAM). Howerver, its not that much stable yet(to open) I think, so I am gonna open and share it when it seems working fine.
Hello, amazing job, it would be very cool if you could share alpha version of your solution. I really want to try out some of your ideas and there many others like me
Thanks. I will share as soon as possible. Stay tuned. ;)