Kim Namhyun

Posted on Feb 24

Improving Host↔VM File Transfer in a Local AI Agent — Smart Search + Deduplication

#ai #security #linux #agents

Fixing SCP encoding crashes, adding fuzzy filename matching, MD5-based deduplication, and automatic VM file search in a QEMU-based AI agent.

1. Problem Definition

The agent 'Xoul' runs inside a QEMU Ubuntu VM. The reason is security.

When you give an AI agent system tools like run_command and write_file, LLM hallucinations or prompt injection attacks could directly damage the host PC. Imagine rm -rf / executing on your main machine.

Solution: Sandbox all agent system operations inside a QEMU VM.

If the LLM executes rm -rf /, only the VM is affected
No direct host filesystem access — the only channel is the share/ folder
share/ uses explicit SCP transfers only (no auto-mount)

This architecture relies on two file transfer tools:

host_to_vm: Host share/ → VM /root/share/ (deliver user files to the agent)
vm_to_host: VM → Host share/ (deliver agent results to the user)

Both use SCP (Secure Copy Protocol) over SSH. In practice, several issues surfaced:

Symptoms:

SCP transfers intermittently failing (cause unclear)
When a user says "send me the report file," the tool fails unless the exact filename is provided
Retrieving files from VM requires knowing the full path
Identical files transferred repeatedly with no deduplication

2. Root Cause Analysis

2-1. Encoding Crash — `UnicodeDecodeError`

All three functions in vm_manager.py (ssh_exec, scp_to_vm, scp_from_vm) used subprocess.run(text=True):

# Problematic code
result = subprocess.run(ssh_cmd, capture_output=True, text=True, timeout=timeout)

text=True internally calls stdout.decode('utf-8'). When SSH output contains BOM bytes (0xFF, 0xFE) or binary data, it crashes immediately:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0

When this error occurs in subprocess.py's _readerthread, the error message itself gets corrupted:

[SSH 오류: 'NoneType' object has no attribute 'strip']

→ To the user, it just looks like "SCP doesn't work."

2-2. Exact Filename Required

# Original code — exact match required
def _host_to_vm(filename: str, vm_path: str = "") -> str:
    local_path = os.path.join(SHARE_DIR, filename)  # ← Must match exactly
    return scp_to_vm(local_path, vm_path)

Users naturally say partial names like "report" or "result" in conversation. The LLM cannot guess exact filenames.

2-3. Full VM Path Required

vm_to_host requires the complete path (/root/workspace/result.txt). Users have no way of knowing internal VM file paths.

2-4. No Deduplication

Repeated requests for the same file trigger full SCP transfers every time — unnecessary network IO and SSH overhead.

3. Solution

3-1. Encoding Fix

# vm_manager.py — common to ssh_exec, scp_to_vm, scp_from_vm
  result = subprocess.run(
      ssh_cmd,
      capture_output=True,
-     text=True,
+     text=False,
      timeout=timeout,
  )
- output = result.stdout
+ output = result.stdout.decode("utf-8", errors="replace")
+ stderr = result.stderr.decode("utf-8", errors="replace")

errors="replace" substitutes undecodable bytes with �. No matter what bytes SSH outputs, no crash occurs.

3-2. Smart File Search

Host Side — `_find_file_in_share()`

Case-insensitive partial matching across the entire share/ directory tree:

def _find_file_in_share(query: str) -> list:
    results = []
    query_lower = query.lower()
    for root, dirs, files in os.walk(SHARE_DIR):
        for f in files:
            if f.startswith("."):
                continue
            if query_lower in f.lower():
                results.append(os.path.join(root, f))
    return results

VM Side — `_find_file_in_vm()`

Uses SSH find command to search common directories:

def _find_file_in_vm(query: str) -> str:
    result = ssh_exec(
        f"find /root/share /root/workspace /root/.xoul/workspace "
        f"-name '*{query}*' -type f 2>/dev/null | head -5",
        timeout=10, quiet=True
    )
    lines = [l.strip() for l in result.strip().split("\n") if l.strip()]
    return lines[0] if lines else ""

3-3. MD5 Hash-Based Deduplication

Transfer history is stored in share/.transfer_log.json:

{
  "htv:test_htv.txt": {
    "hash": "a1b2c3d4e5f6...",
    "direction": "htv",
    "src": "C:\\...\\share\\test_htv.txt",
    "dst": "/root/share/test_htv.txt",
    "timestamp": "2026-02-24T22:05:30"
  }
}

When an identical hash is detected, the transfer is skipped:

✅ Already up to date: test_htv.txt → /root/share/test_htv.txt

4. Testing & Verification

All tests performed with actual SCP on a running VM.

Test	Input	Expected	Result
HTV exact match	`test_htv.txt`	Upload complete	✅
HTV dedup	Same file again	Skip message	✅
HTV partial match	`htv`	Auto-find `test_htv.txt`	✅
HTV nonexistent	`nonexistent.xyz`	Error + path hint	✅
VTH full path	`/root/workspace/result.txt`	Download complete	✅
VTH filename only	`todo.txt`	VM `find` search	✅
VTH nonexistent	`zzz_not_exist.bin`	Error message	✅
SSH encoding	`cat` command (BOM)	No crash	✅

8/8 passed.

5. Results

Before vs After

Aspect	Before	After
SSH encoding	`UnicodeDecodeError` crash	`errors="replace"` — always safe
File search	Exact filename required	Partial match auto-search
VM file access	Full path required	Filename-only `find` search
Duplicate transfer	Full SCP every time	MD5 hash comparison → skip
Transfer history	None	JSON log
Error messages	`NoneType has no attribute 'strip'`	Specific guidance + share/ path

Key Lessons

The subprocess.run(text=True) trap: Python's text=True is convenient but assumes UTF-8 output from external processes. SSH and SCP on Windows can inject BOM bytes (0xFF 0xFE), causing immediate crashes. text=False + decode(errors="replace") is the safe default.
Users don't know paths: When building file management tools, assuming users know exact filenames or VM paths is dangerous. The pattern should be: partial match → candidate list → selection.
Hash-based deduplication: Filenames can be the same with different content, or different with the same content. MD5 hash on actual content is the reliable comparison method.
JSON vs SQLite for logs: For small-scale data like transfer logs (dozens of entries), JSON files are ideal — zero dependencies, easy debugging. SQLite shines for memory systems with thousands of entries requiring semantic search.

DEV Community

Improving Host↔VM File Transfer in a Local AI Agent — Smart Search + Deduplication

1. Problem Definition

2. Root Cause Analysis

2-1. Encoding Crash — `UnicodeDecodeError`

2-2. Exact Filename Required

2-3. Full VM Path Required

2-4. No Deduplication

3. Solution

3-1. Encoding Fix

3-2. Smart File Search

Host Side — `_find_file_in_share()`

VM Side — `_find_file_in_vm()`

3-3. MD5 Hash-Based Deduplication

4. Testing & Verification

5. Results

Before vs After

Key Lessons

Top comments (0)

1. Problem Definition

2. Root Cause Analysis

2-1. Encoding Crash — UnicodeDecodeError

2-2. Exact Filename Required

2-3. Full VM Path Required

2-4. No Deduplication

3. Solution

3-1. Encoding Fix

3-2. Smart File Search

Host Side — _find_file_in_share()

VM Side — _find_file_in_vm()

3-3. MD5 Hash-Based Deduplication

4. Testing & Verification

5. Results

Before vs After

Key Lessons

2-1. Encoding Crash — `UnicodeDecodeError`

Host Side — `_find_file_in_share()`

VM Side — `_find_file_in_vm()`