Fixing SCP encoding crashes, adding fuzzy filename matching, MD5-based deduplication, and automatic VM file search in a QEMU-based AI agent.
1. Problem Definition
The agent 'Xoul' runs inside a QEMU Ubuntu VM. The reason is security.
When you give an AI agent system tools like run_command and write_file, LLM hallucinations or prompt injection attacks could directly damage the host PC. Imagine rm -rf / executing on your main machine.
Solution: Sandbox all agent system operations inside a QEMU VM.
- If the LLM executes
rm -rf /, only the VM is affected - No direct host filesystem access — the only channel is the
share/folder -
share/uses explicit SCP transfers only (no auto-mount)
This architecture relies on two file transfer tools:
-
host_to_vm: Host
share/→ VM/root/share/(deliver user files to the agent) -
vm_to_host: VM → Host
share/(deliver agent results to the user)
Both use SCP (Secure Copy Protocol) over SSH. In practice, several issues surfaced:
Symptoms:
- SCP transfers intermittently failing (cause unclear)
- When a user says "send me the report file," the tool fails unless the exact filename is provided
- Retrieving files from VM requires knowing the full path
- Identical files transferred repeatedly with no deduplication
2. Root Cause Analysis
2-1. Encoding Crash — UnicodeDecodeError
All three functions in vm_manager.py (ssh_exec, scp_to_vm, scp_from_vm) used subprocess.run(text=True):
# Problematic code
result = subprocess.run(ssh_cmd, capture_output=True, text=True, timeout=timeout)
text=True internally calls stdout.decode('utf-8'). When SSH output contains BOM bytes (0xFF, 0xFE) or binary data, it crashes immediately:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0
When this error occurs in subprocess.py's _readerthread, the error message itself gets corrupted:
[SSH 오류: 'NoneType' object has no attribute 'strip']
→ To the user, it just looks like "SCP doesn't work."
2-2. Exact Filename Required
# Original code — exact match required
def _host_to_vm(filename: str, vm_path: str = "") -> str:
local_path = os.path.join(SHARE_DIR, filename) # ← Must match exactly
return scp_to_vm(local_path, vm_path)
Users naturally say partial names like "report" or "result" in conversation. The LLM cannot guess exact filenames.
2-3. Full VM Path Required
vm_to_host requires the complete path (/root/workspace/result.txt). Users have no way of knowing internal VM file paths.
2-4. No Deduplication
Repeated requests for the same file trigger full SCP transfers every time — unnecessary network IO and SSH overhead.
3. Solution
3-1. Encoding Fix
# vm_manager.py — common to ssh_exec, scp_to_vm, scp_from_vm
result = subprocess.run(
ssh_cmd,
capture_output=True,
- text=True,
+ text=False,
timeout=timeout,
)
- output = result.stdout
+ output = result.stdout.decode("utf-8", errors="replace")
+ stderr = result.stderr.decode("utf-8", errors="replace")
errors="replace" substitutes undecodable bytes with �. No matter what bytes SSH outputs, no crash occurs.
3-2. Smart File Search
Host Side — _find_file_in_share()
Case-insensitive partial matching across the entire share/ directory tree:
def _find_file_in_share(query: str) -> list:
results = []
query_lower = query.lower()
for root, dirs, files in os.walk(SHARE_DIR):
for f in files:
if f.startswith("."):
continue
if query_lower in f.lower():
results.append(os.path.join(root, f))
return results
VM Side — _find_file_in_vm()
Uses SSH find command to search common directories:
def _find_file_in_vm(query: str) -> str:
result = ssh_exec(
f"find /root/share /root/workspace /root/.xoul/workspace "
f"-name '*{query}*' -type f 2>/dev/null | head -5",
timeout=10, quiet=True
)
lines = [l.strip() for l in result.strip().split("\n") if l.strip()]
return lines[0] if lines else ""
3-3. MD5 Hash-Based Deduplication
Transfer history is stored in share/.transfer_log.json:
{
"htv:test_htv.txt": {
"hash": "a1b2c3d4e5f6...",
"direction": "htv",
"src": "C:\\...\\share\\test_htv.txt",
"dst": "/root/share/test_htv.txt",
"timestamp": "2026-02-24T22:05:30"
}
}
When an identical hash is detected, the transfer is skipped:
✅ Already up to date: test_htv.txt → /root/share/test_htv.txt
4. Testing & Verification
All tests performed with actual SCP on a running VM.
| Test | Input | Expected | Result |
|---|---|---|---|
| HTV exact match | test_htv.txt |
Upload complete | ✅ |
| HTV dedup | Same file again | Skip message | ✅ |
| HTV partial match | htv |
Auto-find test_htv.txt
|
✅ |
| HTV nonexistent | nonexistent.xyz |
Error + path hint | ✅ |
| VTH full path | /root/workspace/result.txt |
Download complete | ✅ |
| VTH filename only | todo.txt |
VM find search |
✅ |
| VTH nonexistent | zzz_not_exist.bin |
Error message | ✅ |
| SSH encoding |
cat command (BOM) |
No crash | ✅ |
8/8 passed.
5. Results
Before vs After
| Aspect | Before | After |
|---|---|---|
| SSH encoding |
UnicodeDecodeError crash |
errors="replace" — always safe |
| File search | Exact filename required | Partial match auto-search |
| VM file access | Full path required | Filename-only find search |
| Duplicate transfer | Full SCP every time | MD5 hash comparison → skip |
| Transfer history | None | JSON log |
| Error messages | NoneType has no attribute 'strip' |
Specific guidance + share/ path |
Key Lessons
The
subprocess.run(text=True)trap: Python'stext=Trueis convenient but assumes UTF-8 output from external processes. SSH and SCP on Windows can inject BOM bytes (0xFF 0xFE), causing immediate crashes.text=False+decode(errors="replace")is the safe default.Users don't know paths: When building file management tools, assuming users know exact filenames or VM paths is dangerous. The pattern should be: partial match → candidate list → selection.
Hash-based deduplication: Filenames can be the same with different content, or different with the same content. MD5 hash on actual content is the reliable comparison method.
JSON vs SQLite for logs: For small-scale data like transfer logs (dozens of entries), JSON files are ideal — zero dependencies, easy debugging. SQLite shines for memory systems with thousands of entries requiring semantic search.


Top comments (0)