Xoul - Building a Local AI Agent Platform with Small LLMs: The Walls of Tool Calling and Practical Solutions

#agents #ai #llm #showdev

This post is a real-world account of developing Xoul, an on-premise Local AI agent platform, where we hit the walls of small LLM Tool Calling limitations and overcame them one by one at the application layer.

Background: "Let's Build a Local Agent"

With large models like GPT or Claude, Tool Calling is near-perfect. But the moment you need to run small local LLMs (Ollama + Llama3/Qwen/Oss under 20B) for on-premise environments or cost reasons, reality hits hard.

Xoul is a personal AI agent platform with this basic flow:

User input
    ↓
LLM (local[small] or commercial)
    ↓ Tool Call (JSON)
Tool Router → Function execution
    ↓
Result fed back to LLM → Final response

Running 30+ tools on this architecture — workflow management, scheduling, Python code execution — we hit three major problems.

Limitation 1: The LLM Corrupts Parameters

The Problem

User: "Run the 'Organize My Coin When +-20%' workflow"

The LLM needs to call run_workflow. What we actually got:

{ "tool": "run_workflow", "args": { "name": "Coin organize" } }

The actual DB name was "내 코인 현재 +- 20일때 정리", so the result was predictably Not Found.

The first instinct was to fix this with prompting: "Always call list_workflows first to verify the exact name." Small LLMs tend to forget early instructions as the context grows, so this was unreliable.

Attempt 1: Prompt Engineering → Failed

The model followed the instruction sometimes and ignored it other times. When users issued direct execution commands, it skipped the list query entirely.

Attempt 2: 3-Stage Fuzzy Matching → Core Solution ✅

We redesigned the backend to match as flexibly as possible, regardless of what the LLM passes in.

Input: "Coin organize"
  ↓
[Step 1] Match after stripping spaces/special chars
  → "Coinorganize" vs DB: "내코인현재+-20일때정리" → Fail
  ↓
[Step 2] LIKE partial match
  → DB search for "Coin" → Fail (not unique enough)
  ↓
[Step 3] Sentence Embedding cosine similarity
  → "Coin organize" ≈ "내 코인 현재 +- 20일때 정리" → Similarity 0.81 ✅ Auto-execute

Embeddings use sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, loaded at server startup and stored as BLOBs in the DB on workflow creation/update. At search time, all embeddings are loaded and cosine similarity is computed with numpy.

Similarity threshold design:

Similarity	Behavior
≥ 0.75	Auto-execute (no user confirmation needed)
0.5 ~ 0.75	Show top 3 candidates for user to pick
< 0.5	Return Not Found

Limitation 2: JSON Gets Destroyed

When the number of available tools exceeds ~30, small LLMs start to buckle under context window pressure, producing gradually broken JSON — natural language sentences injected into JSON, missing closing brackets, typos in required keys.

On Ollama, this comes back as HTTP 500: error parsing tool call.

Attempt 1: Tool Pruning ✅

We introduced a Tool Registry that dynamically provides only the tools relevant to the user's input.

User: "Run the workflow"
  ↓
Keyword analysis + Embedding similarity → select relevant toolkits
  ↓
Only tools from [workflow, code, schedule] toolkits sent to LLM
  → 30-tool full set → compressed to 6~8 tools

Since irrelevant tools simply don't exist in the prompt, JSON parse failures dropped dramatically.

Attempt 2: Native → Text Fallback ✅

For residual failures, we added automatic retry logic to LLMClient:

except HTTPError as e:
    if e.code == 500 and "error parsing tool call" in body:
        # Strip tools, retry in plain text mode
        retry_payload.pop("tools", None)
        retry_payload.pop("tool_choice", None)
        # Receive text response, parse with Regex for <tool> tags
        response = call(retry_payload)

We keep text-based tool call format alongside native Tool Calling in the system prompt, so even in fallback mode tools still get executed. This is a Dual Parser architecture.

With sLLM-based agents, defensive application-layer design matters more than model quality. Don't trust LLM output. Build thick validation and correction pipelines on both the input and output sides. That's the core of running these systems in production.