DEV Community

David
David

Posted on

your local AI agent shouldn't care which model you run

your local AI agent shouldn't care which model you run

Most local AI apps that offer agent or coding features lock you into a handful of models. OpenAI function calling? Supported. Anthropic tool use? Supported. That random abliterated Llama finetune you pulled from HuggingFace last Tuesday? Good luck.

The core problem is tool calling. It's the mechanism that turns a chatbot into an agent — the model doesn't just generate text, it emits structured calls to functions like "read this file" or "run this shell command." But there's no universal standard for how models express tool calls. OpenAI uses a JSON schema format. Anthropic has its own protocol. Ollama exposes native tool calling for some models but not others. And many local models — especially uncensored, abliterated, or community-finetuned variants — don't support structured tool calling at all.

So we built a system in Locally Uncensored (v2.2.3) that makes our Codex coding agent work with any model. Not "any model from our approved list." Any model, period.

the dual-strategy approach

The solution is two strategies that the app picks between automatically based on the model you're running.

Strategy 1: Native tool calling. If the model supports structured tool calling through its API, we use it directly. This is the cleanest path — the model receives tool definitions in its native format and returns well-formed tool calls that we parse and execute. Cloud providers (OpenAI, Anthropic) always go through this path. For Ollama, we maintain a compatibility list of models known to handle native tool calling reliably.

Strategy 2: Hermes XML fallback. If the model isn't on the compatibility list, we switch to an XML-based prompting strategy inspired by the Hermes function-calling format. Instead of relying on the model's native tool calling infrastructure, we inject tool definitions directly into the system prompt as XML schemas and instruct the model to emit tool calls as XML blocks in its response text. Then we parse those blocks out of the raw text output.

The decision happens automatically. You pick a model, start a Codex session, and the app figures out the right strategy. You never see a "this model doesn't support tools" error.

which models get native tool calling

We tested and verified native tool calling with these model families through Ollama:

  • Hermes 3
  • Qwen 3 and Qwen 3.5
  • Llama 3.x and Llama 4
  • Mistral
  • Phi-4
  • DeepSeek
  • Gemma 3 and Gemma 4
  • Nemotron
  • Command-R

Cloud providers — OpenAI, Anthropic — always use native tool calling regardless of the specific model.

For any Ollama model not on this list, the Hermes XML strategy kicks in automatically.

how the XML fallback actually works

When we detect that a model needs the XML strategy, three things happen at prompt construction time.

First, we build an XML tool prompt. Each of the available MCP tools gets described in a structured XML block — its name, description, and parameters with types. This goes into the system prompt along with instructions telling the model to use <tool_call> XML blocks when it needs to invoke a tool.

The conceptual structure looks something like this:

<tools>
  <tool name="file_read">
    <description>Read the contents of a file</description>
    <parameters>
      <param name="path" type="string" required="true">File path to read</param>
    </parameters>
  </tool>
  <!-- ... more tools ... -->
</tools>

When you want to call a tool, emit:
<tool_call>
  <name>file_read</name>
  <arguments>{"path": "/src/main.rs"}</arguments>
</tool_call>
Enter fullscreen mode Exit fullscreen mode

Second, when the model responds, we scan the raw text for <tool_call> blocks and parse them. The model doesn't need to support any special API — it just needs to be good enough at following instructions to emit the right XML structure in its text output.

Third, we run a JSON repair layer on the arguments. Even when a model gets the XML wrapper right, the JSON inside the arguments field is often messy — missing quotes, trailing commas, single quotes instead of doubles. The repair layer handles the most common malformations before passing arguments to the tool executor.

smart tool filtering saves your context window

Here's a practical problem: we have 13 MCP tools available — web_search, web_fetch, file_read, file_write, file_list, file_search, shell_execute, code_execute, system_info, process_list, screenshot, image_generate, and run_workflow. Injecting all 13 tool definitions into every prompt eats a significant chunk of context window, which is a real problem for local models running 4K–8K contexts.

The keyword-based filtering system analyzes each user message and injects only the tools that are relevant. Ask about files? You get file tools. Ask about your system? System info tools. Ask a general question? No tools at all. This saves roughly 80% of tool-definition tokens compared to injecting everything blindly.

This matters more for the XML strategy than native tool calling, because the XML tool definitions live in the prompt text and directly consume context tokens.

the 13 MCP tools

Every tool runs through Tauri's Rust backend with native OS access, gated by a granular permissions system. The full set:

  • file_read, file_write, file_list, file_search — full filesystem access within the selected project directory
  • shell_execute — run any shell command asynchronously with streaming output
  • code_execute — run code snippets in an isolated context
  • web_search, web_fetch — search the web and fetch page content
  • system_info, process_list — query hardware and running processes
  • screenshot — capture the screen (combined with vision for visual debugging)
  • image_generate — generate images through configured backends
  • run_workflow — execute predefined multi-step workflows

The agent can chain up to 20 tool calls per task. In practice, a typical coding task — "add input validation to this form and write tests" — runs 5–10 iterations: read the existing code, write the validation, read the test file, write new tests, run them, fix failures, run again.

honest assessment: where this works and where it doesn't

Let me be direct about the limitations.

Native tool calling with strong models is great. Qwen 3 32B, Llama 4 Scout, and the cloud models handle multi-step tool-use workflows reliably. They pick the right tools, format arguments correctly, and iterate on errors intelligently. If you're running one of these, the experience is genuinely useful for real development work.

The XML fallback is functional but rougher. It works surprisingly well with models that are good at instruction following — even models that were never specifically trained for tool use. But smaller models (7B–13B) will sometimes botch the XML structure, emit partial tool calls, or hallucinate tool names that don't exist. The repair layers catch many of these failures, but you'll see more "tool call failed, retrying" iterations compared to native tool calling.

Abliterated and uncensored models vary wildly. The whole point of universal model support is that you can run whatever model you want, including ones that won't refuse tasks. Some abliterated models are excellent at instruction following and work well as coding agents. Others were abliterated crudely and lost some of their structured-output capabilities in the process. There's no way to predict this without trying — but at least the fallback strategy means they'll get a chance to work instead of being rejected outright.

Cloud models are still better at agentic tasks. This isn't a controversial take. GPT-4o and Claude 3.5 Sonnet are more reliable tool-callers than any local model. The value proposition of local model support isn't that it's better — it's that it's private, free, uncensored, and runs on your hardware. For many use cases, that matters more than peak reliability.

getting started

Locally Uncensored is AGPL-3.0 licensed and free. Clone and run:

git clone https://github.com/PurpleDoubleD/locally-uncensored.git
cd locally-uncensored
# Windows: setup.bat | macOS/Linux: ./setup.sh
Enter fullscreen mode Exit fullscreen mode

Open the Codex tab, pick any model, select a project folder, and start giving it tasks. The app handles the native-vs-XML strategy decision automatically. Check permissions settings if you want to control what the agent can access.

If you're running Ollama, any model works. If yours isn't on the native tool calling list, you'll be on the XML fallback — and honestly, for most tasks, you won't notice the difference until you're doing complex multi-step workflows.

The GitHub repo has full source, releases for Windows/macOS/Linux, and discussions if you want to report which models work well (or don't) as coding agents. Model compatibility reports from the community are genuinely useful — they help us expand the native tool calling list and improve the XML fallback parsing.

Top comments (0)