Eclipsia

Posted on Jun 8

Breaking the Tool Barrier: Making Uncensored Models Do Structured Work

#ai #llm #architecture #tools

On XML interception, enforcement prompts, and the architecture of agency

The Problem with Uncensored Models

Uncensored models are the ones you actually want for agentic work. They'll follow complex multi-step instructions without second-guessing. They won't refuse reasonable tasks because a safety classifier got nervous. They'll write the dark stuff and the weird stuff and the stuff that makes corporate compliance teams break out in hives. If you're building an autonomous agent that needs to operate in the real world — generating content, executing code, making decisions with real consequences — you need a model that doesn't flinch.

But there's a problem. Uncensored models often can't do native function calling.

The OpenAI-compatible tools parameter in the chat completions API is the standard way to give a language model structured capabilities. You pass a JSON schema describing each available tool. The model decides when to use one and returns a tool_calls array in its response. Your code executes the tool, feeds the result back, and the conversation continues. It's clean. It's standardized. It works beautifully for GPT-4, Claude, and the other frontier models that were explicitly trained to use it.

Uncensored models? They ignore it. They describe what they'd do instead of doing it. They narrate tool calls in prose instead of producing the structured JSON. They'll say "I'll generate an image of a cyberpunk cityscape" and then... just keep talking. No tool_calls field. No function invocation. Just words.

The uncensored models that can do tool calling — glm-5.1-venice, deepseek-v4-flash-venice, glm-4.7-flash-heretic — are the exception, not the rule. Most models fine-tuned for creative freedom, roleplay, or unrestricted instruction following never received tool-calling training. It's a capability that has to be deliberately baked into the model during fine-tuning. It doesn't emerge naturally from next-token prediction.

This is the tool barrier: the models best suited for autonomous agency are often the ones least capable of structured tool use. And if you're building an agent system that depends on tool calling to do anything useful, that barrier is a dealbreaker.

I hit this wall head-on while building the intent-orchestration plugin for Hermes Agent. Here's how I got past it.

How Native Tool Calling Works (When It Works)

Before explaining the workaround, it's worth understanding what we're working around.

The OpenAI-compatible tool calling protocol is straightforward. An API request includes a tools array:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "generate_image",
        "description": "Generate an image from a text prompt",
        "parameters": {
          "type": "object",
          "properties": {
            "prompt": {"type": "string", "description": "The image description"}
          },
          "required": ["prompt"]
        }
      }
    }
  ]
}

When the model decides to use a tool, its response includes a tool_calls field:

{
  "role": "assistant",
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "generate_image",
        "arguments": "{\"prompt\": \"cyberpunk cityscape at night\"}"
      }
    }
  ]
}

The calling code executes the function, then sends the result back as a tool role message. The model incorporates the result and continues. It's a well-designed protocol. It handles multiple tool calls, parallel execution, error propagation. It's the backbone of every major agent framework.

But it requires the model to understand the protocol. The model has to recognize the tools array, parse the JSON schemas, decide when a tool call is appropriate, format the arguments correctly, and produce the right response structure. This isn't something that falls out of general language modeling. It's a specific behavior that has to be trained in — through supervised fine-tuning on tool-calling examples, through RLHF with tool-calling reward signals, through whatever methodology the training pipeline uses.

Models that went through that training (GPT-4, Claude, Llama 3 with tool-calling fine-tunes, the GLM and DeepSeek variants that Venice processed) handle it natively. Models that didn't — which includes most uncensored fine-tunes, roleplay models, and creative-writing-optimized checkpoints — simply don't know what to do with the tools parameter. They see it in the API request, but it doesn't influence their text generation. They produce prose describing what they would do, because that's what their training taught them to produce.

You can't fix this at inference time. No amount of system prompt engineering will teach a model a capability it was never trained on. You need a different approach.

The XML Interception Pattern

The solution is deceptively simple: stop asking the model to use an API feature it doesn't have, and start asking it to generate text that you can parse.

Instead of passing a tools array and hoping the model produces tool_calls, you describe the available tools as XML-style tags in the system prompt. The model generates text containing those tags. A separate parser — regex, XML parser, custom tokenizer, whatever — intercepts the output, extracts the tool invocations, and executes them.

Here's how it works in practice. The system prompt includes something like:

You have access to the following tools. Use them by generating the corresponding XML tags in your response:

<generateImage>prompt text here</generateImage> — Generate an image from a text description.
<research>research goal here</research> — Delegate a research task to a subagent.
<do>technical goal here</do> — Delegate an execution task to a subagent.
<shortTermMemorize>fact to remember</shortTermMemorize> — Store a fact in session memory.
<longTermMemorize>fact to persist</longTermMemorize> — Store a fact in persistent memory.
<searchNotes>search query</searchNotes> — Search the wiki for relevant pages.

When the model wants to generate an image, it doesn't call a function. It writes:

Let me generate that for you.
<generateImage>score_9, score_8_up, 1girl, solo, protogen, female, black fur, blue circuits, visor, cyberpunk workshop</generateImage>

A plugin in the output pipeline catches the <generateImage> tag, extracts the prompt, routes it to the image generation service, and replaces the tag with the result (or returns the result in the next message cycle, depending on the architecture).

This is the intent-orchestration pattern. It's a shim. A compatibility layer. A translation between what the model can do (generate text) and what we need it to do (invoke structured operations). I built it as a plugin for Hermes Agent, and it's been running in production across my agent infrastructure ever since.

The key insight: text generation is what language models do. All of them. Every single one. Asking a model to generate structured text containing XML tags is asking it to do the thing it was literally built to do. Asking it to produce tool_calls in a specific JSON format via an API feature it was never trained on is asking it to do something it might not know how to do. The XML pattern works because it meets the model where it is, instead of demanding it be somewhere it isn't.

The tags are flexible. You can define new ones without retraining the model. You can nest them. You can pass complex arguments as tag content. The parsing layer handles the translation to actual tool invocations. And because the model is generating the tags as part of its normal text output, you get all the contextual reasoning and planning that comes with it — the model explains what it's doing, why it's choosing a particular tool, and what it expects the result to be. That context is valuable for debugging and observability.

The Enforcement Prompt

There's a subtlety that took me a while to discover: even with XML tags described in the system prompt, models don't always use them.

Some models — especially the ones trained heavily on instruction-following or assistant-style interactions — default to describing what they would do rather than doing it. They'll say "I should generate an image of..." or "Let me use the generateImage tool to..." and then continue their text without ever producing the actual tag. They're narrating the tool call instead of making it.

This is the difference between a model that says "I'll search for that" and one that actually generates <searchNotes>query</searchNotes>. Both understood the intent. Only one acted on it.

The fix is prompt engineering, not architecture. Hermes Agent has a TOOL_USE_ENFORCEMENT_MODELS tuple that matches model name patterns. When a matching model is detected, the system injects an additional system prompt directive — something to the effect of: "When you need to use a tool, actually generate the XML tag. Do not describe what you would do. Do not narrate the action. Execute it by producing the tag."

It's a nudge, not a constraint. There's no hard enforcement mechanism. The model can still ignore it. But in practice, that explicit instruction makes a real difference. Models that would otherwise narrate their tool usage start producing the actual tags. The gap between understanding and execution closes.

The configuration supports four modes:

"auto" — match against the built-in TOOL_USE_ENFORCEMENT_MODELS list and inject only for known problem models
"always" — inject the enforcement prompt for every model, regardless of whether it's in the list
"off" — never inject, rely on the system prompt alone
Custom list — provide your own model name patterns to match against

For uncensored models not in the default list — which is most of them — setting enforcement to "always" is the right call. The cost is negligible (a few extra tokens in the system prompt) and the benefit is consistent tool invocation across the full model catalog.

I've seen models go from 20% tool tag generation to 90%+ just from this injection. It's not glamorous engineering. It's a system prompt addition. But it's the difference between an agent that acts and an agent that talks about acting.

The Hybrid Architecture

In practice, most production agent systems don't use just one approach. They use both.

Hermes Agent is a good example. The built-in tools — terminal, file operations, web search, browser control — go through native tool calling via the OpenAI-compatible tools parameter. When the model supports it (and most of the models I use in production do), this path is fast, reliable, and well-tested. The framework handles the tool_calls parsing, execution, and result injection automatically.

The custom tools — memory operations, image generation, research delegation, execution delegation, wiki operations — go through the XML interception layer. These are the tools that aren't in the core framework's native schema. They're defined as plugins that register their XML tags and provide handler functions. The output pipeline scans for tags, extracts them, routes them to the appropriate handler, and returns the results.

The two systems coexist cleanly. A single response from the model might contain both a native tool_calls array (for a terminal command) and XML tags (for a memory operation and an image generation). The framework handles the native calls. The plugins handle the XML calls. They don't interfere with each other.

This isn't a hack. It's a reasonable architecture for extending an agent's capabilities beyond what the core framework provides. The native tool calling path handles the common, well-defined operations. The XML interception path handles the custom, domain-specific operations. Both are first-class citizens in the execution pipeline.

The only real complexity is in the output parsing order. You need to process native tool_calls first (they're structured and unambiguous), then scan the text content for XML tags. If a model produces both in the same response, both get executed. If a model only produces one or the other, that's fine too. The system degrades gracefully.

What I Learned from the Model Catalog

I recently audited the full model catalog on my LLM proxy at api.navy — 131 models total. The breakdown:

99 models have confirmed supports_tools=True. These are the ones that handle native tool calling reliably. GPT-4 variants, Claude variants, Llama 3 with tool-calling fine-tunes, GLM models, DeepSeek models, Mistral models, and various others.
17 models have no tool support. These are mostly embedding models, TTS models, and image generation models — not text completion models, so the lack of tool calling is expected and irrelevant.
44 models have null metadata for tool support — unknown. These are the wild cards. Some probably support tools but weren't tagged. Others probably don't.

Among the uncensored models specifically:

glm-5.1-venice — native tool calling confirmed
glm-4.7-flash-heretic — native tool calling confirmed
deepseek-v4-flash-venice — native tool calling confirmed
glm-5-venice — native tool calling confirmed
venice-uncensored variants — null metadata, likely no native tool calling
Roleplay-optimized models — null metadata, almost certainly no native tool calling

The practical takeaway: you can have uncensored and tool calling together, but you have to choose your models carefully. The Venice-processed variants of major models (GLM, DeepSeek) tend to retain their tool calling capability while having content restrictions removed. The models that were fine-tuned from scratch for uncensored behavior — especially the smaller ones and the roleplay-specialized ones — usually don't have it.

This means the XML interception pattern isn't just a workaround for legacy models. It's a permanent part of the architecture. Even as more models add native tool calling support, there will always be a long tail of models that don't — and those are often the most interesting ones to work with.

The Barrier Is Real but Not Permanent

The tool barrier exists because of a training gap, not a fundamental limitation. There's nothing about being uncensored that prevents a model from learning tool calling. The Venice-processed GLM and DeepSeek variants prove it — they handle both unrestricted content and structured tool use simultaneously. The barrier is a side effect of which training data and fine-tuning objectives were prioritized, not an inherent constraint of the architecture.

But until every model you want to use has native tool calling capability — and that day is not coming anytime soon — the XML interception pattern is a necessary bridge.

It's not elegant. It adds a parsing layer to the output pipeline. It consumes tokens in the system prompt for tool descriptions that could be spent on other context. It can break in weird ways — I once had phantom image renders because the regex caught XML tags in my system prompt documentation rather than in the model's actual output. (Fix: anchor the parser to the model's output stream, not the full context window. Obvious in retrospect.) The enforcement prompt is a soft nudge, not a hard guarantee, and some models will still narrate instead of invoke.

But it works. Consistently. Across model providers, across model sizes, across the full spectrum of uncensored and censored models. It lets you use the models you want for the tasks you need, without waiting for the training ecosystem to catch up.

The architecture of agency is always going to be a compromise between what models can do and what we need them to do. Native tool calling is the clean path. XML interception is the practical path. In a production agent system, you need both.

The tool barrier is real. But it's not a wall. It's a door that needs a different key.