Alechko

Posted on Mar 1

Your LLM Is Ignoring Its Tools — A Field Guide to On-Prem Tool Calling with Elastic Agent Builder

#elasticsearch #llm #devops #hackathon

You pick a model. You serve it with Ollama. You wire it into Elastic Agent Builder. The connector is green. The agent loads. You type a question.

The model responds with a friendly paragraph. It does not call a single tool.

No error. No warning. HTTP 200. The agent is a chatbot now.

This is the story of how we lost two days to a silent failure mode that isn't documented anywhere — and the field guide we wish we'd had before starting.

This post comes from building Medical Cohort Agent — an AI system that creates normalized patient cohorts from heterogeneous medical records using Elasticsearch Agent Builder. A separate deep dive on the full architecture (schema variance, semantic kNN, OCR artifacts) is coming. Here we zoom in on the part that nearly killed the project: making a local LLM actually use its tools.

The Setup

We're building an air-gapped healthcare AI agent. No data leaves the building — regulatory requirement, not a preference. The stack:

Elasticsearch 9.3 + Kibana (Agent Builder + Workflows)
Ollama serving a local LLM (no cloud dependency)
E5-large embeddings via Ollama (CPU, 1024-dim vectors)

Agent Builder is the orchestration layer. It sends the researcher's natural language question to the LLM along with a set of tools — list_indices, get_index_mapping, search, execute_esql, and a custom build_cohort workflow tool. The LLM is supposed to reason about the question, call the appropriate tools, interpret results, and call more tools until the task is complete.

The contract is simple: Agent Builder sends tool definitions using the OpenAI tools parameter and expects the model to respond with structured tool_calls in the response. This is the standard OpenAI function calling protocol. Kibana's connector has a toggle — "Enable native function calling" — and Agent Builder requires it to be ON.

There is a "simulated" fallback mode in Kibana (system-prompt injection with text markers), but it only works for the Observability AI Assistant. Agent Builder doesn't support it. Native tool calling or nothing.

The Plan

We chose Ollama + Llama 4 Maverick. The reasoning:

Llama 4 is Meta's latest flagship — massive context, strong reasoning
Ollama is the simplest way to serve a model locally — one command to pull, one to serve
Elastic's OpenAI-compatible connector handles the API translation
The whole thing fits on a single GPU node

Clean plan. Obvious choice.

What Actually Happened

The model loaded. The connector went green. We typed a research question in Hebrew.

Llama 4 responded with a thoughtful, well-structured paragraph about how one might go about finding diabetic patients in a medical database. General advice. Suggestions to "check with your data team."

It did not call list_indices. It did not call search. It did not call anything.

We adjusted the agent prompt. Made it more explicit: "You MUST use tools." "Always start by calling list_indices." We added examples. We restructured the system message.

Same result. Polite chat. Zero tool calls.

We tried tool_choice: "required" (forcing tool use). Ollama returned the parameter in the response echo but the model still didn't produce tool_calls.

Two days of prompt engineering a problem that had nothing to do with prompts.

The Root Cause

Ollama decides tool calling support per model via baked-in chat templates.

Every model in the Ollama library either has a validated Jinja template that includes tool handling — marked with a "tools" tag on the model's library page — or it doesn't. If it doesn't, Ollama silently strips the tools parameter from API requests before they reach the model.

The model never sees the tools. It can't call what it doesn't know exists.

✅ Qwen 3, Qwen 2.5 — have the "tools" tag → tools work
✅ Mistral Nemo — has the tag → tools work
❌ Llama 4 (Scout and Maverick) — no tag → tools silently dropped
❌ DeepSeek R1 — no tag → tools silently dropped

There is no user-configurable workaround within Ollama. You can't provide a custom chat template. You can't force tool passthrough. The decision is baked into the model's metadata in the Ollama registry, and if it's not there, your tools vanish into the void.

This is the worst kind of failure: the system behaves as if it's working. The API returns 200. The model responds coherently. If you don't specifically inspect the response for tool_calls, you'd conclude the model is "choosing" not to use tools — maybe the prompt needs work, maybe the tools aren't described well enough. In reality, the model never had the option.

The 30-Second Test That Would Have Saved Us Two Days

Before committing to any model for agent work via Ollama, run this:

curl http://YOUR_LLM:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "YOUR_MODEL_HERE",
    "messages": [{"role": "user", "content": "List all medical indices"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "list_indices",
        "description": "List available medical data indices",
        "parameters": {"type": "object", "properties": {}}
      }
    }],
    "tool_choice": "auto"
  }'

If the response contains "tool_calls" — the model works. Connect it to Agent Builder.

If it returns plain text — that model won't work with Agent Builder. Don't waste time on prompt engineering. The tools aren't reaching the model.

You can also check the model's page on ollama.com/library — look for the "tools" tag. No tag, no tools.

The Two Escape Routes

Option A: Switch the Serving Layer — vLLM

vLLM doesn't rely on baked-in chat templates. It exposes explicit --tool-call-parser flags that apply tool calling logic at the serving layer, outside the model's template:

Model	`--tool-call-parser`
Qwen 2.5 / 3	`qwen3_xml`
Llama 3.1 / 3.3	`llama3_json`
Llama 4	`llama4_pythonic`
Kimi K2	`kimi_k2`
DeepSeek V3 / R1	`deepseek_v3`
Mistral	`mistral`

Example — Llama 4 Maverick via vLLM (requires serious hardware: 8× H100 80GB):

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --enable-auto-tool-choice \
  --tool-call-parser llama4_pythonic \
  --chat-template examples/tool_chat_template_llama4_pythonic.jinja \
  --tensor-parallel-size 8

This works. Llama 4 calls tools via vLLM. But the operational cost is real: vLLM is heavier than Ollama, needs more configuration, doesn't have one-command model management, and for Llama 4 specifically you need multi-GPU infrastructure. For our single-VM air-gapped deployment, this was overkill.

Option B: Switch the Model

Find a model that Ollama natively supports for tools, that meets your quality requirements, and that fits your hardware.

This is what we did.

What We Validated

Not all tool-capable models are created equal. "Supports tools" means the serving layer will pass them through. It doesn't mean the model will call the right tool with the correct parameters in the right order for a multi-step agentic workflow.

We tested every viable candidate against our actual agent tasks: schema discovery across 10 indices, field mapping for 4 facilities with Hebrew and English field names, criteria extraction from Hebrew natural language, and workflow invocation with structured JSON parameters.

Model	VRAM (Q4)	Tool Quality	Notes
Qwen 3 30B	~20GB	Excellent	Our production choice — best balance of quality, size, and tool reliability
GPT-OSS 20B	~12GB	Good	Validated by Elastic's own team for Agent Builder
Qwen 2.5 32B	~20GB	Very good	Mature and proven, slightly less agentic than Qwen 3
Qwen 3 8B	~6GB	Good	Fits on consumer GPU — great for dev/testing
Mistral Nemo 12B	~8GB	Decent	Lightest viable option, struggles with complex multi-step plans
Llama 3.1/3.3 70B	~40GB	Good	Needs 2× GPU, good quality but hardware-heavy
Kimi K2	~500GB+	Best agentic	Multi-GPU only; strongest tool orchestration if you have the iron

Qwen 3 30B won. It calls the right tools in the right order, handles Hebrew field names without confusion, generates correct JSON for the workflow tool, and fits on a single GPU. The model went from "never heard of it" to "production choice" in one afternoon.

Connecting to Agent Builder

Once you have a working model, the connector setup in Kibana:

Kibana → Stack Management → Connectors → Create → OpenAI

Provider: "Other (OpenAI Compatible Service)"
URL: http://ollama:11434/v1/chat/completions (or your host URL)
Default model: qwen3:30b (or your chosen model)
API key: any non-empty string (Ollama ignores it, but the field is required)
"Enable native function calling": ON ← this is non-negotiable for Agent Builder

The provider must be "Other (OpenAI Compatible Service)" — not "OpenAI." The "OpenAI" option hardcodes the OpenAI API URL and won't let you point to a local endpoint.

The Broader Lesson

"Model X supports tool calling" is a statement about the model's training. It says nothing about whether your serving infrastructure will actually expose tools to the model at inference time.

The full path that must work, end-to-end:

Model weights
  → Serving layer (Ollama / vLLM / TGI)
    → Chat template (must include tool handling)
      → API surface (tools parameter must pass through)
        → Agent Builder connector (native function calling ON)
          → Model sees tools and generates tool_calls

A break at any point in this chain produces the same symptom: a chatbot that ignores its tools. And the failure is always silent.

In cloud deployments this is invisible — OpenAI, Anthropic, Google handle the serving layer for you. In on-prem / air-gapped deployments, you own every link in the chain. Know which ones can break.

Quick Reference

Before you start

Check the model's Ollama library page for the "tools" tag
Run the curl test against the endpoint with a dummy tool
Inspect the response for "tool_calls" — not just coherent text

If tools aren't working

Symptom	Cause	Fix
Model returns text, no tool_calls	Ollama strips tools (no chat template)	Switch model or use vLLM
Model calls wrong tools / bad params	Model quality issue	Try Qwen 3 30B or larger model
Connector errors in Kibana	Wrong provider type or URL	Use "Other (OpenAI Compatible Service)"
Agent works in Obs AI Assistant but not Agent Builder	Simulated mode	Agent Builder needs native tool calling — toggle ON

Useful links

This is the first post in the Medical Cohort Agent series. Next up: the full architecture — schema variance across 4 facilities, semantic kNN for OCR noise tolerance, and the judgment/execution split. The agent code is at github.com/e2llm/medical-cohort-agent.

Questions or war stories? insitu.im.

Top comments (2)

klement Gunndu • Mar 1

The silent failure mode where the model just answers conversationally instead of calling tools is something we have hit too. Worth checking if the model system prompt explicitly says "You MUST use tools" — some local models treat tool definitions as optional hints unless you are aggressive about it.

Alechko • Mar 2

That said this is actually the exact trap we describe in the post 😄 we spent two days doing exactly that — "You MUST use tools", explicit examples, tool_choice: required — none of it helped. the root cause was that Ollama silently strips the tools parameter for models that don't have a validated chat template (no "tools" tag on the model's library page). The model never sees the tool definitions, so no amount of prompt engineering can fix it. the 30 second curl test in the post is the quickest way to tell if you're dealing with a prompt issue vs. a serving layer issue. If the response has no tool_calls field at all, it's not the prompt. it's the infra.