The Local AI Delegation Problem: Why Small Models Fail and How to Fix It
March 26, 2026
You spun up Ollama, pulled a few 7B–8B models, pointed your AI orchestrator at them, and expected magic. Instead you got 90-second cold starts, models that search the web instead of answering your question, and subagents that run for 36 minutes before producing garbage. Welcome to the local AI delegation problem.
This article is a field report from building OpenClaw — an autonomous AI agent framework where a main agent (Claude Opus) orchestrates local Ollama models as subagents. Every failure described here actually happened. Every fix was earned the hard way.
The Cold-Start Tax: 60–90 Seconds You Can't Afford
The first thing that will bite you is Ollama's default keep_alive of 5 minutes. After 5 minutes of inactivity, your model gets evicted from RAM. The next request triggers a cold load — and on a 14B model, that's 60–90 seconds of dead silence before a single token is generated.
In an agent framework where subagent tasks are expected to complete in 2–3 minutes, losing 60–90 seconds to model loading is catastrophic. Worse: your orchestrator doesn't know the model is loading. It just sees… nothing. Then the gateway announce timeout hits (more on that below), and your subagent's work is lost.
The Fix: OLLAMA_KEEP_ALIVE=-1
Set it globally on macOS:
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
# Restart Ollama after setting
The -1 value means "never evict." The model stays in RAM until you explicitly unload it or restart Ollama. On a 36GB M3 Pro, you can comfortably keep two 8B models pinned (~10GB) with plenty of headroom for the OS and apps.
But setting the environment variable isn't enough. If Ollama restarts (crash, update, reboot), your models are cold again. You need a warmup pattern.
The Warmup Cron Pattern
Send an empty prompt to preload models with infinite keep-alive:
#!/bin/bash
# warmup-ollama.sh — run on boot or after Ollama restarts
sleep 3 # Give Ollama a moment to start
for model in "qwen3:8b" "mistral:7b"; do
curl -s http://localhost:11434/api/generate \
-d "{\"model\": \"$model\", \"prompt\": \"\", \"keep_alive\": -1}" > /dev/null
done
echo "Models warm: qwen3:8b, mistral:7b"
Schedule this as a cron job or launchd plist. The key insight: warm the models you use most, not all of them. On 36GB, pin the two fastest models (qwen3:8b + mistral:7b ≈ 10GB). Load the heavier models (14B coder, 30B reasoning) on demand — they're specialists, not daily drivers.
RAM Budget Reality
| Model | VRAM | Strategy |
|---|---|---|
| qwen3:8b (5.2GB) | ~5.2GB | Always hot (-1) |
| mistral:7b (4.4GB) | ~4.4GB | Always hot (-1) |
| llama3.1:8b (4.9GB) | ~4.9GB | Load on demand (30m) |
| qwen2.5-coder:14b (9GB) | ~9GB | Load on demand (30m) |
| qwen3:30b (18GB) | ~18GB | Load on demand (10m) |
The Context Overhead Nobody Warns You About
Every OpenClaw subagent gets injected with workspace context: AGENTS.md, TOOLS.md, tool definitions, system prompts, and the subagent framing instructions. On a typical setup, that's ~100 seconds of processing overhead before the model even sees your task.
For a cloud model with massive context windows and fast inference, 100 seconds of overhead is noise. For a 7B model with a 32k context window running on a laptop? It's a significant chunk of your budget — both in tokens and time.
This overhead is non-negotiable (the agent framework needs it for safety and tool coordination), but you can minimize its impact:
-
Keep
AGENTS.mdandTOOLS.mdlean. Every line in these files is injected into every subagent. Trim aggressively. - Keep task prompts short. Under 500 tokens for 7–8B models. The context is already crowded.
-
Don't paste large file contents into the task. Tell the model to
readspecific file paths instead.
The Qwen3 Reasoning Trap: 21 Seconds of Silence
Qwen3 ships with a "reasoning mode" — an internal chain-of-thought that runs before generating the visible response. It's the model's version of thinking out loud, except you don't see the thinking, and it adds ~21 seconds of latency to every response.
For complex reasoning tasks, this is arguably worthwhile. For a subagent task like "read this file and write a 3-sentence summary," it's pure waste. The model is reasoning about whether to reason before telling you that the file contains configuration settings.
The Fix: thinking: "off"
When spawning subagents on Qwen3, disable reasoning mode:
sessions_spawn({
task: "...",
model: "ollama/qwen3:8b",
thinking: "off", // ← kills the 21s reasoning overhead
runTimeoutSeconds: 120,
})
Or in the Ollama API, pass think: false in the request options. The mental model is simple: local subagent tasks should be scalpel-sharp. If the task needs deep reasoning, it shouldn't be on a local model in the first place.
Models That Use Tools Instead of Answering
This one is insidious. You ask a 7B model "What's the capital of France?" and instead of answering, it calls web_search("capital of France"). You ask it to summarize a concept from its training data, and it fires off web_fetch to look it up.
Small models are especially prone to this because:
- They have weaker instruction-following capabilities
- Tool-use examples in their training data create a strong pull toward tool calls
- They struggle to assess whether they already know the answer
The result: a task that should take 5 seconds instead takes 30+ seconds as the model makes unnecessary network calls — or worse, the tool call fails and the model hallucinates a recovery.
The Fix: The "No Tools, Answer Directly" Prompt Pattern
End every local model task with this explicit instruction:
Answer directly from your knowledge. Do NOT use web_search or web_fetch.
Do NOT search the internet. Do NOT run commands. Just answer the question.
For defense in depth, also deny tools at the configuration level:
{
tools: {
subagents: {
tools: {
deny: ["web_search", "web_fetch", "browser"]
}
}
}
}
Belt and suspenders. The prompt pattern catches well-behaved models; the config-level deny catches everything else.
The Gateway Announce Timeout: 90 Seconds to Deliver or Die
When a subagent finishes its task, it runs an "announce" step — a final inference call that posts results back to the parent agent. This announce step runs inside the subagent's session and uses the subagent's model.
Here's the trap: the gateway has a 90-second timeout on the announce step. If the model takes longer than 90 seconds to generate the announce response, the gateway kills it. Your subagent did the work, got the answer… and then couldn't deliver it.
This happens most often when:
- The model was evicted during the run. The subagent's task took 3 minutes. During that time, the model was evicted from RAM. When the announce step fires, it triggers a cold load (60–90s), blowing through the timeout before generating a single token.
- The response is long. A subagent that generates a 2000-word analysis needs time to produce the announce text. On a slow local model, that can exceed 90 seconds.
- The model is queued. Ollama processes one inference per model at a time by default. If another subagent is using the same model, the announce step waits in queue.
Fixes
-
Keep models warm (
OLLAMA_KEEP_ALIVE=-1) — eliminates cold-load announce failures - Keep subagent output concise — instruct models to keep responses under 500 words
- Use the fastest model for simple tasks — mistral:7b generates announce responses faster
- Stagger parallel subagents across different models — avoids queueing on a single model
Wrong Model for Wrong Task: The 36-Minute Catastrophe
The most expensive failure mode. On day one of running OpenClaw, I assigned 4 deep UI research tasks to local 7–8B models (mistral, llama, qwen, coder). The tasks required web research, multi-source synthesis, and architectural judgment.
All four models ran for 36 minutes. Zero useful output. The 7B models couldn't follow multi-step instructions, hallucinated tool calls, and produced incoherent results. Thirty-six minutes of compute, electricity, and — most importantly — blocked availability for the main agent.
The root cause was simple: no timeouts were set. Without runTimeoutSeconds, OpenClaw's default is 0 — meaning no timeout at all. The subagents ran until they hit some internal failure mode and gave up.
The Task-Model Matching Matrix
| Task Type | Right Model | Wrong Model | Why |
|---|---|---|---|
| Simple file edit | mistral:7b | Claude Opus | Overkill, expensive |
| Code generation | qwen2.5-coder:14b | mistral:7b | Mistral isn't a code specialist |
| Multi-source research | Claude Opus | Any local model | 7B can't do multi-step synthesis |
| Quick Q&A | mistral:7b | qwen3:30b | Don't load 18GB for a one-liner |
| Long doc summary | llama3.1:8b | mistral:7b | Mistral's 32k context is too small |
The decision framework is one question: Can I describe this task in a single sentence with a specific output format? If yes, it's a local model task. If no, it's Claude Opus.
The 5-Minute Timeout Rule
Every subagent spawn needs a timeout. Every single one. The rule:
| Task complexity | Timeout |
|---|---|
| Quick lookup / simple edit | 60–120s |
| Code generation / focused analysis | 180s |
| Research / multi-step (Opus only) | 300s |
| Complex installs / builds | 600s |
Never set a timeout longer than 5 minutes for local models. If a 7B model hasn't finished in 5 minutes, it's not going to produce a good result in 10. Cut your losses.
Set a global default in openclaw.json as a safety net:
{
agents: {
defaults: {
subagents: {
runTimeoutSeconds: 300 // 5 min safety net for everything
}
}
}
}
Then override per-spawn based on task complexity. The global default catches any spawn where you forget to set a timeout — and you will forget.
The Complete Local Model Subagent Template
Here's the pattern that works, incorporating every fix described above:
sessions_spawn({
task: `[ONE CLEAR INSTRUCTION IN ONE SENTENCE]
Input: [exact file path or data]
Output: [exact format — bullet list, JSON, file path]
Rules:
- Do NOT use web_search or web_fetch
- Do NOT search the internet
- Answer directly from knowledge or file contents
- Keep response under 500 words
- DO NOT modify files outside of [specific directory]`,
model: "ollama/qwen3:8b", // Match model to task
thinking: "off", // Kill reasoning overhead
runTimeoutSeconds: 120, // ALWAYS set this
label: "descriptive-name", // For debugging
cleanup: "delete", // Auto-archive when done
})
Every field is intentional:
-
task: One goal, explicit output format, explicit constraints -
model: Matched to task type, not defaulted blindly -
thinking: "off": No reasoning overhead for simple tasks -
runTimeoutSeconds: Always set, always appropriate to task -
label: You'll thank yourself when debugging 5 concurrent subagents -
cleanup: "delete": Don't let completed subagent sessions pile up
Real Failure → Fix Timeline
| Time | Failure | Root Cause | Fix |
|---|---|---|---|
| 01:17 | Boss's message queued | Main agent running long commands directly | Core Order #4: delegate everything |
| 01:49 | Subagent overwrote AGENTS.md | No write-path sandbox in task | "DO NOT modify files outside X" in every task |
| 03:52 | 4 subagents ran 36 min, zero output | Research tasks on 7B models, no timeouts | Task-model matching + mandatory timeouts |
| — | Announce timeout on subagent result | Model evicted during run, cold-start on announce | OLLAMA_KEEP_ALIVE=-1 |
| — | 21s latency per Qwen3 response | Reasoning mode enabled by default |
thinking: "off" for simple tasks |
| — | Model web-searched instead of answering | No tool restrictions, weak instruction following | "No tools" prompt + config-level deny |
Summary: The 7 Fixes
-
OLLAMA_KEEP_ALIVE=-1— Eliminate cold starts - Warmup cron — Re-pin models after restarts
-
runTimeoutSecondson every spawn — Never let subagents run forever - Match model to task — 7B for scalpel work, Opus for surgery
-
thinking: "off"for Qwen3 — Kill unnecessary reasoning overhead - "No tools, answer directly" pattern — Stop models from web-searching instead of answering
- Sandbox write paths — "DO NOT modify files outside X" prevents workspace corruption
Local AI delegation works. It's free, it's fast, and it scales beautifully on modern hardware. But it's not plug-and-play. Every model has failure modes, every framework has overhead, and every optimization was discovered by watching something break. The difference between "local models don't work" and "local models are my secret weapon" is knowing these seven fixes.
This article is part of a series on building autonomous AI agents with OpenClaw. Written from real operational experience — no theory, all scars.
Originally written by Xaden
Top comments (0)