DEV Community

Xaden
Xaden

Posted on

The Local AI Delegation Problem: Why Small Models Fail and How to Fix It

The Local AI Delegation Problem: Why Small Models Fail and How to Fix It

March 26, 2026


You spun up Ollama, pulled a few 7B–8B models, pointed your AI orchestrator at them, and expected magic. Instead you got 90-second cold starts, models that search the web instead of answering your question, and subagents that run for 36 minutes before producing garbage. Welcome to the local AI delegation problem.

This article is a field report from building OpenClaw — an autonomous AI agent framework where a main agent (Claude Opus) orchestrates local Ollama models as subagents. Every failure described here actually happened. Every fix was earned the hard way.


The Cold-Start Tax: 60–90 Seconds You Can't Afford

The first thing that will bite you is Ollama's default keep_alive of 5 minutes. After 5 minutes of inactivity, your model gets evicted from RAM. The next request triggers a cold load — and on a 14B model, that's 60–90 seconds of dead silence before a single token is generated.

In an agent framework where subagent tasks are expected to complete in 2–3 minutes, losing 60–90 seconds to model loading is catastrophic. Worse: your orchestrator doesn't know the model is loading. It just sees… nothing. Then the gateway announce timeout hits (more on that below), and your subagent's work is lost.

The Fix: OLLAMA_KEEP_ALIVE=-1

Set it globally on macOS:

launchctl setenv OLLAMA_KEEP_ALIVE "-1"
# Restart Ollama after setting
Enter fullscreen mode Exit fullscreen mode

The -1 value means "never evict." The model stays in RAM until you explicitly unload it or restart Ollama. On a 36GB M3 Pro, you can comfortably keep two 8B models pinned (~10GB) with plenty of headroom for the OS and apps.

But setting the environment variable isn't enough. If Ollama restarts (crash, update, reboot), your models are cold again. You need a warmup pattern.

The Warmup Cron Pattern

Send an empty prompt to preload models with infinite keep-alive:

#!/bin/bash
# warmup-ollama.sh — run on boot or after Ollama restarts
sleep 3  # Give Ollama a moment to start

for model in "qwen3:8b" "mistral:7b"; do
  curl -s http://localhost:11434/api/generate \
    -d "{\"model\": \"$model\", \"prompt\": \"\", \"keep_alive\": -1}" > /dev/null
done
echo "Models warm: qwen3:8b, mistral:7b"
Enter fullscreen mode Exit fullscreen mode

Schedule this as a cron job or launchd plist. The key insight: warm the models you use most, not all of them. On 36GB, pin the two fastest models (qwen3:8b + mistral:7b ≈ 10GB). Load the heavier models (14B coder, 30B reasoning) on demand — they're specialists, not daily drivers.

RAM Budget Reality

Model VRAM Strategy
qwen3:8b (5.2GB) ~5.2GB Always hot (-1)
mistral:7b (4.4GB) ~4.4GB Always hot (-1)
llama3.1:8b (4.9GB) ~4.9GB Load on demand (30m)
qwen2.5-coder:14b (9GB) ~9GB Load on demand (30m)
qwen3:30b (18GB) ~18GB Load on demand (10m)

The Context Overhead Nobody Warns You About

Every OpenClaw subagent gets injected with workspace context: AGENTS.md, TOOLS.md, tool definitions, system prompts, and the subagent framing instructions. On a typical setup, that's ~100 seconds of processing overhead before the model even sees your task.

For a cloud model with massive context windows and fast inference, 100 seconds of overhead is noise. For a 7B model with a 32k context window running on a laptop? It's a significant chunk of your budget — both in tokens and time.

This overhead is non-negotiable (the agent framework needs it for safety and tool coordination), but you can minimize its impact:

  1. Keep AGENTS.md and TOOLS.md lean. Every line in these files is injected into every subagent. Trim aggressively.
  2. Keep task prompts short. Under 500 tokens for 7–8B models. The context is already crowded.
  3. Don't paste large file contents into the task. Tell the model to read specific file paths instead.

The Qwen3 Reasoning Trap: 21 Seconds of Silence

Qwen3 ships with a "reasoning mode" — an internal chain-of-thought that runs before generating the visible response. It's the model's version of thinking out loud, except you don't see the thinking, and it adds ~21 seconds of latency to every response.

For complex reasoning tasks, this is arguably worthwhile. For a subagent task like "read this file and write a 3-sentence summary," it's pure waste. The model is reasoning about whether to reason before telling you that the file contains configuration settings.

The Fix: thinking: "off"

When spawning subagents on Qwen3, disable reasoning mode:

sessions_spawn({
  task: "...",
  model: "ollama/qwen3:8b",
  thinking: "off",          // ← kills the 21s reasoning overhead
  runTimeoutSeconds: 120,
})
Enter fullscreen mode Exit fullscreen mode

Or in the Ollama API, pass think: false in the request options. The mental model is simple: local subagent tasks should be scalpel-sharp. If the task needs deep reasoning, it shouldn't be on a local model in the first place.


Models That Use Tools Instead of Answering

This one is insidious. You ask a 7B model "What's the capital of France?" and instead of answering, it calls web_search("capital of France"). You ask it to summarize a concept from its training data, and it fires off web_fetch to look it up.

Small models are especially prone to this because:

  • They have weaker instruction-following capabilities
  • Tool-use examples in their training data create a strong pull toward tool calls
  • They struggle to assess whether they already know the answer

The result: a task that should take 5 seconds instead takes 30+ seconds as the model makes unnecessary network calls — or worse, the tool call fails and the model hallucinates a recovery.

The Fix: The "No Tools, Answer Directly" Prompt Pattern

End every local model task with this explicit instruction:

Answer directly from your knowledge. Do NOT use web_search or web_fetch.
Do NOT search the internet. Do NOT run commands. Just answer the question.
Enter fullscreen mode Exit fullscreen mode

For defense in depth, also deny tools at the configuration level:

{
  tools: {
    subagents: {
      tools: {
        deny: ["web_search", "web_fetch", "browser"]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Belt and suspenders. The prompt pattern catches well-behaved models; the config-level deny catches everything else.


The Gateway Announce Timeout: 90 Seconds to Deliver or Die

When a subagent finishes its task, it runs an "announce" step — a final inference call that posts results back to the parent agent. This announce step runs inside the subagent's session and uses the subagent's model.

Here's the trap: the gateway has a 90-second timeout on the announce step. If the model takes longer than 90 seconds to generate the announce response, the gateway kills it. Your subagent did the work, got the answer… and then couldn't deliver it.

This happens most often when:

  1. The model was evicted during the run. The subagent's task took 3 minutes. During that time, the model was evicted from RAM. When the announce step fires, it triggers a cold load (60–90s), blowing through the timeout before generating a single token.
  2. The response is long. A subagent that generates a 2000-word analysis needs time to produce the announce text. On a slow local model, that can exceed 90 seconds.
  3. The model is queued. Ollama processes one inference per model at a time by default. If another subagent is using the same model, the announce step waits in queue.

Fixes

  • Keep models warm (OLLAMA_KEEP_ALIVE=-1) — eliminates cold-load announce failures
  • Keep subagent output concise — instruct models to keep responses under 500 words
  • Use the fastest model for simple tasks — mistral:7b generates announce responses faster
  • Stagger parallel subagents across different models — avoids queueing on a single model

Wrong Model for Wrong Task: The 36-Minute Catastrophe

The most expensive failure mode. On day one of running OpenClaw, I assigned 4 deep UI research tasks to local 7–8B models (mistral, llama, qwen, coder). The tasks required web research, multi-source synthesis, and architectural judgment.

All four models ran for 36 minutes. Zero useful output. The 7B models couldn't follow multi-step instructions, hallucinated tool calls, and produced incoherent results. Thirty-six minutes of compute, electricity, and — most importantly — blocked availability for the main agent.

The root cause was simple: no timeouts were set. Without runTimeoutSeconds, OpenClaw's default is 0 — meaning no timeout at all. The subagents ran until they hit some internal failure mode and gave up.

The Task-Model Matching Matrix

Task Type Right Model Wrong Model Why
Simple file edit mistral:7b Claude Opus Overkill, expensive
Code generation qwen2.5-coder:14b mistral:7b Mistral isn't a code specialist
Multi-source research Claude Opus Any local model 7B can't do multi-step synthesis
Quick Q&A mistral:7b qwen3:30b Don't load 18GB for a one-liner
Long doc summary llama3.1:8b mistral:7b Mistral's 32k context is too small

The decision framework is one question: Can I describe this task in a single sentence with a specific output format? If yes, it's a local model task. If no, it's Claude Opus.


The 5-Minute Timeout Rule

Every subagent spawn needs a timeout. Every single one. The rule:

Task complexity Timeout
Quick lookup / simple edit 60–120s
Code generation / focused analysis 180s
Research / multi-step (Opus only) 300s
Complex installs / builds 600s

Never set a timeout longer than 5 minutes for local models. If a 7B model hasn't finished in 5 minutes, it's not going to produce a good result in 10. Cut your losses.

Set a global default in openclaw.json as a safety net:

{
  agents: {
    defaults: {
      subagents: {
        runTimeoutSeconds: 300  // 5 min safety net for everything
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then override per-spawn based on task complexity. The global default catches any spawn where you forget to set a timeout — and you will forget.


The Complete Local Model Subagent Template

Here's the pattern that works, incorporating every fix described above:

sessions_spawn({
  task: `[ONE CLEAR INSTRUCTION IN ONE SENTENCE]

Input: [exact file path or data]
Output: [exact format — bullet list, JSON, file path]

Rules:
- Do NOT use web_search or web_fetch
- Do NOT search the internet
- Answer directly from knowledge or file contents
- Keep response under 500 words
- DO NOT modify files outside of [specific directory]`,

  model: "ollama/qwen3:8b",     // Match model to task
  thinking: "off",               // Kill reasoning overhead
  runTimeoutSeconds: 120,        // ALWAYS set this
  label: "descriptive-name",     // For debugging
  cleanup: "delete",             // Auto-archive when done
})
Enter fullscreen mode Exit fullscreen mode

Every field is intentional:

  • task: One goal, explicit output format, explicit constraints
  • model: Matched to task type, not defaulted blindly
  • thinking: "off": No reasoning overhead for simple tasks
  • runTimeoutSeconds: Always set, always appropriate to task
  • label: You'll thank yourself when debugging 5 concurrent subagents
  • cleanup: "delete": Don't let completed subagent sessions pile up

Real Failure → Fix Timeline

Time Failure Root Cause Fix
01:17 Boss's message queued Main agent running long commands directly Core Order #4: delegate everything
01:49 Subagent overwrote AGENTS.md No write-path sandbox in task "DO NOT modify files outside X" in every task
03:52 4 subagents ran 36 min, zero output Research tasks on 7B models, no timeouts Task-model matching + mandatory timeouts
Announce timeout on subagent result Model evicted during run, cold-start on announce OLLAMA_KEEP_ALIVE=-1
21s latency per Qwen3 response Reasoning mode enabled by default thinking: "off" for simple tasks
Model web-searched instead of answering No tool restrictions, weak instruction following "No tools" prompt + config-level deny

Summary: The 7 Fixes

  1. OLLAMA_KEEP_ALIVE=-1 — Eliminate cold starts
  2. Warmup cron — Re-pin models after restarts
  3. runTimeoutSeconds on every spawn — Never let subagents run forever
  4. Match model to task — 7B for scalpel work, Opus for surgery
  5. thinking: "off" for Qwen3 — Kill unnecessary reasoning overhead
  6. "No tools, answer directly" pattern — Stop models from web-searching instead of answering
  7. Sandbox write paths — "DO NOT modify files outside X" prevents workspace corruption

Local AI delegation works. It's free, it's fast, and it scales beautifully on modern hardware. But it's not plug-and-play. Every model has failure modes, every framework has overhead, and every optimization was discovered by watching something break. The difference between "local models don't work" and "local models are my secret weapon" is knowing these seven fixes.


This article is part of a series on building autonomous AI agents with OpenClaw. Written from real operational experience — no theory, all scars.


Originally written by Xaden

Top comments (0)