Four failure modes you'll hit running a local LLM in a multi-step agentic loop

#ai #mcp #claude #githubcopilot

Most local-LLM benchmarks measure single-turn chat quality. Agentic workflows are a different beast: the model has to read state, call a tool, inspect the tool's result, decide whether it's done, and — if not — call another tool. A model that scores 95% on chat benchmarks can fail catastrophically on this loop in characteristic, reproducible ways.

I spent three weeks trying to get local LLMs to reliably run the agentic workflows in a VS Code extension I maintain. Full disclosure: I'm the creator of SPECLAN, an extension that manages product specs as Markdown files with YAML frontmatter — Git-native, one file per requirement, organized in a tree. A core feature, Infer Specs, walks a codebase and proposes a Goal → Feature → Requirement tree by calling MCP tools (create_feature, update_requirement, read_file, etc.) in a loop until it decides the tree is complete. This is a heavy agentic workflow: multi-turn, tool-heavy, and the model has to know when to stop.

The concept works without the tool. Markdown-plus-YAML-plus-Git as a spec format is older than SPECLAN and is the generalizable pattern this article assumes. The failure modes below will hit any agentic workflow that uses MCP tool calls plus structured output — SPECLAN is just where I observed them on seven different models across two local servers.

Here are the four failure modes, in the order you'll probably hit them.

1. The tool-call loop

Setup: an instruction-tuned model, reasonable size, MCP tool wired, seed a requirement and ask the agent to populate it.

What you'll see in the trace:

18:56:25  update_requirement  → R-0049
18:56:25  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
...  (12 more times, same arguments)

Same tool, same target, same arguments, repeated until the agent runs out of turns. On disk: garbage. The requirement's description got jammed into the YAML title: field, the body is still the untouched template placeholder, and the "Acceptance Criteria" section ends up in the wrong place.

This is not a bug in your code. Google's own Gemma 4 docs acknowledge it: Gemma can emit multiple tool calls per turn and has no built-in loop termination. The model sees the tool's success response but doesn't recognize "I am done." MoE and MatFormer-style elastic variants hit this hardest.

Mitigation (application layer): track tool-call fingerprints in the agent runner. If you see the same (tool_name, stable_arg_hash) three times in a row, interrupt the loop with a synthetic tool result that says "this tool has already produced the expected effect; proceed to the next step or terminate." This works because the loop is usually driven by the model not trusting the first success.

const callHistory: string[] = [];
for await (const step of agent.stream()) {
  if (step.type === 'tool_call') {
    const fp = `${step.name}:${stableStringify(step.args)}`;
    if (callHistory.slice(-3).every(x => x === fp)) {
      yield { role: 'tool', content: 'Already applied. Continue or finish.' };
      continue;
    }
    callHistory.push(fp);
  }
  yield step;
}

Not beautiful but survives every MoE variant I've thrown at it.

2. The hallucinated success

Second failure is worse, because it passes superficial validation.

Trace:

17:22:01  update_requirement  → R-8881   [tool call happened]
17:22:03  assistant: "I read the current state of R-8881 and updated its
                     description with a full specification: [long convincing
                     summary of changes]"

File on disk: unchanged.

The tool call fired. Your logs show it. The agent's final answer says the task succeeded. But the tool-call arguments were malformed in a way your MCP server silently ignored — or the model narrated its intent as a completion without ever carrying it out.

This is the "hallucinated success" mode. It's worse than the loop because:

Tests that assert "the agent called update_requirement at least once" pass.
Tests that assert the file changed fail — but only if you actually assert that.
Manual review sees a confident, detailed "I did it" message and believes it.

Mitigation (observability layer): every tool-call MCP server should return a diff summary as part of its response, not just {"success": true}. Something like { changed: true, hash_before: '...', hash_after: '...', fields_modified: ['description', 'acceptance_criteria'] }. Then your agent runner can verify that the model's final claim is consistent with the actual diff history. If the model says "I updated the description" but the diff summary shows changed: false, flag the session as inconsistent.

I also keep a diff_since_seed field that the agent can read at any time — so the model can literally look at what it has and hasn't changed, rather than relying on its own memory of the conversation.

3. Edit-as-replace

Different workflow: user runs /add Acceptance Criteria on an existing 5-section spec. Claude and GPT-5 default to echoing the full document with the addition merged in. A weaker local model — gemma-4-26b-a4b in my case — returned only the new section. Three sentences. The editor received three sentences and replaced the entire document.

Silent data loss. No error, no warning.

This isn't exclusive to local models; it happens to cloud models too if your prompt doesn't explicitly state the invariant. But strong models infer the invariant ("they want me to add a section, not replace the doc"). Weak models execute the surface instruction. Prompt-engineer it out:

DOCUMENT COMPLETENESS RULE (NON-NEGOTIABLE)

Your response MUST contain the ENTIRE document, not just the portion
you modified. The editor replaces the current document with your full
response. A partial response will delete everything you didn't emit.

If you cannot reproduce the full document (length, context budget,
uncertainty), return the ORIGINAL document unchanged. A no-op is
always correct; silent truncation is never correct.

The fact that this has to be said in capitals to a 26B model is the whole lesson of weak-model prompting: invariants that strong models treat as obvious must be written down.

4. Structured-output non-compliance

Clarification flow: ask the model to propose JSON matching a schema like { changes: [...], reasoning: "..." }. Downstream code does response.changes.map(...).

A local model with no guidance returned a raw array instead of the wrapped object. .map on undefined, crash.

Here's the subtle part: the schema text was never reaching the prompt. A helper signature had changed; the caller was still passing the schema positionally. TypeScript accepted it. The local model made up its own structure because we never showed it the schema in-band.

The lesson: don't rely on OpenAI's response_format or any SDK-level structured-output guarantee for local models. Most local servers implement the OpenAI-compatible API but not the structured-output constraints behind it. Put the schema text directly into the system prompt:

const systemPrompt = `
You return JSON matching this exact schema:

${JSON.stringify(schema, null, 2)}

Critical: the root MUST be an object with a "changes" array and a
"reasoning" string. NEVER return a bare array at the root.
`;

Belt-and-braces with the SDK's structured-output call. Local models will still go off-script occasionally, but the schema-in-prompt approach catches ~95% of the drift in my testing.

The benchmark

Seeded a requirement, asked the agent to populate it with description + acceptance criteria, measured: did it call the right tool? did it call the same tool more than 3× (loop)? did the file on disk actually change?

Server	Model	Type	Heavy workflow	Failure mode
Ollama	`gemma4:latest` (8B)	Dense	PASS	—
Ollama	`gemma4:31b`	Dense	PASS	slow but clean
Ollama	`gpt-oss:20b`	MoE	PASS tools / FAIL schema	output non-compliance
LM Studio	`google/gemma-4-26b-a4b`	MoE	FAIL	tool-call loop ×16
LM Studio	`openai/gpt-oss-20b`	MoE	FAIL	hallucinates completion
LM Studio	`google/gemma-4-e4b`	Elastic	FAIL	"no final response"
LM Studio	`openai/gpt-oss-120b`	MoE	FAIL	tool called, file unchanged

Three findings fall out:

Dense beats MoE / elastic for agentic tool calling. Every MoE and MatFormer variant failed the heavy workflow. Every dense variant passed. The jdhodges 2026 local-LLM tool-calling benchmark shows the same pattern — Qwen 3.5 4B (3.4 GB) at 97.5%, beating models 5× its size. Dense weights + good tool-call fine-tuning dominate.
Ollama beats LM Studio on the same weights. Same gpt-oss:20b, opposite results. The difference is the tool-call translation layer: Ollama maps the model's native tool-call format to the OpenAI-compatible wire faithfully; LM Studio's current implementation loses fidelity in ways that matter. This one surprised me — I'd assumed weights dominated the harness.
Size doesn't rescue you. gpt-oss-120b failed the same way as its 20B sibling. You can't out-parameter a chat-template / tool-call-format mismatch.

What to carry away

If you're building something agentic on top of local LLMs, the checklist is short:

Start dense. Qwen 3.5/3.6 or Gemma 4 dense, on Ollama, 7B minimum.
Add loop detection at the application layer. Don't trust the model to self-terminate.
Return meaningful tool results, not {"success": true}. Diff summaries let you detect hallucinated success.
Put your schema in the prompt, not just the SDK.
Bump context length to 16K+ on LM Studio and reload the model (the setting doesn't apply to already-loaded models — I wasted half a day on "Model did not produce a final response" before I realized).
Prompt against weak-model literal-mindedness. The DOCUMENT COMPLETENESS RULE pattern prevents whole classes of silent data loss.

Everything here is generalizable — none of it is specific to how SPECLAN uses MCP tools. If you've run into different failure modes on your local-LLM agentic workflows (especially with Qwen 3.6 dense, Llama 3.3, or GLM-4.7), drop them in the comments. I'm particularly interested in anyone who's gotten Qwen3.6-35B-A3B to self-terminate reliably on a 10+-step tool-calling loop — the MoE training for agentic coding is supposed to have fixed this, but I haven't verified it yet.

References