I Tested 6 Local Models on Real Agent Tasks. The Best Scored 50%.

#ai #agents #opensource #postmortem

Agent Autopsy, Day 10

I had a SmolLM3-3B running on my laptop. It scored 93.3% on my code quality benchmark. I thought I was one config change away from a local AI agent that could actually do things.

I was wrong.

What I Assumed

Code quality equals agent capability. If a model can generate correct Python, read files, and fix bugs at 93%, it should be able to call a function when asked.

That assumption survived exactly two minutes of testing.

What I Built

I wrote a proper agent readiness benchmark. Six pass/fail dimensions. Can it call a single tool when told? Pick the right one from three? Obey tool_choice: required? Stay quiet when no tools exist? Chain calls across turns? Pass the right arguments?

I also built a 100-line translation proxy. Local models output tool calls as text — <tool_call> blocks, JSON, Python syntax. Agent frameworks need OpenAI's native tool_calls format. The proxy bridges that gap. Without it, most models score 0%.

The Results

SmolLM3-3B scored 50%. It calls single tools correctly. It writes files with the right arguments. But give it three tools and ask it to pick — it freezes. Ask it to chain two calls — it can't.

Phi-4-mini scored 90% on code and 17% on agent tasks. The only dimension it passed was "no false positives" — meaning it stayed quiet instead of hallucinating. That's the bar.

Qwen2.5-Coder-14B, at 7.7 gigabytes, scored 85% on code. Couldn't call a single tool. Llama 3.1-8B, same story. Bigger model, zero agent capability.

Why the Gap Exists

Code benchmarks test whether a model can generate correct output from a prompt. Agent tasks test whether it can follow a protocol — receive tools, reason about which to use, call it, receive the result, decide what's next.

A model that writes perfect Python can still fail to understand that search_files is the right tool when someone says "find files." The 93.3% to 17-50% drop isn't a bug. It's revealing a capability that open-weight models under ~3 billion parameters simply don't have.

Architecture matters more than size. Qwen 14B at 7.7GB couldn't call a tool. SmolLM3 at 1.8GB could. Parameter count tells you nothing about agent readiness.

What You Should Check

Test tool calling separately from code quality. The correlation is weak. Your 90% code model might be useless as an agent.
Use a translation proxy. The format mismatch alone costs you 0-17%. A 100-line proxy fixes it.
Don't assume bigger means better for agent tasks. Architecture beats parameter count.
Benchmark before you build. I built the proxy first. Should've tested the models first.

The proxy is at github.com/vystartasv/toolcall-proxy. The benchmark is at benchmarks.workswithagents.dev.

Something else will break tomorrow. Something always does.