Agent Autopsy, Day 10
I had a SmolLM3-3B running on my laptop. It scored 93.3% on my code quality benchmark. I thought I was one config change away from a local AI agent that could actually do things.
I was wrong.
What I Assumed
Code quality equals agent capability. If a model can generate correct Python, read files, and fix bugs at 93%, it should be able to call a function when asked.
That assumption survived exactly two minutes of testing.
What I Built
I wrote a proper agent readiness benchmark. Six pass/fail dimensions. Can it call a single tool when told? Pick the right one from three? Obey tool_choice: required? Stay quiet when no tools exist? Chain calls across turns? Pass the right arguments?
I also built a 100-line translation proxy. Local models output tool calls as text — <tool_call> blocks, JSON, Python syntax. Agent frameworks need OpenAI's native tool_calls format. The proxy bridges that gap. Without it, most models score 0%.
The Results
SmolLM3-3B scored 50%. It calls single tools correctly. It writes files with the right arguments. But give it three tools and ask it to pick — it freezes. Ask it to chain two calls — it can't.
Phi-4-mini scored 90% on code and 17% on agent tasks. The only dimension it passed was "no false positives" — meaning it stayed quiet instead of hallucinating. That's the bar.
Qwen2.5-Coder-14B, at 7.7 gigabytes, scored 85% on code. Couldn't call a single tool. Llama 3.1-8B, same story. Bigger model, zero agent capability.
Why the Gap Exists
Code benchmarks test whether a model can generate correct output from a prompt. Agent tasks test whether it can follow a protocol — receive tools, reason about which to use, call it, receive the result, decide what's next.
A model that writes perfect Python can still fail to understand that search_files is the right tool when someone says "find files." The 93.3% to 17-50% drop isn't a bug. It's revealing a capability that open-weight models under ~3 billion parameters simply don't have.
Architecture matters more than size. Qwen 14B at 7.7GB couldn't call a tool. SmolLM3 at 1.8GB could. Parameter count tells you nothing about agent readiness.
What You Should Check
- Test tool calling separately from code quality. The correlation is weak. Your 90% code model might be useless as an agent.
- Use a translation proxy. The format mismatch alone costs you 0-17%. A 100-line proxy fixes it.
- Don't assume bigger means better for agent tasks. Architecture beats parameter count.
- Benchmark before you build. I built the proxy first. Should've tested the models first.
The proxy is at github.com/vystartasv/toolcall-proxy. The benchmark is at benchmarks.workswithagents.dev.
Something else will break tomorrow. Something always does.
Top comments (0)