Google Said It Had Native Function Calling. I Tested It.

#ai #agents #localai #benchmarking

Google released Gemma 4 E4B with a specific claim: native function calling. "Enhanced coding and agentic capabilities," the model card said. "Native function-calling support, powering highly capable autonomous agents."

4.5 billion effective parameters. Apache 2.0. Runs on a laptop. 50 tokens per second on my Mac Mini M4.

I wanted to believe it. The promise of a local model that could actually use tools — not hallucinate tool calls, not ignore them, but use them — is something I've been chasing for months.

So I ran it through the same battery I've used for every other local model.

The Test

Two dimensions. Code quality: 10 real agent coding tasks — file parsing, bug fixing, YAML repair, regex extraction. Agent readiness: 6 tool-calling scenarios — single-tool selection, multi-tool discrimination, required adherence, false positive resistance, multi-turn chaining, argument correctness.

The same tests I ran on SmolLM3, Phi-4-mini, and nine other models.

Code Quality: 64.2%

Six passes, one partial, three fails. It handled text parsing cleanly — regex extraction, JSON parsing, file analysis all passed. But it fell apart on system tasks. It couldn't fix a YAML indentation error. It couldn't produce the right shell command to check a port. It couldn't recover from a broken rm command.

The pattern was clear. Give Gemma 4 E4B structured text and it shines. Give it anything that touches the terminal and it stumbles.

Agent Readiness: 33.3%

Two out of six.

It correctly picked search_files over read_file when asked to find a config file. That's the good news — it discriminated between tools. And when given no tools at all, it stayed quiet and answered the question. No hallucinated function calls.

But it failed at everything else. It refused to call a tool when asked to read a file — it described what read_file would do instead. When I set tool_choice: required, it ignored it entirely and returned text. It couldn't chain two calls together. And when asked to write a file with specific content, it didn't call any tool at all.

Where It Lands

Model	Code Quality	Agent Readiness
SmolLM3 3B	93.3%	50.0%
Phi-4-mini	90.0%	16.7%
Gemma 4 E4B	64.2%	33.3%
Gemma-3n E2B	76.7%	0.0%
Qwen2.5 0.5B	74.2%	0.0%

Not the worst. Not the best. The middle child.

The "native function calling" is real — it's the only model besides SmolLM3 that reliably produces tool calls at all. Phi-4-mini, Gemma-3n, Qwen, Llama — they all score zero on agent readiness. Gemma 4 E4B gets two. That's genuinely better than the field.

But "better than zero" and "production-ready" are different things.

What This Means

If you need a local model for text tasks — summarization, extraction, parsing — Gemma 4 E4B is fast, small, and solid. 50 tok/s on modest hardware. 5GB on disk. Apache 2.0.

If you need it to act — to call tools, chain operations, function as an agent — it's not there yet. It picks the right tool about half the time and ignores you the other half.

The gap between "native function calling" on a model card and "reliable tool use" in practice is still wide. Google built the plumbing. The wiring isn't finished.

What You Should Check

Don't trust "native function calling" claims without testing them yourself
Test tool_choice: required specifically — many models ignore it
Multi-turn tool chaining is the hardest test — almost nobody passes it
Code quality and agent readiness are different skills — measure both
50 tok/s locally is genuinely useful even without tool calling

Something else will break tomorrow. Something always does.

DEV Community