A weekly tool-use test is the only signal

#llm #openrouter #agents #claudecode

model	tool-use	streak
nemotron-3-super-120b (free)	PASS	26 clean
owl-alpha	PASS	5 clean
gemma-4-31b (free)	PASS	3 clean
trinity-large-thinking (free)	FAIL	6 failing
ring-2.6-1t (free)	FAIL	10 failing
qwen3-next-80b (free)	FAIL	0 of 13 ever
qwen3-coder (free)	FAIL	0 of 13 ever

That is one snapshot from one week.

Last week the order was different. Next week it moves again.

Here is what nobody selling you a free-model leaderboard says out loud.

Reliability for tool use decays under you while the model name on the endpoint stays the same. A ranking you read once will lie to you within weeks.

I run a small pass-or-fail check across the free-tier models I might route an agent through. Same prompt, same expected tool call, every week. It answers one question, whether the model produced a valid call the runtime could execute. Nothing about vibes. Nothing about chat quality.

Two that never worked

Two of the eight never produced a single valid tool call.

Qwen3-next-80b and Qwen3-coder, both on the free tier, went zero for thirteen. Across thirteen separate checks, the function call never once appeared.

These models look strong on chat benchmarks. People recommend them in threads every week. If your agent needs to call a function, neither one is a candidate today, and no chat score changes that.

When a model rots

Trinity was my workhorse for weeks. Forty-one passing checks over its life, more than any model in the set.

Then it slid into six consecutive failures.

Nothing in my setup changed. Something behind the free endpoint did. Maybe the provider swapped the weights, maybe it started throttling, maybe the routing shifted. Free tiers do not send release notes. I saw the slide only because I check on a schedule.

A model you validated once and trusted forever will break your agent on a quiet Tuesday.

What holds, for now

Nemotron has twenty-six clean passes in a row. It is my most reliable free tool-use model at this moment.

I write "at this moment" on purpose. Owl-alpha, with a one million token capacity, sits at five clean passes and earns a slot for long inputs. Gemma 4 31B passes today but carries nine lifetime failures, so I treat it as a backup, never a default.

Pass or fail, never a score

For tool use there is no partial credit.

The agent either returned a call the runtime executed, or it returned prose shaped like a call that broke the loop. A 0.82 quality score blurs those two outcomes into one comforting number. PASS and FAIL refuse to blur them.

That is why the check is binary. A scoreboard tells you a model is pretty good. A pass-or-fail run tells you whether your agent will hang at 2am.

Why your demo will not catch this

Failures hide while you are building.

They show up two weeks into production, the first time the model behind the endpoint shifts and your agent quietly stops calling tools. If the workflow touches revenue, support, or a dashboard someone checks on Monday, that silent shift is the expensive one. Loud errors get fixed fast. A silent degrade hides until a customer finds it.

A weekly re-test is the cheapest insurance I know against that surprise.

What I do not publish

My check runs on a schedule and routes around the failures on its own.

The schedule, the line that decides pass from fail, and the fallback order are what I build for clients. I am not pasting them here, and the reason is honest.

Paste the wiring and the next team copies it, then skips the one decision that actually matters, which is what counts as a pass for their workflow rather than mine. A model that passes for a summarizer can fail for a multi-step agent. Your threshold reflects your own risk. The number that works for my blog can fail for your stack.

A table is easy to post. The decision behind it is the work.

One question for you

If you run free or open models behind an agent, I want two things from you.

How often do you re-test them, and what was the last model that quietly rotted on you?

Drop the model name and the failure shape in the comments. I am building a picture of how fast the free tier decays, and every data point makes it sharper.