DEV Community

Cover image for How do you know budget models are smart enough for your MCP server?
Abe Wheeler
Abe Wheeler

Posted on

How do you know budget models are smart enough for your MCP server?

We just shipped evals for sunpeak.ai

The #1 thing I hear from MCP server teams: “Our tools worked great with the latest models, but we had to start from scratch when we realized the free models couldn't use them at all.”

Budget models call tools differently: they misread ambiguous schemas, they pass wrong arguments, they can't chain tool calls, and you don’t find out until users complain.

sunpeak evals test your MCP server across every model that matters, in one command. 100% on GPT-4o, 40% on Gemini Flash. That 40% is a schema problem you’d never catch testing manually on ChatGPT. Fix the tool architecture + description, run it again, and watch it climb to 95%.

Works with any MCP server. sunpeak connects over MCP, discovers your tools, and runs each eval case dozens of times per model so you get a real pass rate, not a single lucky result.

Put it in CI. Track reliability over time. Your MCP server isn’t production-ready until the cheapest model your users might connect it to can use it consistently.

Top comments (1)

Collapse
 
maaz_ahmed profile image
Maaz Ahmed

This is sharp — the "40% on Gemini Flash is a schema problem
you'd never catch manually" point is exactly the kind of thing
teams miss until users complain.

Curious where you land on the runtime side: sunpeak validates
that a tool's schema works across models at test time, but MCP
tool schemas can also drift after they're deployed and passing
your evals — a tool adds a field, escalates an effect, etc. Do
you see eval-time validation and runtime drift monitoring as
the same problem or two layers?

I've been building the runtime drift side (Interlock) and your
eval angle feels like the natural pre-deployment complement —
test it works, then watch it doesn't silently change. Nice work
shipping this.