We just shipped evals for sunpeak.ai
The #1 thing I hear from MCP server teams: “Our tools worked great with the latest models, but we had to start from scratch when we realized the free models couldn't use them at all.”
Budget models call tools differently: they misread ambiguous schemas, they pass wrong arguments, they can't chain tool calls, and you don’t find out until users complain.
sunpeak evals test your MCP server across every model that matters, in one command. 100% on GPT-4o, 40% on Gemini Flash. That 40% is a schema problem you’d never catch testing manually on ChatGPT. Fix the tool architecture + description, run it again, and watch it climb to 95%.
Works with any MCP server. sunpeak connects over MCP, discovers your tools, and runs each eval case dozens of times per model so you get a real pass rate, not a single lucky result.
Put it in CI. Track reliability over time. Your MCP server isn’t production-ready until the cheapest model your users might connect it to can use it consistently.
Top comments (0)