In the previous article, I explained how we built the evaluation infrastructure for our AI agent: a hand-curated golden dataset, a 3-run minimum per config, and the discovery that 17% of items flip between identical runs. This article puts that infrastructure to use.
I'm building Mio, an app where you scan a product barcode and get the manufacturing country. The AI agent searches the web, reads pages, cross-references sources, and returns a country with a confidence level. I built the same agent pipeline for 5 providers: Gemini, Anthropic, OpenAI, xAI, and Mistral. Same prompt. Same tools. Same scoring.
Here's what happened when I ran them all against the same benchmark.
The four walls
This isn't a "which model is smartest" comparison. It's an elimination tournament. My agent runs inside a consumer app where real people scan products in a store and wait for an answer. That sets hard constraints.
Latency: under 10 seconds ideally, 15 seconds max. At 20-30 seconds, users put their phone back in their pocket.
Cost: under ~$0.01 per scan. At $0.02, the unit economics don't work at scale.
Accuracy: above ~60% country match. Below that, the app feels broken. Users scan 3 products, get 2 wrong answers, and uninstall.
False confidence: as low as possible. The agent saying "verified: Made in France" when the product is made in China is worse than saying "I don't know." One confident wrong answer destroys trust faster than ten honest unknowns.
If any single dimension is unacceptable, the model is out. Doesn't matter how good the other numbers are.
The eliminations
Mistral: accuracy (50%)
Tested on our early eval dataset (10 items). Country match: 50%. Cost was the lowest of anything I tested ($0.0006/trace), latency was fine (10.5s). But 50% accuracy means the agent is basically guessing. Didn't proceed to the gold-curated benchmark.
GPT-5.1: accuracy (26.5%)
This was the most surprising result. GPT-5.1 is a strong model on public benchmarks. On our gold-curated dataset (34 items), it scored 26.5% country match. The model returned null/low confidence on almost everything. 20 out of 34 items were "other failures" where the agent never submitted an answer.
I need to be honest here: I'm not 100% sure this is the model's fault. Our OpenAI integration uses the Responses API, and the way tool results get passed back might not work as effectively as Gemini's native function calling. It's possible that a different integration approach would get better results. But at 26.5% on the only run I got, I didn't invest more time debugging it. The other providers worked out of the box.
GPT-4.1: accuracy (43%) + rate limits
Tested on our eval dataset (90 items): 43% country match, $0.014/trace, 17.9s latency. Already below the accuracy threshold. When I tried to run it on the gold-curated dataset at concurrency 20, it immediately hit OpenAI's 30K tokens-per-minute rate limit. Unusable for benchmarking, let alone production.
xAI Grok 4 Fast: latency (22-35s)
This one was interesting. Across multiple runs on 29-30 items, accuracy ranged from 40% to 72.4%. The best run (72.4%) was genuinely competitive. And the cost was the lowest I tested alongside Mistral, around $0.001/trace.
But latency killed it. Every run came in between 22 and 35 seconds. At 33.6 seconds average on the best-accuracy run, a user would be staring at a loading screen for half a minute. In a grocery store. Not viable.
If xAI gets the latency down, Grok would be worth retesting. The accuracy signal was real.
Claude Haiku 4.5: cost ($0.019/trace)
This was the hardest elimination. Haiku got 67.6% accuracy on gold-curated (34 items), with 7 false confidence cases. Not far from Gemini 3 Flash (74.5%). On easy items, it hit 100%. Solid model.
But: $0.019 per trace. That's 4-5x what Gemini costs. And latency was 17.4 seconds on the gold-curated run, with some eval-dev runs hitting 20-29 seconds.
I tested Haiku extensively during early development (before the gold-curated benchmark existed). Multiple prompt versions, different configurations. The accuracy was consistently decent. The cost was consistently too high. At $0.019/scan, 10,000 daily users doing 3 scans each means $570/day just in LLM costs. Gemini at $0.004/scan brings that to $120/day for better accuracy.
Sometimes a good model just doesn't fit the economics.
Gemini 2.5 Flash: accuracy (45.6%) + false confidence (10.5)
The predecessor to the models I ended up using. 45.6% accuracy with the highest false confidence of any Gemini model (10.5 average across 2 runs). Also 2x more non-deterministic than Flash Lite: 37% of items flipped between identical runs, compared to 17% for Flash Lite.
Bad accuracy, bad FC, unstable results. Out.
The survivors
Two Gemini models made it through all four walls.
Gemini 3.1 Flash Lite
54-60% accuracy (varies by run), FC around 4-7, latency 8.6s, cost ~$0.006/trace. This was my production model for a while. Low false confidence, decent cost, fast.
But as I described in the prompt engineering article, it was stuck on a local optimum. Every prompt change I tried made things worse. The model was too simple to follow nuanced rules. It worked, but it couldn't get better.
Gemini 3 Flash (the winner)
74.5% accuracy (average of 3 runs: 73.5%, 73.5%, 76.5%), FC 7.7, latency 13.5s, cost ~$0.004/trace. With parallel tool dispatch, the best single run hit 82.6%.
Gemini 3 Flash didn't win by being the best at any single dimension. Not the cheapest (Flash Lite was cheaper). Not the lowest FC (Flash Lite had lower FC at ~5). Not the fastest (Flash Lite at 8.6s beat it). But it had the best balance: highest accuracy by a wide margin, within acceptable bounds on everything else.
And unlike Flash Lite, it responded to prompt optimization. The anti-FC rules, the nudge and anti-looping tweaks, the parallel dispatch instruction, all of these worked on 3 Flash. The model was smart enough to follow nuanced instructions, which meant I could keep improving it.
What I learned
Public benchmarks don't predict agentic performance. GPT-5.1 ranks high on MMLU, HumanEval, and other standard benchmarks. It scored 26.5% on our task. Gemini 3 Flash ranks lower on most public benchmarks. It scored 74.5%. The gap is enormous. Agentic tool-use tasks (search, read, reason, decide) test something completely different from the typical "answer this question" benchmarks.
Most eliminations were about economics, not intelligence. Haiku at 67.6% would have been a perfectly good agent. Grok at 72.4% was competitive with Gemini. Both were eliminated on cost or latency, not accuracy. If you're building a backend service with no latency constraint and a generous budget, your winner might be completely different from mine.
Testing depth should match viability. I ran 100+ benchmarks on Gemini models and 1-5 on everything else. That sounds unfair. But it's the right approach. Once a model hits a disqualifying wall, spending more benchmark budget on it is waste. I invested deeply where it mattered (the Gemini family where prompt optimization was possible) and lightly where elimination was clear.
Same prompt ≠ same results. All five providers got the exact same system prompt and tool definitions. The accuracy range was 26.5% to 74.5%. The prompt was designed for Gemini (it's where I iterated), which probably gives Gemini an advantage. A prompt optimized for Haiku or GPT might close some of the gap. But the cost/latency constraints would still eliminate them for my use case.
The unified architecture paid for itself. Building the agent for 5 providers with the same interface was real engineering work. But it meant every comparison was apples-to-apples. Same prompt, same tools, same scoring, same dataset. No "well maybe the OpenAI version just has different tools." If a model underperformed, it was the model (or the API integration), not the setup.
The honest caveats
I want to be clear about what this benchmark does and doesn't show.
It shows how these models perform on my specific task (manufacturing country lookup via web search), with my specific prompt (optimized for Gemini), at my specific scale (consumer app, real-time, cost-sensitive). A different task, a different prompt, or different constraints could produce a completely different ranking.
The GPT-5.1 result in particular might not reflect the model's true capability. If I'd spent more time on the OpenAI integration, the results might improve. I made a pragmatic choice: other providers worked immediately, so I invested time there instead.
And the testing depth is uneven. 3 runs on Haiku versus 20+ runs on Gemini 3 Flash means I have much more confidence in the Gemini numbers. The Haiku result (67.6%) could be an unlucky run. Or a lucky one. With 1 run, I don't know.
What I do know: Gemini 3 Flash at $0.004/trace and 13.5s gives me 74.5% accuracy. That's the combination I can build a product on. For now.
Because this benchmark is a snapshot, not a verdict. Prices drop. Models improve. Latency gets optimized. And every week I spend building and iterating on this agent, I learn more about prompting, tool design, and what makes a model work well on agentic tasks. The prompt I have today, optimized through 108 runs on Gemini, is a much better starting point for retesting other providers than the generic prompt I started with. Haiku was eliminated on cost, but Anthropic's pricing changes regularly. Grok was eliminated on latency, but xAI is actively optimizing inference speed. GPT-5.1 might just need a different integration approach.
The elimination results are real, but they're not permanent. The benchmark infrastructure stays. The logs of what worked and what didn't stay. When the context changes, I'll rerun. That's the whole point of having the eval framework in place.
Next up: LLM-as-Judge: using Claude to review a Gemini agent. How I automated QA by having a smarter model review every agent trace, and the patterns it found that I never would have caught manually.
This is part of a series on building a production AI agent for Mio. Previous: Why your LLM agent needs a benchmark before it needs a prompt.




Top comments (0)