DEV Community: ThomasP

LLM-as-Judge: using Claude to review a Gemini agent

ThomasP — Wed, 08 Apr 2026 12:20:54 +0000

In the previous article, I compared 7 models from 4 providers on the same agentic task. Gemini 3 Flash won on the balance of accuracy, cost, and latency. But winning the benchmark doesn't mean the agent is good. 74.5% accuracy means 1 in 4 products gets the wrong answer. And some of those wrong answers come with high confidence.

The benchmark tells you what fails. It doesn't tell you why. For that, I needed something that could look at the agent's reasoning step by step and tell me where the logic broke down.

So I built a judge.

The idea

The production agent runs on Gemini 3 Flash. It's fast and cheap, which is why it's in production. But it makes mistakes. Some of those mistakes share patterns that, if I could identify them, would tell me exactly what to fix in the prompt or the pipeline.

Manually reviewing agent traces is possible but painful. Each trace has 3-6 tool calls, each with a search query, results, page content, and a reasoning step. Reviewing one product takes 10-15 minutes if you're being thorough. Reviewing 50 takes a week.

The fix: use a smarter model (Claude Opus 4.6) to review the agent's work. Claude has more reasoning capacity than Gemini Flash. It can read an entire agent trace, spot logical errors, verify sources, and try alternative approaches the agent missed. A senior engineer reviewing a junior engineer's work, except the senior engineer is also an LLM.

The 3-phase process

The judge follows a strict 3-phase process for every review.

Phase 1: trace analysis (no tools). The judge reads the complete agent trace. Every search query, every result, every page read, every reasoning step. For each iteration it analyzes: what query was constructed and why, were the results relevant, what did the agent decide next, was the reasoning logical, were there obvious angles the agent didn't explore. This is pure analysis, no tools, just reading and thinking.

Phase 2: informed verification (web search + page reading). Now the judge does its own research. It re-reads pages the agent cited to verify that they actually say what the agent claims. It tries alternative queries the agent missed. It searches in different languages if the product isn't French. It focuses on the weak points identified in Phase 1.

This is the expensive phase. The judge is doing original research, not just rubber-stamping the agent's work.

Phase 3: comparative verdict. The judge compares its findings with the agent's conclusion and produces a structured review. The verdict is one of: correct, incorrect, partially_correct, or uncertain. Each review includes 5 scores (country accuracy, reasoning quality, source reliability, confidence calibration, efficiency), issue tags from a taxonomy of 13 labels, and source-by-source verification (did the page say what the agent claimed?).

The key rule: if the agent said "unknown" but the judge found the answer, that's "incorrect." The verdict is about whether the agent delivered the right answer, not whether it tried hard.

What the judge found

Across 75 production scans reviewed (20 in a first batch, 55 in a second), the average score is ~50/100. That sounds terrible, but there's an important caveat: I don't run the judge on easy wins. I specifically select cases that seem interesting: "probable" confidence results, scans where the GS1 prefix contradicts the found country, results that look surprising, or products where a user submitted a correction. The judge is a learning tool, not a representative sample.

The value isn't in the aggregate score. It's in the patterns.

Three findings stood out, each with a lesson that applies beyond my specific use case.

Agents take shortcuts. The biggest pattern (28/55 scans in the second batch): the agent uses all of its tool calls on web searches and almost never reads the actual pages. It finds a search snippet saying "Made in France," treats it as fact, and moves on. But that snippet might be a navigation link, a category filter, or a statement about a different product. The answer was often on the page, one click away.

If you're building an agent with tools, check whether it's actually using them all. Ours had read_webpage available but preferred to stay in the comfortable search-snippet loop.

Curated benchmarks have blind spots. The agent never searched by barcode number directly (18/55 scans). It always searched by product name. But some products don't have a clean name in our database, and searching the EAN directly on retailers would have found structured origin fields immediately.

This pattern was invisible in the benchmark. Every benchmark item had a clean name because I'd curated it that way. The benchmark tested "can the agent find origin for a known product." Production tested "can the agent handle whatever random barcode a user scans." Different question, different failure modes.

Patterns evolve, and you need to track them over time. Between the first batch (20 reviews) and the second (55 reviews), snippet misinterpretation dropped from 35% to 18%. But wasted tool calls went up from 15% to 25%. The agent was getting better at some things and worse at others. Without running the analysis twice, I would have missed both trends.

The cost problem (and an ugly but effective solution)

The judge runs Claude Opus 4.6 via the Anthropic API, with web search and page reading tools. Phase 2 alone can involve 4-8 tool calls. Each review costs between $0.40 and $0.70.

For 50 products, that's $20-35. Not catastrophic, but too expensive for regular QA. I wanted to review every interesting production scan, not just a sample.

My solution was pragmatic: I rebuilt the exact same judge as a slash command in Claude Code (Anthropic's CLI tool). Same 3-phase process, same tools, same structured output. The difference is that the CLI version runs on my Claude Max subscription instead of the API. Marginal cost per review: $0.

The API version still exists for automated use. But day-to-day, I run /judge <EAN> from my terminal and get the same structured review without paying per call.

Is this elegant? No. Is it a long-term solution? Probably not. But it let me go from "I can afford to review 20 products a month" to "I can review every product I want." And that volume is what makes pattern analysis useful.

The analysis layer

Individual reviews are useful. Patterns across reviews are transformative.

On top of the judge, I built an analysis command that reads the last N reviews and identifies recurring patterns: which issue tags appear most often, which failures cluster together, which recommendations keep coming up, which types of queries consistently fail.

The output is a prioritized report. Each pattern gets a frequency (X out of N reviews), an impact rating (does it cause wrong answers or just inefficiency?), and a scope (universal, market-specific, category-specific). The report ends with 3-5 ranked recommendations.

This is where the judge system pays for itself. One review tells you "this product got the wrong answer because the agent trusted a misleading snippet." Seventy-five reviews tell you "the agent almost never reads pages, and imposing a minimum page-read ratio would address the root cause." The first is an anecdote. The second is a strategy.

The benchmark and the judge complement each other. The benchmark measures aggregate performance and catches regressions. The judge explains why things fail and surfaces patterns that curated test sets miss. I need both.

The feedback loop

The whole point of the judge is to feed improvements back into the agent. Some judge recommendations translated directly into improvements. The EAN-first pattern became a prompt change. The snippet misinterpretation finding led to the anti-FC rules I described in the prompt engineering article (the ones that failed on Flash Lite but worked on 3 Flash).

Other recommendations didn't work in practice. The language adaptation suggestion (search in Italian for Italian products) added noise without improving accuracy on the benchmark. Sometimes the judge identifies a problem but the fix doesn't exist yet, or the model can't handle the added complexity.

The judge doesn't replace human judgment about what to change. It tells you where to look.

Is this worth building?

Honestly, the judge system took real engineering effort. The 3-phase process, the structured review schema, the trace formatting, the CLI rebuild, the analysis layer.

But looking back, the judge found the EAN-first pattern that no amount of benchmark staring would have revealed. It confirmed benchmark findings with production data. It gave me a structured vocabulary for agent failures (those 13 issue tags) that made it possible to track patterns over time.

If you're building an agent that runs in production, you need some way to understand why it fails, not just how often. Manual review doesn't scale. A judge agent does.

Next up: From 42% to 78%: the full iteration log of a production AI agent. 108 benchmark runs, 7 models, 6 prompt versions, 3 weeks. Every decision we made, and the timeline that connects it all.

This is part of a series on building a production AI agent for Mio. Previous: Benchmarking 7 LLMs from 4 providers on the same agentic task.

GPT-5.1 scored 26%. Gemini 3 Flash scored 74%. Same prompt, same tools.

ThomasP — Sat, 28 Mar 2026 09:16:18 +0000

In the previous article, I explained how we built the evaluation infrastructure for our AI agent: a hand-curated golden dataset, a 3-run minimum per config, and the discovery that 17% of items flip between identical runs. This article puts that infrastructure to use.

I'm building Mio, an app where you scan a product barcode and get the manufacturing country. The AI agent searches the web, reads pages, cross-references sources, and returns a country with a confidence level. I built the same agent pipeline for 5 providers: Gemini, Anthropic, OpenAI, xAI, and Mistral. Same prompt. Same tools. Same scoring.

Here's what happened when I ran them all against the same benchmark.

The four walls

This isn't a "which model is smartest" comparison. It's an elimination tournament. My agent runs inside a consumer app where real people scan products in a store and wait for an answer. That sets hard constraints.

Latency: under 10 seconds ideally, 15 seconds max. At 20-30 seconds, users put their phone back in their pocket.

Cost: under ~$0.01 per scan. At $0.02, the unit economics don't work at scale.

Accuracy: above ~60% country match. Below that, the app feels broken. Users scan 3 products, get 2 wrong answers, and uninstall.

False confidence: as low as possible. The agent saying "verified: Made in France" when the product is made in China is worse than saying "I don't know." One confident wrong answer destroys trust faster than ten honest unknowns.

If any single dimension is unacceptable, the model is out. Doesn't matter how good the other numbers are.

The eliminations

Mistral: accuracy (50%)

Tested on our early eval dataset (10 items). Country match: 50%. Cost was the lowest of anything I tested ($0.0006/trace), latency was fine (10.5s). But 50% accuracy means the agent is basically guessing. Didn't proceed to the gold-curated benchmark.

GPT-5.1: accuracy (26.5%)

This was the most surprising result. GPT-5.1 is a strong model on public benchmarks. On our gold-curated dataset (34 items), it scored 26.5% country match. The model returned null/low confidence on almost everything. 20 out of 34 items were "other failures" where the agent never submitted an answer.

I need to be honest here: I'm not 100% sure this is the model's fault. Our OpenAI integration uses the Responses API, and the way tool results get passed back might not work as effectively as Gemini's native function calling. It's possible that a different integration approach would get better results. But at 26.5% on the only run I got, I didn't invest more time debugging it. The other providers worked out of the box.

GPT-4.1: accuracy (43%) + rate limits

Tested on our eval dataset (90 items): 43% country match, $0.014/trace, 17.9s latency. Already below the accuracy threshold. When I tried to run it on the gold-curated dataset at concurrency 20, it immediately hit OpenAI's 30K tokens-per-minute rate limit. Unusable for benchmarking, let alone production.

xAI Grok 4 Fast: latency (22-35s)

This one was interesting. Across multiple runs on 29-30 items, accuracy ranged from 40% to 72.4%. The best run (72.4%) was genuinely competitive. And the cost was the lowest I tested alongside Mistral, around $0.001/trace.

But latency killed it. Every run came in between 22 and 35 seconds. At 33.6 seconds average on the best-accuracy run, a user would be staring at a loading screen for half a minute. In a grocery store. Not viable.

If xAI gets the latency down, Grok would be worth retesting. The accuracy signal was real.

Claude Haiku 4.5: cost ($0.019/trace)

This was the hardest elimination. Haiku got 67.6% accuracy on gold-curated (34 items), with 7 false confidence cases. Not far from Gemini 3 Flash (74.5%). On easy items, it hit 100%. Solid model.

But: $0.019 per trace. That's 4-5x what Gemini costs. And latency was 17.4 seconds on the gold-curated run, with some eval-dev runs hitting 20-29 seconds.

I tested Haiku extensively during early development (before the gold-curated benchmark existed). Multiple prompt versions, different configurations. The accuracy was consistently decent. The cost was consistently too high. At $0.019/scan, 10,000 daily users doing 3 scans each means $570/day just in LLM costs. Gemini at $0.004/scan brings that to $120/day for better accuracy.

Sometimes a good model just doesn't fit the economics.

Gemini 2.5 Flash: accuracy (45.6%) + false confidence (10.5)

The predecessor to the models I ended up using. 45.6% accuracy with the highest false confidence of any Gemini model (10.5 average across 2 runs). Also 2x more non-deterministic than Flash Lite: 37% of items flipped between identical runs, compared to 17% for Flash Lite.

Bad accuracy, bad FC, unstable results. Out.

The survivors

Two Gemini models made it through all four walls.

Gemini 3.1 Flash Lite

54-60% accuracy (varies by run), FC around 4-7, latency 8.6s, cost ~$0.006/trace. This was my production model for a while. Low false confidence, decent cost, fast.

But as I described in the prompt engineering article, it was stuck on a local optimum. Every prompt change I tried made things worse. The model was too simple to follow nuanced rules. It worked, but it couldn't get better.

Gemini 3 Flash (the winner)

74.5% accuracy (average of 3 runs: 73.5%, 73.5%, 76.5%), FC 7.7, latency 13.5s, cost ~$0.004/trace. With parallel tool dispatch, the best single run hit 82.6%.

Gemini 3 Flash didn't win by being the best at any single dimension. Not the cheapest (Flash Lite was cheaper). Not the lowest FC (Flash Lite had lower FC at ~5). Not the fastest (Flash Lite at 8.6s beat it). But it had the best balance: highest accuracy by a wide margin, within acceptable bounds on everything else.

And unlike Flash Lite, it responded to prompt optimization. The anti-FC rules, the nudge and anti-looping tweaks, the parallel dispatch instruction, all of these worked on 3 Flash. The model was smart enough to follow nuanced instructions, which meant I could keep improving it.

What I learned

Public benchmarks don't predict agentic performance. GPT-5.1 ranks high on MMLU, HumanEval, and other standard benchmarks. It scored 26.5% on our task. Gemini 3 Flash ranks lower on most public benchmarks. It scored 74.5%. The gap is enormous. Agentic tool-use tasks (search, read, reason, decide) test something completely different from the typical "answer this question" benchmarks.

Most eliminations were about economics, not intelligence. Haiku at 67.6% would have been a perfectly good agent. Grok at 72.4% was competitive with Gemini. Both were eliminated on cost or latency, not accuracy. If you're building a backend service with no latency constraint and a generous budget, your winner might be completely different from mine.

Testing depth should match viability. I ran 100+ benchmarks on Gemini models and 1-5 on everything else. That sounds unfair. But it's the right approach. Once a model hits a disqualifying wall, spending more benchmark budget on it is waste. I invested deeply where it mattered (the Gemini family where prompt optimization was possible) and lightly where elimination was clear.

Same prompt ≠ same results. All five providers got the exact same system prompt and tool definitions. The accuracy range was 26.5% to 74.5%. The prompt was designed for Gemini (it's where I iterated), which probably gives Gemini an advantage. A prompt optimized for Haiku or GPT might close some of the gap. But the cost/latency constraints would still eliminate them for my use case.

The unified architecture paid for itself. Building the agent for 5 providers with the same interface was real engineering work. But it meant every comparison was apples-to-apples. Same prompt, same tools, same scoring, same dataset. No "well maybe the OpenAI version just has different tools." If a model underperformed, it was the model (or the API integration), not the setup.

The honest caveats

I want to be clear about what this benchmark does and doesn't show.

It shows how these models perform on my specific task (manufacturing country lookup via web search), with my specific prompt (optimized for Gemini), at my specific scale (consumer app, real-time, cost-sensitive). A different task, a different prompt, or different constraints could produce a completely different ranking.

The GPT-5.1 result in particular might not reflect the model's true capability. If I'd spent more time on the OpenAI integration, the results might improve. I made a pragmatic choice: other providers worked immediately, so I invested time there instead.

And the testing depth is uneven. 3 runs on Haiku versus 20+ runs on Gemini 3 Flash means I have much more confidence in the Gemini numbers. The Haiku result (67.6%) could be an unlucky run. Or a lucky one. With 1 run, I don't know.

What I do know: Gemini 3 Flash at $0.004/trace and 13.5s gives me 74.5% accuracy. That's the combination I can build a product on. For now.

Because this benchmark is a snapshot, not a verdict. Prices drop. Models improve. Latency gets optimized. And every week I spend building and iterating on this agent, I learn more about prompting, tool design, and what makes a model work well on agentic tasks. The prompt I have today, optimized through 108 runs on Gemini, is a much better starting point for retesting other providers than the generic prompt I started with. Haiku was eliminated on cost, but Anthropic's pricing changes regularly. Grok was eliminated on latency, but xAI is actively optimizing inference speed. GPT-5.1 might just need a different integration approach.

The elimination results are real, but they're not permanent. The benchmark infrastructure stays. The logs of what worked and what didn't stay. When the context changes, I'll rerun. That's the whole point of having the eval framework in place.

Next up: LLM-as-Judge: using Claude to review a Gemini agent. How I automated QA by having a smarter model review every agent trace, and the patterns it found that I never would have caught manually.

This is part of a series on building a production AI agent for Mio. Previous: Why your LLM agent needs a benchmark before it needs a prompt.

Why your LLM agent needs a benchmark before it needs a prompt

ThomasP — Fri, 27 Mar 2026 08:05:57 +0000

In the previous article, I showed how prompt optimization played out across three dimensions: prompt rules, tooling, and model capability. Some changes failed on one model and worked on another. Some only made sense once the tooling changed. Every claim in that article came with a specific number. "Accuracy dropped from 60% to 43%." "13 false confidence cases, worst run ever." "The same anti-FC rules went from catastrophic on Flash Lite to +10.8% on 3 Flash."

How did I know any of that?

Because before I wrote a single prompt engineering rule, I built the evaluation infrastructure to measure whether it worked. And the single most important thing that infrastructure revealed had nothing to do with prompts at all.

The most dangerous sentence in AI engineering

"It seems better on a few examples."

I hear this constantly from devs building with LLMs. They change the prompt, try it on 3-4 inputs, the outputs look better, they ship it. I did the same thing at the start. I'd tweak a rule, test it on the 2-3 products I had memorized, see an improvement, and commit.

This is how you ship broken changes without knowing.

The anti-false-confidence rules I wrote about in the last article? On a smaller model, they fixed the exact items I was testing against but dropped accuracy from 60% to 43% on everything else. On a smarter model weeks later, the same rules improved accuracy by +10.8%. Without a benchmark, I would have either shipped the broken version or permanently discarded rules that later turned out to be valuable.

You can't evaluate an AI agent by eyeballing outputs. You need a benchmark.

The golden dataset

A benchmark needs ground truth. For my agent (which finds manufacturing countries from product barcodes), that means: for each product, what's the actual country, and how confident should the system be?

My first attempt: I grabbed random EANs from my 69 million product database and ran about 30 benchmarks against them. It was useful for comparing models and getting rough numbers. But I kept running into a problem: when a run showed a regression, I wasn't sure if the agent got worse or if my "ground truth" was wrong. The expected countries had been found by the agent itself in earlier runs, verified by me with quick manual searches. Some were solid. Others, I wasn't confident enough to bet on.

I was building on sand. If your ground truth is shaky, your benchmark is noise. You can't tell an actual regression from a label error.

That's what pushed me to build the golden dataset properly. Each item hand-curated with a verified manufacturing country, a confidence level, and a difficulty rating.

The difficulty ratings matter:

Easy: a product database already has the manufacturing country, or the first web search finds it
Medium: needs 2-3 searches, maybe reading a retailer page, cross-referencing sources
Hard: 3+ searches, white-label products, ambiguous evidence, or genuinely untraceable from public sources

The dataset grew over time: 21, then 30, then 34, then 46, then 53, then 57 items. I kept adding harder cases as the easy ones stabilized. This matters. A benchmark that only tests easy cases will tell you everything is fine when it isn't.

The curation pipeline works like this: a product gets scanned, the agent processes it, an LLM judge reviews the agent's trace, then I validate the review in an admin panel. If the ground truth is solid, I mark it as "gold" and it gets pushed to the benchmark dataset. It's slow. Each item takes serious verification work. But the whole point is that these labels are trustworthy. A benchmark with wrong labels is worse than no benchmark at all.

Agent > LLM Judge > Admin validation > Golden Dataset. The feedback loop means benchmark results surface new items to curate." width="800" height="140">

The metric that actually matters

I track 5 metrics on every benchmark run:

Metric	Question
`country_match`	Right country?
`false_confidence`	Confidently wrong?
`confidence_met`	Right confidence level?
`latency_s`	How fast?
`cost`	How expensive?

Most people would optimize for country_match. Get the right answer more often. That's the obvious goal.

It's not the right goal. Or at least, it's not the only goal.

False confidence is the metric I optimize against. The agent says "verified: Made in France" and the product is actually made in China. That's the failure mode that destroys user trust. One confidently wrong answer does more damage than ten "I don't know" results.

A concrete example from my data: the brand-level fallback I described in the last article got ~33% accuracy with 13 false confidence cases. A different config got 60% accuracy with 4 false confidence cases. The 60% config is obviously better for users, even though neither has great raw accuracy. Because 13 confident wrong answers means the user stops trusting the app.

But accuracy without cost/latency is meaningless. This is a consumer app. Someone scans a product in a grocery store and waits. During early testing, I benchmarked models that hit 70%+ accuracy but cost $0.02 per scan and took 20-30 seconds. A model at $0.004 per scan and 10 seconds that hits 65% is more shippable. The benchmark tracks cost and latency on every run precisely so I can make that trade-off with actual numbers instead of gut feeling.

The optimization target is a three-way balance: minimize false confidence, maximize accuracy, keep cost and latency within production-viable bounds. Not a single number. A region in a multi-dimensional space.

The day everything changed

About a week in, I had a baseline: 60% accuracy, 4 false confidence cases, on a 30-item dataset. I'd tried several prompt changes and they all seemed to make things slightly worse (53%, 57%, 50%). But the deltas were small. 1-2 items. Maybe just bad luck?

So I ran the exact same code a second time. Same model. Same prompt. Same config. Same dataset. Same everything.

Country match came out identical: 18/30 (60%) both runs.

But when I looked at the individual items, 5 out of 30 had flipped. Products that were correct in the first run were wrong in the second. Products that were wrong became correct. One item went from a correct answer to "unknown." Another went from "unknown" to the right answer with high confidence. False confidence went from 4 to 5.

17% of items gave different results on identical runs.

This was the most important finding of the entire project. It meant any delta of +/-1-2 items was noise, not signal. Several "neutral" iterations I'd already tested were probably indistinguishable from the baseline. I needed at least +/-4 items of difference on 30 items to have any confidence that a change was real.

I confirmed this later on a larger 46-item dataset. Two identical runs: 71.7% and 78.3%. That's a 7 percentage point gap on the same code. With 46 items, I needed a >=10pp difference to call something significant.

This one finding changed how I evaluate everything.

The 3-run minimum

After the variance discovery, I established a rule: minimum 3 runs per configuration. No exceptions.

The big comparative session where I tested multiple models and prompt versions ran 3 models x 4 prompt versions, 3 runs each. The config I ended up shipping (Gemini 3 Flash with prompt v4) showed: 73.5%, 73.5%, 76.5% across its 3 runs. False confidence: 7, 9, 7.

That variance in FC (7 to 9 on identical code) is exactly why single runs are dangerous. If I'd only run it once and gotten the 9-FC run, I might have rejected a config that was actually the best option. If I'd only gotten the 7-FC run, I might have been overconfident about how good it was.

3 runs gives you the range. The average tells you where you probably are. The spread tells you how much to trust it.

Prompt versioning

Every benchmark trace in Langfuse records two things in its metadata: the PROMPT_VERSION constant and the git SHA. I bump the version on every prompt or pipeline change.

This means I can always trace back. "This run used prompt v4 at commit 36c5871." If a result looks weird, I can check out that exact commit and re-run it. Full reproducibility.

It sounds trivial. It's not. Without this, after 50+ runs, you lose track of what was tested when. "Was that the run with the anti-FC rules or without them?" is a question you never want to be asking.

Growing the dataset

The dataset started at 21 items, grew to 57, and sits at 69 as I write this. I kept adding harder cases as the easy ones stabilized.

This is important and also painful, because it means raw accuracy numbers across different dataset sizes aren't directly comparable.

My first baseline was 42% on 21 items. My production config hits around 78% on 46 items. But the 46-item dataset is harder than the 21-item one. I deliberately added products that are difficult to trace: white-label goods, brands that manufacture in multiple countries, products where "Made in France" on a retailer page actually refers to something else entirely.

A dataset that only gets easier over time is useless. If your benchmark accuracy goes up just because you're adding easy cases, you're measuring your dataset curation skills, not your agent.

The flip side: a dataset that gets harder over time means you can't compare run #1 to run #108 directly. You need to compare runs on the same dataset version. This is where the git SHA tracking pays for itself.

What I'd do differently

If I started over, two things:

Start with 30+ items from day one. My first benchmark had 21 items. That's too few. At 21 items, the variance is so high that almost nothing is statistically distinguishable from anything else. I wasted several days testing changes that were probably just noise. 30 items is the minimum where you start seeing actual signals. 50+ is where you get comfortable.

Track difficulty breakdown from the start. I added difficulty labels (easy/medium/hard) early, and it turned out to be one of the most useful dimensions. A change that improves easy items by 10% but destroys medium items is not a win, even if the overall number looks similar. The breakdown shows you where the change actually helps and where it hurts.

Why this matters beyond my project

If you're building an AI agent that does anything more complex than "turn this text into JSON," you need eval infrastructure before you need prompt engineering. The pattern is always the same:

You change something
It looks better on 3 examples
You ship it
It's actually worse on 80% of cases you didn't test
You don't find out until users complain

The benchmark breaks this cycle. Instead of "it seems better," you get "accuracy went from 60% to 43%, with 5 new regressions on items that used to work." That's a sentence that changes your decision.

And the variance discovery applies to every LLM-based system, not just mine. If you're evaluating prompts with single runs, you're probably shipping noise. Run it 3 times. Look at the spread. You might find that your "10% improvement" was within the margin of error all along.

Next up: Benchmarking 7 LLMs from 4 providers on the same task. Same prompt, same tools, same dataset. GPT-5.1 scored 26%. Gemini 3 Flash scored 74.5%. The results were not what I expected.

This is part of a series on building a production AI agent for Mio. Previous: The prompt engineering that didn't work (and what did).

The prompt engineering that didn't work (and what did)

ThomasP — Mon, 23 Mar 2026 17:33:08 +0000

In the first article of this series, I explained why finding a product's manufacturing country from its barcode is genuinely an AI problem. The data is scattered, misleading, and requires multi-step reasoning to untangle. I'm building Mio, an app where you scan a barcode and get the manufacturing country, powered by an AI agent that searches the web, reads pages, and cross-references sources.

This article is about what happened when I tried to make that agent better.

Over three weeks, I ran 108 benchmarks against a hand-curated golden dataset. Tested 7 models from 4 providers. Iterated through 6 major prompt versions with dozens of sub-variants. And what I learned is this: optimization is three-dimensional, and a change that fails in one context can succeed in another.

The scoreboard

Before I get into specifics, here's the summary. Every line is a real benchmark run, measured against ground-truth labels. "False Confidence" (FC) is our worst failure mode: the agent says it's confident and it's wrong.

On Gemini 3.1 Flash Lite (our first production model):

Change	Accuracy	FC	Verdict
Baseline (v2)	60%	4	PRODUCTION
Anti-false-confidence rules	43%	4	REVERTED
Brand-level fallback search	~33%	13	CATASTROPHIC
Confidence calibration rules	53%	4	REVERTED
Thinking budget to 2048	50%	4	REVERTED
Temperature 1.0 (was 0)	52%	~5	REVERTED
Double search results	53%	4	REVERTED
Ambiguity guard rule	57%	6	REVERTED
3 changes at once	47%	5	REVERTED

On Gemini 3 Flash (after model switch):

Change	Accuracy	FC	Verdict
Baseline (v2, same prompt)	57.8%	8.7	--
Anti-FC rules (v3)	68.6%	5.7	+10.8%
Nudge + anti-looping (v4)	74.5%	7.7	SHIPPED
Variant guard (v5)	71.6%	7.0	--
Blocklist (v6a)	71.6%	6.3	--

Same anti-FC rules. Flash Lite: 60% -> 43%. Flash 3: 57.8% -> 68.6%. The rules weren't wrong. The model was too simple to follow them.

What moved the needle across the whole project:

Change	Impact
Model switch (2.5 Flash -> 3.1 Flash Lite)	+13.3% match, -3 FC
Model switch (Flash Lite -> 3 Flash) + prompt recalibration	+20% match vs original prod
Parallel tool dispatch	+8.2% match, -3.1 FC

The failures (and why they weren't really failures)

1. The brand-level fallback (worst run in project history)

The idea seemed great: when the agent can't find where a product is made, search for where the brand manufactures. Unilever makes stuff in 50+ countries, but a small French brand probably has one factory.

Result: 13 false confidence cases. The worst run I ever recorded. Not even close to second place.

What happened: the model found "Brand X has a factory in Y" and immediately applied that to the specific product with full confidence. Every brand search returned some country, and the model treated it as a verified answer.

The lesson I keep coming back to: never add fallback strategies that give the model an alternative path to low-quality answers. It will use them eagerly to justify confident wrong answers. The model wants to give you an answer. Your job is to make the lazy path (giving up) easier than the wrong path (guessing from brand-level data).

This one I haven't retried on a smarter model. It might work better on a model that can distinguish "brand manufactures in X" from "this specific product is made in X." But the failure was so spectacular that I moved on.

2. Anti-false-confidence rules (the plot twist)

My agent had 4 items it was consistently wrong about, all with the same pattern: it trusted "Made in France" search snippets that actually referred to a different product on the same page. So I added 3 targeted rules. Things like "catalogue page snippets are unreliable, only trust product-specific pages" and "verified confidence requires reading the actual page, not just a search snippet."

On Flash Lite, it was a disaster. Accuracy dropped from 60% to 43%. I tested 3 times: 43%, 53%, 50%. All worse. The model couldn't distinguish "this snippet from a catalogue page is unreliable" from "this snippet from a manufacturer's website is reliable." So it distrusted everything.

I reverted and moved on. The rules were too nuanced for this model.

Weeks later, I switched to Gemini 3 Flash and tested the same anti-FC rules. Accuracy went from 57.8% to 68.6%. False confidence dropped from 8.7 to 5.7. The exact same rules that broke Flash Lite were a massive improvement on a smarter model. 3 Flash could actually tell the difference between a catalogue page and a product page.

This was the moment I understood that prompt optimization isn't one-dimensional. A rule that fails on model A can succeed on model B. You don't discard the idea, you log it and revisit it when the context changes.

3. More thinking, worse results

Gemini has a "thinking budget" parameter that controls how many tokens the model can use for internal reasoning before responding. My baseline was 1024 tokens. I tried 2048. Then I tried the API's "medium" thinking level.

Both were worse. 2048 tokens: -3 match, the model started overthinking simple items. Medium: -2 match, +1 false confidence, and average iterations jumped from 3.3 to 4.2. The model used the extra thinking budget to second-guess itself, not to make better decisions. It would find a clear "Made in Germany" statement, then spend 500 extra tokens wondering if it was really Germany, and end up submitting "unknown, low confidence."

1024 tokens forces concise, decisive reasoning. More thinking budget just means more hesitation.

Accuracy by thinking budget:

thinkingBudget = 1024: 60% (optimal)
thinkingLevel = "medium": 53%
thinkingBudget = 2048: 50%

More thinking = more hesitation, not better decisions.

4. Temperature 1.0 (going against the docs)

Gemini's documentation explicitly says that temperature below 1.0 causes "looping or degraded performance, particularly in complex reasoning tasks." So I tested temperature 1.0 against my baseline of 0.

Result: -7.8% accuracy, +1.9% false confidence. Every metric worse. Higher temperature means more randomness, which means more hallucination, which means more confidently wrong answers.

Temperature 0 is the right choice for structured tool calling. The official docs are talking about open-ended generation, not agentic workflows where you need deterministic, focused behavior.

5. More search results = more noise

I doubled the number of search results per query from 5 to 10. More data should help the model find the right answer, right?

Accuracy dropped. The model couldn't separate signal from noise in 10 results. Contradictory snippets from different products on different sites confused it. 5 focused results outperformed 10 noisy ones.

There's a context quality threshold, and it depends on the model. A smaller model saturates faster. This is another change I'd want to retest on a bigger model.

6. Multiple changes at once (the compounding problem)

I once shipped 3 prompt changes together: a new search strategy, retailer-specific instructions, and multilingual query templates. Accuracy dropped from 60% to 47%. Medium-difficulty items went from 29% to 0%. Complete collapse.

I couldn't tell which change caused the regression. Maybe all three. Maybe just one. Doesn't matter. I reverted the whole thing and never made that mistake again.

One change at a time. Always. If you can't measure the individual impact, you can't learn from it.

7. The EU trap (when the model just says no)

I added a region field to the agent's output so products labeled "Made in EU" would return "EU" instead of "unknown." Seemed like a small, clean addition.

The model ignored it completely. Across 6 benchmark runs. I reinforced the instruction in the prompt. Still ignored. I added it to the tool description. Still ignored. Country match and false confidence stayed exactly the same whether the field existed or not.

Prompt engineering cannot force a model to fill a field it doesn't understand the purpose of. After a week of trying, I moved the logic to post-processing code. Deterministic. Works every time.

Sometimes the answer isn't a better prompt. It's code.

What actually worked

1. Switching models (the unlock for everything else)

My first production model was Gemini 2.5 Flash. Switching to 3.1 Flash Lite gained +13.3% match and -3 false confidence. Better model, same prompt, instantly better. But Flash Lite turned out to be a ceiling, not a foundation. Every prompt change I tried on it made things worse.

A note on why I stayed within the Gemini family: cost and latency. This is a consumer app where users scan products in a store and expect an answer in seconds. Before the gold-curated benchmark, I tested Haiku (Claude) extensively. It got comparable or slightly better accuracy on some runs, but at $0.01-0.02 per scan and 20-29 seconds latency. Gemini Flash was $0.002-0.004 per scan and 8-13 seconds. At 4-5x the cost and 2-3x the latency, Haiku wasn't viable for production, no matter how good the accuracy. GPT-4.1 had similar cost issues and wildly variable latency.

The real unlock was switching to Gemini 3 Flash. On its first run with the same v2 prompt, it got the best raw accuracy I'd ever seen (66.7%) but also the worst false confidence (7 FC). It answered more questions, but also hallucinated more.

Here's the thing though: once I was on 3 Flash, prompt engineering started working again. The anti-FC rules that failed on Flash Lite? +10.8% on 3 Flash. The nudge and anti-looping rules? Pushed it to 74.5%. I tested 4 prompt variants on 3 Flash, running each 3 times to account for variance, and each showed meaningful differences.

The model switch didn't just improve accuracy. It unlocked an entire dimension of optimization that was previously walled off.

2. Parallel tool dispatch (+8.2% accuracy)

This was the change I expected to matter least.

My agent was executing tool calls sequentially. Search, wait, read page, wait, search again, wait. I changed two things: the code now runs tool calls in parallel (Promise.all instead of sequential await), and I added one line to the prompt telling the model it can batch multiple tool calls in a single turn.

Result: +8.2% match, -3.1 false confidence. The best single improvement across the entire project. And it was remarkably stable: the worst run with this change (76.1%) still beat the average of the previous version (70.1%).

The model started making two searches with different angles in the same turn instead of doing them sequentially. More coverage within the same tool budget. The improvement came from both sides: the prompt change (model batches more) and the infrastructure change (batched calls run in parallel).

This is a good example of the three dimensions working together. The prompt told the model it could batch. The tooling made batching fast. And the model (3 Flash) was smart enough to actually do it effectively.

3. The config that held

Temperature 0 and thinking budget 1024. Not glamorous. But these were the baseline settings that every experiment on every model failed to beat. I tested temperature 1.0 (worse), thinking budget 2048 (worse), thinking level "medium" (worse). The benchmarks don't lie.

The real lesson: optimization is three-dimensional

After 108 runs, I don't think about "prompt engineering" as a standalone activity anymore. The optimization space has three axes:

Prompt (what you tell the model), Tooling (what the model can do), and Model (how capable the model is at following your instructions). And all three are constrained by a fourth dimension: cost and latency. A brilliant model that costs $0.02/scan and takes 25 seconds isn't an option for a consumer app.

The anti-FC rules failed on Flash Lite and worked on 3 Flash. That's prompt x model. Parallel dispatch worked because of a prompt change AND an infrastructure change. That's prompt x tooling. And model switches unlocked prompt optimizations that were previously impossible. That's model x prompt. Staying within the Gemini family despite testing other providers? That's the cost/latency constraint eliminating otherwise viable options.

You have to explore all three dimensions. And critically, you have to keep good logs. The anti-FC rules I "abandoned" on Flash Lite became one of my best improvements when I revisited them on 3 Flash weeks later. If I hadn't logged the iteration, I would have assumed "anti-FC rules don't work" and never tried them again.

Don't discard an optimization because it failed in one context. Log it, understand why it failed (model too simple? tooling bottleneck? wrong config?), and revisit it when the context changes.

The seven consecutive failures on Flash Lite weren't wasted work. They were a map of what this model couldn't do, which made it obvious when to switch models, and which ideas to retry on the new one.

You need a benchmark to know any of this. Most of these changes "felt" like improvements when I tested them on 3-4 examples. The anti-FC rules fixed exactly the items I was targeting. The brand-level pivot found correct answers for some products. Without measuring against 30+ items with ground-truth labels, I would have shipped broken changes and never known.

That's actually the subject of the next article in this series.

Next up: Why your LLM agent needs a benchmark before it needs a prompt. How we built the evaluation framework, why we measure false confidence instead of accuracy, and the day we discovered that 17% of items flip between identical runs.

This is part of a series on building a production AI agent for Mio. Previous: Why finding where a product is made is an AI problem.

Why finding where a product is made is an AI problem

ThomasP — Tue, 17 Mar 2026 09:58:36 +0000

A barcode tells you where a product was registered. Not where it was made.

Pick up any product at the grocery store. Flip it over. See that barcode? The first three digits tell you which country it was registered in. Starts with 300-379? France. 400-440? Germany. 890? India.

Most people (including me, before I started working on this) assume that's where the product is manufactured.

It's not. Not even close.

A French brand can register barcodes in France and make everything in China. A German company can produce in Poland. That 3-digit GS1 prefix matches the actual manufacturing country about 40% of the time. Basically a coin flip.

I'm building Mio, an app that lets you scan a barcode and find where a product is actually made. What I thought would be a fun database project turned into one of the most interesting AI engineering challenges I've worked on. Here's why.

Fair warning: I'm a developer, not a data scientist. Some of what I'll share in this series (variance exists! you need more than 3 test cases!) will make ML engineers smile. But most devs building with LLMs right now don't have a stats background, and I suspect they're running into the same walls I did. If my trial-and-error saves someone a few days, that's enough.

The data exists. Good luck finding it.

Here's the thing that makes this problem so sneaky: the manufacturing origin of most products is available somewhere. It's buried in a retailer's product page. It's in an open database. It's on the packaging in 6pt font. It's implied by a quality label that legally requires a specific region.

But "somewhere" is doing a lot of heavy lifting in that sentence.

It's fragmented. One database has the product name but no origin. A retailer page has "Country of manufacture: Germany" buried in the specs tab nobody clicks. An open food database has a sanitary registration code that implies a packaging location, which may or may not match where it was made.

It's inconsistent. "Made in France." "Fabriqué en France." "Pays de fabrication : FR." "Lieu de production : Normandie." Same information, a dozen different formats across languages and conventions.

It's actively misleading. And this is where it gets really fun. A retailer might display "Fabriqué en France" as a site-wide promotional banner , not a statement about the specific product you're looking at. Amazon might show "Country of Origin: China" for the seller's account, not the product. A brand's website proudly states "French since 1921" while manufacturing in Italy through a parent company nobody's heard of.

This is not a database lookup problem. This is a reasoning problem.

The obvious approaches. We tried them all.

"Just build a database." We did. We integrated a product database covering 69 million items. It has names, brands, categories, labels, and for some products, manufacturing origin. When that field is populated, it's rock solid. Problem: it's populated for maybe 15-20% of products. The other 80% give you a name and brand, but no origin.

"Just scrape retailer sites." Tried that too. Some retailers do list manufacturing origin in product specs. But not all products, not all retailers, and the HTML structure varies wildly. A static scraping pipeline breaks every time someone redesigns a product page. Which is constantly.

"Isn't this regulated?" In the EU, manufacturing country isn't mandatory on most product labels. Food is somewhat better covered, but even food has exceptions. And regulatory databases, when they exist, are rarely machine-readable.

Each approach alone tops out at ~20-30% coverage. And none of them can tell you how confident to be. A direct "Made in Germany" statement on the manufacturer's website is a completely different signal than inferring "probably Germany" because the brand is German and the barcode prefix is 400.

Why this is actually a reasoning task

Here's the insight that changed everything for us: finding where a product is manufactured is a multi-step reasoning task with uncertain evidence.

A real example from our benchmark: a toothbrush with a French barcode, sold by a brand founded in France in 1921, now owned by an Italian conglomerate. First web search returns a retailer page showing "Fabriqué en France", but is that about this product, or a promotional banner for the retailer's French-made product line? A second result shows the parent company runs factories in Italy, Poland, and France. An open database has no manufacturing data but lists a sanitary code starting with "IT", suggesting Italian packaging.

To actually figure this out, you need to:

Search across multiple sources, in multiple languages
Actually read the pages, not just search snippets, to verify that "Made in X" refers to this specific product
Cross-reference: does the sanitary code match the web sources? Does the corporate ownership explain the discrepancy?
Calibrate confidence. Is this verified or an educated guess?
Know when to stop. Some products can't be traced from public sources. "Unknown" beats a wrong answer every time.

This is textbook AI agent territory. Not a single LLM call. Not RAG. An agent that decides what to do next based on what it's found so far.

In practice, this is what it looks like: you scan a product in a store, and within a few seconds you get the manufacturing country, a confidence level, the reasoning behind it, and links to the sources.

The architecture (high level)

The system follows a simple priority chain:

Database first: when a structured database has the origin with high confidence, return it instantly. No LLM needed. This handles ~15-20% of queries in milliseconds.
Agent for the rest: an LLM agent with access to web search and page reading, tasked with finding and verifying the manufacturing country. It searches, reads pages, cross-references, and submits an answer with a confidence level.
Confidence as a first-class output: every result comes with "verified" (explicit source), "probable" (indirect evidence), or "low" (couldn't find much). This distinction matters more than the country itself for user trust.

The agent can dynamically decide: search with different keywords, read a promising page, try a different language, or bail and report low confidence. That adaptive loop is the whole point.

There's an important constraint though: this runs in real time. A user scans a product in a store and expects an answer in seconds, not minutes. And every web search, every page read costs money. So the system needs to be accurate, fast, and cheap. A model that gets 5% more answers right but costs 5x more per scan and takes 30 seconds instead of 10 isn't viable for a consumer app. Finding the right balance between accuracy, cost, and latency turned out to be as hard as the accuracy problem itself.

Five traps that will wreck your accuracy

Building this system taught me things no tutorial or documentation covers. Here are the failure modes that cost us the most time:

1. The GS1 prefix trap

The agent sees a barcode starting with 300 (France) and subconsciously anchors on France, even when the evidence points elsewhere. We had to explicitly break this: "The barcode prefix is where the brand is registered. It is NOT evidence of manufacturing origin." Without this, the agent has a strong France bias. 5 out of 7 false confidence cases in our first benchmark were the agent saying "France" when it was wrong.

2. The brand ≠ factory trap

Moulinex is a French brand. It manufactures in China, Poland, and France depending on the product line. Our agent confidently said "Made in France" for products manufactured on a different continent, because the brand's Wikipedia page says "French company." "French brand" is not "French product." Obvious in hindsight. Not obvious to an LLM.

3. The retailer badge trap

This was our number one source of false confidence. Some retail websites show origin-related badges ("Made in France," "Produit local") as promotional elements across their entire site. These show up in search snippets right next to the product listing. The agent can't distinguish a product-specific statement from a marketing banner without actually reading the full page.

We had cases where the agent stated "verified: Made in France" based on a badge that applied to a completely different product line on the same retailer site. Brutal.

4. The "EU" trap

Many products say "Made in EU." Technically correct, practically useless. 27 member states. We spent a week trying to handle this at the model level. The model completely ignored our instructions across every prompt version we tried. Sometimes the right answer is to accept the limitation.

5. Packaging ≠ manufacturing

Sanitary registration codes (EMB codes) tell you where a product was packaged, not where it was manufactured. A product made in Spain can be packaged in France and carry a French code. The data looks authoritative, which is exactly what makes it dangerous.

What actually matters (after 108 benchmark runs)

We ran 108 benchmarks over three weeks. Seven models from four providers. Six major prompt versions with dozens of sub-variants. A golden dataset we hand-curated from 21 items to 57, adding harder cases as the easy ones stabilized. Every single run was measured against ground-truth labels, with the prompt version and git SHA recorded on each trace in Langfuse for full reproducibility.

We went from 42% accuracy to 78%. Here's what crystallized:

False confidence is the metric that matters. Not accuracy. A system that says "I don't know" when it doesn't know is infinitely more trustworthy than one that answers everything but is wrong 15% of the time. We call it "false confidence": the agent says "verified" and it's wrong. That's the number we optimize against above all others.

The information quality hierarchy is steep. A structured database field is gold. An explicit "Made in X" on the manufacturer's website is silver. A retailer listing with origin in the specs is bronze. A search snippet mentioning a country near a product name is lead, heavy and potentially toxic. We learned this the hard way.

Optimization is three-dimensional. We iterated on prompts, tooling, and models, and the interactions between the three are what matter. Prompt rules that failed on a smaller model worked perfectly on a smarter one. Parallel tool execution only helped because the model was smart enough to batch calls and the prompt told it to. We doubled search results from 5 to 10 per query and accuracy dropped, not because "more is bad" but because that model couldn't handle the noise. The best results came from finding the right combination across all three axes.

Intellectual honesty is non-negotiable. We're not auditing factories. We're not certifying supply chains. We're aggregating publicly available information, assigning a confidence level, and presenting it transparently. If a brand lies on its website, we'll relay that lie, and the confidence system will reflect how many independent sources confirmed it. Being clear about what the system can and cannot do is both the ethical choice and the one that builds the most trust.

This pattern is everywhere

The reason I'm writing this up is that this problem structure is way more common than people realize:

The answer exists somewhere in public sources
No single source is reliable on its own
The reasoning path depends on what you find at each step
Confidence calibration is as important as the answer itself
The problem looks trivially solvable until you actually try to automate it

These are the problems where AI agents genuinely earn their keep. Not because any individual step is hard (searching, reading a webpage, comparing two strings) but because orchestrating those steps requires judgment. When to search again, when to read the full page, when to accept the evidence, when to give up.

108 benchmark runs, 7 models, 6 prompt versions, 3 weeks. The journey from "this kind of works" to "this is reliable enough to ship" was far more interesting, and far more counterintuitive, than I expected. Prompt rules that failed on one model worked on another. Changes I'd written off as failures turned into wins in a different context. The biggest gains came from places I didn't expect.

That's what I gonna tell in the rest of this series.

Next up: Why we built the evaluation framework before writing a single line of prompt. And why "it seems better on a few examples" is the most dangerous sentence in AI engineering.

I'm building Mio, an app that surfaces manufacturing origin from product barcodes. 108 runs, 7 models, a hand-curated golden dataset, and an LLM-as-judge system reviewing the agent's work. If you've built evaluation pipelines for AI agents or dealt with similar multi-source reasoning problems, I'd love to hear about your experience.