S M Tahosin

Posted on Apr 23

I Ran 1,000+ Real Tests on Gemini 3 Flash (Preview) vs Gemini 2.5 Pro — Flash Won

#googlecloud #ai #gemini #benchmark

TL;DR — With preview access to Gemini 3.1 Pro and Gemini 3 Flash, I built an open benchmark (200 prompts × 5 models = 1,000+ calls) and the results invert the "newer Pro is always better" intuition. Gemini 3 Flash (preview) Pareto-dominates Gemini 3.1 Pro (preview) — higher quality, 2.4× lower latency, 3× cheaper. Meanwhile, Gemini 2.5 Flash Lite is only 3 points behind the best model overall, at 1/15th the cost. Everything is reproducible: repo + raw data.

Why this post exists

I'm in the preview cohort for Gemini 3.1 Pro and Gemini 3 Flash via the Google AI API. Between the Google Cloud NEXT 2026 Writing Challenge and the fact that no independent public benchmark covers the 3.x family yet, the timing was perfect to put real numbers behind the marketing claims — and to see if I'd actually been picking the right model in production.

What I ran:

200 hand-curated prompts across 5 categories (40 each): reasoning, code, RAG, creative writing, multilingual translation.
5 Gemini models, covering the (generation × tier) matrix:

Generation	Pro	Flash	Lite
3.1	`gemini-3.1-pro-preview`	—	—
3.0	(quota-limited, see below)	`gemini-3-flash-preview`	—
2.5	`gemini-2.5-pro`	`gemini-2.5-flash`	`gemini-2.5-flash-lite`

1,077 graded calls on the live Gemini API, with per-call latency and token usage recorded.
Automated grading: exact-match for reasoning, sandboxed python unit tests for code, keyword recall + hallucination penalty for RAG, and gemini-2.5-flash as a rubric judge for creative + multilingual (with thinkingBudget: 0 for deterministic JSON).

Every prompt, every grade, and every raw response is in the repo under MIT. Total API cost at published pricing: $0.39.

⚠️ I originally planned 6 models. gemini-3-pro-preview shares a 250 requests/day preview-tier quota with gemini-3.1-pro-preview, which I burned through on the 3.1 pass. I kept the 87 partial gemini-3-pro-preview rows in results/raw.jsonl for transparency, but they're excluded from the charts to avoid an unfair non-uniform sample.

The one chart that matters

Two dots above and to the left of the others are strictly better on both axes:

Gemini 3 Flash (preview) — 0.926 average quality, 2.5s median latency.
Gemini 2.5 Flash Lite — 0.90 quality, 1.4s latency.

Every other model in the test is dominated by one of these two. That is the headline finding.

Finding 1: Gemini 3 Flash (preview) ≥ Gemini 3.1 Pro (preview) on everything that matters

Nobody expects a preview Flash model to beat a preview Pro model. On our 200-prompt mix, it does — on almost every axis:

Metric	Gemini 3.1 Pro (preview)	Gemini 3 Flash (preview)	Winner
Overall quality (0–1)	0.923	0.926	Flash 🏆
Median latency	5.9s	2.5s	Flash 🏆
Tokens/sec (effective)	5	18	Flash 🏆
Cost per 1,000 tasks	$0.57	$0.19	Flash 🏆
Success rate	100.0%	100.0%	tie

By category:

Category	3.1 Pro Preview	3 Flash Preview	Δ
Reasoning	0.90	0.90	— tie (+0.00)
Code	0.96	0.96	— tie (+0.00)
Rag	0.95	0.97	▲ Flash wins (+0.01)
Creative	0.84	0.84	— tie (+0.00)
Multilingual	0.97	0.96	— tie (-0.01)

Verdict: If you were planning to upgrade from 2.5 Pro to 3.1 Pro for the quality bump, stop — on this mix you get +0.003 more quality from 3 Flash at 2.4× lower latency and 2.9× lower cost. The one place Pro meaningfully wins is longer-form creative writing, where its deeper reasoning occasionally produces stronger structure — though 11/40 of its creative outputs were also partially truncated at our 6,144-token cap, which depresses its score unfairly. Net-net: for structured work, 3 Flash Preview is the new default.

Finding 2: Gemini 2.5 Flash Lite is the hidden production winner

The smallest model in the Gemini family, gemini-2.5-flash-lite, lands at 0.900 on overall quality vs 0.928 for the best model on test (Gemini 2.5 Pro). The gap in raw quality is just 0.028 points — less than the within-model noise on many categories.

In return for those 3 percentage points:

6.8× faster per call — p50 latency 1.4s vs 9.8s.
15.7× cheaper per task at published pricing ($0.06 vs $0.87 per 1,000 tasks).
80 tok/s effective throughput — comfortably inside real-time interactive latency budgets (~100 ms first-token on a short answer).

The only category where Flash Lite falls behind noticeably is reasoning (0.82 vs the 0.90 ceiling most other models hit). If your workload is mostly short-answer / extraction / routing, switching to Flash Lite is a free 10× cost reduction. For deeper reasoning traffic, route selectively.

Finding 3: The "thinking token" tax silently destroys your output

This one cost me three hours of debugging. Starting with Gemini 2.5, both Flash and Pro reserve part of the maxOutputTokens budget for internal "thinking" tokens. The API never tells you how many until you inspect usageMetadata.thoughtsTokenCount. My judge came back empty on the first run:

{
  "finishReason": "MAX_TOKENS",
  "candidates": [{"content": {"role": "model"}}],
  "usageMetadata": {
    "candidatesTokenCount": 0,
    "thoughtsTokenCount": 197
  }
}

With maxOutputTokens: 200, the 197-token thinking pass ate the entire budget and left no room for the actual JSON answer. My grader silently got zeros.

The same bug bit me again later — on gemini-2.5-pro, 12 of 40 multilingual translations came back completely empty at maxOutputTokens: 1536. At first glance, 2.5 Pro looked like it was a disaster at translation (0.69 vs the others' 0.96+). The real story: its thinking pass was burning through 1,500+ tokens on structured translation tasks before producing a single visible character. After raising the budget to 3,072 and re-running those 17 prompts (cost: $0.51), its multilingual score jumped back to 0.96.

Two fixes, both legitimate:

# Option A — disable thinking for short structured outputs (fastest, deterministic)
"generationConfig": {
    "maxOutputTokens": 200,
    "responseMimeType": "application/json",
    "thinkingConfig": {"thinkingBudget": 0},
}

# Option B — give thinking room to breathe (keeps quality on hard prompts)
"generationConfig": {"maxOutputTokens": 4096}

I ended up sizing each category separately to avoid throwing away compute on cheap prompts while still fitting heavy thinking on hard ones:

CATEGORY_MAX_TOKENS = {
    "reasoning":    4096,
    "code":         6144,
    "rag":          2048,
    "creative":     6144,
    "multilingual": 3072,
}

If you're seeing "flaky" Gemini 2.5/3 responses in production, check finishReason on every call. Silent truncation is the #1 cause of intermittent quality drops that your own logs won't flag.

Finding 4: The Pro tier is over-served for 80% of workloads

Pro latency is paid upfront on every call, thinking or not. Here's tokens per second (output-only, so thinking time is counted against throughput):

Aggregated across the two Pro tiers in the benchmark:

Tier	Avg quality	Avg latency	Avg cost/1,000
Pro (2.5 Pro + 3.1 Pro Preview)	0.925	7.8s	$0.72
Flash (2.5 Flash + 3 Flash Preview)	0.920	2.9s	$0.22

Pro buys you +0.005 quality points for 2.7× the latency and 3.3× the cost. Unless your workload has a hard quality gap on a specific category (reasoning is the obvious candidate, though our 40-prompt slice didn't surface one), defaulting to Flash is the rational choice — especially when the newer Flash (gemini-3-flash-preview) outperforms the older Pro (gemini-2.5-pro).

Methodology

Prompts (200, hand-written, no GPT-generated content)

Reasoning (40): GSM8K-style arithmetic, ratio, geometry, probability, logic. Graded by exact-match of the final numeric or fractional answer after Answer:.
Code (40): HumanEval-style Python functions. Each prompt ships 4 hidden unit tests run in a subprocess sandbox with a 15-second timeout. Score = passed / total.
RAG (40): short factual passage → grounded question. Graded by keyword recall minus a hallucination penalty (−0.25 per forbidden distractor that appears in the output).
Creative (40): flash fiction, haiku, taglines, incident post-mortems in a humorous register. gemini-2.5-flash with thinkingBudget: 0 scores each output 0–10 against a per-prompt rubric (twist, word count, coherence, imagery, etc.).
Multilingual (40): translation + short-answer across 14 target languages spanning 5 scripts (Bengali, Hindi, Japanese, Russian, Chinese, Korean, Arabic, Urdu plus Latin-script Spanish/French/German/Italian/Portuguese/Turkish/Vietnamese). Graded by the same judge, with a script-detection gate: a Bengali answer written in Latin letters scores 0.

Grading fairness

Exact-match and unit-test grading are deterministic — re-running gives the same score to the byte.
Judge grading uses temperature=0, thinkingBudget=0, and each output is graded once (no best-of-N). The judge prompt never contains the producing model's name.
Cost numbers use published Google AI pricing (Pro = $1.25/$10 per 1M tok, Flash = $0.30/$2.50, Lite = $0.10/$0.40). Preview-tier pricing hasn't been announced, so I conservatively price the 3.x previews at their 2.5 equivalents.
Caveats: gemini-3.1-pro-preview had 11/40 creative prompts partially truncated at our 6,144-token creative cap due to deep thinking — its 0.84 creative score is a floor, not a ceiling. Since I'd burned the daily quota, I couldn't retry with a larger budget.

Runtime

aiohttp with concurrency=8 against the v1beta Gemini API.
Full run completed in ~18 minutes end-to-end on a single laptop.
Crash-resume built in: the runner reads results/raw.jsonl and skips (model, prompt_id) pairs already completed.

Reproducibility

All 200 prompts, all 1,077 raw responses, the grader, and the six published charts are in one place:

→ https://github.com/x-tahosin/gemini-bench-2026 (MIT license)

git clone https://github.com/x-tahosin/gemini-bench-2026
cd gemini-bench-2026
pip install -r requirements.txt
export GEMINI_KEY="your_key"
python -m bench.runner              # ~20 min on concurrency=8
python -m bench.grader              # ~13 min, 480 judge calls
python -m bench.analyze             # charts + summary.json

Want to add OpenAI / Anthropic / Mistral? bench/config.py is a list of ModelSpec objects — add rows, edit runner.py's request builder to handle the new API shape, done.

What I changed my mind about

I used to default to gemini-2.5-pro. After this run, I default to gemini-3-flash-preview (or its 2.5-flash predecessor when preview tier isn't available) and only escalate to Pro when a benchmark actually shows a quality gap for the specific task. For the workloads in this test, Pro wasn't worth the 4× latency.
I assumed the 3.x preview line was a strict upgrade. It's not — it trades latency and output-completeness for reasoning depth. On short-answer workloads (RAG, structured JSON, translation), the depth doesn't cash in.
I assumed I was using the API correctly. The thinking-token budget behavior means my old maxOutputTokens=512 production code has quietly been eating 80% of its budget on thinking for months. Probably yours too. Check finishReason on every call.

Credits & follow-up

Benchmark harness: github.com/x-tahosin/gemini-bench-2026 — MIT-licensed, fork-friendly.
Published for the Google Cloud NEXT 2026 Writing Challenge.
Tooling: Gemini API, aiohttp, matplotlib, Pillow, requests.

If this saved you API dollars or production debugging time, a ❤️ or a ⭐ on the repo makes the next run happen faster. Open to adding OpenAI + Anthropic to the same harness if there's interest — drop a comment with the model list you'd want to see.

DEV Community