TL;DR — With preview access to Gemini 3.1 Pro and Gemini 3 Flash, I built an open benchmark (200 prompts × 5 models = 1,000+ calls) and the results invert the "newer Pro is always better" intuition. Gemini 3 Flash (preview) Pareto-dominates Gemini 3.1 Pro (preview) — higher quality, 2.4× lower latency, 3× cheaper. Meanwhile, Gemini 2.5 Flash Lite is only 3 points behind the best model overall, at 1/15th the cost. Everything is reproducible: repo + raw data.
Why this post exists
I'm in the preview cohort for Gemini 3.1 Pro and Gemini 3 Flash via the Google AI API. Between the Google Cloud NEXT 2026 Writing Challenge and the fact that no independent public benchmark covers the 3.x family yet, the timing was perfect to put real numbers behind the marketing claims — and to see if I'd actually been picking the right model in production.
What I ran:
- 200 hand-curated prompts across 5 categories (40 each): reasoning, code, RAG, creative writing, multilingual translation.
-
5 Gemini models, covering the
(generation × tier)matrix:
| Generation | Pro | Flash | Lite |
|---|---|---|---|
| 3.1 | gemini-3.1-pro-preview |
— | — |
| 3.0 | (quota-limited, see below) | gemini-3-flash-preview |
— |
| 2.5 | gemini-2.5-pro |
gemini-2.5-flash |
gemini-2.5-flash-lite |
- 1,077 graded calls on the live Gemini API, with per-call latency and token usage recorded.
- Automated grading: exact-match for reasoning, sandboxed
pythonunit tests for code, keyword recall + hallucination penalty for RAG, andgemini-2.5-flashas a rubric judge for creative + multilingual (withthinkingBudget: 0for deterministic JSON).
Every prompt, every grade, and every raw response is in the repo under MIT. Total API cost at published pricing: $0.39.
⚠️ I originally planned 6 models.
gemini-3-pro-previewshares a 250 requests/day preview-tier quota withgemini-3.1-pro-preview, which I burned through on the 3.1 pass. I kept the 87 partialgemini-3-pro-previewrows inresults/raw.jsonlfor transparency, but they're excluded from the charts to avoid an unfair non-uniform sample.
The one chart that matters
Two dots above and to the left of the others are strictly better on both axes:
-
Gemini 3 Flash (preview) —
0.926average quality,2.5smedian latency. -
Gemini 2.5 Flash Lite —
0.90quality,1.4slatency.
Every other model in the test is dominated by one of these two. That is the headline finding.
Finding 1: Gemini 3 Flash (preview) ≥ Gemini 3.1 Pro (preview) on everything that matters
Nobody expects a preview Flash model to beat a preview Pro model. On our 200-prompt mix, it does — on almost every axis:
| Metric | Gemini 3.1 Pro (preview) | Gemini 3 Flash (preview) | Winner |
|---|---|---|---|
| Overall quality (0–1) | 0.923 | 0.926 | Flash 🏆 |
| Median latency | 5.9s | 2.5s | Flash 🏆 |
| Tokens/sec (effective) | 5 | 18 | Flash 🏆 |
| Cost per 1,000 tasks | $0.57 | $0.19 | Flash 🏆 |
| Success rate | 100.0% | 100.0% | tie |
By category:
| Category | 3.1 Pro Preview | 3 Flash Preview | Δ |
|---|---|---|---|
| Reasoning | 0.90 | 0.90 | — tie (+0.00) |
| Code | 0.96 | 0.96 | — tie (+0.00) |
| Rag | 0.95 | 0.97 | ▲ Flash wins (+0.01) |
| Creative | 0.84 | 0.84 | — tie (+0.00) |
| Multilingual | 0.97 | 0.96 | — tie (-0.01) |
Verdict: If you were planning to upgrade from 2.5 Pro to 3.1 Pro for the quality bump, stop — on this mix you get +0.003 more quality from 3 Flash at 2.4× lower latency and 2.9× lower cost. The one place Pro meaningfully wins is longer-form creative writing, where its deeper reasoning occasionally produces stronger structure — though 11/40 of its creative outputs were also partially truncated at our 6,144-token cap, which depresses its score unfairly. Net-net: for structured work, 3 Flash Preview is the new default.
Finding 2: Gemini 2.5 Flash Lite is the hidden production winner
The smallest model in the Gemini family, gemini-2.5-flash-lite, lands at 0.900 on overall quality vs 0.928 for the best model on test (Gemini 2.5 Pro). The gap in raw quality is just 0.028 points — less than the within-model noise on many categories.
In return for those 3 percentage points:
- 6.8× faster per call — p50 latency 1.4s vs 9.8s.
-
15.7× cheaper per task at published pricing (
$0.06vs$0.87per 1,000 tasks). - 80 tok/s effective throughput — comfortably inside real-time interactive latency budgets (~100 ms first-token on a short answer).
The only category where Flash Lite falls behind noticeably is reasoning (0.82 vs the 0.90 ceiling most other models hit). If your workload is mostly short-answer / extraction / routing, switching to Flash Lite is a free 10× cost reduction. For deeper reasoning traffic, route selectively.
Finding 3: The "thinking token" tax silently destroys your output
This one cost me three hours of debugging. Starting with Gemini 2.5, both Flash and Pro reserve part of the maxOutputTokens budget for internal "thinking" tokens. The API never tells you how many until you inspect usageMetadata.thoughtsTokenCount. My judge came back empty on the first run:
{
"finishReason": "MAX_TOKENS",
"candidates": [{"content": {"role": "model"}}],
"usageMetadata": {
"candidatesTokenCount": 0,
"thoughtsTokenCount": 197
}
}
With maxOutputTokens: 200, the 197-token thinking pass ate the entire budget and left no room for the actual JSON answer. My grader silently got zeros.
The same bug bit me again later — on gemini-2.5-pro, 12 of 40 multilingual translations came back completely empty at maxOutputTokens: 1536. At first glance, 2.5 Pro looked like it was a disaster at translation (0.69 vs the others' 0.96+). The real story: its thinking pass was burning through 1,500+ tokens on structured translation tasks before producing a single visible character. After raising the budget to 3,072 and re-running those 17 prompts (cost: $0.51), its multilingual score jumped back to 0.96.
Two fixes, both legitimate:
# Option A — disable thinking for short structured outputs (fastest, deterministic)
"generationConfig": {
"maxOutputTokens": 200,
"responseMimeType": "application/json",
"thinkingConfig": {"thinkingBudget": 0},
}
# Option B — give thinking room to breathe (keeps quality on hard prompts)
"generationConfig": {"maxOutputTokens": 4096}
I ended up sizing each category separately to avoid throwing away compute on cheap prompts while still fitting heavy thinking on hard ones:
CATEGORY_MAX_TOKENS = {
"reasoning": 4096,
"code": 6144,
"rag": 2048,
"creative": 6144,
"multilingual": 3072,
}
If you're seeing "flaky" Gemini 2.5/3 responses in production, check finishReason on every call. Silent truncation is the #1 cause of intermittent quality drops that your own logs won't flag.
Finding 4: The Pro tier is over-served for 80% of workloads
Pro latency is paid upfront on every call, thinking or not. Here's tokens per second (output-only, so thinking time is counted against throughput):
Aggregated across the two Pro tiers in the benchmark:
| Tier | Avg quality | Avg latency | Avg cost/1,000 |
|---|---|---|---|
| Pro (2.5 Pro + 3.1 Pro Preview) | 0.925 | 7.8s | $0.72 |
| Flash (2.5 Flash + 3 Flash Preview) | 0.920 | 2.9s | $0.22 |
Pro buys you +0.005 quality points for 2.7× the latency and 3.3× the cost. Unless your workload has a hard quality gap on a specific category (reasoning is the obvious candidate, though our 40-prompt slice didn't surface one), defaulting to Flash is the rational choice — especially when the newer Flash (gemini-3-flash-preview) outperforms the older Pro (gemini-2.5-pro).
Methodology
Prompts (200, hand-written, no GPT-generated content)
-
Reasoning (40): GSM8K-style arithmetic, ratio, geometry, probability, logic. Graded by exact-match of the final numeric or fractional answer after
Answer:. -
Code (40): HumanEval-style Python functions. Each prompt ships 4 hidden unit tests run in a subprocess sandbox with a 15-second timeout. Score =
passed / total. - RAG (40): short factual passage → grounded question. Graded by keyword recall minus a hallucination penalty (−0.25 per forbidden distractor that appears in the output).
-
Creative (40): flash fiction, haiku, taglines, incident post-mortems in a humorous register.
gemini-2.5-flashwiththinkingBudget: 0scores each output 0–10 against a per-prompt rubric (twist, word count, coherence, imagery, etc.). - Multilingual (40): translation + short-answer across 14 target languages spanning 5 scripts (Bengali, Hindi, Japanese, Russian, Chinese, Korean, Arabic, Urdu plus Latin-script Spanish/French/German/Italian/Portuguese/Turkish/Vietnamese). Graded by the same judge, with a script-detection gate: a Bengali answer written in Latin letters scores 0.
Grading fairness
- Exact-match and unit-test grading are deterministic — re-running gives the same score to the byte.
- Judge grading uses
temperature=0,thinkingBudget=0, and each output is graded once (no best-of-N). The judge prompt never contains the producing model's name. - Cost numbers use published Google AI pricing (Pro = $1.25/$10 per 1M tok, Flash = $0.30/$2.50, Lite = $0.10/$0.40). Preview-tier pricing hasn't been announced, so I conservatively price the 3.x previews at their 2.5 equivalents.
-
Caveats:
gemini-3.1-pro-previewhad 11/40 creative prompts partially truncated at our 6,144-token creative cap due to deep thinking — its0.84creative score is a floor, not a ceiling. Since I'd burned the daily quota, I couldn't retry with a larger budget.
Runtime
-
aiohttpwithconcurrency=8against the v1beta Gemini API. - Full run completed in ~18 minutes end-to-end on a single laptop.
- Crash-resume built in: the runner reads
results/raw.jsonland skips(model, prompt_id)pairs already completed.
Reproducibility
All 200 prompts, all 1,077 raw responses, the grader, and the six published charts are in one place:
→ https://github.com/x-tahosin/gemini-bench-2026 (MIT license)
git clone https://github.com/x-tahosin/gemini-bench-2026
cd gemini-bench-2026
pip install -r requirements.txt
export GEMINI_KEY="your_key"
python -m bench.runner # ~20 min on concurrency=8
python -m bench.grader # ~13 min, 480 judge calls
python -m bench.analyze # charts + summary.json
Want to add OpenAI / Anthropic / Mistral? bench/config.py is a list of ModelSpec objects — add rows, edit runner.py's request builder to handle the new API shape, done.
What I changed my mind about
-
I used to default to
gemini-2.5-pro. After this run, I default togemini-3-flash-preview(or its 2.5-flash predecessor when preview tier isn't available) and only escalate to Pro when a benchmark actually shows a quality gap for the specific task. For the workloads in this test, Pro wasn't worth the 4× latency. - I assumed the 3.x preview line was a strict upgrade. It's not — it trades latency and output-completeness for reasoning depth. On short-answer workloads (RAG, structured JSON, translation), the depth doesn't cash in.
-
I assumed I was using the API correctly. The thinking-token budget behavior means my old
maxOutputTokens=512production code has quietly been eating 80% of its budget on thinking for months. Probably yours too. CheckfinishReasonon every call.
Credits & follow-up
- Benchmark harness: github.com/x-tahosin/gemini-bench-2026 — MIT-licensed, fork-friendly.
- Published for the Google Cloud NEXT 2026 Writing Challenge.
- Tooling: Gemini API,
aiohttp,matplotlib,Pillow,requests.
If this saved you API dollars or production debugging time, a ❤️ or a ⭐ on the repo makes the next run happen faster. Open to adding OpenAI + Anthropic to the same harness if there's interest — drop a comment with the model list you'd want to see.





Top comments (0)