Jangwook Kim

Posted on May 24 • Originally published at jangwook.net

Gemini API Model Selection Guide 2026 — Speed, Cost, and Quality Trade-offs Measured Directly from Flash-Lite to 3.5 Flash

#gemini #api #llm #benchmark

While setting up a sandbox experiment today, I found something unexpected. I queried the Gemini API model list and saw gemini-3.5-flash already deployed. I had no memory of a public announcement, so I thought I'd misread it.

I verified it — the model was actually callable. So I set aside the original work and measured all four currently available Gemini models under identical conditions. This article is that data.

Environment and Methodology

Let me be honest about the methodology upfront. This is not a rigorous benchmark.

# Test environment
Node.js v22.22.0
@google/generative-ai (latest)
Measured: 2026-05-24
Prompt: "List 5 practical use cases for AI APIs in modern web applications. One sentence each."
Input tokens: 19-22 (varies by tokenizer per model)
Runs: 2-run average

Short prompt, two-run average — statistically shallow. Network conditions, server load, and region will all affect results. But it's enough to calibrate "this model is roughly in this range."

Models tested:

gemini-2.5-flash-lite — entry-level Flash model
gemini-2.5-flash — default model recommended by most current guides
gemini-2.5-pro — high-capability reasoning model
gemini-3.5-flash — the new model I discovered today

Speed Results: Flash-Lite Is Faster Than Expected

=== Gemini API Benchmark (2026-05-24 measured) ===

[gemini-2.5-flash-lite]
  Total: 2,447ms | TTFT: 1,981ms
  Input: 19 tok | Output: 159 tok
  Est. TPS: 65.0

[gemini-3.5-flash]
  Total: 5,783ms | TTFT: 5,103ms
  Input: 19 tok | Output: 186 tok
  Est. TPS: 32.2

[gemini-2.5-flash]
  Total: 6,334ms | TTFT: 5,849ms
  Input: 19 tok | Output: 170 tok
  Est. TPS: 26.8

[gemini-2.5-pro]
  Total: 11,931ms | TTFT: 11,140ms
  Input: 19 tok | Output: 159 tok
  Est. TPS: 13.3

Flash-Lite dominates at 65 TPS. It's 4.9x faster than Pro and 2.4x faster than 2.5 Flash. The TTFT gap is significant too: Flash-Lite at 1.9 seconds versus Pro at 11.1 seconds is a palpable difference in any interactive use case.

3.5 Flash is interesting. It's slightly faster than 2.5 Flash and produced more output tokens (186 vs 170). Same prompt, richer response — which suggests some quality improvement. But I couldn't verify the official price, so cost comparisons stay rough.

Pro's 11-second latency comes from thinking mode being on by default. Even on a short prompt, it runs internal reasoning steps. For simple tasks, that's wasted compute. For complex reasoning, it's the whole point.

Cost Comparison: Official Pricing (May 2026)

Prices verified from official documentation:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context	Free tier
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M	✓
Gemini 2.5 Flash	$0.30	$2.50	1M	✓
Gemini 3.5 Flash	~$0.50*	~$3.50*	1M	Partial
Gemini 2.5 Pro	$1.25	$10.00	1M+	Limited

Flash-Lite is 67% cheaper on input and 84% cheaper on output compared to Flash. It's also 2.4x faster. That combination raises the question: why use Flash at all for simple tasks?

Pro runs 12.5x Flash-Lite's input cost. As covered in the cross-provider LLM pricing comparison, the gap looks enormous but total cost depends on your actual input/output token ratio.

*Gemini 3.5 Flash pricing isn't officially published yet. I couldn't pull pricing from the API response, and community estimates suggest roughly 1.5x-2x Flash 2.5 pricing. Treat these as rough estimates.

Monthly Cost Scenarios

Theory is one thing. "What does my service actually pay?" matters more. Three scenarios:

Scenario 1: Chatbot service (1M requests/month, 500 input / 200 output tokens)

Model	Monthly cost
Flash-Lite	$130
Flash	$650
3.5 Flash	~$1,050
Pro	$2,625

Twenty-to-one gap between Flash-Lite and Pro. Using Pro for a simple FAQ bot wastes $2,495 per month.

Scenario 2: Code review agent (50K requests/month, 8,000 input / 2,000 output tokens)

Model	Monthly cost
Flash-Lite	$80
Flash	$370
3.5 Flash	~$600
Pro	$1,500

Flash-Lite stays cheapest even with longer context. But code review is an accuracy-sensitive task — choosing by cost alone tends to backfire.

Scenario 3: RAG document analysis (10K requests/month, 50,000 input / 1,000 output tokens)

Model	Monthly cost
Flash-Lite	$54
Flash	$175
3.5 Flash	~$280
Pro	$725

Long-context use cases benefit most from caching. In today's test, a 840-token context request cost $0.001877 for Flash. If you're repeatedly sending the same system prompt or document, Context Caching can cut that input cost by 75%.

Which Model Should You Actually Use

Flash-Lite looks like the obvious winner on speed and cost alone. In practice, it depends.

Use Flash-Lite when:

You're classifying or tagging user inputs in a pipeline
Generating short text — titles, summaries, keyword extraction
Response latency directly affects UX
Processing at high QPS
Budget is tight and quality tolerance is high

Honestly, a lot of chatbots and automation pipelines would run fine on Flash-Lite. Many teams using Flash as default probably wouldn't notice the quality difference on their specific tasks.

Use Flash when:

You need medium-complexity instruction following (email drafts, simple code generation)
Running multi-turn conversations with sustained context
Handling multimodal inputs (images, video)
You want a sensible default that covers most cases

Flash is the pragmatic default. 2.5x slower and 3x more expensive than Flash-Lite, but meaningfully more consistent on complex instructions. If I were starting a new project, I'd launch on Flash and shift specific pipelines to Flash-Lite when cost pressure emerged.

Use 3.5 Flash when:
Too early to recommend definitively — the price isn't confirmed. Today's measurement showed it faster than 2.5 Flash with richer output, but I can't generalize from two runs. I'll revisit once the official docs appear.

Use Pro when:

Analyzing complex codebases or doing architecture review
Tasks involving math or scientific reasoning
RAG over long documents where subtlety matters
B2B use cases where accuracy loss costs more than the model premium

Using Pro for a general chatbot is wasteful. But when I look at real AI agent cost structures, the bigger cost failure usually isn't wrong model selection — it's wrong agent design. Using Flash-Lite where Pro is needed can trigger reprocessing or human review that costs more than the savings.

Three Cost Levers Beyond Model Selection

Model choice isn't the only knob.

1. Context Caching

If your architecture repeatedly sends the same system prompt or reference document, Context Caching is the highest-leverage optimization. Google's documentation states 75% input cost reduction on cache hits.

// Context Caching example (Gemini API)
const cache = await cacheManager.create({
  model: 'gemini-2.5-flash',
  contents: [{ role: 'user', parts: [{ text: systemDocument }] }],
  ttlSeconds: 3600,
});

const model = genAI.getGenerativeModelFromCachedContent(cache);

2. Batch API

Non-real-time processing (bulk analysis, overnight jobs) gets 50% off with Batch API. A $1,000/month workload becomes $500. Flash-Lite + Batch API combined can cut costs 10x versus Pro alone in the right use case.

3. Tier mixing

Don't route all traffic through one model. Classification, routing, and summarization on Flash-Lite; core generation on Flash; complex reasoning on Pro. This architecture extracts the most value per dollar.

Response Style Observations: Same Prompt, Different Outputs

Beyond speed and cost, I also looked at the actual response text from each model. Same prompt ("List 5 practical use cases for AI APIs in modern web applications. One sentence each."), notably different styles.

Flash-Lite was the most literal. Numbered list, one sentence per item, nothing else. No preamble, no closing summary. For classification pipelines, this kind of terse compliance is actually useful — extra explanation adds noise when you're parsing responses into structured data.

Flash added slightly richer expression. Items came with concrete examples or context like "for example, in a customer-facing application." That extra texture works well for user-facing applications where the response reads as complete on its own.

Pro showed a distinct pattern. It opened with a brief framing sentence before the list, and each item's description was more analytical. It also slipped past the "one sentence" constraint on a couple of items. I want to be careful not to call this "better" — it's different. Pro seems to prioritize generating a genuinely useful answer over strictly following the letter of the instruction. Depending on the task, that's either a strength or a problem.

3.5 Flash produced the most output tokens (186) across all models. Each item's description was more detailed than Flash, with additional context woven in naturally. It felt like a balance between following the instruction and giving the reader more value.

What I take from this: when selecting a model, "how fast" and "how cheap" matter, but so does "how literally it follows instructions." For classification, extraction, and structured output generation, precise instruction following is critical. For conversational UX, a bit of autonomous elaboration can improve the experience.

Migration Checklist: Moving from Flash to Flash-Lite

Changing a model is one line of config. Validating that it was the right call is harder. Here's what I check before shifting a pipeline to Flash-Lite.

Before switching:

Categorize your actual prompt types. Not all prompts have the same complexity. Classification ("is this A or B?"), summarization ("reduce to 3 sentences"), and extraction ("pull the dates from this text") are Flash-Lite territory. Multi-step instructions like "find the logic error, propose a fix, and write test cases" may need Flash or above.
Define quality criteria numerically. "Works well" isn't a criterion. "Accuracy below 95% triggers a reject" is. Without a measurable threshold, you can't interpret A/B results.
A/B test on a small traffic slice first. Route 5-10% of traffic to Flash-Lite before switching the whole pipeline. This gives you real quality and cost data without full user exposure.
Calculate error rate and retry costs. If Flash-Lite produces more errors or requires more regeneration, the unit price advantage narrows. Always compute "actual cost including retries," not just the token rate.

// Basic model routing pattern
function selectModel(taskType, complexityScore) {
  if (complexityScore < 0.3) {
    return 'gemini-2.5-flash-lite';  // classification, extraction
  } else if (complexityScore < 0.7) {
    return 'gemini-2.5-flash';       // general generation, summarization
  } else {
    return 'gemini-2.5-pro';         // complex reasoning, analysis
  }
}

How you measure complexityScore is the real engineering problem. Prompt length, instruction count, and multi-step detection combined into a heuristic is a practical starting point.

What I Didn't Get to Test Today

A few honest gaps:

First, I didn't compare quality. Speed and cost only. "More output tokens" doesn't mean "better answer." Flash-Lite's 65 TPS is impressive, but if its code-review quality is 80% of Flash on complex tasks, the calculus changes.

Second, I couldn't confirm the official Gemini 3.5 Flash price. Found the model in the API list, couldn't find it on the pricing page. Community estimates only.

Third, two runs isn't a benchmark. Google's server state, my network latency, prompt characteristics — all of these vary. Before making a production decision, run your actual workload yourself.

My Take

Flash-Lite is faster and cheaper than the "entry model" label implies. If you're defaulting to Flash for classification, routing, or short generation tasks, some of those pipelines could move to Flash-Lite without a noticeable quality drop. Worth a structured test.

3.5 Flash caught my attention. Faster than 2.5 Flash, richer output — but the absence of official documentation suggests it's still in some form of preview. I'll watch it.

Pro is expensive, but whether that's waste depends entirely on the task. For B2B use cases requiring complex reasoning, the 11-second TTFT and premium price are justified. For everything else, probably not.

For anyone new to the Gemini API: start with Flash, collect data, then optimize. Jumping straight to Flash-Lite without a quality baseline first tends to cause more debugging than the cost savings are worth.

DEV Community