Karim G.

Posted on Dec 17, 2025

I Tested 6 Gemini Models for Voice AI Latency. The Results Will Change How You Build.

#ai #gemini #ttft #benchmark

A 600-call benchmark reveals which Gemini model actually delivers real-time performance—and exposes some surprising truths about Google's naming conventions.

The moment your voice AI pauses for 2 seconds, your user is already wondering if it crashed.

That's not hyperbole—it's human biology. Natural conversation operates on a ~200ms response expectation. Exceed 500ms and the experience feels sluggish. Cross 1 second and you've entered "awkward silence" territory. Hit 3 seconds? Your user is reaching for the "End Call" button.

This is why Time-to-First-Token (TTFT) is the single most important metric for voice AI applications. Not quality. Not cost. Latency.

I learned this the hard way while building voice agents. So I decided to answer a deceptively simple question: Which Gemini model should you actually use for real-time voice?

The answer surprised me.

The Problem: Too Many Models, Not Enough Data

Google's Gemini lineup is... complicated. You've got Flash, Flash-Lite, numbered versions, preview releases, and now "thinking" configurations that fundamentally change model behavior. The documentation tells you what each model can do, but not how fast it does it.

For voice applications, that gap is fatal.

I needed hard numbers. So I built a benchmark.

Methodology: 600 API Calls Don't Lie

I tested 6 Gemini models across 20 realistic scenarios with 5 iterations each—600 total API calls using the @google/genai SDK with streaming enabled.

Models tested:

Gemini 2.0 Flash
Gemini 2.0 Flash-Lite
Gemini 2.5 Flash (default)
Gemini 2.5 Flash (thinking: minimal)
Gemini 2.5 Flash-Lite
Gemini 3 Flash (Preview)

Scenarios spanned 7 categories:

Short prompts (greetings, yes/no questions)
Medium complexity (weather, recipes, recommendations)
Long/complex (planning, technical questions)
Context-dependent (follow-ups, clarifications)
Ambiguous (vague requests, incomplete info)
Multi-part (compound questions)
Conversational (emotional support, casual chat)

Each scenario ran 5 times with a warmup iteration discarded. I added 500ms delays between requests to avoid rate limiting effects. I measured both TTFT (when the first token arrives) and total response time.

The Results: Prepare to Rethink Everything

Rank	Model	Avg TTFT	Avg Total Time
🥇 1	Gemini 2.5 Flash-Lite	381ms	674ms
🥈 2	Gemini 2.0 Flash	454ms	758ms
🥉 3	Gemini 2.5 Flash (thinking: minimal)	503ms	729ms
4	Gemini 2.0 Flash-Lite	456ms	868ms
5	Gemini 2.5 Flash (default)	1879ms	2065ms
6	Gemini 3 Flash (Preview)	2900ms	3160ms

Read that again. The fastest model is 4.9× faster than its non-Lite sibling with default settings.

Five Things I Learned

1. "Lite" Doesn't Mean "Worse"—It Means "Faster"

Google's naming convention implies Lite models are stripped-down versions for cost savings. In reality, Gemini 2.5 Flash-Lite at 381ms is the fastest model I tested.

For voice applications where you need a response now, Lite isn't a compromise—it's the optimal choice. The quality difference for typical voice agent tasks (greetings, confirmations, short answers) is negligible. You're not asking it to write a dissertation; you're asking it to say "I found 3 Italian restaurants nearby."

2. The `thinking: minimal` Config is a Game-Changer

Here's a configuration most developers don't know exists.

Gemini 2.5 Flash with default settings clocks in at a painful 1879ms TTFT. That's nearly 2 seconds of silence before your user hears anything. Unacceptable for voice.

But add thinking: minimal to your config? 503ms. That's a 73% reduction from changing one parameter.

const response = await ai.models.generateContentStream({
  model: "gemini-2.5-flash",
  config: {
    thinkingConfig: {
      thinkingBudget: 0  // minimal thinking
    }
  },
  contents: [{ role: "user", parts: [{ text: prompt }] }]
});

The thinking feature is designed for complex reasoning tasks. For voice agents handling conversational queries, you almost never need it. Turn it off.

3. Gemini 3 Flash Preview is NOT Ready for Real-Time Voice

I tested it because developers always ask about the "latest and greatest."

At 2900ms average TTFT, Gemini 3 Flash Preview is approximately 10× slower than what you need for natural conversation. It might have capabilities that justify that latency for other use cases, but for voice? Hard pass.

Wait for the production release—or better yet, wait for the benchmarks.

4. Short Prompts are Consistently Fast (When You Pick the Right Model)

On my top 3 models, simple prompts like "What time is it?" or "Yes" consistently hit the 300-400ms range. That's approaching the human conversational threshold.

This matters because voice agents spend most of their time handling short exchanges: confirmations, acknowledgments, simple queries. If your model can nail those, occasional complex responses can afford slightly more latency.

5. Complexity Creates Variance

Long, multi-part prompts showed TTFT ranging from 600ms to 1000ms+ even on fast models. The standard deviation increased significantly.

Practical implication: If your voice agent handles complex queries, pad your expectations. Design your UX around occasional 1-second delays. Consider using filler phrases ("Let me think about that...") when you detect complex incoming queries.

Practical Recommendations

Based on my data, here's what I'd recommend:

For Production Voice Agents:

Use Gemini 2.5 Flash-Lite. It's the fastest, it's stable, and quality is more than sufficient for conversational AI. At 381ms average TTFT, you're within striking distance of human conversation cadence.

If You Need More Capability:

Use Gemini 2.5 Flash with thinking: minimal. You get the upgraded model capabilities at 503ms—still under the 500ms "feels responsive" threshold for most scenarios.

For Cost-Sensitive Applications:

Gemini 2.0 Flash-Lite offers great value at 456ms TTFT, though total response time runs higher (868ms).

For Complex Reasoning + Voice (Rare):

Consider a hybrid approach: use a fast model for initial acknowledgment, then stream the detailed response. "Great question! Here's what I found..." buys you time.

Quick Start: Fastest Voice Agent Setup

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

async function getVoiceResponse(userSpeech: string): Promise<string> {
  const startTime = Date.now();

  const response = await ai.models.generateContentStream({
    model: "gemini-2.5-flash-lite",  // ← Fastest for voice
    contents: [{ 
      role: "user", 
      parts: [{ text: userSpeech }] 
    }]
  });

  let fullResponse = "";
  for await (const chunk of response) {
    const text = chunk.text();
    if (text) {
      // First chunk = TTFT achieved, start speaking!
      fullResponse += text;
    }
  }

  console.log(`TTFT: ${Date.now() - startTime}ms`);
  return fullResponse;
}

Limitations & Caveats

Let's be honest about what this benchmark doesn't tell you:

Network conditions vary. I tested from a single location. Your production environment may differ.
Load matters. Google's infrastructure handles variable load; my 500ms delays don't simulate peak usage.
Quality wasn't measured. I focused purely on latency. For your use case, response quality might justify slower models.
One point in time. Google updates models continuously. Benchmark again in 3 months.

Conclusion: Latency is a Feature

The best voice AI in the world fails if it takes 3 seconds to respond.

My benchmark shows that model selection alone can mean the difference between a 381ms response and a 2900ms response—a nearly 8× gap. That's the difference between "this feels natural" and "this feels broken."

The bottom line: For real-time voice agents in December 2025, use Gemini 2.5 Flash-Lite. It's not a compromise—it's the right tool for the job.

Stop guessing. Start measuring. Ship something your users won't hang up on.

Have you run your own Gemini benchmarks? I'd love to see your data. Drop a comment or reach out—the more datapoints, the better we all build.

DEV Community