DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

Gemini 3.1 Flash TTS: Production API Guide for Developers

Google released gemini-3.1-flash-tts-preview in April 2026 as part of its Gemini 3.1 wave. Unlike the earlier 2.5-series TTS models, this one ships with a 200-tag expressive control system, native multi-speaker dialogue support, and pricing that positions it as a credible option for production voice pipelines. Effloow Lab ran a series of live API calls against the model using the generativelanguage REST endpoint — the results below come from those verified runs.

Why Gemini 3.1 Flash TTS Matters for Developers

The TTS market in 2026 is crowded: ElevenLabs for premium voice cloning, OpenAI TTS for fast single-speaker synthesis, and a handful of regional providers. What makes Gemini 3.1 Flash TTS worth evaluating is the combination of three things none of the affordable alternatives offer together.

First, it's steerable through inline text tags — no separate API call to set emotion, no fine-tuning, no voice cloning workflow. You inject [excited] or [seriousness] directly in your text and the model shifts delivery. Second, multi-speaker dialogue is a first-class feature: assign named speakers to different voices in the same request, and the model handles turn-taking. Third, it supports batchGenerateContent, which means you can process bulk audio jobs without rewriting your pipeline architecture.

It's not trying to beat ElevenLabs on voice realism. The target is the middle tier: IVR systems, audiobooks, accessibility tooling, and voice-enabled agents where you need reliable quality, flexible control, and predictable costs at scale.

What the API Returns (and What It Doesn't)

Before writing any code, one correction to how the model is often described in the wild: the API does not return WAV files. The MIME type in the response is audio/l16; rate=24000; channels=1 — that is raw Linear PCM audio, 16-bit, 24kHz, mono. You have to wrap it in a WAV container yourself if your playback pipeline expects WAV.

Effloow Lab confirmed this in a direct API call. The response inlineData.mimeType was audio/l16; rate=24000; channels=1, not audio/wav. If you pass raw PCM to a WAV player without the header, you'll get noise or silence.

Python's built-in wave module handles this cleanly:

import base64, wave

raw_pcm = base64.b64decode(response_audio_b64)

with wave.open("output.wav", "wb") as wf:
    wf.setnchannels(1)   # mono
    wf.setsampwidth(2)   # 16-bit = 2 bytes per sample
    wf.setframerate(24000)
    wf.writeframes(raw_pcm)
Enter fullscreen mode Exit fullscreen mode

For streaming to a browser or an IVR system, you can serve the raw PCM directly — most real-time audio pipelines consume L16 natively.

Getting Started: Minimal Working Example

The model ID is gemini-3.1-flash-tts-preview. You call it at the same generateContent endpoint used for all other Gemini models. The only difference is responseModalities: ["AUDIO"] in generationConfig.

REST (curl):

curl -s -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"parts": [{"text": "Hello from Gemini TTS."}]}],
    "generationConfig": {
      "responseModalities": ["AUDIO"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {"voiceName": "Kore"}
        }
      }
    }
  }'
Enter fullscreen mode Exit fullscreen mode

The audio bytes live at candidates[0].content.parts[0].inlineData.data — base64-encoded.

Python SDK (google-genai):

from google import genai
from google.genai import types
import base64, wave

client = genai.Client(api_key="YOUR_API_KEY")

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents=[{"parts": [{"text": "Hello from Gemini TTS."}]}],
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name="Kore"
                )
            )
        ),
    ),
)

raw_pcm = base64.b64decode(
    response.candidates[0].content.parts[0].inline_data.data
)

with wave.open("output.wav", "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(24000)
    wf.writeframes(raw_pcm)
Enter fullscreen mode Exit fullscreen mode

Install the SDK with pip install google-genai. This is the newer unified SDK — not google-generativeai, which is being phased out.

Audio Tags: Expressive Control Without Fine-Tuning

The headline feature is an inline tag system that shapes how the model delivers speech. Tags go in square brackets directly in your text, and they apply from that point forward until another tag overrides them.

The confirmed tag categories are:

  • Emotions: [excited], [calm], [curious], [enthusiastic], [seriousness], [nervousness], [amusement], [tension], [frustration], [hope], [awe]
  • Pacing: [slow], [fast], [short pause], [long pause]
  • Vocal effects: [whispers], [laughs]
  • Tone: [positive], [neutral], [negative]

A few rules that matter in practice:

  1. Tags are English-only, even when your text is in another language
  2. Never place two tags adjacent to each other — separate them with text or punctuation
  3. Tags are applied where they appear; they don't apply retroactively

Here's a realistic IVR example Effloow Lab tested with the [seriousness] and [slow] tags:

text = (
    "[seriousness] Attention: unusual login detected on your account. "
    "[slow] Please verify your identity now."
)
Enter fullscreen mode Exit fullscreen mode

The output ran 8.76 seconds (281 audio tokens) and the pacing shift at [slow] was audible. The model did not read the tag text aloud.

For audiobook-style narration:

text = (
    "[cautious] She stepped through the doorway, scanning the empty corridor. "
    "[tension] Something was wrong. "
    "[awe] Then she saw it — the crystal, glowing in the dark."
)
Enter fullscreen mode Exit fullscreen mode

The 200+ tag count in Google's documentation covers the full vocabulary including regional language-specific emotion tags. The ones above are consistently reliable across Effloow Lab's test runs.

Multi-Speaker Dialogue

Single requests can contain multiple speakers with different voices. You define the speaker-to-voice mapping in multiSpeakerVoiceConfig and label each turn in the text using SpeakerName: format.

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents=[{
        "parts": [{
            "text": (
                "Alice: [curious] How does the multi-speaker mode work exactly?\n"
                "Bob: [enthusiastic] You assign a voice to each speaker in the config, "
                "then label each line with the speaker name."
            )
        }]
    }],
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
                speaker_configs=[
                    types.SpeakerVoiceConfig(
                        speaker="Alice",
                        voice_config=types.VoiceConfig(
                            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                                voice_name="Aoede"
                            )
                        ),
                    ),
                    types.SpeakerVoiceConfig(
                        speaker="Bob",
                        voice_config=types.VoiceConfig(
                            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                                voice_name="Charon"
                            )
                        ),
                    ),
                ]
            )
        ),
    ),
)
Enter fullscreen mode Exit fullscreen mode

Note the field name: multi_speaker_voice_config (Python SDK) maps to multiSpeakerVoiceConfig in JSON, with the array at speakerVoiceConfigs. Some community examples show speakerVoiceConfig (singular) which does not work — use the plural form.

Effloow Lab confirmed this with a two-speaker exchange. The output was 528,000 bytes (~11 seconds) with two clearly distinct voice types (Aoede is described as warm/bright; Charon is deeper). Audio tag expressions work inside multi-speaker mode.

Available Voices

The model ships with 30 prebuilt voices. The following five are the most referenced in documentation and have been confirmed available via the API:

Voice Character
Aoede Bright, warm
Charon Deep, measured
Fenrir Authoritative
Kore Neutral, clear
Puck Expressive, energetic

The full list of 30 voices is in Google AI Studio's voice picker. There's no API endpoint to enumerate them programmatically — you need to know the name in advance.

All 30 voices support the full language set (70+ languages), and audio tags work across languages even though the tags themselves must be in English.

Pricing and Token Math

Tier Input (text) Output (audio) 1 min of audio (est.)
Standard $1.00 / 1M tokens $20.00 / 1M tokens ~$0.038
Batch $0.50 / 1M tokens $10.00 / 1M tokens ~$0.019

Google's documentation states 25 tokens per second of audio. Effloow Lab's measurements showed approximately 32 tokens per second across multiple test calls (281 tokens for 8.76s, 295 tokens for ~9.2s, 352 tokens for ~11s). Use 30–35 tokens/second as a planning estimate until the preview stabilizes.

At 32 tokens/second on Standard tier: one hour of synthesized audio costs roughly 32 × 3600 / 1,000,000 × $20 ≈ $2.30. For bulk audiobook production, the Batch tier halves that to around $1.15/hour.

The model's input limit is 8,192 tokens; output is 16,384 tokens. At ~32 tokens/second, the maximum single response duration is about 8.5 minutes, which covers most production use cases without chunking.

Comparing to Gemini 2.5 Flash TTS

Feature Gemini 2.5 Flash TTS Gemini 3.1 Flash TTS
Audio tags Not available 200+ inline tags
Multi-speaker Limited Native, per-speaker voice assignment
Voices Multiple 30 prebuilt
Languages Multiple 70+
Watermarking Not documented SynthID included
Batch API Yes Yes
Best for Simple synthesis Expressive, multi-character production

If you're already on 2.5 Flash TTS and don't need audio tags or multi-speaker, migration is not urgent. The model IDs are separate, so you can run both in parallel. If you're starting a new project, gemini-3.1-flash-tts-preview is the obvious default — it's strictly more capable at comparable pricing.

Common Mistakes

Using the wrong SDK package. The google-generativeai package is deprecated. Use google-genai. The class names and method signatures differ.

Treating the response as WAV. As covered above, the response is raw L16 PCM. Wrap it with the wave module before playing.

Placing two tags in a row. [slow][seriousness] will not work as expected. Put them around a comma or natural pause: [slow] Proceed carefully. [seriousness] This is a security alert.

Expecting real-time streaming from generateContent. The standard endpoint returns the full audio in one response. For low-latency streaming, you need server-sent events — not covered in this guide, but supported via Vertex AI's streaming endpoint.

Missing the responseModalities field. Without "responseModalities": ["AUDIO"] in generationConfig, the model returns text, not audio.

Practical Use Cases

The model is well-suited for three categories:

IVR and automated voice notifications. Fraud alerts, flight delay announcements, appointment reminders. Audio tags let you add urgency or calm to specific phrases without recording a new voice file.

Audiobook and podcast production. Multi-speaker support means you can turn a dialogue transcript into a voiced exchange without mixing separate API calls. Single requests handle character voices, narration, and scene transitions.

Voice-enabled agents and assistants. Agent responses synthesized on the fly, with emotion tags applied based on the content type — [calm] for instructions, [enthusiastic] for confirmations, [seriousness] for warnings.

FAQ

Q: Is Gemini 3.1 Flash TTS available on Vertex AI?

Yes. The model is available through both the Google AI Developer API (generativelanguage.googleapis.com) and Vertex AI. Pricing may differ slightly on Vertex depending on your contract tier.

Q: Can I clone my own voice with this model?

No. Gemini 3.1 Flash TTS uses prebuilt voices only. For voice cloning, you would need ElevenLabs, PlayHT, or similar services that support custom voice profiles.

Q: Does SynthID watermarking affect audio quality?

SynthID embeds an imperceptible watermark into generated audio. Google states it does not affect perceived audio quality. There's no API option to disable it.

Q: How do I handle languages other than English?

Write your content text in the target language. Keep audio tags in English. The model applies the tag semantics across languages. For example, [enthusiastic] Bienvenido al sistema. will synthesize Spanish with an enthusiastic delivery.

Q: What's the latency on a typical call?

[DATA NOT AVAILABLE] — Google has not published latency specs for this preview model. Community reports suggest single-sentence synthesis completes in 1–3 seconds, but this varies with server load during the preview period.

Key Takeaways

Gemini 3.1 Flash TTS is a capable, cost-efficient option for production voice pipelines that need expressive control without fine-tuning. The 200+ audio tag system and native multi-speaker support are genuine differentiators at this price point.

Three things to remember before shipping:

  1. The API returns raw L16 PCM — wrap it in a WAV header or pipe it directly to an audio stream
  2. Audio tags must be in English even for non-English text
  3. Plan for ~30–35 tokens/second when estimating audio output costs

The model is in preview as of this writing. The model ID (gemini-3.1-flash-tts-preview) will change when it reaches GA. Pin the version explicitly in your code and watch Google AI Studio's changelog for the stable release.

Bottom Line

Gemini 3.1 Flash TTS is worth testing if you need expressive, multi-speaker voice synthesis at scale. The inline audio tag system is the most practical emotion-control mechanism available in any API-accessible TTS model right now, and the batch pricing makes it viable for high-volume production workloads.

Top comments (0)