Gemini 3.1: Native TTS for Easier, More Powerful Summary Reading

#gemini #google #api #ai

Background

In the previous practical session, we used Gemini 3.1 Flash Live to achieve speech recognition, and through the "side-attack" method of the Gemini 2.5 Live API, we barely achieved the text-to-speech (TTS) function.

But in April 2026, Google officially released Gemini 3.1 Flash TTS. This is a native model specifically designed for audio output, no longer requiring a Live WebSocket, and can directly output high-quality audio through the standard generate_content process.

As a developer, of course, you want to follow up immediately with a more elegant and native solution. This article will share how to upgrade the LINE Bot's text-to-speech summary function to Gemini 3.1 Native TTS, and the "asynchronous pit" encountered in the process.

Technical Upgrade: From Live API to Native TTS

The previous reading function was simulated using the Gemini 2.5 Live API. Although it was usable, it had several shortcomings:

High complexity: Requires managing the WebSocket connection lifecycle.
Model limitations: Must use a specific native-audio model, and primarily supports us-central1.
Fixed return format: The sampling rate is usually fixed at 16kHz.

The emergence of Gemini 3.1 Flash TTS changed all this:

Model name: gemini-3.1-flash-tts-preview.
Consistent interface: Uses the familiar generate_content_stream.
Dynamic parameters: Supports automatically detecting the sampling rate from the returned MIME type (usually increased to 24kHz, better sound quality).

Core Code Evolution (tools/tts_tool.py)

The new implementation has become more concise, with the focus on the response_modalities=["audio"] setting:

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(api_key=GOOGLE_AI_API_KEY, http_options={"api_version": "v1beta"})

    contents = [
        types.Content(
            role="user",
            parts=[
                # Add localization instructions to make the tone more natural
                types.Part.from_text(text=f"Please use Traditional Chinese with Taiwanese vocabulary, and read the following summary in a friendly and natural tone. ## Transcript:\n{text}"),
            ],
        ),
    ]

    config = types.GenerateContentConfig(
        response_modalities=["audio"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Zephyr")
            )
        ),
    )

    pcm_chunks = []
    sample_rate = 24000 # Default value

    try:
        # ⚠️ This is the big pit that almost made me stay up all night fixing it
        response_stream = await client.aio.models.generate_content_stream(
            model="gemini-3.1-flash-tts-preview",
            contents=contents,
            config=config,
        )
        async for chunk in response_stream:
            if chunk.parts:
                for part in chunk.parts:
                    if part.inline_data:
                        pcm_chunks.append(part.inline_data.data)
                        # Get the sampling rate dynamically from the MIME type (e.g. audio/L16;rate=24000)
                        if part.inline_data.mime_type:
                            sample_rate = parse_rate(part.inline_data.mime_type)
    except Exception as e:
        logger.error(f"TTS Error: {e}")
        raise

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / (sample_rate * 2) * 1000)

    # Subsequently, it is also converted to m4a via ffmpeg and sent to LINE...

The Pitfall: The Missing `await`

This upgrade encountered a very subtle TypeError, which kept popping up after remote deployment:

TypeError: 'async for' requires an object with __aiter__ method, got coroutine

❌ Incorrect Writing

When I wrote it according to the example, I intuitively thought I could directly async for a method:

# This is wrong!
async for chunk in client.aio.models.generate_content_stream(...):
    pass

✅ Correct Solution

In the asynchronous version of the Google GenAI Python SDK, generate_content_stream itself is an async function, and it returns an iterator. So you must await to get that iterator, and then perform async for on it.

# Correct approach: two steps
response_stream = await client.aio.models.generate_content_stream(...)
async for chunk in response_stream:
    pass

This detail may not exist in general synchronous code or some older SDKs, but when dealing with the asynchronous stream of 3.1 Flash TTS, this is the key to whether it can run successfully.

Localization Adjustment: Making the Bot Speak "Taiwanese"

Although the summary itself is already in Traditional Chinese, the TTS model sometimes has non-native accents or vocabulary when reading. We solved this problem through Prompt Engineering:

"Please use Taiwanese vocabulary in Traditional Chinese, and read it in a friendly and natural tone..."

After adding this line of instruction, the audio output by Gemini is closer to the habits of Taiwanese users in terms of intonation and sentence breaks, which greatly enhances the friendliness of the "reading summary".

Summary: Changes Brought by Native TTS

After migrating from Live API to Native TTS:

More stable connection: No longer need to maintain a long-term WebSocket.
Improved sound quality: Native support for 24kHz sampling rate.
Easy to maintain: The amount of code is reduced by about 30%, and the logic is more direct.

This experience also reminds me that even a seemingly mature SDK should carefully check the return value type when dealing with the async mode.

If you also want your LINE Bot to speak, Gemini 3.1 Flash TTS is definitely the best choice at the moment.

The complete code has been updated to GitHub, see you next time!