Evan Lin for Google Developer Experts

Posted on Mar 29 • Originally published at evanlin.com on Mar 29

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

#ai #python #gemini #tutorial

Background

Google released Gemini 3.1 Flash Live at the end of March 2026 March, focusing on "making audio AI more natural and reliable." This model is specifically designed for real-time two-way voice conversations, with low latency, interruptibility, and multi-language support.

I happened to have a LINE Bot project (linebot-helper-python) on hand, which already handles text, images, URLs, PDFs, and YouTube, but completely ignores voice messages:

User sends a voice message
Bot: (Silence)

This time, I'll add voice support and share a few pitfalls I encountered.

Design Decision: Flash Live or Standard Gemini API?

The first question: Gemini 3.1 Flash Live is designed for real-time streaming, but LINE's voice messages are pre-recorded m4a files, not real-time audio streams.

Using Flash Live to process pre-recorded files is like using a live streaming camera to take photos – technically feasible, but the wrong tool.

Decided to use the standard Gemini API – directly passing the audio bytes as inline data, and getting the transcribed text in one call. It's simpler and more suitable for this scenario.

Architecture Design

Integration Approach

This repo already has a complete Orchestrator architecture, which automatically routes to different Agents (Chat, Content, Location, Vision, GitHub) based on the message content. The goal for voice messages is clear:

Convert voice to text, and then treat it as a regular text message and pass it into the Orchestrator – allowing all existing features to automatically support voice input.

User says "Help me search for nearby gas stations" → transcribed into text → Orchestrator determines it's a location query → LocationAgent processes it. No need to implement separate logic for voice.

Complete Flow

User sends AudioMessage (m4a)
    │
    ▼ handle_audio_message()
    │
    ├─ ① LINE SDK downloads audio bytes
    │ get_message_content(message_id) → iter_content()
    │
    ├─ ② Gemini transcription
    │ tools/audio_tool.py → transcribe_audio()
    │ model: gemini-3.1-flash-lite-preview
    │
    ├─ ③ Reply #1: "You said: {transcription}"
    │ reply_message() (consumes reply token)
    │
    └─ ④ Reply #2: Orchestrator routing
            handle_text_message_via_orchestrator(push_user_id=user_id)
            ↓
            push_message() (reply token already used, use push instead)

Why two replies?

The replies are divided into two parts to let the user see the transcription result immediately, without waiting for the Orchestrator to finish processing to know if the Bot understood what they said.

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

from google import genai
from google.genai import types

TRANSCRIPTION_MODEL = "gemini-3.1-flash-lite-preview"

async def transcribe_audio(audio_bytes: bytes, mime_type: str = "audio/mp4") -> str:
    """
    Transcribe audio bytes to text using Gemini.
    LINE voice messages are always m4a, MIME type is always audio/mp4.
    """
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
    )

    audio_part = types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)

    response = await client.aio.models.generate_content(
        model=TRANSCRIPTION_MODEL,
        contents=[
            types.Content(
                role="user",
                parts=[
                    audio_part,
                    types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."),
                ],
            )
        ],
    )

    return response.text or ""

Design principle: The function itself does not catch exceptions, allowing the upper-level handler to handle error responses uniformly.

Step 2: Handler Main Flow (main.py)

async def handle_audio_message(event: MessageEvent):
    """Handle audio (voice) messages — transcribe and route through Orchestrator."""
    user_id = event.source.user_id
    replied = False # Track if the reply token has been used
    try:
        # Download audio
        message_content = await line_bot_api.get_message_content(event.message.id)
        audio_bytes = b""
        async for chunk in message_content.iter_content():
            audio_bytes += chunk

        # Transcription
        transcription = await transcribe_audio(audio_bytes)

        # Empty transcription (silent or too short)
        if not transcription.strip():
            await line_bot_api.reply_message(
                event.reply_token,
                [TextSendMessage(text="Unable to recognize voice content, please re-record.")]
            )
            return

        # Reply #1: Let the user confirm the transcription result (consumes reply token)
        await line_bot_api.reply_message(
            event.reply_token,
            [TextSendMessage(text=f"You said: {transcription.strip()}")]
        )
        replied = True

        # Reply #2: Send to Orchestrator, using push_message (token already used)
        await handle_text_message_via_orchestrator(
            event, user_id,
            text=transcription.strip(),
            push_user_id=user_id,
        )

    except Exception as e:
        logger.error(f"Error handling audio for {user_id}: {e}", exc_info=True)
        error_text = LineService.format_error_message(e, "processing voice message")
        error_msg = TextSendMessage(text=error_text)
        if replied:
            # reply token has been consumed, use push instead
            await line_bot_api.push_message(user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

Step 3: Enabling Orchestrator to Support External Text Input

The original handle_text_message_via_orchestrator directly reads event.message.text. AudioMessage doesn't have .text, so add two optional parameters:

async def handle_text_message_via_orchestrator(
    event: MessageEvent,
    user_id: str,
    text: str = None, # ← External text input (voice transcription)
    push_user_id: str = None, # ← Use push_message when set
):
    msg = text if text is not None else event.message.text.strip()
    try:
        result = await orchestrator.process_text(user_id=user_id, message=msg)
        response_text = format_orchestrator_response(result)
        reply_msg = TextSendMessage(text=response_text)

        if push_user_id:
            await line_bot_api.push_message(push_user_id, [reply_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [reply_msg])
    except Exception as e:
        error_msg = TextSendMessage(text=LineService.format_error_message(e, "processing your question"))
        if push_user_id:
            await line_bot_api.push_message(push_user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

text is not None (instead of text or ...) is intentional – in case the voice transcription results in an empty string, allow the empty string to pass through (and then be intercepted by the upper-level if not transcription.strip()), instead of falling back to event.message.text.

Pitfalls Encountered

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

The first TypeError encountered:

# ❌ Error (TypeError: Part.from_text() takes 1 positional argument but 2 were given)
types.Part.from_text(
    "Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."
)

# ✅ Correct
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes.")

In this version of the SDK, text in Part.from_text() is a keyword argument, or use the Part(text=...) constructor directly for more safety.

❌ Pitfall 2: LINE reply token can only be used once

LINE's reply token is one-time use. Once reply_message() is called, the token is invalidated.

This project's voice flow will call twice:

Reply #1 (display transcription text) → consumes token
Reply #2 (Orchestrator result) → token is invalid, will receive LINE 400 error

The solution is to have the Orchestrator handler support push_message mode (via the push_user_id parameter), and Reply #2 changes to push_message.

Error handling should also be noted: if Orchestrator throws an exception after Reply #1 succeeds, the reply_message cannot be used in the except block, and it also needs to be changed to push_message. This is the purpose of the replied flag in the code.

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Not a real "pitfall", but worth clarifying:

Gemini 3.1 Flash Live is designed for real-time two-way streaming, which has the overhead of connection establishment and streaming protocols. LINE voice messages are complete pre-recorded m4a files, which can be processed once.

Using client.aio.models.generate_content() directly to pass inline audio bytes is simpler, and the delay is not bad. Leave Flash Live for scenarios that truly require real-time conversations.

Effect Demonstration

Scenario 1: Voice Command Query

User sends: [Voice] Help me search for cafes near Taipei Main Station

Bot Reply #1: You said: Help me search for cafes near Taipei Main Station
Bot Reply #2: [LocationAgent replies with a list of nearby cafes]

Scenario 2: Voice Question

User sends: [Voice] What's the difference between Gemini and GPT-4

Bot Reply #1: You said: What's the difference between Gemini and GPT-4
Bot Reply #2: [ChatAgent with Google Search Grounding replies with comparison results]

Scenario 3: Voice Send URL

User sends: [Voice] Help me summarize this article https://example.com/article

Bot Reply #1: You said: Help me summarize this article https://example.com/article
Bot Reply #2: [ContentAgent fetches and summarizes the article]

The text transcribed from voice goes directly into the Orchestrator, and all existing URL detection and intent determination work as usual, with zero extra logic.

Traditional Text Input vs. Voice Input

	Text Input	Voice Input
Input Format	TextMessage	AudioMessage (m4a)
Pre-processing	None	Gemini transcription
reply token	Direct use	Reply #1 consumes, Reply #2 changes to push
Orchestrator	Direct routing	Route after transcription
Supported Functions	All	All (no additional settings required)
Error Handling	reply_message	replied flag determines reply/push

Analysis and Outlook

What I am most satisfied with in this integration is that I hardly need to change the Orchestrator itself. As long as the voice is converted to text at the input end, all the routing logic, Agent calls, and error handling are automatically inherited.

Gemini's multimodal audio understanding performs very stably in this scenario – Traditional Chinese, Taiwanese accents, and sentences mixed with English can basically be transcribed accurately.

Future directions for extension:

Multi-language automatic detection: Tell Gemini to preserve the original language during transcription, Japanese voice → Japanese transcription, and then the Orchestrator decides whether to translate
Group voice support: Currently limited to 1:1, voice messages in groups are temporarily ignored
Long recording summary: Recordings exceeding a certain length go directly to ContentAgent for summarization, instead of being treated as commands

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Voice recognition allows the Bot to "understand" what the user is saying. After this is done, the next question naturally arises:

Can the Bot respond by speaking?

The Gemini Live API has a setting response_modalities: ["AUDIO"], which can directly output an audio PCM stream. I connected it to another scenario – reading summaries aloud.

Function Design

Each time the Bot summarizes a URL, YouTube, or PDF, a "🔊 Read Aloud" QuickReply button will appear below the message. When the user presses it, the Bot sends the summary text into Gemini Live TTS, converts the PCM audio to m4a, and then sends it back using AudioSendMessage.

URL summary complete
    │
    ▼ [🔊 Read Aloud] QuickReply button
    │
User presses the button → PostbackEvent
    │
    ▼ handle_read_aloud_postback()
    │
    ├─ ① Retrieve the summary text from summary_store (10 minutes TTL)
    │
    ├─ ② Gemini Live API → PCM audio
    │ model: gemini-live-2.5-flash-native-audio
    │ response_modalities: ["AUDIO"]
    │
    ├─ ③ ffmpeg transcoding: PCM → m4a
    │ s16le, 16kHz, mono → AAC
    │
    └─ ④ AudioSendMessage sent to the user
            original_content_url: /audio/{uuid}
            duration: {ms}

Core Code (tools/tts_tool.py)

LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(vertexai=True, project=VERTEX_PROJECT, location="us-central1")
    config = {"response_modalities": ["AUDIO"]}

    async with client.aio.live.connect(model=LIVE_MODEL, config=config) as session:
        await session.send_client_content(
            turns=types.Content(role="user", parts=[types.Part(text=text)]),
            turn_complete=True,
        )
        pcm_chunks = []
        async for message in session.receive():
            if message.server_content and message.server_content.model_turn:
                for part in message.server_content.model_turn.parts:
                    if part.inline_data and part.inline_data.data:
                        pcm_chunks.append(part.inline_data.data)
            if message.server_content and message.server_content.turn_complete:
                break

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / 32000 * 1000) # 16kHz × 16-bit mono

    # PCM → m4a (temp file mode, avoid moov atom problem)
    with tempfile.NamedTemporaryFile(suffix=".pcm", delete=False) as f:
        f.write(pcm_bytes)
        pcm_path = f.name
    m4a_path = pcm_path.replace(".pcm", ".m4a")
    subprocess.run(
        ["ffmpeg", "-y", "-f", "s16le", "-ar", "16000", "-ac", "1",
         "-i", pcm_path, "-c:a", "aac", m4a_path],
        check=True, capture_output=True,
    )
    with open(m4a_path, "rb") as f:
        return f.read(), duration_ms

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

The first attempt at Gemini Live TTS was:

LIVE_MODEL = "gemini-3.1-flash-live-preview"

Following the inference of gemini-3.1-flash-lite-preview used for voice recognition, the result was a direct 1008 policy violation:

Publisher Model `projects/line-vertex/locations/global/publishers/google/
models/gemini-3.1-flash-live-preview` was not found

Listing the available models on Vertex AI revealed that the model naming rules for Live/native audio are completely different:

# ✅ Correct
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

There is no Live version of Gemini 3.1 on Vertex AI. The Live/native audio feature is currently the 2.5 generation, and the naming format is gemini-live-{version}-{variant}-native-audio, which is completely separate from the general model gemini-{version}-flash-{variant}.

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect

After changing to the correct model name, the error message was still the same:

Publisher Model `projects/line-vertex/locations/global/...` was not found

This time the model name was correct, but locations/global was strange – we clearly set us-central1.

Investigating the source code of the Google GenAI SDK revealed:

# _api_client.py
self.location = location or env_location
if not self.location and not self.api_key:
    self.location = 'global' # ← here

location or env_location – if the passed-in location is an empty string, it will fall back to global.

The root cause of the problem is the environment variable of Cloud Run:

{ "name": "GOOGLE_CLOUD_LOCATION", "value": "global" }

GOOGLE_CLOUD_LOCATION was set to the "global" string. os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") did not get "us-central1", but "global" – then the SDK obediently connected to the global endpoint, but gemini-live-2.5-flash-native-audio does not have BidiGenerateContent support in global.

Endpoint	Standard API	Live API
`global`	✅ Available	❌ Model not here
`us-central1`	✅ Available	✅ `gemini-live-2.5-flash-native-audio`

Solution: Hardcode the location of the Live API, and don't read from the env var:

# ❌ Affected by GOOGLE_CLOUD_LOCATION=global
VERTEX_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")

# ✅ Hardcoded, not affected by env var
VERTEX_LOCATION = "us-central1" # Live API needs a regional endpoint

Voice Recognition vs. Read Summary Aloud

The two functions use completely different Gemini APIs:

	Voice Recognition	Read Summary Aloud
Direction	Audio → Text	Text → Audio
API	Standard `generate_content`	Live API `BidiGenerateContent`
Model	`gemini-3.1-flash-lite-preview`	`gemini-live-2.5-flash-native-audio`
Location	Follows env var	Hardcoded `us-central1`
Output Format	text	PCM → ffmpeg → m4a
LINE Message Type	Input: `AudioMessage`	Output: `AudioSendMessage`

Conclusion

The release of Gemini 3.1 Flash Live makes audio AI more worthy of serious consideration. This time, both voice recognition and read summary aloud were integrated into the LINE Bot:

Voice Recognition: Standard Gemini API, pre-recorded m4a one-time transcription, connected to the existing Orchestrator
Read Summary Aloud: Gemini Live TTS, summary text to PCM, ffmpeg to m4a, AudioSendMessage returns

The most troublesome part is not the function itself, but finding the correct model name and locating the SDK's location logic – neither of these are clearly written in a prominent place in the documentation, and the answer can only be found by listing available models and reading the SDK source code.

The full code is on GitHub, feel free to refer to it.

See you next time!

DEV Community

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

Background

Design Decision: Flash Live or Standard Gemini API?

Architecture Design

Integration Approach

Complete Flow

Why two replies?

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

Step 2: Handler Main Flow (main.py)

Step 3: Enabling Orchestrator to Support External Text Input

Pitfalls Encountered

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

❌ Pitfall 2: LINE reply token can only be used once

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Effect Demonstration

Scenario 1: Voice Command Query

Scenario 2: Voice Question

Scenario 3: Voice Send URL

Traditional Text Input vs. Voice Input

Analysis and Outlook

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Function Design

Core Code (tools/tts_tool.py)

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect

Voice Recognition vs. Read Summary Aloud

Conclusion

Top comments (0)

Background

Design Decision: Flash Live or Standard Gemini API?

Architecture Design

Integration Approach

Complete Flow

Why two replies?

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

Step 2: Handler Main Flow (main.py)

Step 3: Enabling Orchestrator to Support External Text Input

Pitfalls Encountered

❌ Pitfall 1: Part.from_text() does not accept positional arguments

❌ Pitfall 2: LINE reply token can only be used once

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Effect Demonstration

Scenario 1: Voice Command Query

Scenario 2: Voice Question

Scenario 3: Voice Send URL

Traditional Text Input vs. Voice Input

Analysis and Outlook

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Function Design

Core Code (tools/tts_tool.py)

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

❌ Pitfall 5: GOOGLE_CLOUD_LOCATION=global Causes Live API to Disconnect

Voice Recognition vs. Read Summary Aloud

Conclusion

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect