DEV Community

Evan Lin for Google Developer Experts

Posted on • Originally published at evanlin.com on

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

image-20260328203306501

Background

Google released Gemini 3.1 Flash Live at the end of March 2026 March, focusing on "making audio AI more natural and reliable." This model is specifically designed for real-time two-way voice conversations, with low latency, interruptibility, and multi-language support.

I happened to have a LINE Bot project (linebot-helper-python) on hand, which already handles text, images, URLs, PDFs, and YouTube, but completely ignores voice messages:

User sends a voice message
Bot: (Silence)
Enter fullscreen mode Exit fullscreen mode

This time, I'll add voice support and share a few pitfalls I encountered.


Design Decision: Flash Live or Standard Gemini API?

The first question: Gemini 3.1 Flash Live is designed for real-time streaming, but LINE's voice messages are pre-recorded m4a files, not real-time audio streams.

Using Flash Live to process pre-recorded files is like using a live streaming camera to take photos – technically feasible, but the wrong tool.

Decided to use the standard Gemini API – directly passing the audio bytes as inline data, and getting the transcribed text in one call. It's simpler and more suitable for this scenario.

image-20260328203340798


Architecture Design

Integration Approach

This repo already has a complete Orchestrator architecture, which automatically routes to different Agents (Chat, Content, Location, Vision, GitHub) based on the message content. The goal for voice messages is clear:

Convert voice to text, and then treat it as a regular text message and pass it into the Orchestrator – allowing all existing features to automatically support voice input.

User says "Help me search for nearby gas stations" → transcribed into text → Orchestrator determines it's a location query → LocationAgent processes it. No need to implement separate logic for voice.

Complete Flow

User sends AudioMessage (m4a)
    │
    ▼ handle_audio_message()
    │
    ├─ ① LINE SDK downloads audio bytes
    │ get_message_content(message_id) → iter_content()
    │
    ├─ ② Gemini transcription
    │ tools/audio_tool.py → transcribe_audio()
    │ model: gemini-3.1-flash-lite-preview
    │
    ├─ ③ Reply #1: "You said: {transcription}"
    │ reply_message() (consumes reply token)
    │
    └─ ④ Reply #2: Orchestrator routing
            handle_text_message_via_orchestrator(push_user_id=user_id)
            ↓
            push_message() (reply token already used, use push instead)
Enter fullscreen mode Exit fullscreen mode

Why two replies?

The replies are divided into two parts to let the user see the transcription result immediately, without waiting for the Orchestrator to finish processing to know if the Bot understood what they said.


Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

from google import genai
from google.genai import types

TRANSCRIPTION_MODEL = "gemini-3.1-flash-lite-preview"

async def transcribe_audio(audio_bytes: bytes, mime_type: str = "audio/mp4") -> str:
    """
    Transcribe audio bytes to text using Gemini.
    LINE voice messages are always m4a, MIME type is always audio/mp4.
    """
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
    )

    audio_part = types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)

    response = await client.aio.models.generate_content(
        model=TRANSCRIPTION_MODEL,
        contents=[
            types.Content(
                role="user",
                parts=[
                    audio_part,
                    types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."),
                ],
            )
        ],
    )

    return response.text or ""

Enter fullscreen mode Exit fullscreen mode

Design principle: The function itself does not catch exceptions, allowing the upper-level handler to handle error responses uniformly.

Step 2: Handler Main Flow (main.py)

async def handle_audio_message(event: MessageEvent):
    """Handle audio (voice) messages — transcribe and route through Orchestrator."""
    user_id = event.source.user_id
    replied = False # Track if the reply token has been used
    try:
        # Download audio
        message_content = await line_bot_api.get_message_content(event.message.id)
        audio_bytes = b""
        async for chunk in message_content.iter_content():
            audio_bytes += chunk

        # Transcription
        transcription = await transcribe_audio(audio_bytes)

        # Empty transcription (silent or too short)
        if not transcription.strip():
            await line_bot_api.reply_message(
                event.reply_token,
                [TextSendMessage(text="Unable to recognize voice content, please re-record.")]
            )
            return

        # Reply #1: Let the user confirm the transcription result (consumes reply token)
        await line_bot_api.reply_message(
            event.reply_token,
            [TextSendMessage(text=f"You said: {transcription.strip()}")]
        )
        replied = True

        # Reply #2: Send to Orchestrator, using push_message (token already used)
        await handle_text_message_via_orchestrator(
            event, user_id,
            text=transcription.strip(),
            push_user_id=user_id,
        )

    except Exception as e:
        logger.error(f"Error handling audio for {user_id}: {e}", exc_info=True)
        error_text = LineService.format_error_message(e, "processing voice message")
        error_msg = TextSendMessage(text=error_text)
        if replied:
            # reply token has been consumed, use push instead
            await line_bot_api.push_message(user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

Enter fullscreen mode Exit fullscreen mode

Step 3: Enabling Orchestrator to Support External Text Input

The original handle_text_message_via_orchestrator directly reads event.message.text. AudioMessage doesn't have .text, so add two optional parameters:

async def handle_text_message_via_orchestrator(
    event: MessageEvent,
    user_id: str,
    text: str = None, # ← External text input (voice transcription)
    push_user_id: str = None, # ← Use push_message when set
):
    msg = text if text is not None else event.message.text.strip()
    try:
        result = await orchestrator.process_text(user_id=user_id, message=msg)
        response_text = format_orchestrator_response(result)
        reply_msg = TextSendMessage(text=response_text)

        if push_user_id:
            await line_bot_api.push_message(push_user_id, [reply_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [reply_msg])
    except Exception as e:
        error_msg = TextSendMessage(text=LineService.format_error_message(e, "processing your question"))
        if push_user_id:
            await line_bot_api.push_message(push_user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

Enter fullscreen mode Exit fullscreen mode

text is not None (instead of text or ...) is intentional – in case the voice transcription results in an empty string, allow the empty string to pass through (and then be intercepted by the upper-level if not transcription.strip()), instead of falling back to event.message.text.


Pitfalls Encountered

❌ Pitfall 1: Part.from_text() does not accept positional arguments

The first TypeError encountered:

# ❌ Error (TypeError: Part.from_text() takes 1 positional argument but 2 were given)
types.Part.from_text(
    "Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."
)

# ✅ Correct
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes.")

Enter fullscreen mode Exit fullscreen mode

In this version of the SDK, text in Part.from_text() is a keyword argument, or use the Part(text=...) constructor directly for more safety.

❌ Pitfall 2: LINE reply token can only be used once

LINE's reply token is one-time use. Once reply_message() is called, the token is invalidated.

This project's voice flow will call twice:

  1. Reply #1 (display transcription text) → consumes token
  2. Reply #2 (Orchestrator result) → token is invalid, will receive LINE 400 error

The solution is to have the Orchestrator handler support push_message mode (via the push_user_id parameter), and Reply #2 changes to push_message.

Error handling should also be noted: if Orchestrator throws an exception after Reply #1 succeeds, the reply_message cannot be used in the except block, and it also needs to be changed to push_message. This is the purpose of the replied flag in the code.

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Not a real "pitfall", but worth clarifying:

Gemini 3.1 Flash Live is designed for real-time two-way streaming, which has the overhead of connection establishment and streaming protocols. LINE voice messages are complete pre-recorded m4a files, which can be processed once.

Using client.aio.models.generate_content() directly to pass inline audio bytes is simpler, and the delay is not bad. Leave Flash Live for scenarios that truly require real-time conversations.


Effect Demonstration

Scenario 1: Voice Command Query

User sends: [Voice] Help me search for cafes near Taipei Main Station

Bot Reply #1: You said: Help me search for cafes near Taipei Main Station
Bot Reply #2: [LocationAgent replies with a list of nearby cafes]
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Voice Question

User sends: [Voice] What's the difference between Gemini and GPT-4

Bot Reply #1: You said: What's the difference between Gemini and GPT-4
Bot Reply #2: [ChatAgent with Google Search Grounding replies with comparison results]
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Voice Send URL

User sends: [Voice] Help me summarize this article https://example.com/article

Bot Reply #1: You said: Help me summarize this article https://example.com/article
Bot Reply #2: [ContentAgent fetches and summarizes the article]
Enter fullscreen mode Exit fullscreen mode

The text transcribed from voice goes directly into the Orchestrator, and all existing URL detection and intent determination work as usual, with zero extra logic.


Traditional Text Input vs. Voice Input

Text Input Voice Input
Input Format TextMessage AudioMessage (m4a)
Pre-processing None Gemini transcription
reply token Direct use Reply #1 consumes, Reply #2 changes to push
Orchestrator Direct routing Route after transcription
Supported Functions All All (no additional settings required)
Error Handling reply_message replied flag determines reply/push

Analysis and Outlook

What I am most satisfied with in this integration is that I hardly need to change the Orchestrator itself. As long as the voice is converted to text at the input end, all the routing logic, Agent calls, and error handling are automatically inherited.

Gemini's multimodal audio understanding performs very stably in this scenario – Traditional Chinese, Taiwanese accents, and sentences mixed with English can basically be transcribed accurately.

Future directions for extension:

  • Multi-language automatic detection: Tell Gemini to preserve the original language during transcription, Japanese voice → Japanese transcription, and then the Orchestrator decides whether to translate
  • Group voice support: Currently limited to 1:1, voice messages in groups are temporarily ignored
  • Long recording summary: Recordings exceeding a certain length go directly to ContentAgent for summarization, instead of being treated as commands

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Preview Program 2026-03-28 20.33.53

Voice recognition allows the Bot to "understand" what the user is saying. After this is done, the next question naturally arises:

Can the Bot respond by speaking?

The Gemini Live API has a setting response_modalities: ["AUDIO"], which can directly output an audio PCM stream. I connected it to another scenario – reading summaries aloud.

Function Design

Each time the Bot summarizes a URL, YouTube, or PDF, a "🔊 Read Aloud" QuickReply button will appear below the message. When the user presses it, the Bot sends the summary text into Gemini Live TTS, converts the PCM audio to m4a, and then sends it back using AudioSendMessage.

URL summary complete
    │
    ▼ [🔊 Read Aloud] QuickReply button
    │
User presses the button → PostbackEvent
    │
    ▼ handle_read_aloud_postback()
    │
    ├─ ① Retrieve the summary text from summary_store (10 minutes TTL)
    │
    ├─ ② Gemini Live API → PCM audio
    │ model: gemini-live-2.5-flash-native-audio
    │ response_modalities: ["AUDIO"]
    │
    ├─ ③ ffmpeg transcoding: PCM → m4a
    │ s16le, 16kHz, mono → AAC
    │
    └─ ④ AudioSendMessage sent to the user
            original_content_url: /audio/{uuid}
            duration: {ms}
Enter fullscreen mode Exit fullscreen mode

Core Code (tools/tts_tool.py)

LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(vertexai=True, project=VERTEX_PROJECT, location="us-central1")
    config = {"response_modalities": ["AUDIO"]}

    async with client.aio.live.connect(model=LIVE_MODEL, config=config) as session:
        await session.send_client_content(
            turns=types.Content(role="user", parts=[types.Part(text=text)]),
            turn_complete=True,
        )
        pcm_chunks = []
        async for message in session.receive():
            if message.server_content and message.server_content.model_turn:
                for part in message.server_content.model_turn.parts:
                    if part.inline_data and part.inline_data.data:
                        pcm_chunks.append(part.inline_data.data)
            if message.server_content and message.server_content.turn_complete:
                break

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / 32000 * 1000) # 16kHz × 16-bit mono

    # PCM → m4a (temp file mode, avoid moov atom problem)
    with tempfile.NamedTemporaryFile(suffix=".pcm", delete=False) as f:
        f.write(pcm_bytes)
        pcm_path = f.name
    m4a_path = pcm_path.replace(".pcm", ".m4a")
    subprocess.run(
        ["ffmpeg", "-y", "-f", "s16le", "-ar", "16000", "-ac", "1",
         "-i", pcm_path, "-c:a", "aac", m4a_path],
        check=True, capture_output=True,
    )
    with open(m4a_path, "rb") as f:
        return f.read(), duration_ms

Enter fullscreen mode Exit fullscreen mode

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

The first attempt at Gemini Live TTS was:

LIVE_MODEL = "gemini-3.1-flash-live-preview"

Enter fullscreen mode Exit fullscreen mode

Following the inference of gemini-3.1-flash-lite-preview used for voice recognition, the result was a direct 1008 policy violation:

Publisher Model `projects/line-vertex/locations/global/publishers/google/
models/gemini-3.1-flash-live-preview` was not found
Enter fullscreen mode Exit fullscreen mode

Listing the available models on Vertex AI revealed that the model naming rules for Live/native audio are completely different:

# ✅ Correct
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"
Enter fullscreen mode Exit fullscreen mode

There is no Live version of Gemini 3.1 on Vertex AI. The Live/native audio feature is currently the 2.5 generation, and the naming format is gemini-live-{version}-{variant}-native-audio, which is completely separate from the general model gemini-{version}-flash-{variant}.

❌ Pitfall 5: GOOGLE_CLOUD_LOCATION=global Causes Live API to Disconnect

After changing to the correct model name, the error message was still the same:

Publisher Model `projects/line-vertex/locations/global/...` was not found
Enter fullscreen mode Exit fullscreen mode

This time the model name was correct, but locations/global was strange – we clearly set us-central1.

Investigating the source code of the Google GenAI SDK revealed:

# _api_client.py
self.location = location or env_location
if not self.location and not self.api_key:
    self.location = 'global' # ← here
Enter fullscreen mode Exit fullscreen mode

location or env_location – if the passed-in location is an empty string, it will fall back to global.

The root cause of the problem is the environment variable of Cloud Run:

{ "name": "GOOGLE_CLOUD_LOCATION", "value": "global" }
Enter fullscreen mode Exit fullscreen mode

GOOGLE_CLOUD_LOCATION was set to the "global" string. os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") did not get "us-central1", but "global" – then the SDK obediently connected to the global endpoint, but gemini-live-2.5-flash-native-audio does not have BidiGenerateContent support in global.

Endpoint Standard API Live API
global ✅ Available ❌ Model not here
us-central1 ✅ Available gemini-live-2.5-flash-native-audio

Solution: Hardcode the location of the Live API, and don't read from the env var:

# ❌ Affected by GOOGLE_CLOUD_LOCATION=global
VERTEX_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")

# ✅ Hardcoded, not affected by env var
VERTEX_LOCATION = "us-central1" # Live API needs a regional endpoint
Enter fullscreen mode Exit fullscreen mode

Voice Recognition vs. Read Summary Aloud

The two functions use completely different Gemini APIs:

Voice Recognition Read Summary Aloud
Direction Audio → Text Text → Audio
API Standard generate_content Live API BidiGenerateContent
Model gemini-3.1-flash-lite-preview gemini-live-2.5-flash-native-audio
Location Follows env var Hardcoded us-central1
Output Format text PCM → ffmpeg → m4a
LINE Message Type Input: AudioMessage Output: AudioSendMessage

Conclusion

The release of Gemini 3.1 Flash Live makes audio AI more worthy of serious consideration. This time, both voice recognition and read summary aloud were integrated into the LINE Bot:

  • Voice Recognition: Standard Gemini API, pre-recorded m4a one-time transcription, connected to the existing Orchestrator
  • Read Summary Aloud: Gemini Live TTS, summary text to PCM, ffmpeg to m4a, AudioSendMessage returns

The most troublesome part is not the function itself, but finding the correct model name and locating the SDK's location logic – neither of these are clearly written in a prominent place in the documentation, and the answer can only be found by listing available models and reading the SDK source code.

The full code is on GitHub, feel free to refer to it.

See you next time!

Top comments (0)