Background
Google released Gemini 3.1 Flash Live at the end of March 2026 March, focusing on "making audio AI more natural and reliable." This model is specifically designed for real-time two-way voice conversations, with low latency, interruptibility, and multi-language support.
I happened to have a LINE Bot project (linebot-helper-python) on hand, which already handles text, images, URLs, PDFs, and YouTube, but completely ignores voice messages:
User sends a voice message
Bot: (Silence)
This time, I'll add voice support and share a few pitfalls I encountered.
Design Decision: Flash Live or Standard Gemini API?
The first question: Gemini 3.1 Flash Live is designed for real-time streaming, but LINE's voice messages are pre-recorded m4a files, not real-time audio streams.
Using Flash Live to process pre-recorded files is like using a live streaming camera to take photos – technically feasible, but the wrong tool.
Decided to use the standard Gemini API – directly passing the audio bytes as inline data, and getting the transcribed text in one call. It's simpler and more suitable for this scenario.
Architecture Design
Integration Approach
This repo already has a complete Orchestrator architecture, which automatically routes to different Agents (Chat, Content, Location, Vision, GitHub) based on the message content. The goal for voice messages is clear:
Convert voice to text, and then treat it as a regular text message and pass it into the Orchestrator – allowing all existing features to automatically support voice input.
User says "Help me search for nearby gas stations" → transcribed into text → Orchestrator determines it's a location query → LocationAgent processes it. No need to implement separate logic for voice.
Complete Flow
User sends AudioMessage (m4a)
│
▼ handle_audio_message()
│
├─ ① LINE SDK downloads audio bytes
│ get_message_content(message_id) → iter_content()
│
├─ ② Gemini transcription
│ tools/audio_tool.py → transcribe_audio()
│ model: gemini-3.1-flash-lite-preview
│
├─ ③ Reply #1: "You said: {transcription}"
│ reply_message() (consumes reply token)
│
└─ ④ Reply #2: Orchestrator routing
handle_text_message_via_orchestrator(push_user_id=user_id)
↓
push_message() (reply token already used, use push instead)
Why two replies?
The replies are divided into two parts to let the user see the transcription result immediately, without waiting for the Orchestrator to finish processing to know if the Bot understood what they said.
Core Code Explanation
Step 1: Audio Transcription Tool (tools/audio_tool.py)
from google import genai
from google.genai import types
TRANSCRIPTION_MODEL = "gemini-3.1-flash-lite-preview"
async def transcribe_audio(audio_bytes: bytes, mime_type: str = "audio/mp4") -> str:
"""
Transcribe audio bytes to text using Gemini.
LINE voice messages are always m4a, MIME type is always audio/mp4.
"""
client = genai.Client(
vertexai=True,
project=os.getenv("GOOGLE_CLOUD_PROJECT"),
location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
)
audio_part = types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)
response = await client.aio.models.generate_content(
model=TRANSCRIPTION_MODEL,
contents=[
types.Content(
role="user",
parts=[
audio_part,
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."),
],
)
],
)
return response.text or ""
Design principle: The function itself does not catch exceptions, allowing the upper-level handler to handle error responses uniformly.
Step 2: Handler Main Flow (main.py)
async def handle_audio_message(event: MessageEvent):
"""Handle audio (voice) messages — transcribe and route through Orchestrator."""
user_id = event.source.user_id
replied = False # Track if the reply token has been used
try:
# Download audio
message_content = await line_bot_api.get_message_content(event.message.id)
audio_bytes = b""
async for chunk in message_content.iter_content():
audio_bytes += chunk
# Transcription
transcription = await transcribe_audio(audio_bytes)
# Empty transcription (silent or too short)
if not transcription.strip():
await line_bot_api.reply_message(
event.reply_token,
[TextSendMessage(text="Unable to recognize voice content, please re-record.")]
)
return
# Reply #1: Let the user confirm the transcription result (consumes reply token)
await line_bot_api.reply_message(
event.reply_token,
[TextSendMessage(text=f"You said: {transcription.strip()}")]
)
replied = True
# Reply #2: Send to Orchestrator, using push_message (token already used)
await handle_text_message_via_orchestrator(
event, user_id,
text=transcription.strip(),
push_user_id=user_id,
)
except Exception as e:
logger.error(f"Error handling audio for {user_id}: {e}", exc_info=True)
error_text = LineService.format_error_message(e, "processing voice message")
error_msg = TextSendMessage(text=error_text)
if replied:
# reply token has been consumed, use push instead
await line_bot_api.push_message(user_id, [error_msg])
else:
await line_bot_api.reply_message(event.reply_token, [error_msg])
Step 3: Enabling Orchestrator to Support External Text Input
The original handle_text_message_via_orchestrator directly reads event.message.text. AudioMessage doesn't have .text, so add two optional parameters:
async def handle_text_message_via_orchestrator(
event: MessageEvent,
user_id: str,
text: str = None, # ← External text input (voice transcription)
push_user_id: str = None, # ← Use push_message when set
):
msg = text if text is not None else event.message.text.strip()
try:
result = await orchestrator.process_text(user_id=user_id, message=msg)
response_text = format_orchestrator_response(result)
reply_msg = TextSendMessage(text=response_text)
if push_user_id:
await line_bot_api.push_message(push_user_id, [reply_msg])
else:
await line_bot_api.reply_message(event.reply_token, [reply_msg])
except Exception as e:
error_msg = TextSendMessage(text=LineService.format_error_message(e, "processing your question"))
if push_user_id:
await line_bot_api.push_message(push_user_id, [error_msg])
else:
await line_bot_api.reply_message(event.reply_token, [error_msg])
text is not None (instead of text or ...) is intentional – in case the voice transcription results in an empty string, allow the empty string to pass through (and then be intercepted by the upper-level if not transcription.strip()), instead of falling back to event.message.text.
Pitfalls Encountered
❌ Pitfall 1: Part.from_text() does not accept positional arguments
The first TypeError encountered:
# ❌ Error (TypeError: Part.from_text() takes 1 positional argument but 2 were given)
types.Part.from_text(
"Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."
)
# ✅ Correct
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes.")
In this version of the SDK, text in Part.from_text() is a keyword argument, or use the Part(text=...) constructor directly for more safety.
❌ Pitfall 2: LINE reply token can only be used once
LINE's reply token is one-time use. Once reply_message() is called, the token is invalidated.
This project's voice flow will call twice:
- Reply #1 (display transcription text) → consumes token
- Reply #2 (Orchestrator result) → token is invalid, will receive LINE 400 error
The solution is to have the Orchestrator handler support push_message mode (via the push_user_id parameter), and Reply #2 changes to push_message.
Error handling should also be noted: if Orchestrator throws an exception after Reply #1 succeeds, the reply_message cannot be used in the except block, and it also needs to be changed to push_message. This is the purpose of the replied flag in the code.
❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files
Not a real "pitfall", but worth clarifying:
Gemini 3.1 Flash Live is designed for real-time two-way streaming, which has the overhead of connection establishment and streaming protocols. LINE voice messages are complete pre-recorded m4a files, which can be processed once.
Using client.aio.models.generate_content() directly to pass inline audio bytes is simpler, and the delay is not bad. Leave Flash Live for scenarios that truly require real-time conversations.
Effect Demonstration
Scenario 1: Voice Command Query
User sends: [Voice] Help me search for cafes near Taipei Main Station
Bot Reply #1: You said: Help me search for cafes near Taipei Main Station
Bot Reply #2: [LocationAgent replies with a list of nearby cafes]
Scenario 2: Voice Question
User sends: [Voice] What's the difference between Gemini and GPT-4
Bot Reply #1: You said: What's the difference between Gemini and GPT-4
Bot Reply #2: [ChatAgent with Google Search Grounding replies with comparison results]
Scenario 3: Voice Send URL
User sends: [Voice] Help me summarize this article https://example.com/article
Bot Reply #1: You said: Help me summarize this article https://example.com/article
Bot Reply #2: [ContentAgent fetches and summarizes the article]
The text transcribed from voice goes directly into the Orchestrator, and all existing URL detection and intent determination work as usual, with zero extra logic.
Traditional Text Input vs. Voice Input
| Text Input | Voice Input | |
|---|---|---|
| Input Format | TextMessage | AudioMessage (m4a) |
| Pre-processing | None | Gemini transcription |
| reply token | Direct use | Reply #1 consumes, Reply #2 changes to push |
| Orchestrator | Direct routing | Route after transcription |
| Supported Functions | All | All (no additional settings required) |
| Error Handling | reply_message | replied flag determines reply/push |
Analysis and Outlook
What I am most satisfied with in this integration is that I hardly need to change the Orchestrator itself. As long as the voice is converted to text at the input end, all the routing logic, Agent calls, and error handling are automatically inherited.
Gemini's multimodal audio understanding performs very stably in this scenario – Traditional Chinese, Taiwanese accents, and sentences mixed with English can basically be transcribed accurately.
Future directions for extension:
- Multi-language automatic detection: Tell Gemini to preserve the original language during transcription, Japanese voice → Japanese transcription, and then the Orchestrator decides whether to translate
- Group voice support: Currently limited to 1:1, voice messages in groups are temporarily ignored
- Long recording summary: Recordings exceeding a certain length go directly to ContentAgent for summarization, instead of being treated as commands
Extension: 🔊 Read Summary Aloud – Make the Bot Speak
Voice recognition allows the Bot to "understand" what the user is saying. After this is done, the next question naturally arises:
Can the Bot respond by speaking?
The Gemini Live API has a setting response_modalities: ["AUDIO"], which can directly output an audio PCM stream. I connected it to another scenario – reading summaries aloud.
Function Design
Each time the Bot summarizes a URL, YouTube, or PDF, a "🔊 Read Aloud" QuickReply button will appear below the message. When the user presses it, the Bot sends the summary text into Gemini Live TTS, converts the PCM audio to m4a, and then sends it back using AudioSendMessage.
URL summary complete
│
▼ [🔊 Read Aloud] QuickReply button
│
User presses the button → PostbackEvent
│
▼ handle_read_aloud_postback()
│
├─ ① Retrieve the summary text from summary_store (10 minutes TTL)
│
├─ ② Gemini Live API → PCM audio
│ model: gemini-live-2.5-flash-native-audio
│ response_modalities: ["AUDIO"]
│
├─ ③ ffmpeg transcoding: PCM → m4a
│ s16le, 16kHz, mono → AAC
│
└─ ④ AudioSendMessage sent to the user
original_content_url: /audio/{uuid}
duration: {ms}
Core Code (tools/tts_tool.py)
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"
async def text_to_speech(text: str) -> tuple[bytes, int]:
client = genai.Client(vertexai=True, project=VERTEX_PROJECT, location="us-central1")
config = {"response_modalities": ["AUDIO"]}
async with client.aio.live.connect(model=LIVE_MODEL, config=config) as session:
await session.send_client_content(
turns=types.Content(role="user", parts=[types.Part(text=text)]),
turn_complete=True,
)
pcm_chunks = []
async for message in session.receive():
if message.server_content and message.server_content.model_turn:
for part in message.server_content.model_turn.parts:
if part.inline_data and part.inline_data.data:
pcm_chunks.append(part.inline_data.data)
if message.server_content and message.server_content.turn_complete:
break
pcm_bytes = b"".join(pcm_chunks)
duration_ms = int(len(pcm_bytes) / 32000 * 1000) # 16kHz × 16-bit mono
# PCM → m4a (temp file mode, avoid moov atom problem)
with tempfile.NamedTemporaryFile(suffix=".pcm", delete=False) as f:
f.write(pcm_bytes)
pcm_path = f.name
m4a_path = pcm_path.replace(".pcm", ".m4a")
subprocess.run(
["ffmpeg", "-y", "-f", "s16le", "-ar", "16000", "-ac", "1",
"-i", pcm_path, "-c:a", "aac", m4a_path],
check=True, capture_output=True,
)
with open(m4a_path, "rb") as f:
return f.read(), duration_ms
Pitfalls of Read Aloud Function
❌ Pitfall 4: Completely Different Model Name
The first attempt at Gemini Live TTS was:
LIVE_MODEL = "gemini-3.1-flash-live-preview"
Following the inference of gemini-3.1-flash-lite-preview used for voice recognition, the result was a direct 1008 policy violation:
Publisher Model `projects/line-vertex/locations/global/publishers/google/
models/gemini-3.1-flash-live-preview` was not found
Listing the available models on Vertex AI revealed that the model naming rules for Live/native audio are completely different:
# ✅ Correct
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"
There is no Live version of Gemini 3.1 on Vertex AI. The Live/native audio feature is currently the 2.5 generation, and the naming format is gemini-live-{version}-{variant}-native-audio, which is completely separate from the general model gemini-{version}-flash-{variant}.
❌ Pitfall 5: GOOGLE_CLOUD_LOCATION=global Causes Live API to Disconnect
After changing to the correct model name, the error message was still the same:
Publisher Model `projects/line-vertex/locations/global/...` was not found
This time the model name was correct, but locations/global was strange – we clearly set us-central1.
Investigating the source code of the Google GenAI SDK revealed:
# _api_client.py
self.location = location or env_location
if not self.location and not self.api_key:
self.location = 'global' # ← here
location or env_location – if the passed-in location is an empty string, it will fall back to global.
The root cause of the problem is the environment variable of Cloud Run:
{ "name": "GOOGLE_CLOUD_LOCATION", "value": "global" }
GOOGLE_CLOUD_LOCATION was set to the "global" string. os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") did not get "us-central1", but "global" – then the SDK obediently connected to the global endpoint, but gemini-live-2.5-flash-native-audio does not have BidiGenerateContent support in global.
| Endpoint | Standard API | Live API |
|---|---|---|
global |
✅ Available | ❌ Model not here |
us-central1 |
✅ Available | ✅ gemini-live-2.5-flash-native-audio
|
Solution: Hardcode the location of the Live API, and don't read from the env var:
# ❌ Affected by GOOGLE_CLOUD_LOCATION=global
VERTEX_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")
# ✅ Hardcoded, not affected by env var
VERTEX_LOCATION = "us-central1" # Live API needs a regional endpoint
Voice Recognition vs. Read Summary Aloud
The two functions use completely different Gemini APIs:
| Voice Recognition | Read Summary Aloud | |
|---|---|---|
| Direction | Audio → Text | Text → Audio |
| API | Standard generate_content
|
Live API BidiGenerateContent
|
| Model | gemini-3.1-flash-lite-preview |
gemini-live-2.5-flash-native-audio |
| Location | Follows env var | Hardcoded us-central1
|
| Output Format | text | PCM → ffmpeg → m4a |
| LINE Message Type | Input: AudioMessage
|
Output: AudioSendMessage
|
Conclusion
The release of Gemini 3.1 Flash Live makes audio AI more worthy of serious consideration. This time, both voice recognition and read summary aloud were integrated into the LINE Bot:
- Voice Recognition: Standard Gemini API, pre-recorded m4a one-time transcription, connected to the existing Orchestrator
- Read Summary Aloud: Gemini Live TTS, summary text to PCM, ffmpeg to m4a,
AudioSendMessagereturns
The most troublesome part is not the function itself, but finding the correct model name and locating the SDK's location logic – neither of these are clearly written in a prominent place in the documentation, and the answer can only be found by listing available models and reading the SDK source code.
The full code is on GitHub, feel free to refer to it.
See you next time!



Top comments (0)