Adding Voice to Your AI Bot: Speech-to-Text and Text-to-Speech with Gemini 3.1

#programming #webdev

---
title: "Voice-Enable Your AI Bot with Gemini STT and TTS"
published: true
description: "A hands-on tutorial for adding speech-to-text and text-to-speech to any AI chatbot using Gemini APIs — standard API for transcription, Live API for synthesis."
tags: api, architecture, cloud, typescript
canonical_url: https://blog.mvpfactory.co/voice-enable-your-ai-bot-with-gemini-stt-and-tts
---

## What We Are Building

In this workshop, I will walk you through adding voice input and voice output to an existing AI chatbot using two Gemini API surfaces. By the end, you will have a working pipeline that transcribes incoming voice messages to text (STT) and synthesizes spoken audio replies (TTS). You will also understand exactly which SDK, model, and endpoint to use for each step — because Google made this more confusing than it needs to be.

## Prerequisites

- A Google Cloud project with the Gemini API enabled
- Python 3.10+
- Two separate SDK packages installed: `google-generativeai` and `google-genai`
- `ffmpeg` available in your environment (check with `ffmpeg -version`)
- An existing text-based chatbot or orchestrator you want to voice-enable

## Step 1: Choose the Right Architecture

Let me show you a pattern I use in every project that adds voice. Teams often build a parallel voice pipeline alongside their text pipeline. That doubles your maintenance surface and guarantees the two drift apart.

The pattern that holds up in production is almost disappointingly simple:

Voice Message → STT (Transcribe) → Text Orchestrator → Response Text → TTS → Audio Reply


Convert voice to text at the entry point and every downstream feature — RAG search, function calling, intent routing — automatically supports voice input. No additional code. Do one thing well at each layer.

## Step 2: Transcribe Voice with the Standard API

For pre-recorded audio (voice messages your bot receives), use the standard Gemini API via the `google-generativeai` SDK. You have a complete audio file, not a real-time stream, so the Live API would be overkill.

python

SDK: google-generativeai (pip install google-generativeai)

import google.generativeai as genai

model = genai.GenerativeModel("gemini-3.1-flash")

audio_bytes = download_voice_message(message_id)

response = model.generate_content([
"Transcribe this audio to text accurately.",
{"mime_type": "audio/m4a", "data": audio_bytes}
])

transcript = response.text


In our tests, a 15-second voice message transcribed in ~2.1 seconds (median, n=50, `gemini-3.1-flash`, `us-central1`). Multi-language recognition handled mixed-language sentences within a single utterance without issues.

## Step 3: Synthesize Speech with the Live API

Here is the gotcha that will save you hours. Generating spoken audio requires the Gemini Live API and a **completely different SDK**: `google-genai` (not `google-generativeai`). The model name is different too.

python

SDK: google-genai (pip install google-genai)

from google import genai

client = genai.Client(
vertexai=True,
project="your-project",
location="us-central1" # MUST be regional, not global
)

async with client.aio.live.connect(
model="gemini-live-2.5-flash-native-audio"
) as session:
await session.send_client_content(
turns=[{"role": "user", "parts": [{"text": response_text}]}]
)
# Collect PCM audio chunks from the stream
pcm_data = await collect_audio_response(session)


The output is raw PCM audio. You will need `ffmpeg` to convert it to something usable like m4a before sending it back to users.

## Step 4: Handle Platform Delivery

If you are integrating with messaging platforms, reply token constraints will bite you. LINE, for example, lets a reply token be used only once. If your bot sends a text reply and then follows up with audio, you need a push message API for the second response. The transcribe-first architecture pays off here — your orchestrator handles the logic, your platform adapter handles delivery quirks.

## Gotchas and Common Mistakes

The docs do not mention most of these, but they will cost you real debugging time:

| Pitfall | What You Expect | What Actually Happens |
|---|---|---|
| SDK package | One unified SDK | Two packages: `google-generativeai` vs `google-genai` |
| Model naming | Consistent convention | `gemini-3.1-flash` (standard) vs `gemini-live-2.5-flash-native-audio` (Live) |
| Endpoint | Global endpoint works | Live API requires regional (`us-central1`). Global fails silently. |
| `Part.from_text()` | Positional arg works | Must use keyword argument syntax or it throws unexpected errors |
| Audio format | Ready-to-use output | Raw PCM. Requires `ffmpeg` conversion. |
| Pricing | Similar cost model | Live API charges per session-second; standard charges per token. Live ran 3-5x higher per interaction in our testing. |

The SDK split is the single most common integration mistake I have seen, and the error messages will not help you figure out what went wrong. Hard-code `us-central1` for Live API calls — this is the number one deployment failure in production.

## Before You Ship

Monitor your end-to-end voice round-trip time. Users expect sub-3-second responses for voice interactions, and latency adds up fast across transcription, orchestration, and synthesis.

Make sure `ffmpeg` is available in your deployment environment — container images often omit it. Set up proper error handling for the Live API WebSocket connection. It will drop under load if you are not managing connections carefully.

One more thing: take breaks during long WebSocket debugging sessions. I am serious — fatigue introduces subtle bugs that cost more time than the break would have. I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running during these sessions; the break reminders and guided desk exercises are genuinely useful when you are deep in streaming audio debugging for hours.

## Wrapping Up

Here is the minimal setup to get this working: use `google-generativeai` for STT and `google-genai` with the Live API for TTS. Transcribe first, route through your existing text orchestrator, and build the voice layer as a thin adapter — not a parallel pipeline. Know which package you are importing and why. Your future self maintaining this system will thank you.

**Resources:**
- [Gemini API Documentation](https://ai.google.dev/docs)
- [google-generativeai PyPI](https://pypi.org/project/google-generativeai/)
- [google-genai PyPI](https://pypi.org/project/google-genai/)