Originally published on NextFuture
What's new this week
On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.
Why it matters for builders
Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.
AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ... and TTS returns one WAV with two distinct voices. Your agent graph loses a node, latency drops by a full network round-trip, and you skip maintaining a separate speaker-labelling prompt that drifts every model upgrade.
Indie maker. A Duolingo-style pronunciation app priced at $4/month barely broke even on Azure TTS at roughly $0.05 per lesson. At $20 per million output audio tokens, a 30-second lesson now costs about $0.003 — gross margin holds above 90% on the same $4 tier. Voice is no longer the line item that kills your side project's unit economics, which means TTS-heavy features like audiobook summaries, podcast previews, or accessibility narration finally pencil out on a free-tier SaaS.
Hands-on: try it in under 15 minutes
Grab a free API key from aistudio.google.com, store it as GEMINI_API_KEY, then install the SDK:
npm install @google/genai wav
Minimal Node/TypeScript call wrapped as a Next.js 16 server action:
"use server";
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });
export async function synthesize(text: string, voice = "Kore") {
const res = await ai.models.generateContent({
model: "gemini-3.1-flash-tts-preview",
contents: [{ parts: [{ text }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: voice } },
},
},
});
const b64 = res.candidates![0].content!.parts![0].inlineData!.data!;
return Buffer.from(b64, "base64"); // 24kHz mono PCM
}
await synthesize(
"Say warmly: [slow] Welcome back, Alex. [happy] You crushed this week."
);
The inline [slow] and [happy] tags steer pacing and emotion mid-sentence — no separate prosody config. Tags must live inside square brackets, separated by text or punctuation: two adjacent tags will error. For a two-person podcast intro via cURL:
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents":[{"parts":[{"text":"TTS between Joe and Jane: Joe: [excited] The feed dropped. Jane: [amused] Took long enough."}]}],
"generationConfig":{"responseModalities":["AUDIO"]}
}' --output podcast.wav
Pipe the output through ffmpeg -i in.wav -b:a 64k out.mp3 if you need smaller transfer. Preview rate limits follow Gemini Flash defaults (10 RPM on free, 1,000 RPM on paid) — fine for prototyping. For production, queue synthesis in a BullMQ worker and cache finished clips in S3 or R2 keyed by a hash of text + voice + tagSet; a cache hit rate above 60% on a changelog feature is common. One caveat: preview model IDs have been renamed twice in the Gemini 3.1 family this quarter, so read the exact ID from an env var rather than hard-coding it.
How it compares to alternatives
Gemini 3.1 Flash TTSOpenAI gpt-4o-mini-ttsElevenLabs Flash v2.5Starts atFree tier; $1/M text + $20/M audio tokens paid$0.60 per 1M input chars, no free tier$5/mo Starter (30k credits)Best forMultilingual, expressive multi-speaker narrationLow-latency voice replies inside GPT appsCloned brand voices, audiobook productionKey limitPreview only — no SLA, model ID may change before GA~8 voices, fewer expressive tagsPer-character billing scales fast on free-tier SaaSIntegration@google/genai SDK, Vertex AI, REST/cURLOpenAI SDK, streaming WebSocketREST API, WebSocket streaming, native SDK
Try it this week
Pick one text-heavy screen in your product — an onboarding intro, a weekly changelog entry, a lesson summary — and wire Flash TTS behind a "Play" button. Ship it behind a feature flag so you can A/B the voice UX on 10% of sessions, then compare time-on-page and replay counts; if replay rate clears 15%, keep it and expand to every long-form page. For wider context on where the Gemini stack sits today, read our Gemma 4 review and the Q1 2026 Web+AI recap for the pricing shifts since January.
This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.
Top comments (0)