Gemini 3.1 Flash TTS for Next.js: ship voice UX in 15 min (2026)

#fullstack #ai #webdev #javascript

Originally published on NextFuture

What's new this week

On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.

Why it matters for builders

Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.

AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ... and TTS returns one WAV with two distinct voices. Your agent graph loses a node, latency drops by a full network round-trip, and you skip maintaining a separate speaker-labelling prompt that drifts every model upgrade.

Indie maker. A Duolingo-style pronunciation app priced at $4/month barely broke even on Azure TTS at roughly $0.05 per lesson. At $20 per million output audio tokens, a 30-second lesson now costs about $0.003 — gross margin holds above 90% on the same $4 tier. Voice is no longer the line item that kills your side project's unit economics, which means TTS-heavy features like audiobook summaries, podcast previews, or accessibility narration finally pencil out on a free-tier SaaS.

Hands-on: try it in under 15 minutes

Grab a free API key from aistudio.google.com, store it as GEMINI_API_KEY, then install the SDK:

npm install @google/genai wav

Minimal Node/TypeScript call wrapped as a Next.js 16 server action:

"use server";
import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

export async function synthesize(text: string, voice = "Kore") {
  const res = await ai.models.generateContent({
    model: "gemini-3.1-flash-tts-preview",
    contents: [{ parts: [{ text }] }],
    config: {
      responseModalities: ["AUDIO"],
      speechConfig: {
        voiceConfig: { prebuiltVoiceConfig: { voiceName: voice } },
      },
    },
  });

  const b64 = res.candidates![0].content!.parts![0].inlineData!.data!;
  return Buffer.from(b64, "base64"); // 24kHz mono PCM
}

await synthesize(
  "Say warmly: [slow] Welcome back, Alex. [happy] You crushed this week."
);

The inline [slow] and [happy] tags steer pacing and emotion mid-sentence — no separate prosody config. Tags must live inside square brackets, separated by text or punctuation: two adjacent tags will error. For a two-person podcast intro via cURL:

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents":[{"parts":[{"text":"TTS between Joe and Jane: Joe: [excited] The feed dropped. Jane: [amused] Took long enough."}]}],
    "generationConfig":{"responseModalities":["AUDIO"]}
  }' --output podcast.wav

Pipe the output through ffmpeg -i in.wav -b:a 64k out.mp3 if you need smaller transfer. Preview rate limits follow Gemini Flash defaults (10 RPM on free, 1,000 RPM on paid) — fine for prototyping. For production, queue synthesis in a BullMQ worker and cache finished clips in S3 or R2 keyed by a hash of text + voice + tagSet; a cache hit rate above 60% on a changelog feature is common. One caveat: preview model IDs have been renamed twice in the Gemini 3.1 family this quarter, so read the exact ID from an env var rather than hard-coding it.

How it compares to alternatives

Gemini 3.1 Flash TTSOpenAI gpt-4o-mini-ttsElevenLabs Flash v2.5Starts atFree tier; $1/M text + $20/M audio tokens paid$0.60 per 1M input chars, no free tier$5/mo Starter (30k credits)Best forMultilingual, expressive multi-speaker narrationLow-latency voice replies inside GPT appsCloned brand voices, audiobook productionKey limitPreview only — no SLA, model ID may change before GA~8 voices, fewer expressive tagsPer-character billing scales fast on free-tier SaaSIntegration@google/genai SDK, Vertex AI, REST/cURLOpenAI SDK, streaming WebSocketREST API, WebSocket streaming, native SDK

Try it this week

Pick one text-heavy screen in your product — an onboarding intro, a weekly changelog entry, a lesson summary — and wire Flash TTS behind a "Play" button. Ship it behind a feature flag so you can A/B the voice UX on 10% of sessions, then compare time-on-page and replay counts; if replay rate clears 15%, keep it and expand to every long-form page. For wider context on where the Gemini stack sits today, read our Gemma 4 review and the Q1 2026 Web+AI recap for the pricing shifts since January.

This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.