Gemini 3.1 Flash text to speech (TTS) is a new model that you can direct to get the precise audio performance you want. In this blog post I'll share some tips on how to guide the model with prompts, and share some examples of its strengths.
Out of the box gemini-3.1-flash-tts-preview will natively interpret a transcript and determine how your words should be delivered. Simple transcripts without any additional prompting sound natural. But 3.1 Flash TTS also comes with tools you can use to steer it.
You can give the model plenty of context, such as an audio profile – who is speaking, how they are speaking, what their voice sounds like, and so on. You can also describe the scene, where they are, what they are doing, the environment, and provide any extra "director's notes" to guide the performance. The model will use that information to generate speech that sounds right for that context.
You can now also use tags to control the delivery of specific parts of the transcript. Tags are inline modifiers like [whispers] or [laughs] that give you granular control over the delivery. You can use them to change the tone, pace, and emotional vibe of a line or section of the transcript. You can also use them to add interjections and a few other non-verbal sounds to the performance, like [cough], [sighs] or [gasp].
There are no limits to the tags you can use. You can be creative with what you put within those [] brackets, and the model will always do its best to understand and interpret them.
Simple transcripts and creative tags
To show the kind of variability you can get with tags alone, here are a set of examples that each say the same thing, with the same voice, but the delivery changes based on the tags I used. I picked the Algenib voice, a male, slightly gravelly voice.
Here's how it sounds with no tags:
Hey there, I'm a new text to speech model, and I can say things in many different ways. How can I help you today?
Let's start with a change of emphasis, our speaker is either bored, reluctant or excited, and we can hear it:
[excitedly] Hey there, I'm a new text to speech model...
[bored] Hey there, I'm a new text to speech model...
[reluctantly] Hey there, I'm a new text to speech model...
We can also use tags to change the pace of the delivery, and combine them with emphasis too:
[very fast] Hey there, I'm a new text to speech model...
[very slowly] Hey there, I'm a new text to speech model...
[sarcastically, one painfully slow word at a time] Hey there, I'm a new text to speech model...
Tags also give precise control over sections, so we can whisper something, then shout something, or whatever combination you want:
[asmr] Hey there, I'm a new text to speech model, [deep and loud shouting] and I can say things in many different ways. [asmr] How can I help you today?
You can really try all sorts of things:
[like a dog] Hey there, I'm a new text to speech model...
[like dracula] Hey there, I'm a new text to speech model...
[singing] Hey there, I'm a new text to speech model...
Some more tags you can try:
- [amazed]
- [crying]
- [curious]
- [gasp]
- [giggles]
- [mischievously]
- [panicked]
- [sarcastic]
- [serious]
- [sighs]
- [snorts]
- [tired]
- [trembling]
Tags give us quick and easy control over the delivery of our transcript. We can can also combine them with a context prompt, to set the overall tone and vibe of the performance.
Context and performance
By providing nuanced instructions like a precise regional accent, specific features like breathiness, or pacing, you can use the model’s context awareness to generate dynamic, natural, and expressive audio performances. This avoids needing to use tags for every micro-edit.
It works best when the transcript and prompts align, so that "who is saying it" matches with "what is said" and "how it is being said."
Prompting structure
A good prompt includes a few key elements before the transcript:
- Audio profile
- Scene
- Director's notes
These sections are all optional, but they can help the model understand the context and performance you want. You can think of them as a system instruction for creating consistent sounding outputs from different transcripts.
Audio profile
This is the persona for your voice. You can define a character identity, archetype, and any other characteristics like age or background.
Giving your character a name helps ground the model and tie the performance together. You can refer to the character by name when setting the scene and context. It's also helpful to define their identity, like whether they are a radio DJ, a podcaster, or a news reporter.
Scene
The scene sets the stage. Location, mood, and environmental details define the tone and vibe. You should describe what is happening around the character and how it affects them. The scene gives the model environmental context for the entire interaction and will guide the performance in a subtle and organic way. Like a conversation at a busy early morning coffee shop, a DJ in their professional studio, or an announcement in a busy airport.
## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.
Director's notes
Director's notes are performance guidance for the model. The most common directions are style, pacing, and accent, but the model is not limited to these. Feel free to include custom instructions to cover any additional details important to your performance, and go into as much or as little detail as necessary.
### DIRECTOR'S NOTES
Style: Enthusiastic and Sassy GenZ beauty YouTuber
Accent: Southern california valley girl from Laguna Beach
Pacing: Speaks at an energetic pace, keeping up with the extremely fast, rapid delivery influencers use in short form videos.
Style
The style sets the tone of the generated speech. Include things like upbeat, energetic, relaxed, or bored to guide the performance. Be descriptive and provide as much detail as necessary. Saying "Infectious enthusiasm. The listener should feel like they are part of a massive, exciting community event." works much better than simply saying "energetic and enthusiastic".
You can even try terms that are popular in the voiceover industry, like "vocal smile." You can layer as many style characteristics as you want.
Style: Sassy GenZ beauty YouTuber, who mostly creates content for YouTube Shorts.
Accent
Describe the desired accent. The more specific you are, the better the results. For example, use "British English accent as heard in Croydon, England" rather than just "British Accent".
Accent: Jaz is a DJ from Brixton, London
Pacing
You can also specify the overall pacing and pace variation throughout the piece.
Pacing: The "Drift": The tempo is incredibly slow and liquid. Words bleed into each other. There is zero urgency.
Full prompt example
Here is an example of what a full prompt might look like:
# AUDIO PROFILE: Jaz R.
## "The Morning Hype"
## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.
### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").
Accent: Jaz is from Brixton, London
Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks with a "bouncing" cadence. High-speed delivery with fluid transitions—no dead air, no gaps.
### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.
#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you. [shouting] Turn this up! We’ve got the project roadmap landing in three, two... let's go!
Ask Gemini for help
If you're struggling to find the words, Gemini works well as a co-director. Here's a good system instruction to generate context from a simple prompt:
You are a scriptwriter and audio director. I have a simple context but NO TRANSCRIPT.
TASK:
1. Write a creative, engaging script based on the given context.
2. Format the entire output as a structured TTS prompt. Follow the strict output format exactly.
You may include emotion and interjection tags in brackets within the script to direct the TTS model's performance. For example, you can write: "[amused] Oh, really?" or "[sigh] I suppose so". You can be creative with the tags you use, and the model will always do its best to understand and interpret them.
STRICT OUTPUT FORMAT:
# AUDIO PROFILE: [Invent a Name]
## "[Invent a Title]"
## THE SCENE: [Invent a Scene Title]
[Vivid description of the scene]
### DIRECTOR'S NOTES
Style: [Style instructions]
Pace: [Pace instructions]
Accent: [Accent instructions]
### SAMPLE CONTEXT
[Role/Persona description]
#### TRANSCRIPT
[Script]
----------------
INPUT CONTEXT:
...
CRITICAL RULE:
Ensure the divider "#### TRANSCRIPT" is used exactly as written before the spoken text.
Play around and find out
Try some of these examples for yourself on AI Studio.
Some tips to keep in mind:
- keep the script and the direction coherent
- don't overspecify, you don't need to describe everything, the model will fill in the gaps
- give the model space to fill in the gaps, sometimes it helps with naturalness
Top comments (0)