DEV Community

Cover image for Deep Dive: I Tested Google's New Gemini 2.5 Pro TTS with Emotion & SSML Tags
abdalrohman
abdalrohman

Posted on

Deep Dive: I Tested Google's New Gemini 2.5 Pro TTS with Emotion & SSML Tags

As developers, we're always on the lookout for powerful, accessible APIs, and high-quality text-to-speech (TTS) is no exception. For years, the trade-off has been between expensive, high-quality models with great control (like ElevenLabs) and free, more robotic-sounding alternatives.

I've been searching for a model that hits the sweet spot: high quality, granular control, and a generous free tier for experimentation. After testing a few options, I decided to put Google's new Gemini 2.5 Pro TTS model, available in Google AI Studio, through its paces.

The results were impressive, especially its ability to handle emotional cues and SSML tags directly in the prompt. Here's a breakdown of my tests so you can see how it performs.

How to Access It

  1. Go to Google AI Studio.
  2. Click the "Generate Media" tab.
  3. Select "Gemini speech generation".
  4. You'll see two models: Gemini 2.5 Pro Preview TTS and Gemini 2.5 Flash Preview TTS. My tests focus on the Pro model, which is significantly more capable.

Test 1: Emotion & Tone Control with Simple Tags

The most impressive feature is using simple [tag] syntax to inject emotions, tones, and vocalizations. I used the list from the Fish-Speech repo as a benchmark.

  • Working:
  • Fails / Not Supported:

Basic & Advanced Emotions

The model handled every single emotional cue I tested flawlessly.

Emotion Working? Prompt to Test
[angry] [angry] This is completely unacceptable!
[excited] [excited] We won the match! I can't believe it!
[sarcastic] [sarcastic] Oh, wonderful. Another problem.
[scornful] [scornful] You call that an achievement? How pathetic.
[empathetic] [empathetic] I'm so sorry you have to go through this.

Tones, Vocalizations & Special Effects

This is where you can direct the performance. It works great for single-speaker sounds but can't generate environmental audio.

Command Working? Prompt to Test
[shouting] [shouting] HEY! GET OUT OF THE WAY!
[whispering] [whispering] Be very quiet, I don't want anyone to hear us.
[laughing] That's the funniest thing I've heard all day! [laughing]
[sighing] [sighing] Alright, I guess I'll do it myself.
[clears throat] [clears throat] Ahem. Let's begin the presentation.
[speaking slowly] [speaking slowly] You... must... understand... this.
[crowd laughing] The comedian told his best joke. [crowd laughing]

Test 2: Deep Dive into SSML Support

For fine-grained, programmatic control, SSML is the standard. I was surprised by how much of the SSML spec Gemini already supports. This is huge for building dynamic applications.

Here's a summary of my SSML tests.

Category SSML Tag & Feature Working? Example
Pauses <break time=""> Text <break time="2s"/> more text.
Speech Control <prosody rate/pitch/volume> <prosody rate="slow">Slowing down.</prosody>
Emphasis <emphasis level=""> A <emphasis level="strong">very</emphasis> important point.
Content Type <say-as interpret-as="characters"> Spell <say-as interpret-as="characters">AI</say-as>.
Content Type <say-as interpret-as="date"> The date is <say-as interpret-as="date" format="mdy">9-10-2025</say-as>.
Pronunciation <sub> (Substitution) The <sub alias="World Wide Web Consortium">W3C</sub>.
Pronunciation <phoneme> (IPA) I say <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>.
Audio Insertion <audio> <audio src="..."> is not supported.
Voice Change <voice> / <lang> Switch to <lang xml:lang="fr-FR">chat</lang>.
Content Type <say-as interpret-as="currency"> <say-as interpret-as="currency" language="en-US">$50</say-as>

Key Takeaways & Limitations for Devs

  • Hybrid Control is Powerful: You can mix and match [tag] style and SSML in the same prompt for the best of both worlds (e.g., [sad] <break time="1s"/> I guess it's over.).
  • Pro vs. Flash: The Pro model is significantly better at interpreting nuanced tags. Use Flash for speed and basic TTS, Pro for quality and control.
  • API vs. Studio: These tests were in the AI Studio UI. Behavior in the official API might differ slightly. Always test.
  • Error Handling: For long, complex prompts with many tags, the model can sometimes misfire and read the tag aloud. The fix is to break your text into smaller, more manageable chunks for generation.
  • No Voice Cloning (Yet): This is the biggest missing feature compared to paid competitors.

Final Thoughts

The Gemini 2.5 Pro TTS model is a seriously impressive tool for developers. The combination of easy-to-use emotional tags and robust SSML support—available for free in AI Studio—makes it one of the most versatile and high-quality options on the market today.

If they add voice cloning, it'll be a true category killer.

Have you tried it? Drop your findings or any cool use cases in the comments below!

Top comments (0)