You paste a script into a text-to-speech tool, hit generate, and the result sounds… flat. The pacing is wrong, emphasis lands on the wrong syllable, and your carefully chosen words blur together into a robotic drone. You know the content is good. The delivery just isn't there yet.
That gap between "readable" and "listenable" is exactly what SSML closes. Speech Synthesis Markup Language is a W3C standard that lets you tell a TTS engine how to speak — not just what to say. Think of it as stage directions for a voice actor who happens to be software.
In this tutorial, you'll learn the four SSML tags that handle 90% of real-world audio polishing: <break>, <emphasis>, <prosody>, and <phoneme>. Each section includes a plain-text "before" and an SSML-enhanced "after" so you can hear the difference immediately.
What SSML Actually Is (and Why You Should Care)
SSML stands for Speech Synthesis Markup Language. It's an XML-based markup standard maintained by the W3C — the same body behind HTML and CSS. Every major TTS engine supports it, from cloud providers to standalone apps.
Without SSML, a TTS engine makes its best guess about pacing, pronunciation, and emphasis. Those guesses are surprisingly good for casual sentences. But the moment your script contains a product name, a dramatic pause, a foreign loan word, or a passage that needs emotional weight, guesswork falls apart.
SSML doesn't require programming skills. If you've ever written an HTML tag, you already know the syntax. You wrap the text you want to control in an opening and closing tag, add an attribute or two, and let the engine do the rest.
For creators working in podcasting, audiobook narration, or document-to-audio workflows, SSML is the fastest way to go from "decent first draft" to "publish-ready." Let's start with the easiest tag.
Breaks: Controlling Silence
The <break> tag inserts a pause. That sounds trivial until you realize how much pacing matters. A half-second pause after a heading lets the listener's brain reset. A full second of silence before a key statistic creates anticipation. Without explicit breaks, TTS engines sometimes rush through transitions that a human narrator would breathe through.
Before (plain text)
Welcome to the show. Today we're talking about voice design. Let's dive in.
The engine reads this as one continuous stream. "Show" and "Today" collide. "Let's dive in" arrives before the listener has processed the topic.
After (with SSML breaks)
Welcome to the show.
<break time="600ms"/>
Today we're talking about voice design.
<break time="400ms"/>
Let's dive in.
The time attribute accepts milliseconds (ms) or seconds (s). You can also use strength values like medium, strong, or x-strong if you prefer relative pauses over exact durations. Start with 400ms for natural breathing room and 800ms for section transitions. Adjust from there.
A good rule of thumb: anywhere you'd take a breath if you were reading the script aloud, drop a <break>. Anywhere you want the listener to sit with an idea, make the pause longer.
Emphasis: Guiding Attention
Emphasis is how you bold a word in audio. The <emphasis> tag tells the engine to stress a word or phrase, subtly shifting pitch and volume the way a human speaker naturally would.
Before
You need to back up your files every single day.
The engine reads every word at equal weight. The urgency of "every single day" disappears.
After
You need to back up your files <emphasis level="strong">every single day</emphasis>.
The level attribute accepts reduced, moderate, and strong. Use moderate for conversational stress and strong for moments that need real weight — warnings, key takeaways, or emotional beats.
Over-emphasizing dilutes the effect. If everything is bold, nothing is bold. A useful guideline: limit strong emphasis to one or two phrases per paragraph. Let the rest breathe at moderate or with no tag at all.
Emphasis pairs beautifully with breaks. Place a short <break time="300ms"/> before an emphasized phrase and the listener's ear naturally locks onto the next word.
Prosody: Pitch, Rate, and Volume
If <break> and <emphasis> are scalpels, <prosody> is the full surgical kit. It lets you control three dimensions of the voice at once: pitch (how high or low), rate (how fast or slow), and volume (how loud or soft).
Before
Breaking news. The merger has been confirmed. Shares are up twelve percent.
Read in a flat monotone, this sounds like a grocery list instead of a newsflash.
After
<prosody rate="105%" pitch="+5%">Breaking news.</prosody>
<break time="500ms"/>
<prosody rate="95%" volume="loud">The merger has been confirmed.</prosody>
<break time="300ms"/>
Shares are up twelve percent.
Here the opening line is slightly faster and higher-pitched — mimicking the energy of a news anchor. The confirmation slows down and gets louder for gravity. The final detail returns to normal delivery, grounding the listener.
You can set values as percentages (rate="80%"), relative changes (pitch="+2st" for semitones), or keywords (volume="soft"). Percentages are the most portable across engines.
Start small. A 5–10% shift in rate or pitch is often enough. Large swings (say, rate="50%") sound unnatural. Think of prosody as seasoning: a pinch transforms the dish; a handful ruins it.
For podcasters building scripted shows with TTS, prosody adjustments are what separate a monotone draft from something listeners actually enjoy. Vary energy across sections, slow down for definitions, and speed up during transitions.
Phonemes: Nailing Tricky Pronunciations
Names, technical terms, loan words, brand names — TTS engines mispronounce these constantly. The <phoneme> tag lets you specify exact pronunciation using the International Phonetic Alphabet (IPA) or a provider-specific phonetic alphabet.
Before
The event is held in Yosemite every year.
Some engines pronounce "Yosemite" as "YOZ-mite" instead of "yoh-SEM-ih-tee."
After
The event is held in <phoneme alphabet="ipa" ph="joʊˈsɛmɪti">Yosemite</phoneme> every year.
The alphabet attribute tells the engine which phonetic system you're using. IPA (ipa) is the universal standard. The ph attribute contains the phonetic spelling.
You don't need to memorize IPA. Online IPA keyboards and dictionaries make it easy to look up any word. For common mispronunciations — brand names, city names, foreign phrases — a single phoneme tag permanently fixes the issue.
A related tag worth knowing is <sub>, which substitutes display text with spoken text. It's lighter than <phoneme> when you just need an abbreviation expanded:
The file is 5 <sub alias="megabytes">MB</sub>.
Between <phoneme> and <sub>, you can correct virtually every pronunciation quirk a TTS engine throws at you.
Putting It All Together in EchoLive
You don't have to write raw XML in a text editor. EchoLive's visual SSML tools let you highlight a word, pick a tag from a toolbar, and adjust attributes with sliders — no angle brackets required. The studio editor shows a segment-based timeline, so you can apply different voices, pacing, and SSML to each section of your project independently.
Here's a workflow that takes about five minutes:
- Import your script. EchoLive's Smart Import handles txt, md, docx, pdf, HTML, and URLs. It auto-segments your content and suggests pacing.
- Preview the raw output. Listen for flat spots, mispronunciations, and rushed transitions.
- Add SSML. Use the visual editor to drop breaks at section boundaries, add emphasis to key phrases, tweak prosody for energy shifts, and fix any names with phoneme tags.
- Regenerate and compare. The before-and-after difference is usually dramatic.
EchoLive supports 650+ neural voices across three quality tiers. Experiment with different voices — some respond more dramatically to prosody shifts than others. You can try voices instantly in the Playground before committing to a full project.
Quick Reference Cheat Sheet
| Tag | What It Controls | Key Attributes | Example Value |
|---|---|---|---|
<break> |
Silence / pauses |
time, strength
|
time="500ms" |
<emphasis> |
Stress on words | level |
level="strong" |
<prosody> |
Pitch, rate, volume |
pitch, rate, volume
|
rate="90%" |
<phoneme> |
Exact pronunciation |
alphabet, ph
|
ph="joʊˈsɛmɪti" |
<sub> |
Text substitution | alias |
alias="megabytes" |
Keep this table handy for your first few projects. After a dozen scripts, the tags will feel as natural as bold and italic in a word processor.
Start Shaping Your Audio
SSML is the difference between audio that exists and audio that connects. Four tags — <break>, <emphasis>, <prosody>, and <phoneme> — give you control over pacing, stress, energy, and pronunciation. That's enough to transform a flat TTS draft into something that sounds intentional and polished.
The best way to learn is to experiment. Open a script you've already drafted, listen for the rough spots, and tag them. Within a few minutes, you'll hear the improvement. If you want a visual editor that handles the markup for you, EchoLive's studio lets you build nuanced audio segment by segment — no XML expertise required.
Originally published on EchoLive.
Top comments (0)