Inworld TTS Paralinguistic Tags Don't Work — Here's What Does

#ai #tts #llm #voice

If you've worked with expressive TTS in the last year you've probably seen the pattern:

She paused. [sigh] "Fine, you can come in."

Inline paralinguistic tags. Half the model demos use them. So when we wired up Inworld TTS-1.5 Max for HoneyChat — Telegram-native AI companion where voice messages are a first-class output — we sprinkled [laugh], [sigh], [breathe] through the prompts and shipped.

The audio sounded fine. Just… exactly the same as before. No laugh. No sigh. The tags were getting read out as silence at best, and as the literal text "sigh" at worst, depending on the voice.

We tested all the variants we could find. None of them moved the needle.

HoneyChat voice stack at a glance:

Engine: Inworld TTS-1.5 Max — $10 per 1M characters, currently #1 on the TTS Arena ELO board at 1259 ELO, 15 languages with native pronunciation: en, ru, ja, zh, ko, es, fr, de, it, pt, pl, hi, ar, he, nl.
Voice catalog: 312 designed voices (26 character archetypes × 12 languages), stored as voiceId strings in config/archetype_voice_ids.json. Generated via the Voice Design API and managed with core/voice_design.py.
Custom voices: Voice Clone Manager (core/voice_clone_manager.py) — persistent voiceId minted from a WAV/MP3 sample.
Cache: voice previews + test samples are lazy-loaded from Storj S3 via core/voice_cache.py.
Fallback: gTTS (Google) — free, no API key, used if Inworld returns 5xx or budget is exhausted.
What we removed to get here: Kokoro (CPU Docker, latency too high) and Chatterbox (GPU on Vast.ai, ops cost too high). Inworld replaced both for a flat per-char cost and dramatically better expressivity.
One API gotcha: gender enum is VOICE_GENDER_MALE/VOICE_GENDER_FEMALE, not "male"/"female" strings. Passing the strings 400s silently.

What actually doesn't work

Tried on the same sentence, same voice, side-by-side audio comparison:

Pattern	What it did
`[laugh]` `[sigh]`	Silence in output
`(laughs)` `(sighs)`	Sometimes read literally
`laughs` `sighs`	Silence (asterisks get stripped)
`<laugh/>` `<sigh/>`	Silence (not valid SSML on Inworld)
`<emotion>laugh</emotion>`	Silence

The Inworld API does not document support for any of these. We had assumed (because every other TTS post on the internet uses them) that they were a universal convention. They are not.

What Inworld does expose is temperature and speakingRate as request parameters, plus a small subset of SSML. The expressivity has to come from those plus how you shape the text itself.

What actually does work

After enough A/B-ing across 26 archetypes × 15 languages, four patterns reliably change the audio output.

1. Asterisks for emphasis

"You did *what?*"

The asterisks get stripped from the spoken text but the emphasised word lands with audible stress. Works in every voice we tried. The cheapest, highest-hit-rate marker.

2. Ellipsis for pause-with-mood

"Fine... you can come in."

Three dots produces a real pause with a tonal drop — the voice equivalent of a sigh, without trying to fake [sigh]. Five dots for a longer pause. The model interprets them as prosodic cues.

3. SSML `<break>` for hard pauses

<speak>
  She paused. <break time="0.4s"/> "Fine, you can come in."
</speak>

Inworld accepts a useful subset of SSML, and <break> is the one that matters most for expressive speech. 0.2s for a beat, 0.4s for a sigh-pause, 0.8s for a beat-before-a-line-delivery moment. Wrap the whole text in <speak> and the parser handles it.

4. Onomatopoeia for laughs, moans, breath

"Mmm... ha-ha, you're right."
"ahh... I needed that."

The model will render ha-ha, mmm, ahh, oh, nnn as the actual sound, because they're spellings of sounds rather than meta-tags. They sound far more natural than a synthesised [laugh] even when one exists.

For emotional/intimate scenes, rhythmic repeats (ah... ah... ah) carry actual prosody. We use this for breath patterns where another TTS would want a [breathe] marker.

The wrapper that ties it together

In core/voice.py we run every chunk through enrich_for_tts() (line ~772) before handing it to Inworld. Regex-based, language-aware, idempotent:

def enrich_for_tts(text: str, lang: str = "en") -> tuple[str, dict]:
    """Return (preprocessed_text, request_params).
    Strips fake paralinguistic tags, adds SSML breaks where appropriate,
    and bumps temperature/speakingRate for high-emotion scenes."""
    text = _STRIP_FAKE_TAGS.sub("", text)
    text = _ELLIPSIS_TO_BREAK.sub(r'<break time="0.3s"/>', text)
    if "<break" in text:
        text = f"<speak>{text}</speak>"
    params = _detect_mood_params(text, lang)
    return text, params

The mood detector looks for emotional cues (intensity words, repeated punctuation, onomatopoeia density) and bumps temperature and speakingRate for the more expressive scenes. Same model, same voice, much more dynamic output, all without any inline tag that the model would have ignored.

Lessons

Don't assume [laugh]/[sigh] is universal. It isn't. Check the provider's docs and probe.
Probe with side-by-side audio, not just visual diffs. A [sigh] that emits silence looks identical to one that emits a sigh in any log.
Use what the API actually exposes. For Inworld that's temperature, speakingRate, and a useful subset of SSML — not inline tags.
Onomatopoeia beats meta-tags for emotional sounds. "ahh..." is a thing the model can read; [sigh] is a meta-instruction it can't.
Strip the fake tags out of your prompt before sending. Otherwise they leak as text on some voices.

The audio quality jump from these four patterns is meaningful — users notice. The cost is a 30-line preprocessor and the courage to delete every [laugh] your team has been sprinkling for months.

This is from production work at HoneyChat — Telegram-native AI companion where voice messages are a first-class output. Canonical version: honeychat.bot/en/blog/inworld-tts-paralinguistic-tags-alternatives.

— HoneyChat Engineering

Sources

Inworld TTS — documentation — supported request parameters (temperature, speakingRate), SSML subset, voice design API.
W3C — Speech Synthesis Markup Language (SSML) 1.1 — full SSML spec; <break>, <speak>, prosody elements.
TTS Arena (Hugging Face) — community ELO ranking; Inworld TTS-1.5 Max top-position context.
gTTS — Python library — the free fallback we use when Inworld is unavailable.
HoneyChat engineering notes: LLM prompt caching measured · LLM refusal rescue chain.

Top comments (1)

Andreas Assad Kottner • Jun 1

Hey, I'm Andreas from the Inworld product team, thanks for the detailed writeup and for actually putting our tags through their paces with TTS-1.5 Max!

To add some context: the non-verbal tags you tested were exposed in TTS-1.5 as an experimental feature specifically to help us collect initial feedback from developers. Reliability was never guaranteed at that stage, and posts like yours are a big part of how we identify where the gaps are.

We've since invested heavily in improving tag reliability for our next model, Realtime TTS-2 , which is currently available in research preview. Our internal metrics show meaningful improvements in how consistently these tags are rendered and we expect reliability to keep improving as we move toward general availability.

We'd love for you to continue trying this out and giving us feedback! We love it!