I recently added text-to-speech (TTS) capabilities to OpenBench AI, my local-first Tauri desktop chat app. Rather than relying on a single TTS service, I built a dual-engine system that lets users choose between browser-native speech synthesis or a high-quality on-device neural model. Here's how I did it.
The Goal
Users should be able to click a speaker icon on any assistant message and hear it read aloud. They should have options: lightweight and built-in, or more natural-sounding but requiring a larger model download. All processing should happen localy with no API calls, no privacy concerns.
Architecture: Dual Engines
I implemented two TTS engines with a simple interface:
Supertonic (ST-TTS) — On-device neural TTS. Produces natural, high-quality audio. Requires downloading a ~100MB ONNX model on first use.
Browser SpeechSynthesis — The Web Speech API. No downloads, instant playback, slightly less natural, voice availability varies by OS.
Both engines are exposed via a speaker icon on each message and configured in Settings > Speech.
The Stack
Backend: tauri-plugin-supertonic
A Rust Tauri 2 plugin (tauri-plugin-supertonic@0.1) wrapping supertonic-core and ONNX Runtime.
Why a plugin? No custom Rust commands needed. The plugin handles everything at the Rust layer: model download, voice selection, and WAV synthesis. The frontend only speaks through the JavaScript API.
Frontend: tauri-plugin-supertonic-api
An npm package (tauri-plugin-supertonic-api@0.1) exposing a clean JavaScript interface:
loadModel(): Promise<void>
synthesize(text: string, language: string): Promise<string> // Returns base64 WAV
listVoices(): Promise<Voice[]>
selectVoice(voiceId: string): Promise<void>
State Management: Two Zustand Stores
settingsStore.ts — Holds TTS configuration:
type TtsSettings = {
engine: "browser" | "stTts"
browser: {
voice: string
speed: number // 0.5–2.0
pitch: number // 0–2.0
}
stTts: {
voice: string
speed: number
}
}
Settings persist to localStorage with version migration — if you ever add a new setting, users don't break.
ttsStore.ts — Manages playback state:
type TtsState = {
activeMessageId: string | null
isLoading: boolean
currentAudio: HTMLAudioElement | null
play: (messageId: string, text: string) => Promise<void>
stop: () => void
}
Per-message tracking means each message bubble can independently show play/stop/loading state.
How It Works End-to-End
1. User Clicks Speaker Icon
Calls ttsStore.play(messageId, text).
2. Text Cleaning
A cleanTextForSpeech() function strips markdown, HTML, code blocks, and math notation. You don't want the TTS engine trying to pronounce **bold** or $\LaTeX{}$.
// Strip markdown bold/italic
text = text.replace(/\*\*(.+?)\*\*/g, "$1")
// Strip code blocks and inline code
text = text.replace(/```
{% endraw %}
[\s\S]*?
{% raw %}
```/g, "")
text = text.replace(/`(.+?)`/g, "$1")
// Strip HTML tags
text = text.replace(/<[^>]*>/g, "")
// etc.
3. Engine Dispatch
Browser SpeechSynthesis
- Split text into sentences (simple regex split on
.or!or?). - For each sentence, create a
SpeechSynthesisUtterancewith the configured voice, rate, and pitch. - Queue them with
window.speechSynthesis.speak()— the browser queues them automatically. - Stop via
window.speechSynthesis.cancel(). Pros: instant, no downloads, works everywhere. Cons: less natural, voice quality varies by OS.
Supertonic (ST-TTS)
- Check if the model is loaded. If not, show a "Downloading TTS model (~100MB)" toast and call
loadModel(). - Once ready, call
synthesize(text, "en")from the supertonic API. - Get back a WAV file as base64.
- Play via
new Audio("data:audio/wav;base64,...").play(). Pros: high-quality, consistent across platforms, on-device. Cons: ~100MB download, slower synthesis.
const play = async (messageId: string, text: string) => {
set({ activeMessageId: messageId, isLoading: true })
const engine = settingsStore.getState().settings.tts.engine
const cleanText = cleanTextForSpeech(text)
try {
if (engine === "browser") {
playBrowserTts(cleanText)
} else {
// Ensure model is loaded
if (!modelLoaded) {
showToast("Downloading TTS model (~100MB)...")
await window.supertonic.loadModel()
modelLoaded = true
}
const wavBase64 = await window.supertonic.synthesize(
cleanText,
"en"
)
const audio = new Audio(`data:audio/wav;base64,${wavBase64}`)
audio.play()
set({ currentAudio: audio, isLoading: false })
}
} catch (error) {
showToast("TTS failed")
set({ isLoading: false })
}
}
4. Stopping Playback
The stop function pauses the current HTMLAudioElement or cancels window.speechSynthesis.
Also, when the user switches conversations, auto-stop fires — no audio bleeding into the next chat.
Settings UI (SpeechTab.tsx)
The Settings modal has a Speech tab with two sections:
Engine Selector
○ Browser (lightweight, instant)
◉ On-device (natural, ~100MB download)
If Browser Selected
- Voice dropdown (sourced from
speechSynthesis.getVoices()) - Speed slider (0.5–2.0)
- Pitch slider (0–2.0)
- Test Voice button If ST-TTS Selected
- Load Model button (shows progress; disabled if model is loading or loaded)
- Voice style selector (sourced from
listVoices()) - Speed slider (if supported)
- Test Voice button Both tabs show a live playback preview so users hear the difference before committing.
Key Design Decisions
No Custom Rust Commands
The plugin abstraction meant I could avoid writing custom Tauri command handlers. The plugin exposes everything through a clean npm package. If I ever need to add a new TTS feature (e.g., model switching, streaming), I just extend the plugin — the frontend doesn't need to change.
Web-Standard Audio Playback
I didn't reach for a Tauri audio plugin. HTMLAudioElement handles the WAV file from supertonic perfectly. window.speechSynthesis is built-in. Both are reliable and require zero extra dependencies.
Per-Message Tracking
The activeMessageId in the TTS store lets each message bubble track its own playback state independently. One message can be playing while another shows a "Play" button. This feels more natural than a global play/stop.
Lazy Model Loading
The ~100MB supertonic model only downloads when the user first clicks play with ST-TTS selected — not on app startup. A toast notifies them what's happening. Once loaded, subsequent plays are instant. This keeps the app lightweight until the feature is actually needed.
What I'd Do Differently (or Next)
- Streaming synthesis — For long messages, synthesizing the entire text at once can feel slow. Streaming chunks as they're generated would improve perceived performance.
- Voice caching — Cache synthesized audio so replaying the same message is instant.
- Rate limiting — If ST-TTS synthesis is slow, queue requests or show a "Synthesizing..." state more explicitly.
- Cross-platform voice testing — Browser voice availability varies wildly (Windows, Mac, Linux). Testing coverage is key. ## Wrapping Up
Dual-engine TTS isn't complicated, but it requires thoughtful state management and a clean architecture. By separating concerns (settings, playback, engine logic) and leaning on the plugin system, I ended up with a feature that's flexible, performant, and maintainable.
If you're building a Tauri app and want to add TTS, this pattern should translate directly. The key takeaway: abstract your engine behind a simple interface, lazy-load heavy resources, and let the user choose.
Feel free to check it out at: https://github.com/monolabsdev/openbench-ai
Top comments (0)