Theo Slater

Posted on May 25

How I Implemented Supertonic TTS into My Desktop App, OpenBench AI

#ai #programming #react #rust

I recently added text-to-speech (TTS) capabilities to OpenBench AI, my local-first Tauri desktop chat app. Rather than relying on a single TTS service, I built a dual-engine system that lets users choose between browser-native speech synthesis or a high-quality on-device neural model. Here's how I did it.

The Goal

Users should be able to click a speaker icon on any assistant message and hear it read aloud. They should have options: lightweight and built-in, or more natural-sounding but requiring a larger model download. All processing should happen localy with no API calls, no privacy concerns.

Architecture: Dual Engines

I implemented two TTS engines with a simple interface:

Supertonic (ST-TTS) — On-device neural TTS. Produces natural, high-quality audio. Requires downloading a ~100MB ONNX model on first use.

Browser SpeechSynthesis — The Web Speech API. No downloads, instant playback, slightly less natural, voice availability varies by OS.

Both engines are exposed via a speaker icon on each message and configured in Settings > Speech.

The Stack

Backend: tauri-plugin-supertonic

A Rust Tauri 2 plugin (tauri-plugin-supertonic@0.1) wrapping supertonic-core and ONNX Runtime.

Why a plugin? No custom Rust commands needed. The plugin handles everything at the Rust layer: model download, voice selection, and WAV synthesis. The frontend only speaks through the JavaScript API.

Frontend: tauri-plugin-supertonic-api

An npm package (tauri-plugin-supertonic-api@0.1) exposing a clean JavaScript interface:

loadModel(): Promise<void>
synthesize(text: string, language: string): Promise<string> // Returns base64 WAV
listVoices(): Promise<Voice[]>
selectVoice(voiceId: string): Promise<void>

State Management: Two Zustand Stores

settingsStore.ts — Holds TTS configuration:

type TtsSettings = {
  engine: "browser" | "stTts"
  browser: {
    voice: string
    speed: number // 0.5–2.0
    pitch: number // 0–2.0
  }
  stTts: {
    voice: string
    speed: number
  }
}

Settings persist to localStorage with version migration — if you ever add a new setting, users don't break.

ttsStore.ts — Manages playback state:

type TtsState = {
  activeMessageId: string | null
  isLoading: boolean
  currentAudio: HTMLAudioElement | null
  play: (messageId: string, text: string) => Promise<void>
  stop: () => void
}

Per-message tracking means each message bubble can independently show play/stop/loading state.

How It Works End-to-End

1. User Clicks Speaker Icon

Calls ttsStore.play(messageId, text).

2. Text Cleaning

A cleanTextForSpeech() function strips markdown, HTML, code blocks, and math notation. You don't want the TTS engine trying to pronounce **bold** or $\LaTeX{}$ .

// Strip markdown bold/italic
text = text.replace(/\*\*(.+?)\*\*/g, "$1")

// Strip code blocks and inline code
text = text.replace(/```
{% endraw %}
[\s\S]*?
{% raw %}
```/g, "")
text = text.replace(/`(.+?)`/g, "$1")

// Strip HTML tags
text = text.replace(/<[^>]*>/g, "")

// etc.

3. Engine Dispatch

Browser SpeechSynthesis

Split text into sentences (simple regex split on . or ! or ?).
For each sentence, create a SpeechSynthesisUtterance with the configured voice, rate, and pitch.
Queue them with window.speechSynthesis.speak() — the browser queues them automatically.
Stop via window.speechSynthesis.cancel(). Pros: instant, no downloads, works everywhere. Cons: less natural, voice quality varies by OS.

Supertonic (ST-TTS)

Check if the model is loaded. If not, show a "Downloading TTS model (~100MB)" toast and call loadModel().
Once ready, call synthesize(text, "en") from the supertonic API.
Get back a WAV file as base64.
Play via new Audio("data:audio/wav;base64,...").play(). Pros: high-quality, consistent across platforms, on-device. Cons: ~100MB download, slower synthesis.

const play = async (messageId: string, text: string) => {
  set({ activeMessageId: messageId, isLoading: true })

  const engine = settingsStore.getState().settings.tts.engine
  const cleanText = cleanTextForSpeech(text)

  try {
    if (engine === "browser") {
      playBrowserTts(cleanText)
    } else {
      // Ensure model is loaded
      if (!modelLoaded) {
        showToast("Downloading TTS model (~100MB)...")
        await window.supertonic.loadModel()
        modelLoaded = true
      }

      const wavBase64 = await window.supertonic.synthesize(
        cleanText,
        "en"
      )
      const audio = new Audio(`data:audio/wav;base64,${wavBase64}`)
      audio.play()
      set({ currentAudio: audio, isLoading: false })
    }
  } catch (error) {
    showToast("TTS failed")
    set({ isLoading: false })
  }
}

4. Stopping Playback

The stop function pauses the current HTMLAudioElement or cancels window.speechSynthesis.

Also, when the user switches conversations, auto-stop fires — no audio bleeding into the next chat.

Settings UI (SpeechTab.tsx)

The Settings modal has a Speech tab with two sections:

Engine Selector

○ Browser (lightweight, instant)
◉ On-device (natural, ~100MB download)

If Browser Selected

Voice dropdown (sourced from speechSynthesis.getVoices())
Speed slider (0.5–2.0)
Pitch slider (0–2.0)
Test Voice button If ST-TTS Selected
Load Model button (shows progress; disabled if model is loading or loaded)
Voice style selector (sourced from listVoices())
Speed slider (if supported)
Test Voice button Both tabs show a live playback preview so users hear the difference before committing.

Key Design Decisions

No Custom Rust Commands

The plugin abstraction meant I could avoid writing custom Tauri command handlers. The plugin exposes everything through a clean npm package. If I ever need to add a new TTS feature (e.g., model switching, streaming), I just extend the plugin — the frontend doesn't need to change.

Web-Standard Audio Playback

I didn't reach for a Tauri audio plugin. HTMLAudioElement handles the WAV file from supertonic perfectly. window.speechSynthesis is built-in. Both are reliable and require zero extra dependencies.

Per-Message Tracking

The activeMessageId in the TTS store lets each message bubble track its own playback state independently. One message can be playing while another shows a "Play" button. This feels more natural than a global play/stop.

Lazy Model Loading

The ~100MB supertonic model only downloads when the user first clicks play with ST-TTS selected — not on app startup. A toast notifies them what's happening. Once loaded, subsequent plays are instant. This keeps the app lightweight until the feature is actually needed.

What I'd Do Differently (or Next)

Streaming synthesis — For long messages, synthesizing the entire text at once can feel slow. Streaming chunks as they're generated would improve perceived performance.
Voice caching — Cache synthesized audio so replaying the same message is instant.
Rate limiting — If ST-TTS synthesis is slow, queue requests or show a "Synthesizing..." state more explicitly.
Cross-platform voice testing — Browser voice availability varies wildly (Windows, Mac, Linux). Testing coverage is key. ## Wrapping Up

Dual-engine TTS isn't complicated, but it requires thoughtful state management and a clean architecture. By separating concerns (settings, playback, engine logic) and leaning on the plugin system, I ended up with a feature that's flexible, performant, and maintainable.

If you're building a Tauri app and want to add TTS, this pattern should translate directly. The key takeaway: abstract your engine behind a simple interface, lazy-load heavy resources, and let the user choose.

Feel free to check it out at: https://github.com/monolabsdev/openbench-ai

DEV Community