DEV Community

monkeymore studio
monkeymore studio

Posted on

Building a Browser-Based Text-to-Speech System with Piper TTS

Introduction

In this article, we'll explore how to implement a pure frontend Text-to-Speech (TTS) system that runs entirely in the browser using Piper TTS. Unlike traditional TTS solutions that require server-side processing, this approach leverages WebAssembly and modern browser APIs to generate high-quality speech synthesis without ever sending your data to a remote server.

Why Browser-Based TTS?

The Privacy Imperative

Traditional TTS services typically work by sending your text to remote servers for processing. This creates several critical concerns:

  • Data Privacy: Sensitive information (medical records, legal documents, personal messages) must leave the user's device
  • Network Dependency: Requires stable internet connection
  • Latency: Network round-trips introduce delays
  • Cost: Server infrastructure and API calls incur ongoing expenses
  • Offline Limitations: No TTS capability without internet connectivity

By implementing TTS directly in the browser, we eliminate these issues entirely. All processing happens locally on the user's device, ensuring:

✓ Zero data transmission to external servers
✓ Instant response times (no network latency)
✓ Full offline capability
✓ Complete privacy - your text never leaves your browser
Enter fullscreen mode Exit fullscreen mode

The Technical Foundation

Our implementation relies on the @realtimex/piper-tts-web library, which packages the Piper TTS engine as a WebAssembly module. Piper is a fast, local neural text-to-speech system that produces natural-sounding speech.

Architecture Overview

The system consists of three main components:

  1. Voice Management: Configuration and selection of available voices
  2. Audio Generation: Converting text to audio using Piper WASM
  3. Playback Control: Managing audio playback, pausing, and downloading

Core Data Structures

Voice Configuration

The system supports multiple languages and voices through a well-defined interface:

interface PiperVoice {
  id: string;    // Unique identifier (e.g., "en_US-lessac-medium")
  name: string;  // Display name (e.g., "English (US) - Lessac")
  lang: string;  // Language code (e.g., "en-US")
}
Enter fullscreen mode Exit fullscreen mode

Available Voices

const PIPER_VOICES: PiperVoice[] = [
  { id: "en_US-lessac-medium", name: "English (US) - Lessac", lang: "en-US" },
  { id: "en_GB-alan-medium", name: "English (UK) - Alan", lang: "en-GB" },
  { id: "zh_CN-huayan-medium", name: "中文 - 华燕", lang: "zh-CN" },
  { id: "ja_JP-amy-medium", name: "日本語 - Amy", lang: "ja-JP" },
  { id: "ko_KR-amy-medium", name: "한국어 - Amy", lang: "ko-KR" },
  { id: "es_ES-davefx-medium", name: "Español - Dave", lang: "es-ES" },
  { id: "fr_FR-siwis-medium", name: "Français - Siwis", lang: "fr-FR" },
  { id: "de_DE-thorsten-medium", name: "Deutsch - Thorsten", lang: "de-DE" },
  { id: "pt_BR-edresson-low", name: "Português - Edresson", lang: "pt-BR" },
  { id: "ru_RU-denis-medium", name: "Русский - Denis", lang: "ru-RU" },
];
Enter fullscreen mode Exit fullscreen mode

Implementation Deep Dive

Step 1: Dynamic Library Loading

To optimize initial page load, we dynamically import the Piper library only when needed:

const generateAudio = async (text: string, voiceId: string) => {
  // Dynamic import - only loads when user clicks generate
  const { TtsSession } = await import('@realtimex/piper-tts-web');

  // Create a session with the selected voice
  const session = await TtsSession.create({ voiceId: voiceId as any });

  // Generate audio blob from text
  const audioBlob = await session.predict(text);

  // Create a blob URL for the audio
  return URL.createObjectURL(audioBlob);
};
Enter fullscreen mode Exit fullscreen mode

Why dynamic import?

  • Reduces initial bundle size
  • Voice models (10-50MB each) are downloaded on-demand
  • Faster first paint and Time-to-Interactive

Step 2: Audio Generation Flow

Step 3: State Management

The component manages several key states:

const TTSClient = ({ lang }: { lang: string }) => {
  // Input state
  const [text, setText] = useState("");

  // Voice configuration
  const [voice, setVoice] = useState("");
  const [rate, setRate] = useState(1);    // Playback speed
  const [pitch, setPitch] = useState(1);  // Voice pitch

  // Generation state
  const [isGenerating, setIsGenerating] = useState(false);
  const [generatedAudioUrl, setGeneratedAudioUrl] = useState<string | null>(null);

  // Playback state
  const [isSpeaking, setIsSpeaking] = useState(false);
  const [isPaused, setIsPaused] = useState(false);
  const [audioDuration, setAudioDuration] = useState<number | null>(null);

  // Refs for audio control
  const audioRef = useRef<HTMLAudioElement | null>(null);
  const generatedAudioRef = useRef<HTMLAudioElement | null>(null);

  // ... implementation
};
Enter fullscreen mode Exit fullscreen mode

Step 4: Handling Audio Playback

The system supports both preview and download workflows:

const handleSpeak = async () => {
  if (!text.trim()) return;

  // Toggle pause/play for existing audio
  if (isSpeaking && !isPaused && audioRef.current) {
    audioRef.current.pause();
    setIsPaused(true);
  } else if (isPaused && audioRef.current) {
    audioRef.current.play();
    setIsPaused(false);
  } else {
    // Generate new audio
    try {
      setIsSpeaking(true);
      setIsPaused(false);

      const audioUrl = await generateAudio(text, voice || 'en_US-lessac-medium');

      const audio = new Audio(audioUrl);
      audioRef.current = audio;

      // Apply playback rate (speed adjustment)
      audio.playbackRate = rate;

      // Handle playback completion
      audio.onended = () => {
        setIsSpeaking(false);
        setIsPaused(false);
      };

      await audio.play();
    } catch (error) {
      console.error('Speak error:', error);
      setIsSpeaking(false);
    }
  }
};
Enter fullscreen mode Exit fullscreen mode

Step 5: Audio Download

For downloading the generated audio:

const handleDownload = async () => {
  if (!text.trim() || isDownloading) return;

  // Reuse generated audio if available, otherwise create new
  const audioUrl = generatedAudioUrl || await generateAudio(text, voice || 'en_US-lessac-medium');

  // Create temporary anchor element for download
  const a = document.createElement('a');
  a.href = audioUrl;
  a.download = 'tts-audio.wav';  // Piper outputs WAV format
  document.body.appendChild(a);
  a.click();
  document.body.removeChild(a);
};
Enter fullscreen mode Exit fullscreen mode

Language-Aware Voice Filtering

The system intelligently filters voices based on the user's selected language:

const getFilteredVoices = () => {
  // Extract base language code (e.g., "en" from "en-US")
  const langCode = lang.split('-')[0];

  // Filter voices that match the current language
  return availableVoices.filter(v => 
    v.lang.startsWith(langCode) || 
    v.lang.startsWith(langCode.split('-')[0])
  );
};
Enter fullscreen mode Exit fullscreen mode

This ensures users see the most relevant voices first while still having access to all available options.

Error Handling and User Feedback

Robust error handling ensures a smooth user experience:

const handleGenerate = async () => {
  if (!text.trim() || isGenerating) return;

  try {
    setIsGenerating(true);
    setGeneratedAudioUrl(null);

    const audioUrl = await generateAudio(text, voice || 'en_US-lessac-medium');
    setGeneratedAudioUrl(audioUrl);

    // Get audio duration for display
    const audio = new Audio(audioUrl);
    audio.onloadedmetadata = () => {
      setAudioDuration(audio.duration);
    };

  } catch (error) {
    console.error('Generate error:', error);
    alert(t.ttsGenerateError || 'Failed to generate audio. Please try again.');
  } finally {
    setIsGenerating(false);
  }
};
Enter fullscreen mode Exit fullscreen mode

Performance Considerations

Lazy Loading Voice Models

Voice models are large files (10-50MB). The library handles this intelligently:

Memory Management

After audio generation, we revoke blob URLs when they're no longer needed:

// Cleanup on component unmount
useEffect(() => {
  return () => {
    if (audioUrlRef.current) {
      URL.revokeObjectURL(audioUrlRef.current);
    }
    if (generatedAudioUrl) {
      URL.revokeObjectURL(generatedAudioUrl);
    }
  };
}, []);
Enter fullscreen mode Exit fullscreen mode

Browser Compatibility

The @realtimex/piper-tts-web library requires:

  • WebAssembly support: All modern browsers (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
  • Web Audio API: Universal support in modern browsers
  • Blob URLs: Universal support

Technical Stack

Component Technology Purpose
Framework React 19 UI components
Build Tool Next.js 16 SSR and static generation
Styling Tailwind CSS 4 Utility-first styling
TTS Engine @realtimex/piper-tts-web WebAssembly TTS
Runtime ONNX Runtime Web Neural network inference
Icons Lucide React UI icons

Try It Yourself

Want to experience browser-based TTS in action? Visit our online demo:

👉 Try the TTS Tool

All processing happens directly in your browser - no data is sent to any server, ensuring complete privacy for your text content.

Conclusion

Building a browser-based TTS system with Piper demonstrates the power of modern web technologies. By leveraging WebAssembly and the Web Audio API, we can perform complex neural network inference entirely client-side, offering:

  • Privacy: Your data never leaves your device
  • Speed: No network latency
  • Offline capability: Works without internet
  • Cost efficiency: No server infrastructure needed

This architecture is ideal for applications handling sensitive data, requiring offline functionality, or simply prioritizing user privacy.

Further Reading


Ready to convert text to speech in your own applications? The @realtimex/piper-tts-web library makes it remarkably simple to add high-quality, privacy-preserving TTS to any web project.

Top comments (0)