Introduction
In this article, we'll explore how to implement a pure frontend Text-to-Speech (TTS) system that runs entirely in the browser using Piper TTS. Unlike traditional TTS solutions that require server-side processing, this approach leverages WebAssembly and modern browser APIs to generate high-quality speech synthesis without ever sending your data to a remote server.
Why Browser-Based TTS?
The Privacy Imperative
Traditional TTS services typically work by sending your text to remote servers for processing. This creates several critical concerns:
- Data Privacy: Sensitive information (medical records, legal documents, personal messages) must leave the user's device
- Network Dependency: Requires stable internet connection
- Latency: Network round-trips introduce delays
- Cost: Server infrastructure and API calls incur ongoing expenses
- Offline Limitations: No TTS capability without internet connectivity
By implementing TTS directly in the browser, we eliminate these issues entirely. All processing happens locally on the user's device, ensuring:
✓ Zero data transmission to external servers
✓ Instant response times (no network latency)
✓ Full offline capability
✓ Complete privacy - your text never leaves your browser
The Technical Foundation
Our implementation relies on the @realtimex/piper-tts-web library, which packages the Piper TTS engine as a WebAssembly module. Piper is a fast, local neural text-to-speech system that produces natural-sounding speech.
Architecture Overview
The system consists of three main components:
- Voice Management: Configuration and selection of available voices
- Audio Generation: Converting text to audio using Piper WASM
- Playback Control: Managing audio playback, pausing, and downloading
Core Data Structures
Voice Configuration
The system supports multiple languages and voices through a well-defined interface:
interface PiperVoice {
id: string; // Unique identifier (e.g., "en_US-lessac-medium")
name: string; // Display name (e.g., "English (US) - Lessac")
lang: string; // Language code (e.g., "en-US")
}
Available Voices
const PIPER_VOICES: PiperVoice[] = [
{ id: "en_US-lessac-medium", name: "English (US) - Lessac", lang: "en-US" },
{ id: "en_GB-alan-medium", name: "English (UK) - Alan", lang: "en-GB" },
{ id: "zh_CN-huayan-medium", name: "中文 - 华燕", lang: "zh-CN" },
{ id: "ja_JP-amy-medium", name: "日本語 - Amy", lang: "ja-JP" },
{ id: "ko_KR-amy-medium", name: "한국어 - Amy", lang: "ko-KR" },
{ id: "es_ES-davefx-medium", name: "Español - Dave", lang: "es-ES" },
{ id: "fr_FR-siwis-medium", name: "Français - Siwis", lang: "fr-FR" },
{ id: "de_DE-thorsten-medium", name: "Deutsch - Thorsten", lang: "de-DE" },
{ id: "pt_BR-edresson-low", name: "Português - Edresson", lang: "pt-BR" },
{ id: "ru_RU-denis-medium", name: "Русский - Denis", lang: "ru-RU" },
];
Implementation Deep Dive
Step 1: Dynamic Library Loading
To optimize initial page load, we dynamically import the Piper library only when needed:
const generateAudio = async (text: string, voiceId: string) => {
// Dynamic import - only loads when user clicks generate
const { TtsSession } = await import('@realtimex/piper-tts-web');
// Create a session with the selected voice
const session = await TtsSession.create({ voiceId: voiceId as any });
// Generate audio blob from text
const audioBlob = await session.predict(text);
// Create a blob URL for the audio
return URL.createObjectURL(audioBlob);
};
Why dynamic import?
- Reduces initial bundle size
- Voice models (10-50MB each) are downloaded on-demand
- Faster first paint and Time-to-Interactive
Step 2: Audio Generation Flow
Step 3: State Management
The component manages several key states:
const TTSClient = ({ lang }: { lang: string }) => {
// Input state
const [text, setText] = useState("");
// Voice configuration
const [voice, setVoice] = useState("");
const [rate, setRate] = useState(1); // Playback speed
const [pitch, setPitch] = useState(1); // Voice pitch
// Generation state
const [isGenerating, setIsGenerating] = useState(false);
const [generatedAudioUrl, setGeneratedAudioUrl] = useState<string | null>(null);
// Playback state
const [isSpeaking, setIsSpeaking] = useState(false);
const [isPaused, setIsPaused] = useState(false);
const [audioDuration, setAudioDuration] = useState<number | null>(null);
// Refs for audio control
const audioRef = useRef<HTMLAudioElement | null>(null);
const generatedAudioRef = useRef<HTMLAudioElement | null>(null);
// ... implementation
};
Step 4: Handling Audio Playback
The system supports both preview and download workflows:
const handleSpeak = async () => {
if (!text.trim()) return;
// Toggle pause/play for existing audio
if (isSpeaking && !isPaused && audioRef.current) {
audioRef.current.pause();
setIsPaused(true);
} else if (isPaused && audioRef.current) {
audioRef.current.play();
setIsPaused(false);
} else {
// Generate new audio
try {
setIsSpeaking(true);
setIsPaused(false);
const audioUrl = await generateAudio(text, voice || 'en_US-lessac-medium');
const audio = new Audio(audioUrl);
audioRef.current = audio;
// Apply playback rate (speed adjustment)
audio.playbackRate = rate;
// Handle playback completion
audio.onended = () => {
setIsSpeaking(false);
setIsPaused(false);
};
await audio.play();
} catch (error) {
console.error('Speak error:', error);
setIsSpeaking(false);
}
}
};
Step 5: Audio Download
For downloading the generated audio:
const handleDownload = async () => {
if (!text.trim() || isDownloading) return;
// Reuse generated audio if available, otherwise create new
const audioUrl = generatedAudioUrl || await generateAudio(text, voice || 'en_US-lessac-medium');
// Create temporary anchor element for download
const a = document.createElement('a');
a.href = audioUrl;
a.download = 'tts-audio.wav'; // Piper outputs WAV format
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
};
Language-Aware Voice Filtering
The system intelligently filters voices based on the user's selected language:
const getFilteredVoices = () => {
// Extract base language code (e.g., "en" from "en-US")
const langCode = lang.split('-')[0];
// Filter voices that match the current language
return availableVoices.filter(v =>
v.lang.startsWith(langCode) ||
v.lang.startsWith(langCode.split('-')[0])
);
};
This ensures users see the most relevant voices first while still having access to all available options.
Error Handling and User Feedback
Robust error handling ensures a smooth user experience:
const handleGenerate = async () => {
if (!text.trim() || isGenerating) return;
try {
setIsGenerating(true);
setGeneratedAudioUrl(null);
const audioUrl = await generateAudio(text, voice || 'en_US-lessac-medium');
setGeneratedAudioUrl(audioUrl);
// Get audio duration for display
const audio = new Audio(audioUrl);
audio.onloadedmetadata = () => {
setAudioDuration(audio.duration);
};
} catch (error) {
console.error('Generate error:', error);
alert(t.ttsGenerateError || 'Failed to generate audio. Please try again.');
} finally {
setIsGenerating(false);
}
};
Performance Considerations
Lazy Loading Voice Models
Voice models are large files (10-50MB). The library handles this intelligently:
Memory Management
After audio generation, we revoke blob URLs when they're no longer needed:
// Cleanup on component unmount
useEffect(() => {
return () => {
if (audioUrlRef.current) {
URL.revokeObjectURL(audioUrlRef.current);
}
if (generatedAudioUrl) {
URL.revokeObjectURL(generatedAudioUrl);
}
};
}, []);
Browser Compatibility
The @realtimex/piper-tts-web library requires:
- WebAssembly support: All modern browsers (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
- Web Audio API: Universal support in modern browsers
- Blob URLs: Universal support
Technical Stack
| Component | Technology | Purpose |
|---|---|---|
| Framework | React 19 | UI components |
| Build Tool | Next.js 16 | SSR and static generation |
| Styling | Tailwind CSS 4 | Utility-first styling |
| TTS Engine | @realtimex/piper-tts-web | WebAssembly TTS |
| Runtime | ONNX Runtime Web | Neural network inference |
| Icons | Lucide React | UI icons |
Try It Yourself
Want to experience browser-based TTS in action? Visit our online demo:
All processing happens directly in your browser - no data is sent to any server, ensuring complete privacy for your text content.
Conclusion
Building a browser-based TTS system with Piper demonstrates the power of modern web technologies. By leveraging WebAssembly and the Web Audio API, we can perform complex neural network inference entirely client-side, offering:
- Privacy: Your data never leaves your device
- Speed: No network latency
- Offline capability: Works without internet
- Cost efficiency: No server infrastructure needed
This architecture is ideal for applications handling sensitive data, requiring offline functionality, or simply prioritizing user privacy.
Further Reading
- Piper TTS GitHub Repository
- ONNX Runtime Web Documentation
- Web Audio API MDN
- WebAssembly Documentation
Ready to convert text to speech in your own applications? The @realtimex/piper-tts-web library makes it remarkably simple to add high-quality, privacy-preserving TTS to any web project.



Top comments (0)