Running AI models in the browser used to be a pipe dream. Neural networks required powerful GPUs, gigabytes of memory, and server-side processing. But what if I told you we're now running a complete text-to-speech AI model entirely in your browser, with no server communication whatsoever?
This is the technical story of how we built our Text to Speech tool using KittenTTS, ONNX Runtime, and WebAssembly—creating a privacy-first, unlimited AI voice synthesis system that runs completely client-side.
The Technical Challenge
Traditional text-to-speech systems rely on server-side processing for good reason:
- Model size: Neural TTS models can be hundreds of megabytes
- Computational complexity: Voice synthesis requires intensive matrix operations
- Memory usage: Audio generation consumes significant RAM
- Browser limitations: JavaScript wasn't designed for heavy numerical computing
Yet here we are, doing exactly that. Let's dive into how we solved each challenge.
Architecture Overview
Our browser-based TTS system consists of four main components:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Text Input │───>│ Text Cleaner │───>│ Phonemizer │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Audio Output │<───│ KittenTTS ONNX │<───│ Token Converter │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Each component runs entirely in the browser, with no external dependencies once loaded.
KittenTTS: The Neural Voice Engine
At the heart of our system is KittenTTS, a neural text-to-speech model that balances quality with efficiency. Unlike massive models like Tacotron 2 or FastSpeech, KittenTTS is designed to be lightweight while still producing natural-sounding speech.
Model Architecture
KittenTTS uses a transformer-based architecture with:
- Text encoder: Converts phonemes to hidden representations
- Style embeddings: Define voice characteristics (8 distinct voices)
- Decoder: Generates mel-spectrograms from text representations
- Vocoder: Converts spectrograms to raw audio waveforms
The entire pipeline from text to audio happens in a single ONNX model file, making it perfect for browser deployment.
ONNX Runtime Web: Bringing ML to Browsers
ONNX (Open Neural Network Exchange) Runtime is Microsoft's cross-platform ML inference engine. The Web version brings near-native performance to browsers through WebAssembly.
Why ONNX Runtime?
// Initialize ONNX Runtime with WebAssembly backend
const sessionOptions = {
executionProviders: ['wasm'],
graphOptimizationLevel: 'disabled',
enableCpuMemArena: false,
enableMemPattern: false,
logSeverityLevel: 3
};
this.model = await InferenceSession.create(modelBuffer, sessionOptions);
ONNX Runtime Web offers several advantages:
- Performance: WebAssembly execution is 10-20x faster than pure JavaScript
- Memory efficiency: Optimized tensor operations minimize memory allocation
- Cross-platform: Works consistently across all modern browsers
- GPU acceleration: Can leverage WebGL when available
Loading the Model
One of our biggest challenges was loading a 25MB ONNX model efficiently in the browser:
async load(): Promise<void> {
// Check IndexedDB cache first
let modelBuffer = await this.loadModelFromCache();
if (!modelBuffer) {
// Load from embedded assets or fetch
if (this.config.useEmbeddedAssets && hasEmbeddedAssets()) {
modelBuffer = getEmbeddedModel();
} else {
const response = await fetch(this.config.modelPath);
modelBuffer = await response.arrayBuffer();
}
// Cache for future sessions
await this.saveModelToCache(modelBuffer);
}
this.model = await InferenceSession.create(new Uint8Array(modelBuffer));
}
This approach provides progressive loading: first-time users download the model, while returning users load instantly from IndexedDB.
WebAssembly: JavaScript's Numerical Computing Engine
WebAssembly (WASM) is the secret sauce that makes browser-based AI possible. It provides near-native performance for computationally intensive operations.
WASM Configuration
configureWasmPaths(wasmPaths: Record<string, string>) {
(env.wasm as any).wasmPaths = wasmPaths;
// Optimize for browser environment
env.wasm.numThreads = 1; // Single-threaded for compatibility
env.wasm.simd = true; // Enable SIMD if available
env.logLevel = 'warning';
}
Key WASM optimizations:
- Single-threaded execution: Avoids SharedArrayBuffer requirements
- SIMD support: Vectorized operations when browser supports it
- Memory management: Efficient allocation for tensor operations
Performance Characteristics
Our benchmarks show impressive performance across different browsers:
Browser | First Load | Cached Load | Generation (100 chars) |
---|---|---|---|
Chrome 120+ | 8-12 seconds | 2-3 seconds | 3-5 seconds |
Firefox 119+ | 10-15 seconds | 3-4 seconds | 4-6 seconds |
Safari 17+ | 12-18 seconds | 4-5 seconds | 5-7 seconds |
Text Processing Pipeline
Converting text to audio requires several preprocessing steps that we've optimized for browser execution.
Text Cleaning and Normalization
export function cleanTextForTTS(text: string): string {
// Remove emojis using Unicode ranges
const emojiRegex = /[\u{1F600}-\u{1F64F}]|[\u{1F300}-\u{1F5FF}]/gu;
return text
.replace(emojiRegex, '')
.replace(/\b\/\b/, ' slash ')
.replace(/[\/\\()¯]/g, '')
.replace(/["""]/g, '')
.replace(/\s—/g, '.')
.replace(/[^\u0000-\u024F]/g, '') // Keep only Latin characters
.trim();
}
Phonemization
Converting text to phonemes is crucial for natural speech. We use the phonemizer.js library:
// Convert text to phonemes
const phonemesList = await phonemize(text, 'en-us');
const phonemes = phonemesList.join('');
// Add boundary markers
const tokensWithBoundaries = `${phonemes}`;
// Convert to token IDs
const tokens = this.textCleaner.call(tokensWithBoundaries);
tokens.unshift(0); // Add start token
tokens.push(0); // Add end token
Token Encoding
Our TextCleaner converts phonemes to numerical tokens that the neural network understands:
export class TextCleaner {
private wordIndexDictionary: Record<string, number>;
constructor() {
const _pad = "$";
const _punctuation = ';:,.!?¡¿—…"«»"" ';
const _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz';
const _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ";
const symbols = [_pad, ...Array.from(_punctuation), ...Array.from(_letters), ...Array.from(_letters_ipa)];
this.wordIndexDictionary = {};
for (let i = 0; i < symbols.length; i++) {
this.wordIndexDictionary[symbols[i]] = i;
}
}
call(text: string): number[] {
const indexes: number[] = [];
for (const char of text) {
if (this.wordIndexDictionary[char] !== undefined) {
indexes.push(this.wordIndexDictionary[char]);
}
}
return indexes;
}
}
Neural Network Inference
The actual voice synthesis happens through ONNX model inference with carefully prepared inputs.
Model Input Preparation
private async prepareInputs(text: string, options: GenerateOptions) {
const { voice = 'expr-voice-2-m', speed = 1.0 } = options;
// Get phonemes and convert to tokens
const phonemes = await phonemize(text, 'en-us');
const tokens = this.textCleaner.call(phonemes.join(''));
// Add start/end tokens
tokens.unshift(0);
tokens.push(0);
const tokenIds = new BigInt64Array(tokens.map(id => BigInt(id)));
const voiceEmbedding = this.voices[voice];
return {
'input_ids': new Tensor('int64', tokenIds, [1, tokenIds.length]),
'style': new Tensor('float32', voiceEmbedding, [1, voiceEmbedding.length]),
'speed': new Tensor('float32', new Float32Array([speed]), [1])
};
}
Running Inference
async generateSingle(text: string, options: GenerateOptions) {
const inputs = await this.prepareInputs(text, options);
// Run neural network inference
const results = await this.model!.run(inputs);
// Extract audio tensor (usually the largest output)
let audioTensor = null;
for (const [name, tensor] of Object.entries(results)) {
if (!audioTensor || tensor.size > audioTensor.size) {
audioTensor = tensor;
}
}
// Convert to Float32Array and post-process
const audioData = new Float32Array(audioTensor.data);
return this.postProcessAudio(audioData);
}
Audio Post-Processing
Raw neural network output requires several post-processing steps to create clean, playable audio.
Cleaning and Normalization
private postProcessAudio(audioData: Float32Array): Float32Array {
// Clean NaN values
for (let i = 0; i < audioData.length; i++) {
if (isNaN(audioData[i])) {
audioData[i] = 0;
}
}
// Trim silence
let startIdx = 0, endIdx = audioData.length - 1;
const threshold = 0.001;
for (let i = 0; i < audioData.length; i++) {
if (Math.abs(audioData[i]) > threshold) {
startIdx = i;
break;
}
}
for (let i = audioData.length - 1; i >= 0; i--) {
if (Math.abs(audioData[i]) > threshold) {
endIdx = i;
break;
}
}
const trimmedAudio = audioData.slice(startIdx, endIdx + 1);
// Normalize volume
let maxAmplitude = 0;
for (const sample of trimmedAudio) {
maxAmplitude = Math.max(maxAmplitude, Math.abs(sample));
}
if (maxAmplitude > 0) {
const normalizationFactor = 0.8 / maxAmplitude;
for (let i = 0; i < trimmedAudio.length; i++) {
trimmedAudio[i] *= normalizationFactor;
}
}
return trimmedAudio;
}
Memory Management and Optimization
Running large neural networks in browsers requires careful memory management.
Chunked Processing
For long texts, we automatically split content into manageable chunks:
export function chunkText(text: string): string[] {
const MAX_CHUNK_LENGTH = 500;
const sentences = text.split(/(?<=[.!?])(?=\s+[A-Z]|$)/);
const chunks = [];
let currentChunk = '';
for (const sentence of sentences) {
const potentialChunk = currentChunk + (currentChunk ? ' ' : '') + sentence.trim();
if (potentialChunk.length > MAX_CHUNK_LENGTH) {
if (currentChunk) chunks.push(currentChunk);
currentChunk = sentence.trim();
} else {
currentChunk = potentialChunk;
}
}
if (currentChunk) chunks.push(currentChunk);
return chunks;
}
IndexedDB Caching
We implement a sophisticated caching system using IndexedDB:
class ModelCache {
async set(key: string, data: ArrayBuffer): Promise<void> {
const db = await this.init();
return new Promise((resolve, reject) => {
const transaction = db.transaction([this.storeName], 'readwrite');
const store = transaction.objectStore(this.storeName);
const request = store.put({
key,
data,
timestamp: Date.now()
});
request.onsuccess = () => resolve();
request.onerror = () => reject(request.error);
});
}
async get(key: string): Promise<ArrayBuffer | null> {
const db = await this.init();
return new Promise((resolve) => {
const transaction = db.transaction([this.storeName], 'readonly');
const store = transaction.objectStore(this.storeName);
const request = store.get(key);
request.onsuccess = () => {
const result = request.result;
if (result && Date.now() - result.timestamp < 7 * 24 * 60 * 60 * 1000) {
resolve(result.data);
} else {
resolve(null);
}
};
});
}
}
Audio Format Conversion
Converting Float32Array audio data to playable WAV format happens entirely in JavaScript:
export function createWavBlob(audioData: Float32Array, sampleRate: number): Blob {
const buffer = new ArrayBuffer(44 + audioData.length * 2);
const view = new DataView(buffer);
// Write WAV header
const writeString = (offset: number, string: string) => {
for (let i = 0; i < string.length; i++) {
view.setUint8(offset + i, string.charCodeAt(i));
}
};
writeString(0, 'RIFF');
view.setUint32(4, 36 + audioData.length * 2, true);
writeString(8, 'WAVE');
writeString(12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, 1, true);
view.setUint16(22, 1, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 2, true);
view.setUint16(32, 2, true);
view.setUint16(34, 16, true);
writeString(36, 'data');
view.setUint32(40, audioData.length * 2, true);
// Convert float to 16-bit PCM
let offset = 44;
for (let i = 0; i < audioData.length; i++) {
const sample = Math.max(-1, Math.min(1, audioData[i]));
view.setInt16(offset, sample < 0 ? sample * 0x8000 : sample * 0x7FFF, true);
offset += 2;
}
return new Blob([buffer], { type: 'audio/wav' });
}
Performance Optimizations
Progressive Loading
We've implemented several strategies to improve perceived performance:
- Embedded assets: Critical model files bundled with the application
- IndexedDB caching: Models persist across browser sessions
- Lazy initialization: ONNX Runtime loads only when needed
- Background processing: Model loading doesn't block the UI
Browser Compatibility
Our implementation gracefully handles different browser capabilities:
// Try WebAssembly backend first, fallback to CPU
try {
this.model = await InferenceSession.create(modelBuffer, {
executionProviders: ['wasm'],
graphOptimizationLevel: 'disabled'
});
} catch (wasmError) {
console.warn('WebAssembly failed, using CPU backend');
this.model = await InferenceSession.create(modelBuffer, {
executionProviders: ['cpu'],
graphOptimizationLevel: 'basic'
});
}
Security and Privacy Implications
Running AI models client-side has significant privacy advantages:
- Zero data transmission: Text never leaves the user's device
- No server logs: No record of what users synthesize
- Offline capability: Works without internet after initial load
- No API keys: No authentication or usage tracking
Content Security Policy
WebAssembly requires specific CSP headers:
// netlify.toml
[[headers]]
for = "/*"
[headers.values]
Cross-Origin-Embedder-Policy = "require-corp"
Cross-Origin-Opener-Policy = "same-origin"
Challenges and Limitations
Current Limitations
- Initial load time: First-time users wait 8-15 seconds for model download
- Memory usage: Neural networks consume 100-200MB RAM during inference
- Browser compatibility: Requires modern browsers with WebAssembly support
- Mobile performance: Slower generation on resource-constrained devices
Future Optimizations
Several improvements are on our roadmap:
- Model quantization: Reduce model size through 8-bit precision
- WebGL acceleration: Leverage GPU when available
- Streaming inference: Generate audio progressively for long texts
- Service Worker caching: More aggressive asset caching strategies
The Bigger Picture
Our KittenTTS implementation represents a broader trend toward edge AI computing. By running neural networks in browsers, we're enabling:
- Privacy-preserving AI: No data leaves the user's device
- Reduced infrastructure costs: No expensive GPU servers required
- Better user experience: No network latency or rate limits
- Democratized AI: Advanced capabilities accessible to everyone
Technical Takeaways
If you're building browser-based AI applications, here are key lessons from our implementation:
- ONNX Runtime Web is production-ready for neural network inference in browsers
- WebAssembly provides significant performance gains for numerical computing
- Careful memory management is crucial for large model deployment
- Progressive loading strategies improve perceived performance
- IndexedDB caching is essential for models larger than a few megabytes
Try It Yourself
Want to experience browser-based AI voice synthesis? Try our Text to Speech tool and see KittenTTS, ONNX Runtime, and WebAssembly working together in real-time.
For developers interested in implementing similar systems, our KittenTTS JavaScript package (@quickeditvideo/kittentts
) provides a clean API for browser-based neural text-to-speech synthesis. (Coming shortly)
The future of AI is moving to the edge—and that includes the browser edge. Privacy-preserving, unlimited, client-side AI is not just possible, it's here today.
Top comments (0)