Posted on Aug 20

Running KittenTTS in the Browser: A Deep Dive into WASM and ONNX

#videoediting #webassembly #tts #frontend

Running AI models in the browser used to be a pipe dream. Neural networks required powerful GPUs, gigabytes of memory, and server-side processing. But what if I told you we're now running a complete text-to-speech AI model entirely in your browser, with no server communication whatsoever?

This is the technical story of how we built our Text to Speech tool using KittenTTS, ONNX Runtime, and WebAssembly—creating a privacy-first, unlimited AI voice synthesis system that runs completely client-side.

The Technical Challenge

Traditional text-to-speech systems rely on server-side processing for good reason:

Model size: Neural TTS models can be hundreds of megabytes
Computational complexity: Voice synthesis requires intensive matrix operations
Memory usage: Audio generation consumes significant RAM
Browser limitations: JavaScript wasn't designed for heavy numerical computing

Yet here we are, doing exactly that. Let's dive into how we solved each challenge.

Architecture Overview

Our browser-based TTS system consists of four main components:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Text Input    │───>│   Text Cleaner   │───>│   Phonemizer    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                          │
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Audio Output  │<───│ KittenTTS ONNX   │<───│ Token Converter │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Each component runs entirely in the browser, with no external dependencies once loaded.

KittenTTS: The Neural Voice Engine

At the heart of our system is KittenTTS, a neural text-to-speech model that balances quality with efficiency. Unlike massive models like Tacotron 2 or FastSpeech, KittenTTS is designed to be lightweight while still producing natural-sounding speech.

Model Architecture

KittenTTS uses a transformer-based architecture with:

Text encoder: Converts phonemes to hidden representations
Style embeddings: Define voice characteristics (8 distinct voices)
Decoder: Generates mel-spectrograms from text representations
Vocoder: Converts spectrograms to raw audio waveforms

The entire pipeline from text to audio happens in a single ONNX model file, making it perfect for browser deployment.

ONNX Runtime Web: Bringing ML to Browsers

ONNX (Open Neural Network Exchange) Runtime is Microsoft's cross-platform ML inference engine. The Web version brings near-native performance to browsers through WebAssembly.

Why ONNX Runtime?

// Initialize ONNX Runtime with WebAssembly backend
const sessionOptions = {
  executionProviders: ['wasm'],
  graphOptimizationLevel: 'disabled',
  enableCpuMemArena: false,
  enableMemPattern: false,
  logSeverityLevel: 3
};

this.model = await InferenceSession.create(modelBuffer, sessionOptions);

ONNX Runtime Web offers several advantages:

Performance: WebAssembly execution is 10-20x faster than pure JavaScript
Memory efficiency: Optimized tensor operations minimize memory allocation
Cross-platform: Works consistently across all modern browsers
GPU acceleration: Can leverage WebGL when available

Loading the Model

One of our biggest challenges was loading a 25MB ONNX model efficiently in the browser:

async load(): Promise<void> {
  // Check IndexedDB cache first
  let modelBuffer = await this.loadModelFromCache();
  
  if (!modelBuffer) {
    // Load from embedded assets or fetch
    if (this.config.useEmbeddedAssets && hasEmbeddedAssets()) {
      modelBuffer = getEmbeddedModel();
    } else {
      const response = await fetch(this.config.modelPath);
      modelBuffer = await response.arrayBuffer();
    }
    
    // Cache for future sessions
    await this.saveModelToCache(modelBuffer);
  }
  
  this.model = await InferenceSession.create(new Uint8Array(modelBuffer));
}

This approach provides progressive loading: first-time users download the model, while returning users load instantly from IndexedDB.

WebAssembly: JavaScript's Numerical Computing Engine

WebAssembly (WASM) is the secret sauce that makes browser-based AI possible. It provides near-native performance for computationally intensive operations.

WASM Configuration

configureWasmPaths(wasmPaths: Record<string, string>) {
  (env.wasm as any).wasmPaths = wasmPaths;
  
  // Optimize for browser environment
  env.wasm.numThreads = 1;  // Single-threaded for compatibility
  env.wasm.simd = true;     // Enable SIMD if available
  env.logLevel = 'warning';
}

Key WASM optimizations:

Single-threaded execution: Avoids SharedArrayBuffer requirements
SIMD support: Vectorized operations when browser supports it
Memory management: Efficient allocation for tensor operations

Performance Characteristics

Our benchmarks show impressive performance across different browsers:

Browser	First Load	Cached Load	Generation (100 chars)
Chrome 120+	8-12 seconds	2-3 seconds	3-5 seconds
Firefox 119+	10-15 seconds	3-4 seconds	4-6 seconds
Safari 17+	12-18 seconds	4-5 seconds	5-7 seconds

Text Processing Pipeline

Converting text to audio requires several preprocessing steps that we've optimized for browser execution.

Text Cleaning and Normalization

export function cleanTextForTTS(text: string): string {
  // Remove emojis using Unicode ranges
  const emojiRegex = /[\u{1F600}-\u{1F64F}]|[\u{1F300}-\u{1F5FF}]/gu;
  
  return text
    .replace(emojiRegex, '')
    .replace(/\b\/\b/, ' slash ')
    .replace(/[\/\\()¯]/g, '')
    .replace(/["""]/g, '')
    .replace(/\s—/g, '.')
    .replace(/[^\u0000-\u024F]/g, '') // Keep only Latin characters
    .trim();
}

Phonemization

Converting text to phonemes is crucial for natural speech. We use the phonemizer.js library:

// Convert text to phonemes
const phonemesList = await phonemize(text, 'en-us');
const phonemes = phonemesList.join('');

// Add boundary markers
const tokensWithBoundaries = `${phonemes}`;

// Convert to token IDs
const tokens = this.textCleaner.call(tokensWithBoundaries);
tokens.unshift(0); // Add start token
tokens.push(0);    // Add end token

Token Encoding

Our TextCleaner converts phonemes to numerical tokens that the neural network understands:

export class TextCleaner {
  private wordIndexDictionary: Record<string, number>;

  constructor() {
    const _pad = "$";
    const _punctuation = ';:,.!?¡¿—…"«»"" ';
    const _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz';
    const _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ";

    const symbols = [_pad, ...Array.from(_punctuation), ...Array.from(_letters), ...Array.from(_letters_ipa)];
    
    this.wordIndexDictionary = {};
    for (let i = 0; i < symbols.length; i++) {
      this.wordIndexDictionary[symbols[i]] = i;
    }
  }

  call(text: string): number[] {
    const indexes: number[] = [];
    for (const char of text) {
      if (this.wordIndexDictionary[char] !== undefined) {
        indexes.push(this.wordIndexDictionary[char]);
      }
    }
    return indexes;
  }
}

Neural Network Inference

The actual voice synthesis happens through ONNX model inference with carefully prepared inputs.

Model Input Preparation

private async prepareInputs(text: string, options: GenerateOptions) {
  const { voice = 'expr-voice-2-m', speed = 1.0 } = options;
  
  // Get phonemes and convert to tokens
  const phonemes = await phonemize(text, 'en-us');
  const tokens = this.textCleaner.call(phonemes.join(''));
  
  // Add start/end tokens
  tokens.unshift(0);
  tokens.push(0);
  
  const tokenIds = new BigInt64Array(tokens.map(id => BigInt(id)));
  const voiceEmbedding = this.voices[voice];
  
  return {
    'input_ids': new Tensor('int64', tokenIds, [1, tokenIds.length]),
    'style': new Tensor('float32', voiceEmbedding, [1, voiceEmbedding.length]),
    'speed': new Tensor('float32', new Float32Array([speed]), [1])
  };
}

Running Inference

async generateSingle(text: string, options: GenerateOptions) {
  const inputs = await this.prepareInputs(text, options);
  
  // Run neural network inference
  const results = await this.model!.run(inputs);
  
  // Extract audio tensor (usually the largest output)
  let audioTensor = null;
  for (const [name, tensor] of Object.entries(results)) {
    if (!audioTensor || tensor.size > audioTensor.size) {
      audioTensor = tensor;
    }
  }
  
  // Convert to Float32Array and post-process
  const audioData = new Float32Array(audioTensor.data);
  return this.postProcessAudio(audioData);
}

Audio Post-Processing

Raw neural network output requires several post-processing steps to create clean, playable audio.

Cleaning and Normalization

private postProcessAudio(audioData: Float32Array): Float32Array {
  // Clean NaN values
  for (let i = 0; i < audioData.length; i++) {
    if (isNaN(audioData[i])) {
      audioData[i] = 0;
    }
  }
  
  // Trim silence
  let startIdx = 0, endIdx = audioData.length - 1;
  const threshold = 0.001;
  
  for (let i = 0; i < audioData.length; i++) {
    if (Math.abs(audioData[i]) > threshold) {
      startIdx = i;
      break;
    }
  }
  
  for (let i = audioData.length - 1; i >= 0; i--) {
    if (Math.abs(audioData[i]) > threshold) {
      endIdx = i;
      break;
    }
  }
  
  const trimmedAudio = audioData.slice(startIdx, endIdx + 1);
  
  // Normalize volume
  let maxAmplitude = 0;
  for (const sample of trimmedAudio) {
    maxAmplitude = Math.max(maxAmplitude, Math.abs(sample));
  }
  
  if (maxAmplitude > 0) {
    const normalizationFactor = 0.8 / maxAmplitude;
    for (let i = 0; i < trimmedAudio.length; i++) {
      trimmedAudio[i] *= normalizationFactor;
    }
  }
  
  return trimmedAudio;
}

Memory Management and Optimization

Running large neural networks in browsers requires careful memory management.

Chunked Processing

For long texts, we automatically split content into manageable chunks:

export function chunkText(text: string): string[] {
  const MAX_CHUNK_LENGTH = 500;
  const sentences = text.split(/(?<=[.!?])(?=\s+[A-Z]|$)/);
  
  const chunks = [];
  let currentChunk = '';
  
  for (const sentence of sentences) {
    const potentialChunk = currentChunk + (currentChunk ? ' ' : '') + sentence.trim();
    
    if (potentialChunk.length > MAX_CHUNK_LENGTH) {
      if (currentChunk) chunks.push(currentChunk);
      currentChunk = sentence.trim();
    } else {
      currentChunk = potentialChunk;
    }
  }
  
  if (currentChunk) chunks.push(currentChunk);
  return chunks;
}

IndexedDB Caching

We implement a sophisticated caching system using IndexedDB:

class ModelCache {
  async set(key: string, data: ArrayBuffer): Promise<void> {
    const db = await this.init();
    return new Promise((resolve, reject) => {
      const transaction = db.transaction([this.storeName], 'readwrite');
      const store = transaction.objectStore(this.storeName);
      const request = store.put({
        key,
        data,
        timestamp: Date.now()
      });
      
      request.onsuccess = () => resolve();
      request.onerror = () => reject(request.error);
    });
  }
  
  async get(key: string): Promise<ArrayBuffer | null> {
    const db = await this.init();
    return new Promise((resolve) => {
      const transaction = db.transaction([this.storeName], 'readonly');
      const store = transaction.objectStore(this.storeName);
      const request = store.get(key);
      
      request.onsuccess = () => {
        const result = request.result;
        if (result && Date.now() - result.timestamp < 7 * 24 * 60 * 60 * 1000) {
          resolve(result.data);
        } else {
          resolve(null);
        }
      };
    });
  }
}

Audio Format Conversion

Converting Float32Array audio data to playable WAV format happens entirely in JavaScript:

export function createWavBlob(audioData: Float32Array, sampleRate: number): Blob {
  const buffer = new ArrayBuffer(44 + audioData.length * 2);
  const view = new DataView(buffer);
  
  // Write WAV header
  const writeString = (offset: number, string: string) => {
    for (let i = 0; i < string.length; i++) {
      view.setUint8(offset + i, string.charCodeAt(i));
    }
  };
  
  writeString(0, 'RIFF');
  view.setUint32(4, 36 + audioData.length * 2, true);
  writeString(8, 'WAVE');
  writeString(12, 'fmt ');
  view.setUint32(16, 16, true);
  view.setUint16(20, 1, true);
  view.setUint16(22, 1, true);
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * 2, true);
  view.setUint16(32, 2, true);
  view.setUint16(34, 16, true);
  writeString(36, 'data');
  view.setUint32(40, audioData.length * 2, true);
  
  // Convert float to 16-bit PCM
  let offset = 44;
  for (let i = 0; i < audioData.length; i++) {
    const sample = Math.max(-1, Math.min(1, audioData[i]));
    view.setInt16(offset, sample < 0 ? sample * 0x8000 : sample * 0x7FFF, true);
    offset += 2;
  }
  
  return new Blob([buffer], { type: 'audio/wav' });
}

Performance Optimizations

Progressive Loading

We've implemented several strategies to improve perceived performance:

Embedded assets: Critical model files bundled with the application
IndexedDB caching: Models persist across browser sessions
Lazy initialization: ONNX Runtime loads only when needed
Background processing: Model loading doesn't block the UI

Browser Compatibility

Our implementation gracefully handles different browser capabilities:

// Try WebAssembly backend first, fallback to CPU
try {
  this.model = await InferenceSession.create(modelBuffer, {
    executionProviders: ['wasm'],
    graphOptimizationLevel: 'disabled'
  });
} catch (wasmError) {
  console.warn('WebAssembly failed, using CPU backend');
  this.model = await InferenceSession.create(modelBuffer, {
    executionProviders: ['cpu'],
    graphOptimizationLevel: 'basic'
  });
}

Security and Privacy Implications

Running AI models client-side has significant privacy advantages:

Zero data transmission: Text never leaves the user's device
No server logs: No record of what users synthesize
Offline capability: Works without internet after initial load
No API keys: No authentication or usage tracking

Content Security Policy

WebAssembly requires specific CSP headers:

// netlify.toml
[[headers]]
  for = "/*"
  [headers.values]
    Cross-Origin-Embedder-Policy = "require-corp"
    Cross-Origin-Opener-Policy = "same-origin"

Challenges and Limitations

Current Limitations

Initial load time: First-time users wait 8-15 seconds for model download
Memory usage: Neural networks consume 100-200MB RAM during inference
Browser compatibility: Requires modern browsers with WebAssembly support
Mobile performance: Slower generation on resource-constrained devices

Future Optimizations

Several improvements are on our roadmap:

Model quantization: Reduce model size through 8-bit precision
WebGL acceleration: Leverage GPU when available
Streaming inference: Generate audio progressively for long texts
Service Worker caching: More aggressive asset caching strategies

The Bigger Picture

Our KittenTTS implementation represents a broader trend toward edge AI computing. By running neural networks in browsers, we're enabling:

Privacy-preserving AI: No data leaves the user's device
Reduced infrastructure costs: No expensive GPU servers required
Better user experience: No network latency or rate limits
Democratized AI: Advanced capabilities accessible to everyone

Technical Takeaways

If you're building browser-based AI applications, here are key lessons from our implementation:

ONNX Runtime Web is production-ready for neural network inference in browsers
WebAssembly provides significant performance gains for numerical computing
Careful memory management is crucial for large model deployment
Progressive loading strategies improve perceived performance
IndexedDB caching is essential for models larger than a few megabytes

Try It Yourself

Want to experience browser-based AI voice synthesis? Try our Text to Speech tool and see KittenTTS, ONNX Runtime, and WebAssembly working together in real-time.

For developers interested in implementing similar systems, our KittenTTS JavaScript package (@quickeditvideo/kittentts) provides a clean API for browser-based neural text-to-speech synthesis. (Coming shortly)

The future of AI is moving to the edge—and that includes the browser edge. Privacy-preserving, unlimited, client-side AI is not just possible, it's here today.

DEV Community