Posted on Jan 28

Running Text-to-Speech Fully in the Browser with PocketTTS

#ai #showdev #javascript #machinelearning

Most text-to-speech (TTS) tools still work the same way they did years ago:
send text to a server → wait → receive an audio file.

At QuickEditVideo, we took a very different approach.

Our TTS tool runs entirely in the browser — no servers, no uploads, no API keys — using a compact neural model powered by PocketTTS and jax-js.

In this post, I’ll break down:

What PocketTTS is and why it’s a great fit for browsers
How in-browser inference actually works
Why client-side AI is a big deal for privacy, cost, and UX
How modern JS + WASM tooling makes this possible today

👉 Live demo: https://quickeditvideo.com/tts/

What Is PocketTTS?

PocketTTS is a lightweight neural text-to-speech system designed for edge devices and CPU-only inference.

Its design goals are very different from large cloud TTS models:

Small model size (around 300MB)
Fast CPU inference
Deterministic output
Low memory pressure
Easy embedding

That makes it ideal for environments like:

Browsers
Desktop apps
Offline tools
Mobile devices

Instead of relying on massive autoregressive architectures, PocketTTS uses a streamlined acoustic model + vocoder pipeline that trades a bit of expressiveness for speed, portability, and predictability — a trade-off that makes sense for real-time tools.

PocketTTS Inference Pipeline (High Level)

Here’s what happens when you click “Generate” in the browser:

Text input
  ↓
Text normalization
  ↓
Phoneme / token encoding
  ↓
Acoustic model (phonemes → features)
  ↓
Vocoder (features → waveform)
  ↓
WAV / audio buffer

Every one of these steps runs locally — nothing leaves your device.

No servers.
No background jobs.
No queueing.

Why Run TTS in the Browser?

Most “privacy-friendly” AI tools still send your data somewhere.

With in-browser TTS:

Text never leaves your machine
Audio is generated locally
Nothing is logged, stored, or transmitted

This isn’t a policy — it’s a technical guarantee.

ALso, Zero Latency, Zero Rate Limits! The speed you get depends only on the user’s device — not your backend capacity.

Last but not least, From a builder’s perspective, this is huge.

Running TTS in the browser means:

no GPU servers
no inference queues
no autoscaling
no “free tier abuse” issues

Your infrastructure cost is zero. Yay!

How PocketTTS Runs in the Browser

Browsers today are much more capable than most people realize. Yes, I'm talking about WebGPU! :)

Here’s the stack that makes this work.

🧠 WebAssembly (WASM)

PocketTTS is compiled to WebAssembly, allowing near-native performance inside the browser.

Why WASM matters:

Fast numeric computation
Predictable memory layout
SIMD support (where available)
No JS GC interference during inference

This makes CPU-based neural inference viable — even on mid-range laptops.

      // Prepare text for TTS
      const [preparedText, framesAfterEos] = prepareTextPrompt(textToGenerate);
      const tokens = tokenizerRef.current.encode(preparedText);

      // Load voice embedding
      const voiceUrl = POCKET_TTS_VOICES[voiceToUse].url;
      const audioPromptData = safetensors.parse(await cachedFetch(voiceUrl));
      const audioPrompt = audioPromptData.tensors.audio_prompt;

      const voiceEmbed = np
        .array(audioPrompt.data as Float32Array<ArrayBuffer>, {
          shape: audioPrompt.shape,
          dtype: np.float32,
        })
        .slice(0)
        .astype(np.float16);

      // Create text embeddings
      const tokensAr = np.array(tokens, { dtype: np.uint32 });
      let embeds = modelRef.current.flowLM.conditionerEmbed.ref.slice(tokensAr);
      embeds = np.concatenate([voiceEmbed, embeds]);

      // Create streaming player and generate audio
      const player = createStreamingPlayer();

      try {
        await playTTS(player, tree.ref(modelRef.current), embeds, {
          framesAfterEos,
          seed: null,
          temperature: 0.7,
          lsdDecodeSteps: 1,
        });

        // Get the generated audio as WAV blob
        const audioBlob = player.toWav();
        const audioUrl = URL.createObjectURL(audioBlob);

        // Update the audio item with the generated URL
        setGeneratedAudios(prev =>
          prev.map(audio =>
            audio.id === audioId
              ? { ...audio, audioUrl, isGenerating: false }
              : audio
          )
        );
        if (autoPlayGeneratedAudio) {
          setLastAutoPlayedAudioId(audioId);
        }

      } finally {
        await player.close();
      }

export async function playTTS(
  player: AudioPlayer,
  model: PocketTTS,
  embeds: np.Array,
  {
    framesAfterEos = 0,
    seed = null,
    lsdDecodeSteps = 1,
    temperature = 0.7,
    noiseClamp = null,
    playChunk = false,
  }: Partial<PlayTTSOptions> = {},
): Promise<void> {
  let sequence = model.flowLM.bosEmb.ref.reshape([1, -1]); // [1, 32]
  let audioPromise: Promise<void> = Promise.resolve();

  if (seed === null) seed = Math.floor(Math.random() * 2 ** 32);
  let key = random.key(seed);

  try {
    let flowLMState = createFlowLMState(model.flowLM);
    let mimiState = createMimiDecodeState(model.mimi);
    let eosStep: number | null = null;

    console.log("Starting TTS generation...");
    let lastTimestamp = performance.now();

    for (let step = 0; step < 1000; step++) {
      let stepKey: np.Array;
      [key, stepKey] = random.split(key);
      const {
        latent,
        isEos,
        state: newFlowLMState,
      } = runFlowLMStep(
        tree.ref(model.flowLM),
        flowLMState,
        stepKey,
        step === 0 ? sequence.ref : sequence.ref.slice([-1]),
        step === 0 ? embeds.ref : null,
        flowLMState.kvCacheLen, // same as offset
        lsdDecodeSteps,
        temperature,
        noiseClamp,
      );
      flowLMState = newFlowLMState;

      const isEosData = await isEos.data();
      if (isEosData[0] && eosStep === null) {
        console.log(`🛑 EOS at step ${step}!`);
        eosStep = step;
      }
      if (eosStep !== null && step >= eosStep + framesAfterEos) {
        console.log(
          `Generation ended at step ${step}, ${framesAfterEos} frames after EOS.`,
        );
        latent.dispose();
        break;
      }

      sequence = np.concatenate([sequence, latent]);

      const timestamp = performance.now();
      console.log(
        `Generated step ${step} in ${(timestamp - lastTimestamp).toFixed(1)} ms`,
      );
      lastTimestamp = timestamp;

      let mimiInput = sequence.ref.slice([-1]);
      mimiInput = mimiInput
        .mul(model.flowLM.embStd.ref)
        .add(model.flowLM.embMean.ref);

      const [audio, newMimiState] = runMimiDecode(
        tree.ref(model.mimi),
        mimiState,
        mimiInput,
        step,
      );
      mimiState = newMimiState;

      const audioPcm = (await np
        .clip(audio.slice(0), -1, 1)
        .astype(np.float32)
        .data()) as Float32Array;
      if (audioPcm.length !== 1920) {
        throw new Error(
          `expected 1920 audio samples, got ${audioPcm.length}`,
        );
      }

      player.recordChunk(audioPcm);
      if (playChunk) {
        const lastAudioPromise = audioPromise;
        audioPromise = (async () => {
          await lastAudioPromise;
          await player.playChunk(audioPcm);
        })();
      }
    }
  } finally {
    sequence.dispose();
    tree.dispose([model, embeds]);
    await audioPromise;
  }

🧮 Typed Arrays + Audio Buffers

Instead of shuffling data through JSON or base64 blobs, the pipeline uses:

Float32Array / Int16Array for tensors
Direct audio buffer synthesis
Minimal copying between stages

That keeps memory overhead low and performance stable.

📦 Model Caching (IndexedDB)

The model is downloaded once and cached locally using IndexedDB.

On repeat visits:

No re-download
Near-instant startup
Offline-friendly behavior

This is critical for UX when models are 300 MB in size.

// Initialize WebGPU and load model
  useEffect(() => {
    let isMounted = true;

    async function initializeModel() {
      let weightsAreCached = false;

      try {
        setLoadingProgress('Initializing WebGPU...');
        const devices = await init();

        if (!devices.includes('webgpu')) {
          throw new Error('WebGPU is not supported on this device. Please use a browser with WebGPU support (Chrome 113+, Edge 113+).');
        }

        defaultDevice('webgpu');

        if (!isMounted) return;

        try {
          const info = await opfs.info(MODEL_WEIGHTS_URL);
          weightsAreCached = info !== null;
        } catch (cacheInfoError) {
          console.warn('Failed to read cached model metadata:', cacheInfoError);
        }

        if (!isMounted) return;

        if (weightsAreCached) {
          setLoadingProgress('Loading cached model weights...');
          setIsWeightsDownloadInProgress(false);
          setWeightsDownloadPercent(null);
        } else {
          setLoadingProgress('Downloading model weights (300MB)...');
          setIsWeightsDownloadInProgress(true);
          setWeightsDownloadPercent(0);
        }

        const weightsData = await cachedFetch(
          MODEL_WEIGHTS_URL,
          undefined,
          (progress) => {
            if (weightsAreCached || !isMounted) return;
            const total = progress.totalBytes ?? ESTIMATED_MODEL_WEIGHTS_SIZE;
            const percent = total > 0 ? Math.min(100, Math.round((progress.loadedBytes / total) * 100)) : 0;
            setWeightsDownloadPercent(percent);
          }
        );
        weightsRef.current = safetensors.parse(weightsData);

        if (!isMounted) return;

        if (!weightsAreCached) {
          setWeightsDownloadPercent(100);
          setIsWeightsDownloadInProgress(false);
        }

        setLoadingProgress('Loading model...');
        modelRef.current = fromSafetensors(weightsRef.current);

        if (!isMounted) return;

        // Load tokenizer
        setLoadingProgress('Loading tokenizer...');
        tokenizerRef.current = await tokenizers.loadSentencePiece(TOKENIZER_URL);

        if (!isMounted) return;

        setIsModelLoaded(true);
        setIsModelLoading(false);
        setLoadingProgress('');
        console.log('Pocket TTS model loaded successfully');

      } catch (err) {
        console.error('Failed to initialize Pocket TTS:', err);
        if (isMounted) {
          if (!weightsAreCached) {
            setIsWeightsDownloadInProgress(false);
            setWeightsDownloadPercent(null);
          }
          setError(err instanceof Error ? err.message : 'Failed to load TTS model');
          setIsModelLoading(false);
        }
      }
    }

    initializeModel();

    return () => {
      isMounted = false;
    };
  }, []);

🧪 Why This Approach Actually Works Now

A few years ago, this would’ve been unrealistic.

Today, we benefit from:

Faster JS engines
Mature WASM runtimes
Better browser audio APIs
Smarter lightweight models like PocketTTS

We’re not brute-forcing AI into the browser — we’re choosing models that fit the medium.

That mindset shift is the key.

🚀 Try It Yourself

If you’re curious what modern browser-based TTS feels like:

👉 https://quickeditvideo.com/tts/

No login
No limits
Fully client-side
Instant playback

🧠 Final Thought

AI doesn’t have to live on massive servers.

With the right models — like PocketTTS — and modern web runtimes, the browser becomes a serious inference platform.

This isn’t just about TTS.

It’s a glimpse of where web-native AI is heading.

If you’re building tools for creators, educators, or everyday users — it’s worth paying attention.

DEV Community