Most text-to-speech (TTS) tools still work the same way they did years ago:
send text to a server → wait → receive an audio file.
At QuickEditVideo, we took a very different approach.
Our TTS tool runs entirely in the browser — no servers, no uploads, no API keys — using a compact neural model powered by PocketTTS and jax-js.
In this post, I’ll break down:
- What PocketTTS is and why it’s a great fit for browsers
- How in-browser inference actually works
- Why client-side AI is a big deal for privacy, cost, and UX
- How modern JS + WASM tooling makes this possible today
👉 Live demo: https://quickeditvideo.com/tts/
What Is PocketTTS?
PocketTTS is a lightweight neural text-to-speech system designed for edge devices and CPU-only inference.
Its design goals are very different from large cloud TTS models:
- Small model size (around 300MB)
- Fast CPU inference
- Deterministic output
- Low memory pressure
- Easy embedding
That makes it ideal for environments like:
- Browsers
- Desktop apps
- Offline tools
- Mobile devices
Instead of relying on massive autoregressive architectures, PocketTTS uses a streamlined acoustic model + vocoder pipeline that trades a bit of expressiveness for speed, portability, and predictability — a trade-off that makes sense for real-time tools.
PocketTTS Inference Pipeline (High Level)
Here’s what happens when you click “Generate” in the browser:
Text input
↓
Text normalization
↓
Phoneme / token encoding
↓
Acoustic model (phonemes → features)
↓
Vocoder (features → waveform)
↓
WAV / audio buffer
Every one of these steps runs locally — nothing leaves your device.
No servers.
No background jobs.
No queueing.
Why Run TTS in the Browser?
Most “privacy-friendly” AI tools still send your data somewhere.
With in-browser TTS:
Text never leaves your machine
Audio is generated locally
Nothing is logged, stored, or transmitted
This isn’t a policy — it’s a technical guarantee.
ALso, Zero Latency, Zero Rate Limits! The speed you get depends only on the user’s device — not your backend capacity.
Last but not least, From a builder’s perspective, this is huge.
Running TTS in the browser means:
no GPU servers
no inference queues
no autoscaling
no “free tier abuse” issues
Your infrastructure cost is zero. Yay!
How PocketTTS Runs in the Browser
Browsers today are much more capable than most people realize. Yes, I'm talking about WebGPU! :)
Here’s the stack that makes this work.
🧠 WebAssembly (WASM)
PocketTTS is compiled to WebAssembly, allowing near-native performance inside the browser.
Why WASM matters:
Fast numeric computation
Predictable memory layout
SIMD support (where available)
No JS GC interference during inference
This makes CPU-based neural inference viable — even on mid-range laptops.
// Prepare text for TTS
const [preparedText, framesAfterEos] = prepareTextPrompt(textToGenerate);
const tokens = tokenizerRef.current.encode(preparedText);
// Load voice embedding
const voiceUrl = POCKET_TTS_VOICES[voiceToUse].url;
const audioPromptData = safetensors.parse(await cachedFetch(voiceUrl));
const audioPrompt = audioPromptData.tensors.audio_prompt;
const voiceEmbed = np
.array(audioPrompt.data as Float32Array<ArrayBuffer>, {
shape: audioPrompt.shape,
dtype: np.float32,
})
.slice(0)
.astype(np.float16);
// Create text embeddings
const tokensAr = np.array(tokens, { dtype: np.uint32 });
let embeds = modelRef.current.flowLM.conditionerEmbed.ref.slice(tokensAr);
embeds = np.concatenate([voiceEmbed, embeds]);
// Create streaming player and generate audio
const player = createStreamingPlayer();
try {
await playTTS(player, tree.ref(modelRef.current), embeds, {
framesAfterEos,
seed: null,
temperature: 0.7,
lsdDecodeSteps: 1,
});
// Get the generated audio as WAV blob
const audioBlob = player.toWav();
const audioUrl = URL.createObjectURL(audioBlob);
// Update the audio item with the generated URL
setGeneratedAudios(prev =>
prev.map(audio =>
audio.id === audioId
? { ...audio, audioUrl, isGenerating: false }
: audio
)
);
if (autoPlayGeneratedAudio) {
setLastAutoPlayedAudioId(audioId);
}
} finally {
await player.close();
}
export async function playTTS(
player: AudioPlayer,
model: PocketTTS,
embeds: np.Array,
{
framesAfterEos = 0,
seed = null,
lsdDecodeSteps = 1,
temperature = 0.7,
noiseClamp = null,
playChunk = false,
}: Partial<PlayTTSOptions> = {},
): Promise<void> {
let sequence = model.flowLM.bosEmb.ref.reshape([1, -1]); // [1, 32]
let audioPromise: Promise<void> = Promise.resolve();
if (seed === null) seed = Math.floor(Math.random() * 2 ** 32);
let key = random.key(seed);
try {
let flowLMState = createFlowLMState(model.flowLM);
let mimiState = createMimiDecodeState(model.mimi);
let eosStep: number | null = null;
console.log("Starting TTS generation...");
let lastTimestamp = performance.now();
for (let step = 0; step < 1000; step++) {
let stepKey: np.Array;
[key, stepKey] = random.split(key);
const {
latent,
isEos,
state: newFlowLMState,
} = runFlowLMStep(
tree.ref(model.flowLM),
flowLMState,
stepKey,
step === 0 ? sequence.ref : sequence.ref.slice([-1]),
step === 0 ? embeds.ref : null,
flowLMState.kvCacheLen, // same as offset
lsdDecodeSteps,
temperature,
noiseClamp,
);
flowLMState = newFlowLMState;
const isEosData = await isEos.data();
if (isEosData[0] && eosStep === null) {
console.log(`🛑 EOS at step ${step}!`);
eosStep = step;
}
if (eosStep !== null && step >= eosStep + framesAfterEos) {
console.log(
`Generation ended at step ${step}, ${framesAfterEos} frames after EOS.`,
);
latent.dispose();
break;
}
sequence = np.concatenate([sequence, latent]);
const timestamp = performance.now();
console.log(
`Generated step ${step} in ${(timestamp - lastTimestamp).toFixed(1)} ms`,
);
lastTimestamp = timestamp;
let mimiInput = sequence.ref.slice([-1]);
mimiInput = mimiInput
.mul(model.flowLM.embStd.ref)
.add(model.flowLM.embMean.ref);
const [audio, newMimiState] = runMimiDecode(
tree.ref(model.mimi),
mimiState,
mimiInput,
step,
);
mimiState = newMimiState;
const audioPcm = (await np
.clip(audio.slice(0), -1, 1)
.astype(np.float32)
.data()) as Float32Array;
if (audioPcm.length !== 1920) {
throw new Error(
`expected 1920 audio samples, got ${audioPcm.length}`,
);
}
player.recordChunk(audioPcm);
if (playChunk) {
const lastAudioPromise = audioPromise;
audioPromise = (async () => {
await lastAudioPromise;
await player.playChunk(audioPcm);
})();
}
}
} finally {
sequence.dispose();
tree.dispose([model, embeds]);
await audioPromise;
}
🧮 Typed Arrays + Audio Buffers
Instead of shuffling data through JSON or base64 blobs, the pipeline uses:
Float32Array/Int16Arrayfor tensorsDirect audio buffer synthesis
Minimal copying between stages
That keeps memory overhead low and performance stable.
📦 Model Caching (IndexedDB)
The model is downloaded once and cached locally using IndexedDB.
On repeat visits:
No re-download
Near-instant startup
Offline-friendly behavior
This is critical for UX when models are 300 MB in size.
// Initialize WebGPU and load model
useEffect(() => {
let isMounted = true;
async function initializeModel() {
let weightsAreCached = false;
try {
setLoadingProgress('Initializing WebGPU...');
const devices = await init();
if (!devices.includes('webgpu')) {
throw new Error('WebGPU is not supported on this device. Please use a browser with WebGPU support (Chrome 113+, Edge 113+).');
}
defaultDevice('webgpu');
if (!isMounted) return;
try {
const info = await opfs.info(MODEL_WEIGHTS_URL);
weightsAreCached = info !== null;
} catch (cacheInfoError) {
console.warn('Failed to read cached model metadata:', cacheInfoError);
}
if (!isMounted) return;
if (weightsAreCached) {
setLoadingProgress('Loading cached model weights...');
setIsWeightsDownloadInProgress(false);
setWeightsDownloadPercent(null);
} else {
setLoadingProgress('Downloading model weights (300MB)...');
setIsWeightsDownloadInProgress(true);
setWeightsDownloadPercent(0);
}
const weightsData = await cachedFetch(
MODEL_WEIGHTS_URL,
undefined,
(progress) => {
if (weightsAreCached || !isMounted) return;
const total = progress.totalBytes ?? ESTIMATED_MODEL_WEIGHTS_SIZE;
const percent = total > 0 ? Math.min(100, Math.round((progress.loadedBytes / total) * 100)) : 0;
setWeightsDownloadPercent(percent);
}
);
weightsRef.current = safetensors.parse(weightsData);
if (!isMounted) return;
if (!weightsAreCached) {
setWeightsDownloadPercent(100);
setIsWeightsDownloadInProgress(false);
}
setLoadingProgress('Loading model...');
modelRef.current = fromSafetensors(weightsRef.current);
if (!isMounted) return;
// Load tokenizer
setLoadingProgress('Loading tokenizer...');
tokenizerRef.current = await tokenizers.loadSentencePiece(TOKENIZER_URL);
if (!isMounted) return;
setIsModelLoaded(true);
setIsModelLoading(false);
setLoadingProgress('');
console.log('Pocket TTS model loaded successfully');
} catch (err) {
console.error('Failed to initialize Pocket TTS:', err);
if (isMounted) {
if (!weightsAreCached) {
setIsWeightsDownloadInProgress(false);
setWeightsDownloadPercent(null);
}
setError(err instanceof Error ? err.message : 'Failed to load TTS model');
setIsModelLoading(false);
}
}
}
initializeModel();
return () => {
isMounted = false;
};
}, []);
🧪 Why This Approach Actually Works Now
A few years ago, this would’ve been unrealistic.
Today, we benefit from:
Faster JS engines
Mature WASM runtimes
Better browser audio APIs
Smarter lightweight models like PocketTTS
We’re not brute-forcing AI into the browser — we’re choosing models that fit the medium.
That mindset shift is the key.
🚀 Try It Yourself
If you’re curious what modern browser-based TTS feels like:
👉 https://quickeditvideo.com/tts/
No login
No limits
Fully client-side
Instant playback
🧠 Final Thought
AI doesn’t have to live on massive servers.
With the right models — like PocketTTS — and modern web runtimes, the browser becomes a serious inference platform.
This isn’t just about TTS.
It’s a glimpse of where web-native AI is heading.
If you’re building tools for creators, educators, or everyday users — it’s worth paying attention.
Top comments (0)