DEV Community

tronghieuit
tronghieuit

Posted on

I Built a 1.6M-Parameter Offline Text-to-Speech Engine for Node.js — Here's How

Hi everyone! I'm a developer passionate about making AI accessible on low-resource devices. I've been working on speech synthesis for a while, and I wanted to share a project I've been building: TinyTTS.

The idea started from a simple frustration — I needed text-to-speech in a Node.js app, but every option either required Python, called a cloud API, or shipped a massive model. I thought: what if TTS could be as easy as npm install and just work offline?

So I built one from scratch.

TL;DR

  • 1.6M parameters — smallest TTS model I know of that still sounds natural
  • ~3.4 MB ONNX model (auto-downloaded on first use)
  • 44.1 kHz output, ~53x real-time on a laptop CPU
  • Zero Python dependency — pure Node.js + ONNX Runtime
  • 100% G2P match with the Python version
npm install tiny-tts
Enter fullscreen mode Exit fullscreen mode
const TinyTTS = require('tiny-tts');
const tts = new TinyTTS();
await tts.speak('Hello world!', { output: 'hello.wav' });
Enter fullscreen mode Exit fullscreen mode

The Problem

Most TTS solutions for Node.js fall into one of these categories:

Approach Downside
Cloud APIs (Google, AWS, Azure) Requires internet, costs money, privacy concerns
Python wrapper (Coqui, Bark, etc.) Need Python installed, 100MB–1GB models
System TTS (say.js, espeak) Robotic quality, platform-dependent
WebSocket to Python server Extra infra, latency, complexity

I wanted something that's npm install and done. Run on a $5 VPS, a Raspberry Pi, or in a CI pipeline — no cloud, no Python, no hassle.


The Architecture

TinyTTS is an end-to-end VITS-based model compressed down to just 1.62 million parameters:

Text → G2P → Phoneme IDs → ONNX Model → 44.1kHz WAV
Enter fullscreen mode Exit fullscreen mode

How small is 1.6M params?

Model Parameters Size
TinyTTS 1.6M ~3.4 MB
Piper ~63M ~63 MB
Kokoro 82M ~330 MB
Coqui XTTS 467M ~1.8 GB

Benchmark (CPU only, same machine)

Engine Synthesis Time Audio Duration RTFx
TinyTTS (ONNX) 92 ms 4.88s ~53x
Piper (ONNX) 112 ms 2.91s ~26x
Kokoro ONNX 933 ms 3.16s ~3x

Usage

API

const TinyTTS = require('tiny-tts');

const tts = new TinyTTS();

await tts.speak('Hello world!', { output: 'hello.wav' });

await tts.speak('This is faster.', {
  output: 'fast.wav',
  speed: 1.5
});

await tts.dispose();
Enter fullscreen mode Exit fullscreen mode

CLI

npx tiny-tts "The weather is nice today." -o weather.wav
npx tiny-tts "Quick test" -o test.wav --speed 1.3
Enter fullscreen mode Exit fullscreen mode

Python

Also available on PyPI with identical output:

pip install tiny-tts
Enter fullscreen mode Exit fullscreen mode
from tiny_tts import TinyTTS
tts = TinyTTS()
tts.speak("Hello world!", output_path="hello.wav")
Enter fullscreen mode Exit fullscreen mode

What's Next

This is just the beginning. Here's what I'm working on:

  • Improve voice quality — better prosody, more natural intonation, reduce artifacts while keeping the model tiny
  • More voices — different speakers, genders, and speaking styles
  • Multi-language support — expanding beyond English to other languages

Links


If you've read this far — try it out and let me know what you think! I'm especially curious about edge use cases: IoT, CI/CD audio generation, accessibility tools, game dev, etc.

Top comments (0)