WonderLab

Posted on Mar 9 • Edited on Jun 4

Open Source Project of the Day (Part 11): Supertonic - Lightning-Fast On-Device Multilingual TTS

#ai #opensource #tts #cpp

Introduction

"What if speech synthesis could run on your device at 1000+ characters per second — completely offline, supporting 50+ languages?"

This is Part 11 of the "Open Source Project of the Day" series. Today we explore Supertonic (GitHub).

Traditional TTS systems either rely on cloud APIs (with latency and privacy concerns) or are slow with poor quality. Supertonic uses ONNX Runtime to deliver blazing-fast, high-quality, fully on-device speech synthesis — reaching 1000+ characters/second on an M1 Mac, supporting 50+ languages, with built-in intelligent text normalization requiring no preprocessing. Speech synthesis truly "flies."

What You'll Learn

Supertonic's core architecture and technical characteristics
How to use Supertonic for TTS across various platforms
The advantages and implementation of ONNX Runtime
How built-in text normalization works
Streaming processing and real-time speech synthesis
Comparative analysis with other TTS systems
How to start building applications with Supertonic

Prerequisites

Basic understanding of TTS (Text-to-Speech)
Familiarity with at least one programming language (Python, JavaScript, Swift, Java, etc.)
Basic understanding of ONNX concepts (optional)
Basic knowledge of on-device AI (optional)

Project Background

Project Introduction

Supertonic is a lightning-fast, on-device, multilingual Text-to-Speech (TTS) system designed for ultimate performance and minimal computational overhead. Running on ONNX Runtime, it operates entirely on-device — no cloud, no API calls, no privacy concerns.

Core problems the project solves:

Cloud TTS has latency and privacy issues
Traditional on-device TTS is slow and low quality
Lack of multilingual support
Text normalization requires preprocessing
Different platforms need different implementations

Target user groups:

Mobile app developers needing on-device TTS
Desktop app developers needing offline speech synthesis
Developers with privacy requirements
Internationalized app developers needing multilingual TTS
Developers requiring extreme performance

Author/Team Introduction

Team: Supertone Inc.

Background: Technology company focused on voice technology and AI
Contributors: 4 contributors, including the core development team
Philosophy: Build a blazing-fast, high-quality, fully on-device TTS system

Project creation date: 2024 (based on GitHub activity, an actively maintained project)

Project Stats

⭐ GitHub Stars: 2.6k+ (rapidly and continuously growing)
🍴 Forks: 232+
📦 Version: v2.0.0 (latest version, released January 6, 2026)
📄 License: MIT (code), OpenRAIL-M (model)
🌐 Demo: Hugging Face Spaces
📚 Documentation: GitHub README includes complete usage guides
💬 Community: Active GitHub Issues

Project development history:

2024: Project created, released v1
2024-2025: Continuous optimization, added multilingual support
2025: Released v2, significant performance improvements
2026: Continuous iteration, growing community activity

Main Features

Core Purpose

Supertonic's core purpose is to provide a blazing-fast, high-quality, fully on-device TTS system, with main features including:

Blazing-fast speech synthesis: Reaches 1000+ characters/second on an M1 Mac
Multilingual support: Supports 5 languages including English, Chinese, Korean, Spanish, and Portuguese
Intelligent text normalization: Built-in text normalization requiring no preprocessing
Streaming processing: Supports streaming TTS for real-time speech synthesis
Fully offline: No cloud required, runs entirely on-device

Use Cases

Mobile applications
- Reading assistant apps
- Voice navigation apps
- Accessibility apps
Desktop applications
- E-book readers
- Document reading tools
- Voice assistants
Web applications
- Browser extensions
- Online speech synthesis services
- Voice chat applications
IoT devices
- Smart speakers
- Voice interaction devices
- Edge computing devices

Quick Start

Installation

Supertonic supports multiple programming languages and platforms:

Python:

# Install Python package
pip install supertonic

# Usage example
from supertonic import SupertonicTTS

tts = SupertonicTTS()
audio = tts.synthesize("Hello, world!")

JavaScript/Node.js:

# Install npm package
npm install supertonic

# Usage example
const { SupertonicTTS } = require('supertonic');

const tts = new SupertonicTTS();
const audio = await tts.synthesize("Hello, world!");

Other platforms:

C++: Use the implementation in the cpp directory
Swift: Use the implementation in the swift directory
Java: Use the implementation in the java directory
C#: Use the implementation in the csharp directory
Go: Use the implementation in the go directory
Rust: Use the implementation in the rust directory
Flutter: Use the implementation in the flutter directory
Web: Use the implementation in the web directory

Simplest Usage Examples

Python example:

from supertonic import SupertonicTTS

# Initialize TTS engine
tts = SupertonicTTS()

# Synthesize speech
text = "Supertonic is a lightning-fast, on-device TTS system."
audio = tts.synthesize(text)

# Save audio file
with open("output.wav", "wb") as f:
    f.write(audio)

JavaScript example:

const { SupertonicTTS } = require('supertonic');

async function synthesize() {
    const tts = new SupertonicTTS();
    const audio = await tts.synthesize("Supertonic is lightning-fast!");
    // Process audio data
    console.log("Audio generated:", audio.length, "bytes");
}

synthesize();

Core Features

Blazing-fast performance: 1000+ characters/second on M1 Mac, far surpassing traditional TTS systems
Multilingual support: Supports 5 major international languages
Intelligent text normalization: Built-in text normalization handles numbers, dates, abbreviations, and complex expressions
Streaming processing: Supports streaming TTS for real-time speech synthesis
Fully offline: No cloud required, runs entirely on-device, protecting privacy
Cross-platform support: Supports C++, Swift, JavaScript, Java, C#, Go, Rust, Flutter, Web, and more
ONNX Runtime: Based on ONNX Runtime for efficient inference
High-quality speech: Generates natural, clear speech

Project Advantages

Comparison	Supertonic	Cloud TTS	Traditional On-Device TTS
Speed	✅ 1000+ chars/sec	⚠️ Network-dependent	❌ Slow
Privacy	✅ Fully local	❌ Data uploaded	✅ Local
Latency	✅ Ultra-low	❌ Network latency	⚠️ Moderate
Multilingual	✅ 5 languages	✅ Supported	⚠️ Limited
Text normalization	✅ Built-in intelligent processing	⚠️ Preprocessing required	❌ Preprocessing required
Offline use	✅ Fully offline	❌ Requires network	✅ Offline
Cost	✅ Free and open source	❌ API fees	✅ Free

Why choose Supertonic?

Compared to cloud TTS and traditional on-device TTS, Supertonic provides blazing-fast performance, full offline capability, intelligent text normalization, and multilingual support — making it the ideal choice for on-device TTS.

Detailed Project Analysis

Architecture Design

Supertonic uses ONNX Runtime as its inference engine for efficient on-device TTS.

Core Architecture

Supertonic TTS System
├── Text Normalization
│   ├── Number processing
│   ├── Date/time processing
│   ├── Abbreviation expansion
│   └── Multilingual support
├── Text-to-Latent
│   ├── Flow Matching model
│   ├── Length-Aware RoPE
│   └── Text-speech alignment
├── Latent-to-Speech
│   ├── Speech Autoencoder
│   ├── Streaming processing
│   └── Audio generation
└── ONNX Runtime (inference engine)
    ├── Model optimization
    ├── Hardware acceleration
    └── Cross-platform support

ONNX Runtime Advantages

ONNX Runtime provides the following advantages:

Cross-platform: Unified model format, supports multiple platforms
Hardware acceleration: Supports GPU, NPU, and other hardware acceleration
Model optimization: Automatically optimizes model inference performance
Easy deployment: Models can be deployed directly after export

Text Normalization

Supertonic has built-in intelligent text normalization that handles:

Numbers: 123 → "one hundred twenty-three"
Dates: 2024-01-01 → "January first, twenty twenty-four"
Times: 2:30 → "two thirty"
Abbreviations: Dr. → "Doctor"
Units: 30kph → "thirty kilometers per hour"
Technical abbreviations: h → "hours"

Advantages:

No preprocessing required, directly handles raw text
Intelligently recognizes context for correct abbreviation expansion
Supports multiple languages, each with dedicated normalization rules

Streaming Processing

Supertonic supports streaming TTS for real-time speech synthesis:

Workflow:

Text chunking
Audio generation chunk by chunk
Real-time audio stream output
Low-latency response

Advantages:

Low latency, suitable for real-time applications
Low memory usage, suitable for mobile devices
Great user experience, fast response

Multilingual Support

Supertonic supports 5 languages:

English, Chinese, Korean, Spanish, and Portuguese

Each language has dedicated:

Text normalization rules
Speech models
Pronunciation dictionaries

Performance Optimization

Supertonic achieves blazing-fast performance through multiple techniques:

Model Optimization

Model compression: Reduce model size, improve inference speed
Quantization: Use INT8 quantization to boost speed while maintaining quality
Operator fusion: Merge multiple operators to reduce computational overhead

Hardware Acceleration

GPU acceleration: Leverage GPU parallel computing capabilities
NPU acceleration: Supports NPU hardware acceleration (e.g., Apple Neural Engine)
CPU optimization: SIMD optimization for CPUs

Inference Optimization

Batch processing: Process multiple requests in batches
Caching: Cache audio results for frequently used text
Preloading: Preload models into memory

Application Cases

Multiple projects are built on Supertonic:

TLDRL: Chrome extension, free on-device TTS that can read any webpage aloud
Read Aloud: Open-source TTS browser extension supporting Chrome and Edge
PageEcho: iOS e-book reader app
VoiceChat: On-device voice-to-voice LLM chatbot in the browser
OmniAvatar: Generate talking avatar videos from photos and voice
CopiloTTS: Kotlin multiplatform TTS SDK
Voice Mixer: PyQt5 tool for mixing and modifying voice styles
Supertonic MNN: Lightweight library based on MNN (fp32/fp16/int8)
Transformers.js: Hugging Face's JS library with Supertonic support
Pinokio: One-click local cloud for Mac, Windows, and Linux

Technical Papers

Supertonic is based on three core papers:

SupertonicTTS: Main Architecture
- Introduces the overall architecture of SupertonicTTS
- Includes the speech autoencoder and Flow Matching-based text-to-latent module
- Efficient design choices
Length-Aware RoPE: Text-Speech Alignment
- Proposes Length-Aware Rotary Position Embedding (LARoPE)
- Improves text-speech alignment in cross-attention mechanisms
Self-Purifying Flow Matching: Training with Noisy Labels
- Describes the self-purification technique
- Robust training of Flow Matching models using noisy or unreliable labels

Project Resources

Official Resources

🌟 GitHub: https://github.com/supertone-inc/supertonic
🌐 Demo: Hugging Face Spaces

Who Should Use This

Supertonic is especially suitable for: Mobile app developers needing on-device TTS, desktop app developers needing offline speech synthesis, developers with privacy requirements, internationalized app developers needing multilingual TTS, developers requiring extreme performance, and developers needing real-time speech synthesis.

Not suitable for: Users who only need cloud TTS, scenarios that don't require multilingual support, extreme edge cases with strict model size constraints.

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community