Building a high-quality text-to-speech service that's completely free seemed impossible until I discovered Microsoft's Edge-TTS. Here's how I architected TTS-Free.Online using modern web technologies and why the technical decisions matter.
The Problem with Existing TTS Solutions
Most TTS APIs are expensive or have quality limitations:
- Google Cloud TTS: $4-16 per million characters
- Amazon Polly: $4 per million characters
- Azure Cognitive Services: $15 per million characters
- Free alternatives often have robotic voices
Discovering Edge-TTS
Edge-TTS is the engine behind Microsoft Edge's "Read Aloud" feature. Key advantages:
- Neural voices with natural prosody
- 40+ languages with regional variants
- SSML support for advanced control
- Completely free (though not officially documented for external use)
The challenge was making it accessible through a web interface.
Technical Architecture
Frontend: Next.js 14 with App Router
I chose Next.js for its full-stack capabilities and excellent developer experience:
// app/page.tsx - Main TTS interface
'use client'
import { useState } from 'react'
import { Voice, TTSOptions } from '@/types/tts'
export default function TTSGenerator() {
const [text, setText] = useState('')
const [selectedVoice, setSelectedVoice] = useState<Voice>()
const [isGenerating, setIsGenerating] = useState(false)
const [audioUrl, setAudioUrl] = useState<string>()
const handleGenerate = async () => {
setIsGenerating(true)
try {
const response = await fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text,
voice: selectedVoice?.value,
options: {
rate: '0%',
pitch: '0%'
}
})
})
if (response.ok) {
const blob = await response.blob()
setAudioUrl(URL.createObjectURL(blob))
}
} finally {
setIsGenerating(false)
}
}
return (
<div className="max-w-4xl mx-auto p-6">
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter your text here..."
className="w-full h-40 p-4 border rounded-lg"
/>
<VoiceSelector
onVoiceSelect={setSelectedVoice}
selectedVoice={selectedVoice}
/>
<button
onClick={handleGenerate}
disabled={!text || !selectedVoice || isGenerating}
className="bg-blue-500 text-white px-6 py-2 rounded-lg disabled:opacity-50"
>
{isGenerating ? 'Generating...' : 'Generate Speech'}
</button>
{audioUrl && (
<audio controls className="w-full mt-4">
<source src={audioUrl} type="audio/mpeg" />
</audio>
)}
</div>
)
}
Backend API: Edge Runtime with Streaming
The core TTS generation happens in a Cloudflare Pages function:
// app/api/generate/route.ts
import { NextRequest } from 'next/server'
export const runtime = 'edge'
interface TTSRequest {
text: string
voice: string
options?: {
rate?: string
pitch?: string
volume?: string
}
}
export async function POST(request: NextRequest) {
try {
const { text, voice, options }: TTSRequest = await request.json()
// Validate input
if (!text || text.length > 10000) {
return new Response('Invalid text length', { status: 400 })
}
if (!voice) {
return new Response('Voice selection required', { status: 400 })
}
// Generate TTS using Edge-TTS
const audioBuffer = await generateTTS(text, voice, options)
return new Response(audioBuffer, {
headers: {
'Content-Type': 'audio/mpeg',
'Content-Disposition': 'attachment; filename="speech.mp3"',
'Cache-Control': 'public, max-age=3600'
}
})
} catch (error) {
console.error('TTS generation failed:', error)
return new Response('Generation failed', { status: 500 })
}
}
async function generateTTS(
text: string,
voice: string,
options: TTSRequest['options'] = {}
): Promise<ArrayBuffer> {
// Edge-TTS implementation
const { Readable } = await import('stream')
const EdgeTTS = await import('edge-tts')
const tts = new EdgeTTS.default()
// Configure voice settings
await tts.setMetadata(voice, EdgeTTS.OUTPUT_FORMAT.AUDIO_24KHZ_48KBITRATE_MONO_MP3)
// Apply SSML if options provided
let ssmlText = text
if (options.rate || options.pitch || options.volume) {
ssmlText = `<speak><prosody${
options.rate ? ` rate="${options.rate}"` : ''
}${
options.pitch ? ` pitch="${options.pitch}"` : ''
}${
options.volume ? ` volume="${options.volume}"` : ''
}>${text}</prosody></speak>`
}
const stream = tts.generateSpeech(ssmlText)
// Convert stream to ArrayBuffer
const chunks: Uint8Array[] = []
for await (const chunk of stream) {
chunks.push(chunk)
}
const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0)
const result = new Uint8Array(totalLength)
let offset = 0
for (const chunk of chunks) {
result.set(chunk, offset)
offset += chunk.length
}
return result.buffer
}
Voice Management System
Dynamic voice loading with language categorization:
// lib/voices.ts
export interface Voice {
value: string
label: string
language: string
gender: 'Male' | 'Female'
locale: string
}
export const VOICE_CATEGORIES = {
'English': ['en-US', 'en-GB', 'en-AU', 'en-CA', 'en-IN'],
'Spanish': ['es-ES', 'es-MX', 'es-AR', 'es-CO'],
'French': ['fr-FR', 'fr-CA'],
'German': ['de-DE', 'de-AT', 'de-CH'],
'Chinese': ['zh-CN', 'zh-HK', 'zh-TW'],
'Japanese': ['ja-JP'],
'Korean': ['ko-KR']
// ... more languages
}
export async function getAvailableVoices(): Promise<Voice[]> {
// In production, this would call Edge-TTS voice discovery
// For now, return static list of known high-quality voices
return [
{
value: 'en-US-AriaNeural',
label: 'Aria (US English, Female)',
language: 'English',
gender: 'Female',
locale: 'en-US'
},
{
value: 'en-US-GuyNeural',
label: 'Guy (US English, Male)',
language: 'English',
gender: 'Male',
locale: 'en-US'
}
// ... more voices
]
}
// components/VoiceSelector.tsx
import { useState, useEffect } from 'react'
import { Voice, VOICE_CATEGORIES, getAvailableVoices } from '@/lib/voices'
interface VoiceSelectorProps {
onVoiceSelect: (voice: Voice) => void
selectedVoice?: Voice
}
export default function VoiceSelector({ onVoiceSelect, selectedVoice }: VoiceSelectorProps) {
const [voices, setVoices] = useState<Voice[]>([])
const [selectedLanguage, setSelectedLanguage] = useState('English')
useEffect(() => {
getAvailableVoices().then(setVoices)
}, [])
const filteredVoices = voices.filter(voice =>
VOICE_CATEGORIES[selectedLanguage]?.includes(voice.locale)
)
return (
<div className="space-y-4">
<div>
<label className="block text-sm font-medium mb-2">Language</label>
<select
value={selectedLanguage}
onChange={(e) => setSelectedLanguage(e.target.value)}
className="w-full p-2 border rounded-lg"
>
{Object.keys(VOICE_CATEGORIES).map(lang => (
<option key={lang} value={lang}>{lang}</option>
))}
</select>
</div>
<div>
<label className="block text-sm font-medium mb-2">Voice</label>
<select
value={selectedVoice?.value || ''}
onChange={(e) => {
const voice = voices.find(v => v.value === e.target.value)
if (voice) onVoiceSelect(voice)
}}
className="w-full p-2 border rounded-lg"
>
<option value="">Select a voice...</option>
{filteredVoices.map(voice => (
<option key={voice.value} value={voice.value}>
{voice.label}
</option>
))}
</select>
</div>
</div>
)
}
Deployment: Cloudflare Pages
The entire application runs on Cloudflare's edge network:
// next.config.js
/** @type {import('next').NextConfig} */
const nextConfig = {
experimental: {
runtime: 'edge'
},
images: {
unoptimized: true
}
}
module.exports = nextConfig
// package.json scripts
{
"scripts": {
"dev": "next dev",
"build": "next build",
"pages:build": "@cloudflare/next-on-pages",
"preview": "wrangler pages dev .vercel/output/static",
"deploy": "wrangler pages deploy .vercel/output/static"
}
}
Advanced Features Implementation
1. SSML Support for Voice Control
// lib/ssml.ts
export function generateSSML(text: string, options: {
rate?: string
pitch?: string
volume?: string
emphasis?: 'strong' | 'moderate' | 'reduced'
pauseAfter?: string
}): string {
let ssml = text
// Wrap in prosody for voice modifications
if (options.rate || options.pitch || options.volume) {
const prosodyAttrs = [
options.rate && `rate="${options.rate}"`,
options.pitch && `pitch="${options.pitch}"`,
options.volume && `volume="${options.volume}"`
].filter(Boolean).join(' ')
ssml = `<prosody ${prosodyAttrs}>${ssml}</prosody>`
}
// Add emphasis
if (options.emphasis) {
ssml = `<emphasis level="${options.emphasis}">${ssml}</emphasis>`
}
// Add pause
if (options.pauseAfter) {
ssml += `<break time="${options.pauseAfter}"/>`
}
return `<speak>${ssml}</speak>`
}
2. Batch Processing for Long Texts
// lib/textProcessor.ts
export function chunkText(text: string, maxLength: number = 3000): string[] {
if (text.length <= maxLength) return [text]
const chunks: string[] = []
const sentences = text.split(/[.!?]+/)
let currentChunk = ''
for (const sentence of sentences) {
if ((currentChunk + sentence).length > maxLength && currentChunk) {
chunks.push(currentChunk.trim())
currentChunk = sentence
} else {
currentChunk += sentence + '. '
}
}
if (currentChunk.trim()) {
chunks.push(currentChunk.trim())
}
return chunks
}
// app/api/generate-long/route.ts
export async function POST(request: NextRequest) {
const { text, voice, options } = await request.json()
// Split long text into chunks
const chunks = chunkText(text, 3000)
const audioChunks: ArrayBuffer[] = []
for (const chunk of chunks) {
const audio = await generateTTS(chunk, voice, options)
audioChunks.push(audio)
}
// Combine audio chunks (simplified - would need proper audio concatenation)
const totalLength = audioChunks.reduce((sum, chunk) => sum + chunk.byteLength, 0)
const combined = new Uint8Array(totalLength)
let offset = 0
for (const chunk of audioChunks) {
combined.set(new Uint8Array(chunk), offset)
offset += chunk.byteLength
}
return new Response(combined.buffer, {
headers: {
'Content-Type': 'audio/mpeg',
'Content-Disposition': 'attachment; filename="long-speech.mp3"'
}
})
}
3. Real-time Audio Controls
// components/AudioControls.tsx
import { useState, useRef, useEffect } from 'react'
interface AudioControlsProps {
audioUrl: string
}
export default function AudioControls({ audioUrl }: AudioControlsProps) {
const audioRef = useRef<HTMLAudioElement>(null)
const [isPlaying, setIsPlaying] = useState(false)
const [currentTime, setCurrentTime] = useState(0)
const [duration, setDuration] = useState(0)
const [volume, setVolume] = useState(1)
const [playbackRate, setPlaybackRate] = useState(1)
useEffect(() => {
const audio = audioRef.current
if (!audio) return
const updateTime = () => setCurrentTime(audio.currentTime)
const updateDuration = () => setDuration(audio.duration)
const handleEnd = () => setIsPlaying(false)
audio.addEventListener('timeupdate', updateTime)
audio.addEventListener('loadedmetadata', updateDuration)
audio.addEventListener('ended', handleEnd)
return () => {
audio.removeEventListener('timeupdate', updateTime)
audio.removeEventListener('loadedmetadata', updateDuration)
audio.removeEventListener('ended', handleEnd)
}
}, [audioUrl])
const togglePlay = () => {
const audio = audioRef.current
if (!audio) return
if (isPlaying) {
audio.pause()
} else {
audio.play()
}
setIsPlaying(!isPlaying)
}
const handleSeek = (e: React.ChangeEvent<HTMLInputElement>) => {
const audio = audioRef.current
if (!audio) return
const newTime = parseFloat(e.target.value)
audio.currentTime = newTime
setCurrentTime(newTime)
}
const handleVolumeChange = (e: React.ChangeEvent<HTMLInputElement>) => {
const newVolume = parseFloat(e.target.value)
setVolume(newVolume)
if (audioRef.current) {
audioRef.current.volume = newVolume
}
}
const handleRateChange = (e: React.ChangeEvent<HTMLSelectElement>) => {
const newRate = parseFloat(e.target.value)
setPlaybackRate(newRate)
if (audioRef.current) {
audioRef.current.playbackRate = newRate
}
}
return (
<div className="bg-gray-100 p-4 rounded-lg space-y-4">
<audio ref={audioRef} src={audioUrl} preload="metadata" />
<div className="flex items-center space-x-4">
<button
onClick={togglePlay}
className="bg-blue-500 text-white p-2 rounded-full"
>
{isPlaying ? '⏸️' : '▶️'}
</button>
<div className="flex-1">
<input
type="range"
min="0"
max={duration || 0}
value={currentTime}
onChange={handleSeek}
className="w-full"
/>
<div className="flex justify-between text-sm text-gray-600">
<span>{formatTime(currentTime)}</span>
<span>{formatTime(duration)}</span>
</div>
</div>
</div>
<div className="flex items-center space-x-4">
<label className="flex items-center space-x-2">
<span>Volume:</span>
<input
type="range"
min="0"
max="1"
step="0.1"
value={volume}
onChange={handleVolumeChange}
className="w-20"
/>
</label>
<label className="flex items-center space-x-2">
<span>Speed:</span>
<select value={playbackRate} onChange={handleRateChange}>
<option value="0.5">0.5x</option>
<option value="0.75">0.75x</option>
<option value="1">1x</option>
<option value="1.25">1.25x</option>
<option value="1.5">1.5x</option>
<option value="2">2x</option>
</select>
</label>
</div>
</div>
)
}
function formatTime(seconds: number): string {
const mins = Math.floor(seconds / 60)
const secs = Math.floor(seconds % 60)
return `${mins}:${secs.toString().padStart(2, '0')}`
}
Content Strategy with MDX
The site includes comprehensive educational content using MDX:
// mdx-components.tsx
export function useMDXComponents(components: any) {
return {
h1: ({ children }: any) => (
<h1 className="text-4xl font-bold mb-6 text-gray-900">{children}</h1>
),
h2: ({ children }: any) => (
<h2 className="text-3xl font-semibold mb-4 mt-8 text-gray-800">{children}</h2>
),
p: ({ children }: any) => (
<p className="mb-4 leading-relaxed text-gray-700">{children}</p>
),
code: ({ children }: any) => (
<code className="bg-gray-100 px-2 py-1 rounded text-sm font-mono">{children}</code>
),
...components,
}
}
Performance Optimizations
1. Edge Caching Strategy
// middleware.ts
import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'
export function middleware(request: NextRequest) {
const response = NextResponse.next()
// Cache static assets
if (request.nextUrl.pathname.startsWith('/api/voices')) {
response.headers.set('Cache-Control', 'public, max-age=86400') // 24 hours
}
// Cache generated audio for 1 hour
if (request.nextUrl.pathname.startsWith('/api/generate')) {
response.headers.set('Cache-Control', 'public, max-age=3600')
}
return response
}
2. Client-Side Optimization
// hooks/useTTSCache.ts
import { useState, useCallback } from 'react'
interface CacheEntry {
audioUrl: string
timestamp: number
}
const CACHE_DURATION = 1000 * 60 * 30 // 30 minutes
export function useTTSCache() {
const [cache, setCache] = useState<Map<string, CacheEntry>>(new Map())
const getCacheKey = (text: string, voice: string, options: any) => {
return `${text}-${voice}-${JSON.stringify(options)}`
}
const getCachedAudio = useCallback((text: string, voice: string, options: any) => {
const key = getCacheKey(text, voice, options)
const entry = cache.get(key)
if (entry && Date.now() - entry.timestamp < CACHE_DURATION) {
return entry.audioUrl
}
return null
}, [cache])
const setCachedAudio = useCallback((text: string, voice: string, options: any, audioUrl: string) => {
const key = getCacheKey(text, voice, options)
setCache(prev => new Map(prev).set(key, {
audioUrl,
timestamp: Date.now()
}))
}, [])
return { getCachedAudio, setCachedAudio }
}
Monitoring and Analytics
// lib/analytics.ts
export function trackTTSGeneration(voice: string, textLength: number, success: boolean) {
// Analytics implementation
if (typeof window !== 'undefined' && window.gtag) {
window.gtag('event', 'tts_generation', {
voice_used: voice,
text_length_category: getTextLengthCategory(textLength),
success: success
})
}
}
function getTextLengthCategory(length: number): string {
if (length < 100) return 'short'
if (length < 500) return 'medium'
if (length < 2000) return 'long'
return 'very_long'
}
Key Technical Learnings
- Edge Runtime Limitations: Not all Node.js APIs are available in Cloudflare's edge runtime
- Audio Streaming: Implementing proper audio streaming for large files requires careful buffer management
- Voice Quality: Different voices perform better with different content types
- Caching Strategy: Balancing cache duration with storage costs and user experience
- Error Handling: Graceful fallbacks when Edge-TTS services are unavailable
Deployment Configuration
# .github/workflows/deploy.yml
name: Deploy to Cloudflare Pages
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build application
run: npm run pages:build
- name: Deploy to Cloudflare Pages
uses: cloudflare/pages-action@v1
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
projectName: tts-free-online
directory: .vercel/output/static
Results and Impact
After 6 months of operation:
- Zero infrastructure costs (Cloudflare Pages free tier)
- Global edge deployment with <100ms response times
- 50,000+ monthly active users
- 500,000+ audio generations
- 99.5% uptime
Future Technical Improvements
- WebAssembly Integration: Moving Edge-TTS processing to client-side WASM
- Real-time Streaming: Implementing Server-Sent Events for progressive audio generation
- Voice Cloning: Adding custom voice training capabilities
- API Access: Public API with rate limiting and authentication
The combination of Edge-TTS, Next.js 14, and Cloudflare Pages created a powerful, scalable, and cost-effective solution that democratizes access to high-quality text-to-speech technology.
Try the Implementation
The complete source code demonstrates how modern web technologies can create powerful, free alternatives to expensive commercial services. Visit TTS-Free.Online to experience the result, or check out the implementation patterns for your own projects.
Building useful, accessible technology doesn't require massive infrastructure investments—sometimes the best solutions come from creative combinations of existing tools.
Top comments (0)