AlixWang

Posted on May 23

How I Used Edge-TTS to Build a Free Online Text-to-Speech Site

#javascript #nextjs #tts #cloudflare

Building a high-quality text-to-speech service that's completely free seemed impossible until I discovered Microsoft's Edge-TTS. Here's how I architected TTS-Free.Online using modern web technologies and why the technical decisions matter.

The Problem with Existing TTS Solutions

Most TTS APIs are expensive or have quality limitations:

Google Cloud TTS: $4-16 per million characters
Amazon Polly: $4 per million characters
Azure Cognitive Services: $15 per million characters
Free alternatives often have robotic voices

Discovering Edge-TTS

Edge-TTS is the engine behind Microsoft Edge's "Read Aloud" feature. Key advantages:

Neural voices with natural prosody
40+ languages with regional variants
SSML support for advanced control
Completely free (though not officially documented for external use)

The challenge was making it accessible through a web interface.

Technical Architecture

Frontend: Next.js 14 with App Router

I chose Next.js for its full-stack capabilities and excellent developer experience:

// app/page.tsx - Main TTS interface
'use client'

import { useState } from 'react'
import { Voice, TTSOptions } from '@/types/tts'

export default function TTSGenerator() {
  const [text, setText] = useState('')
  const [selectedVoice, setSelectedVoice] = useState<Voice>()
  const [isGenerating, setIsGenerating] = useState(false)
  const [audioUrl, setAudioUrl] = useState<string>()

  const handleGenerate = async () => {
    setIsGenerating(true)
    try {
      const response = await fetch('/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          text,
          voice: selectedVoice?.value,
          options: {
            rate: '0%',
            pitch: '0%'
          }
        })
      })

      if (response.ok) {
        const blob = await response.blob()
        setAudioUrl(URL.createObjectURL(blob))
      }
    } finally {
      setIsGenerating(false)
    }
  }

  return (
    <div className="max-w-4xl mx-auto p-6">
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter your text here..."
        className="w-full h-40 p-4 border rounded-lg"
      />

      <VoiceSelector 
        onVoiceSelect={setSelectedVoice}
        selectedVoice={selectedVoice}
      />

      <button
        onClick={handleGenerate}
        disabled={!text || !selectedVoice || isGenerating}
        className="bg-blue-500 text-white px-6 py-2 rounded-lg disabled:opacity-50"
      >
        {isGenerating ? 'Generating...' : 'Generate Speech'}
      </button>

      {audioUrl && (
        <audio controls className="w-full mt-4">
          <source src={audioUrl} type="audio/mpeg" />
        </audio>
      )}
    </div>
  )
}

Backend API: Edge Runtime with Streaming

The core TTS generation happens in a Cloudflare Pages function:

// app/api/generate/route.ts
import { NextRequest } from 'next/server'

export const runtime = 'edge'

interface TTSRequest {
  text: string
  voice: string
  options?: {
    rate?: string
    pitch?: string
    volume?: string
  }
}

export async function POST(request: NextRequest) {
  try {
    const { text, voice, options }: TTSRequest = await request.json()

    // Validate input
    if (!text || text.length > 10000) {
      return new Response('Invalid text length', { status: 400 })
    }

    if (!voice) {
      return new Response('Voice selection required', { status: 400 })
    }

    // Generate TTS using Edge-TTS
    const audioBuffer = await generateTTS(text, voice, options)

    return new Response(audioBuffer, {
      headers: {
        'Content-Type': 'audio/mpeg',
        'Content-Disposition': 'attachment; filename="speech.mp3"',
        'Cache-Control': 'public, max-age=3600'
      }
    })

  } catch (error) {
    console.error('TTS generation failed:', error)
    return new Response('Generation failed', { status: 500 })
  }
}

async function generateTTS(
  text: string, 
  voice: string, 
  options: TTSRequest['options'] = {}
): Promise<ArrayBuffer> {
  // Edge-TTS implementation
  const { Readable } = await import('stream')
  const EdgeTTS = await import('edge-tts')

  const tts = new EdgeTTS.default()

  // Configure voice settings
  await tts.setMetadata(voice, EdgeTTS.OUTPUT_FORMAT.AUDIO_24KHZ_48KBITRATE_MONO_MP3)

  // Apply SSML if options provided
  let ssmlText = text
  if (options.rate || options.pitch || options.volume) {
    ssmlText = `<speak><prosody${
      options.rate ? ` rate="${options.rate}"` : ''
    }${
      options.pitch ? ` pitch="${options.pitch}"` : ''
    }${
      options.volume ? ` volume="${options.volume}"` : ''
    }>${text}</prosody></speak>`
  }

  const stream = tts.generateSpeech(ssmlText)

  // Convert stream to ArrayBuffer
  const chunks: Uint8Array[] = []
  for await (const chunk of stream) {
    chunks.push(chunk)
  }

  const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0)
  const result = new Uint8Array(totalLength)
  let offset = 0

  for (const chunk of chunks) {
    result.set(chunk, offset)
    offset += chunk.length
  }

  return result.buffer
}

Voice Management System

Dynamic voice loading with language categorization:

// lib/voices.ts
export interface Voice {
  value: string
  label: string
  language: string
  gender: 'Male' | 'Female'
  locale: string
}

export const VOICE_CATEGORIES = {
  'English': ['en-US', 'en-GB', 'en-AU', 'en-CA', 'en-IN'],
  'Spanish': ['es-ES', 'es-MX', 'es-AR', 'es-CO'],
  'French': ['fr-FR', 'fr-CA'],
  'German': ['de-DE', 'de-AT', 'de-CH'],
  'Chinese': ['zh-CN', 'zh-HK', 'zh-TW'],
  'Japanese': ['ja-JP'],
  'Korean': ['ko-KR']
  // ... more languages
}

export async function getAvailableVoices(): Promise<Voice[]> {
  // In production, this would call Edge-TTS voice discovery
  // For now, return static list of known high-quality voices
  return [
    {
      value: 'en-US-AriaNeural',
      label: 'Aria (US English, Female)',
      language: 'English',
      gender: 'Female',
      locale: 'en-US'
    },
    {
      value: 'en-US-GuyNeural', 
      label: 'Guy (US English, Male)',
      language: 'English',
      gender: 'Male',
      locale: 'en-US'
    }
    // ... more voices
  ]
}

// components/VoiceSelector.tsx
import { useState, useEffect } from 'react'
import { Voice, VOICE_CATEGORIES, getAvailableVoices } from '@/lib/voices'

interface VoiceSelectorProps {
  onVoiceSelect: (voice: Voice) => void
  selectedVoice?: Voice
}

export default function VoiceSelector({ onVoiceSelect, selectedVoice }: VoiceSelectorProps) {
  const [voices, setVoices] = useState<Voice[]>([])
  const [selectedLanguage, setSelectedLanguage] = useState('English')

  useEffect(() => {
    getAvailableVoices().then(setVoices)
  }, [])

  const filteredVoices = voices.filter(voice => 
    VOICE_CATEGORIES[selectedLanguage]?.includes(voice.locale)
  )

  return (
    <div className="space-y-4">
      <div>
        <label className="block text-sm font-medium mb-2">Language</label>
        <select 
          value={selectedLanguage}
          onChange={(e) => setSelectedLanguage(e.target.value)}
          className="w-full p-2 border rounded-lg"
        >
          {Object.keys(VOICE_CATEGORIES).map(lang => (
            <option key={lang} value={lang}>{lang}</option>
          ))}
        </select>
      </div>

      <div>
        <label className="block text-sm font-medium mb-2">Voice</label>
        <select
          value={selectedVoice?.value || ''}
          onChange={(e) => {
            const voice = voices.find(v => v.value === e.target.value)
            if (voice) onVoiceSelect(voice)
          }}
          className="w-full p-2 border rounded-lg"
        >
          <option value="">Select a voice...</option>
          {filteredVoices.map(voice => (
            <option key={voice.value} value={voice.value}>
              {voice.label}
            </option>
          ))}
        </select>
      </div>
    </div>
  )
}

Deployment: Cloudflare Pages

The entire application runs on Cloudflare's edge network:

// next.config.js
/** @type {import('next').NextConfig} */
const nextConfig = {
  experimental: {
    runtime: 'edge'
  },
  images: {
    unoptimized: true
  }
}

module.exports = nextConfig

// package.json scripts
{
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "pages:build": "@cloudflare/next-on-pages",
    "preview": "wrangler pages dev .vercel/output/static",
    "deploy": "wrangler pages deploy .vercel/output/static"
  }
}

Advanced Features Implementation

1. SSML Support for Voice Control

// lib/ssml.ts
export function generateSSML(text: string, options: {
  rate?: string
  pitch?: string
  volume?: string
  emphasis?: 'strong' | 'moderate' | 'reduced'
  pauseAfter?: string
}): string {
  let ssml = text

  // Wrap in prosody for voice modifications
  if (options.rate || options.pitch || options.volume) {
    const prosodyAttrs = [
      options.rate && `rate="${options.rate}"`,
      options.pitch && `pitch="${options.pitch}"`,
      options.volume && `volume="${options.volume}"`
    ].filter(Boolean).join(' ')

    ssml = `<prosody ${prosodyAttrs}>${ssml}</prosody>`
  }

  // Add emphasis
  if (options.emphasis) {
    ssml = `<emphasis level="${options.emphasis}">${ssml}</emphasis>`
  }

  // Add pause
  if (options.pauseAfter) {
    ssml += `<break time="${options.pauseAfter}"/>`
  }

  return `<speak>${ssml}</speak>`
}

2. Batch Processing for Long Texts

// lib/textProcessor.ts
export function chunkText(text: string, maxLength: number = 3000): string[] {
  if (text.length <= maxLength) return [text]

  const chunks: string[] = []
  const sentences = text.split(/[.!?]+/)
  let currentChunk = ''

  for (const sentence of sentences) {
    if ((currentChunk + sentence).length > maxLength && currentChunk) {
      chunks.push(currentChunk.trim())
      currentChunk = sentence
    } else {
      currentChunk += sentence + '. '
    }
  }

  if (currentChunk.trim()) {
    chunks.push(currentChunk.trim())
  }

  return chunks
}

// app/api/generate-long/route.ts
export async function POST(request: NextRequest) {
  const { text, voice, options } = await request.json()

  // Split long text into chunks
  const chunks = chunkText(text, 3000)
  const audioChunks: ArrayBuffer[] = []

  for (const chunk of chunks) {
    const audio = await generateTTS(chunk, voice, options)
    audioChunks.push(audio)
  }

  // Combine audio chunks (simplified - would need proper audio concatenation)
  const totalLength = audioChunks.reduce((sum, chunk) => sum + chunk.byteLength, 0)
  const combined = new Uint8Array(totalLength)
  let offset = 0

  for (const chunk of audioChunks) {
    combined.set(new Uint8Array(chunk), offset)
    offset += chunk.byteLength
  }

  return new Response(combined.buffer, {
    headers: {
      'Content-Type': 'audio/mpeg',
      'Content-Disposition': 'attachment; filename="long-speech.mp3"'
    }
  })
}

3. Real-time Audio Controls

// components/AudioControls.tsx
import { useState, useRef, useEffect } from 'react'

interface AudioControlsProps {
  audioUrl: string
}

export default function AudioControls({ audioUrl }: AudioControlsProps) {
  const audioRef = useRef<HTMLAudioElement>(null)
  const [isPlaying, setIsPlaying] = useState(false)
  const [currentTime, setCurrentTime] = useState(0)
  const [duration, setDuration] = useState(0)
  const [volume, setVolume] = useState(1)
  const [playbackRate, setPlaybackRate] = useState(1)

  useEffect(() => {
    const audio = audioRef.current
    if (!audio) return

    const updateTime = () => setCurrentTime(audio.currentTime)
    const updateDuration = () => setDuration(audio.duration)
    const handleEnd = () => setIsPlaying(false)

    audio.addEventListener('timeupdate', updateTime)
    audio.addEventListener('loadedmetadata', updateDuration)
    audio.addEventListener('ended', handleEnd)

    return () => {
      audio.removeEventListener('timeupdate', updateTime)
      audio.removeEventListener('loadedmetadata', updateDuration)
      audio.removeEventListener('ended', handleEnd)
    }
  }, [audioUrl])

  const togglePlay = () => {
    const audio = audioRef.current
    if (!audio) return

    if (isPlaying) {
      audio.pause()
    } else {
      audio.play()
    }
    setIsPlaying(!isPlaying)
  }

  const handleSeek = (e: React.ChangeEvent<HTMLInputElement>) => {
    const audio = audioRef.current
    if (!audio) return

    const newTime = parseFloat(e.target.value)
    audio.currentTime = newTime
    setCurrentTime(newTime)
  }

  const handleVolumeChange = (e: React.ChangeEvent<HTMLInputElement>) => {
    const newVolume = parseFloat(e.target.value)
    setVolume(newVolume)
    if (audioRef.current) {
      audioRef.current.volume = newVolume
    }
  }

  const handleRateChange = (e: React.ChangeEvent<HTMLSelectElement>) => {
    const newRate = parseFloat(e.target.value)
    setPlaybackRate(newRate)
    if (audioRef.current) {
      audioRef.current.playbackRate = newRate
    }
  }

  return (
    <div className="bg-gray-100 p-4 rounded-lg space-y-4">
      <audio ref={audioRef} src={audioUrl} preload="metadata" />

      <div className="flex items-center space-x-4">
        <button
          onClick={togglePlay}
          className="bg-blue-500 text-white p-2 rounded-full"
        >
          {isPlaying ? '⏸️' : '▶️'}
        </button>

        <div className="flex-1">
          <input
            type="range"
            min="0"
            max={duration || 0}
            value={currentTime}
            onChange={handleSeek}
            className="w-full"
          />
          <div className="flex justify-between text-sm text-gray-600">
            <span>{formatTime(currentTime)}</span>
            <span>{formatTime(duration)}</span>
          </div>
        </div>
      </div>

      <div className="flex items-center space-x-4">
        <label className="flex items-center space-x-2">
          <span>Volume:</span>
          <input
            type="range"
            min="0"
            max="1"
            step="0.1"
            value={volume}
            onChange={handleVolumeChange}
            className="w-20"
          />
        </label>

        <label className="flex items-center space-x-2">
          <span>Speed:</span>
          <select value={playbackRate} onChange={handleRateChange}>
            <option value="0.5">0.5x</option>
            <option value="0.75">0.75x</option>
            <option value="1">1x</option>
            <option value="1.25">1.25x</option>
            <option value="1.5">1.5x</option>
            <option value="2">2x</option>
          </select>
        </label>
      </div>
    </div>
  )
}

function formatTime(seconds: number): string {
  const mins = Math.floor(seconds / 60)
  const secs = Math.floor(seconds % 60)
  return `${mins}:${secs.toString().padStart(2, '0')}`
}

Content Strategy with MDX

The site includes comprehensive educational content using MDX:

// mdx-components.tsx
export function useMDXComponents(components: any) {
  return {
    h1: ({ children }: any) => (
      <h1 className="text-4xl font-bold mb-6 text-gray-900">{children}</h1>
    ),
    h2: ({ children }: any) => (
      <h2 className="text-3xl font-semibold mb-4 mt-8 text-gray-800">{children}</h2>
    ),
    p: ({ children }: any) => (
      <p className="mb-4 leading-relaxed text-gray-700">{children}</p>
    ),
    code: ({ children }: any) => (
      <code className="bg-gray-100 px-2 py-1 rounded text-sm font-mono">{children}</code>
    ),
    ...components,
  }
}

Performance Optimizations

1. Edge Caching Strategy

// middleware.ts
import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'

export function middleware(request: NextRequest) {
  const response = NextResponse.next()

  // Cache static assets
  if (request.nextUrl.pathname.startsWith('/api/voices')) {
    response.headers.set('Cache-Control', 'public, max-age=86400') // 24 hours
  }

  // Cache generated audio for 1 hour
  if (request.nextUrl.pathname.startsWith('/api/generate')) {
    response.headers.set('Cache-Control', 'public, max-age=3600')
  }

  return response
}

2. Client-Side Optimization

// hooks/useTTSCache.ts
import { useState, useCallback } from 'react'

interface CacheEntry {
  audioUrl: string
  timestamp: number
}

const CACHE_DURATION = 1000 * 60 * 30 // 30 minutes

export function useTTSCache() {
  const [cache, setCache] = useState<Map<string, CacheEntry>>(new Map())

  const getCacheKey = (text: string, voice: string, options: any) => {
    return `${text}-${voice}-${JSON.stringify(options)}`
  }

  const getCachedAudio = useCallback((text: string, voice: string, options: any) => {
    const key = getCacheKey(text, voice, options)
    const entry = cache.get(key)

    if (entry && Date.now() - entry.timestamp < CACHE_DURATION) {
      return entry.audioUrl
    }

    return null
  }, [cache])

  const setCachedAudio = useCallback((text: string, voice: string, options: any, audioUrl: string) => {
    const key = getCacheKey(text, voice, options)
    setCache(prev => new Map(prev).set(key, {
      audioUrl,
      timestamp: Date.now()
    }))
  }, [])

  return { getCachedAudio, setCachedAudio }
}

Monitoring and Analytics

// lib/analytics.ts
export function trackTTSGeneration(voice: string, textLength: number, success: boolean) {
  // Analytics implementation
  if (typeof window !== 'undefined' && window.gtag) {
    window.gtag('event', 'tts_generation', {
      voice_used: voice,
      text_length_category: getTextLengthCategory(textLength),
      success: success
    })
  }
}

function getTextLengthCategory(length: number): string {
  if (length < 100) return 'short'
  if (length < 500) return 'medium'
  if (length < 2000) return 'long'
  return 'very_long'
}

Key Technical Learnings

Edge Runtime Limitations: Not all Node.js APIs are available in Cloudflare's edge runtime
Audio Streaming: Implementing proper audio streaming for large files requires careful buffer management
Voice Quality: Different voices perform better with different content types
Caching Strategy: Balancing cache duration with storage costs and user experience
Error Handling: Graceful fallbacks when Edge-TTS services are unavailable

Deployment Configuration

# .github/workflows/deploy.yml
name: Deploy to Cloudflare Pages

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run pages:build

      - name: Deploy to Cloudflare Pages
        uses: cloudflare/pages-action@v1
        with:
          apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          projectName: tts-free-online
          directory: .vercel/output/static

Results and Impact

After 6 months of operation:

Zero infrastructure costs (Cloudflare Pages free tier)
Global edge deployment with <100ms response times
50,000+ monthly active users
500,000+ audio generations
99.5% uptime

Future Technical Improvements

WebAssembly Integration: Moving Edge-TTS processing to client-side WASM
Real-time Streaming: Implementing Server-Sent Events for progressive audio generation
Voice Cloning: Adding custom voice training capabilities
API Access: Public API with rate limiting and authentication

The combination of Edge-TTS, Next.js 14, and Cloudflare Pages created a powerful, scalable, and cost-effective solution that democratizes access to high-quality text-to-speech technology.

Try the Implementation

The complete source code demonstrates how modern web technologies can create powerful, free alternatives to expensive commercial services. Visit TTS-Free.Online to experience the result, or check out the implementation patterns for your own projects.

Building useful, accessible technology doesn't require massive infrastructure investments—sometimes the best solutions come from creative combinations of existing tools.

DEV Community