We Added Text-to-Speech to Our API — 23 Languages, Voice Cloning, 11x Cheaper Than ElevenLabs

#ai #tts #api #webdev

I've been building PixelAPI — an AI media API that does image generation, video, background removal, and a bunch of other stuff. This week we shipped something I'm genuinely excited about: text-to-speech.

Not the robotic kind. Natural-sounding speech in 23 languages, with emotion tags and voice cloning. And it costs $0.015 per 30 seconds.

Why we built this

We kept hearing from users: "I use PixelAPI for images in my app, but then I need a separate TTS provider for voice." Having to juggle ElevenLabs or Google Cloud TTS alongside our API was a pain point.

So we figured — our GPU workers already handle image and video generation. TTS models are lightweight (2-3GB VRAM). Why not run them on the same infrastructure?

What it does

Two models, one endpoint:

Chatterbox-Turbo — English, 350M parameters, ~3x faster than realtime. The killer feature is paralinguistic emotion tags. You can literally write [laugh] or [sigh] in your text and the model renders actual laughter or sighing in the audio. No other API offers this at this price point.

Chatterbox-Multilingual — 23 languages including Hindi, Japanese, Korean, Arabic, and European languages. 500M parameters, near-realtime speed.

Both support zero-shot voice cloning — upload a 10-second WAV clip and the model clones that voice. No fine-tuning, no training, just works.

How to use it

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=Hello! [laugh] This is pretty cool, right?" \
  -F "model=chatterbox-turbo" \
  -F "language=en"

Returns a job ID. Poll /v1/tts/status/{id} and you get back a WAV URL.

Python example:

import requests, time

resp = requests.post("https://api.pixelapi.dev/v1/tts/generate",
    headers={"Authorization": "Bearer YOUR_KEY"},
    data={
        "text": "Welcome to our app. [laugh] We're glad you're here.",
        "model": "chatterbox-turbo",
        "language": "en"
    })

job = resp.json()

# Poll for completion
while True:
    status = requests.get(
        f"https://api.pixelapi.dev/v1/tts/status/{job['id']}",
        headers={"Authorization": "Bearer YOUR_KEY"}).json()
    if status["status"] == "completed":
        print(f"Audio: {status['output_url']}")
        break
    time.sleep(3)

For voice cloning, just add the reference audio:

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_KEY" \
  -F "text=This is my cloned voice." \
  -F "model=chatterbox-multilingual" \
  -F "language=en" \
  -F "voice_ref=@my_voice_sample.wav"

Pricing comparison

Provider	Price	Notes
ElevenLabs Flash	$0.05/1K chars (~$0.17/min)	Proprietary
ElevenLabs v2/v3	$0.10/1K chars (~$0.33/min)	Proprietary
Resemble AI	$0.03/min	Proprietary
Google Cloud TTS	$4-16/1M chars	Standard/WaveNet
PixelAPI Turbo	$0.015/30s (~$0.03/min)	Open-source Chatterbox
PixelAPI Multilingual	$0.020/30s (~$0.04/min)	23 languages

We run on our own NVIDIA RTX GPUs — no cloud markup, no cold starts, no per-second billing surprises.

Supported languages

English, Hindi, Japanese, Korean, Chinese, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Dutch, Polish, Turkish, Swedish, Danish, Finnish, Greek, Hebrew, Malay, Norwegian, Swahili.

Emotion tags (Turbo model)

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

These actually render as audio effects in the speech. It's wild.

What's next

We're looking at adding streaming TTS (WebSocket-based, sub-200ms first-byte latency) and more voice customization options. If you have specific use cases, drop a comment or hit us up at support@pixelapi.dev.

Full API docs: pixelapi.dev/docs#text-to-speech