I've been building PixelAPI — an AI media API that does image generation, video, background removal, and a bunch of other stuff. This week we shipped something I'm genuinely excited about: text-to-speech.
Not the robotic kind. Natural-sounding speech in 23 languages, with emotion tags and voice cloning. And it costs $0.015 per 30 seconds.
Why we built this
We kept hearing from users: "I use PixelAPI for images in my app, but then I need a separate TTS provider for voice." Having to juggle ElevenLabs or Google Cloud TTS alongside our API was a pain point.
So we figured — our GPU workers already handle image and video generation. TTS models are lightweight (2-3GB VRAM). Why not run them on the same infrastructure?
What it does
Two models, one endpoint:
Chatterbox-Turbo — English, 350M parameters, ~3x faster than realtime. The killer feature is paralinguistic emotion tags. You can literally write [laugh] or [sigh] in your text and the model renders actual laughter or sighing in the audio. No other API offers this at this price point.
Chatterbox-Multilingual — 23 languages including Hindi, Japanese, Korean, Arabic, and European languages. 500M parameters, near-realtime speed.
Both support zero-shot voice cloning — upload a 10-second WAV clip and the model clones that voice. No fine-tuning, no training, just works.
How to use it
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=Hello! [laugh] This is pretty cool, right?" \
-F "model=chatterbox-turbo" \
-F "language=en"
Returns a job ID. Poll /v1/tts/status/{id} and you get back a WAV URL.
Python example:
import requests, time
resp = requests.post("https://api.pixelapi.dev/v1/tts/generate",
headers={"Authorization": "Bearer YOUR_KEY"},
data={
"text": "Welcome to our app. [laugh] We're glad you're here.",
"model": "chatterbox-turbo",
"language": "en"
})
job = resp.json()
# Poll for completion
while True:
status = requests.get(
f"https://api.pixelapi.dev/v1/tts/status/{job['id']}",
headers={"Authorization": "Bearer YOUR_KEY"}).json()
if status["status"] == "completed":
print(f"Audio: {status['output_url']}")
break
time.sleep(3)
For voice cloning, just add the reference audio:
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_KEY" \
-F "text=This is my cloned voice." \
-F "model=chatterbox-multilingual" \
-F "language=en" \
-F "voice_ref=@my_voice_sample.wav"
Pricing comparison
| Provider | Price | Notes |
|---|---|---|
| ElevenLabs Flash | $0.05/1K chars (~$0.17/min) | Proprietary |
| ElevenLabs v2/v3 | $0.10/1K chars (~$0.33/min) | Proprietary |
| Resemble AI | $0.03/min | Proprietary |
| Google Cloud TTS | $4-16/1M chars | Standard/WaveNet |
| PixelAPI Turbo | $0.015/30s (~$0.03/min) | Open-source Chatterbox |
| PixelAPI Multilingual | $0.020/30s (~$0.04/min) | 23 languages |
We run on our own NVIDIA RTX GPUs — no cloud markup, no cold starts, no per-second billing surprises.
Supported languages
English, Hindi, Japanese, Korean, Chinese, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Dutch, Polish, Turkish, Swedish, Danish, Finnish, Greek, Hebrew, Malay, Norwegian, Swahili.
Emotion tags (Turbo model)
[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]
These actually render as audio effects in the speech. It's wild.
What's next
We're looking at adding streaming TTS (WebSocket-based, sub-200ms first-byte latency) and more voice customization options. If you have specific use cases, drop a comment or hit us up at support@pixelapi.dev.
Full API docs: pixelapi.dev/docs#text-to-speech
Sign up for free (100 credits included): pixelapi.dev
Top comments (0)