The ElevenLabs quickstart is four lines and it works on the first run. That part is honest. What the quickstart does not tell you is that the two things most likely to break your integration in production are not in the code at all — they are how the API bills you, and how it hands you a voice that quietly stops existing.
I run ElevenLabs behind a content pipeline that generates voiceovers for short-form video. Not a toy script — a service that calls the API on a schedule, caches the audio, and ships it downstream. Here is what I wish someone had told me before I wired it in.
The happy path really is this short
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="YOUR_API_KEY")
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb", # a library or cloned voice
model_id="eleven_multilingual_v2",
text="Welcome to the show.",
output_format="mp3_44100_128",
)
convert returns an iterator of bytes, so you write it out like any stream:
with open("out.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
That is the whole text-to-speech call. The same API key and the same credit pool drive this and the web app, which is the part I actually care about as a developer: what I prototype in the browser is what ships. No second account, no separate quota.
Gotcha 1: a voice_id is not a stable identifier
This one cost me a confusing afternoon. I hardcoded a voice_id I had picked from the Voice Library. A few weeks later the pipeline started throwing on that ID. The voice had been removed from the library by whoever published it, and the ID went with it.
The library is community-contributed. Voices come and go. If your code references a library voice by ID, you have a dependency on someone else's decision to keep sharing it.
Two fixes, depending on how much you care:
- Add the voice to your own collection from the dashboard (the "Add to My Voices" button on the voice), which drops it into your My Voices list with a stable ID so a library removal does not pull it out from under you.
- For anything load-bearing, clone or design the voice so the ID is yours. A designed voice generated from a text prompt is reproducible and never disappears.
Either way: do not treat a raw library voice_id as a permanent key. Treat it like a CDN URL you do not control.
Gotcha 2: streaming is a different method, and you want it for agents
convert waits for the full clip before it hands you bytes. Fine for batch voiceover. Wrong for anything interactive, where time-to-first-byte is the number that matters.
For a realtime voice agent, use the streaming endpoint and a low-latency model so audio starts playing while the rest is still generating:
stream = client.text_to_speech.stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5", # low-latency model for realtime
text="Hold on, let me check that for you.",
output_format="mp3_44100_128",
)
for chunk in stream:
play(chunk) # pipe to your audio sink as it arrives
The mistake I see people make is benchmarking an agent with convert and concluding the latency is bad. It is not the model — it is that you asked for the whole file. Switch to stream, drop to the flash model, and the felt latency falls off a cliff.
Gotcha 3: you are billed per character, including the takes you throw away
Here is the one that actually shapes your architecture. ElevenLabs meters in credits, and credits map roughly one-to-one to characters of text. About 1,000 characters is a minute of audio.
The trap is that every generation is billed, including regenerations. During development I would tweak a script, re-run, tweak, re-run — and each of those runs spent real credits even though I was throwing the audio away. On a tool where the headline plan advertises ~121 minutes a month, a chatty dev loop or a busy agent eats that faster than the minute count suggests, because the count assumes you nail every take once. You will not, and neither will your users.
This turns into two concrete engineering decisions:
Cache aggressively, keyed on the inputs. If the (text, voice_id, model_id, settings) tuple is unchanged, the output is identical — so do not pay for it twice. A content hash of those inputs makes a clean cache key:
import hashlib
def cache_key(text, voice_id, model_id, settings):
blob = f"{text}|{voice_id}|{model_id}|{settings}".encode()
return hashlib.sha256(blob).hexdigest()
Look that up before you call the API. In my pipeline this is the difference between paying once per script and paying every time a downstream retry fires.
Proof the text before it ever hits the API. Spell-check, normalize, lock the script as text first. Generating audio to discover a typo means paying to hear your own mistake read aloud, then paying again for the fix.
I went deeper on the credit economics — what the tiers actually buy, where the real per-minute ceiling lands once you account for regenerations, and whether it is worth it versus the cheaper options — in a full hands-on review of ElevenLabs where I ran my own scripts through every plan. The short version for builders: the API is excellent and the voices are a real class above the field, but price your expected traffic against the credit math, not the marketing minutes, before you commit a production workload to it.
So is the API worth building on?
For anything where the voice is the product — narration at scale, a voice agent, dubbing — yes, and it is not close. The SDK is clean, the Python and JavaScript clients are first-class rather than an afterthought, and the prototype-to-production continuity is genuinely rare in this space.
Just go in knowing the three things the quickstart leaves out: pin your voices, stream for latency, and budget credits like the line item they are. Get those right up front and the integration is boring in the best way.
If you have wired up a different TTS API and hit a worse or better version of the billing problem, I would like to hear it — drop it in the comments.
Top comments (0)