elizabeththomas7

Posted on May 14 • Originally published at Medium

From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and Everything That Broke Along the Way)

#python #redis #webdev #tutorial

A walkthrough of building a voice AI backend — through three TTS providers, a chunking problem, Redis caching, distributed locks, and a thundering herd.

The Idea

I wanted to read long articles without staring at a screen. The concept was simple: paste an article, get back an MP3. Building it turned out to be an education in the real-world constraints of TTS APIs — character limits, latency, cost, and what happens when 50 users click Play on the same article at the same moment.

Here's the full journey, told through the architecture decisions that actually mattered.

Iteration 1 — Piper TTS: Free, Local, and Immediately Limiting

The first version ran Piper — an open-source, offline neural TTS engine. You spin up a process, feed it text, get back a WAV file. No API keys, no cost, no network round-trips.

What worked: It ran entirely on my machine. Zero latency on credentials. Perfect for prototyping.

What broke: Piper is a local binary. It has no concept of concurrency — one synthesis job at a time. Voice quality, while decent, was noticeably robotic on longer prose. And crucially, the model files are large. Deploying this to a server meant bundling hundreds of megabytes of model weights and a native binary per target platform.

The real killer was the character limit. Piper (like all neural TTS systems) struggles with very long inputs. A full 2000-word article would either fail silently or produce garbled audio near the end. That problem — long text — became the thread I'd keep pulling on through every subsequent iteration.

The exit criterion: I needed a hosted API with a predictable quality ceiling and a clear path to production.

Iteration 2 — ElevenLabs: Great Voice, Brutal Cost at Scale

ElevenLabs produces genuinely impressive voice output. The SDK is well-designed:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key=api_key, timeout=timeout_sec)
audio_iter = client.text_to_speech.convert(
    voice_id,
    text=text,
    model_id=model_id,
    output_format=output_format,
)
data = b"".join(audio_iter)

You stream back an iterator of bytes and concatenate. Clean, fast, and the voice quality is excellent.

What worked: Plug-and-play integration. The voices sound human. Developer experience is top-tier.

What broke: The free tier evaporates fast. A single medium-length article at 1500 characters per minute of audio burns through credits quickly. If you're building something for real users — or even testing seriously — the cost curve is steep.

There was also the same character-limit problem: ElevenLabs has a per-request text limit. A long article needs to be split before you even call the API. I'd deferred solving this from the Piper days, but here it became unavoidable.

The exit criterion: I needed either cheaper synthesis or a way to make the expensive calls worth their cost. That meant solving chunking first, then caching second.

The Architecture Pivot: Chunking the Article

Before switching providers, I had to solve the fundamental problem: a full article can be 10,000+ characters, and every TTS provider has a per-request limit (Amazon Polly's is 3,000 characters for standard voices, for example).

The solution is a text chunker that splits on sentence boundaries — never in the middle of a sentence — and targets a configurable chunk size:

_SENTENCE_RE = re.compile(r"(?<=[.!?])\s+")

def chunk_text(text: str, target_chars: int = 2500, max_chars: int = 4000) -> list[str]:
    text = re.sub(r"\s+", " ", text).strip()
    sentences = _SENTENCE_RE.split(text)
    chunks: list[str] = []
    buf: list[str] = []
    size = 0

    for sentence in sentences:
        add = len(sentence) + (1 if buf else 0)
        if size + add > max_chars and buf:
            chunks.append(" ".join(buf))
            buf = [sentence]
            size = len(sentence)
        else:
            buf.append(sentence)
            size += add
        if size >= target_chars:
            chunks.append(" ".join(buf))
            buf = []
            size = 0

    if buf:
        chunks.append(" ".join(buf))
    return chunks

Two thresholds, not one:

target_chars (default 2500): the soft target. Once a chunk reaches this, close it and start a new one.
max_chars (default 4000): the hard ceiling. If the next sentence would push past this, flush first even if target_chars hasn't been reached.

This means every chunk is a coherent set of complete sentences, never a mid-sentence cut, and always within the provider's hard limit.

Once you have chunks, you synthesize each one independently and stitch the resulting MP3 files together with ffmpeg's concat demuxer.

Iteration 3 — Amazon Polly: The Right Economics

With chunking solved, I switched the synthesis backend to Amazon Polly. The economics are hard to beat:

Standard voices: 5 million characters/month, free, permanently.
Neural voices: 1 million characters/month free for the first 12 months, then pay-as-you-go.

For a personal reading assistant or a low-to-medium traffic app, the standard tier is effectively free forever.

The full request flow at this point:

POST /tts/from-text  { "text": "..." }
│
▼
chunk_text()  →  [chunk_0, chunk_1, chunk_2, ...]
│
▼  (for each chunk)
polly.synthesize_speech()  →  chunk_N.mp3
│
▼
ffmpeg concat  →  combined.mp3
│
▼
FileResponse  (streamed back to client, temp dir cleaned up in background)

This works. A 3000-word article (~18,000 characters) splits into roughly 7 chunks. Each Polly call takes 0.5–2 seconds. Total latency: 4–12 seconds depending on network and chunk count.

The problem with this: every request re-synthesizes everything from scratch. The same New York Times article, requested by 100 different users, triggers 700 Polly calls. That's wasteful, slow, and eventually expensive.

Iteration 4 — Redis Cache: Stop Paying for the Same Sentence Twice

The insight: text is deterministic. The same input sentence will always produce the same audio bytes from the same voice/engine/region combination. This is a perfect caching problem.

The cache key encodes everything that affects the output. Including voice ID, engine, and region in the key means that if you switch from Joanna/standard to Matthew/neural, you automatically get cache misses — you never accidentally serve audio from the wrong voice.

The loop with Redis, before locking:

for i, chunk in enumerate(chunks):
    h = hash_chunk(chunk)
    key = _redis_chunk_key(chunk_hash=h)
    cached = r.get(key)

    if cached:
        hits += 1
        mp3_path.write_bytes(cached)   # cache hit: instant
    else:
        misses += 1
        mp3_path = polly_synth_chunk_mp3(...)  # cache miss: call Polly
        r.set(key, mp3_path.read_bytes(), ex=cache_ttl_sec)

This is dramatically better than nothing. A second request for the same article is now pure cache reads — sub-100ms total instead of 4–12 seconds.

But there's a race condition hiding here.

Iteration 5 — The Thundering Herd Problem

Imagine a popular article gets published. Fifty users open it and click Play simultaneously. Here's what happens without locking:

All 50 requests call r.get(key) — all get None (cold cache).
All 50 requests call polly.synthesize_speech() for the exact same chunks.
All 50 requests write the same bytes to Redis.
You just made 350 redundant Polly calls (50 users × 7 chunks) and wasted 349/350ths of them.

It's expensive, it stresses the upstream API, and it can push you into Polly's throttling limits.

The fix is a distributed synthesis lock in Redis.

Iteration 6 — Redis Distributed Lock: One Synthesis Per Chunk

The pattern: before calling Polly, try to atomically acquire a lock on the synthesis of that chunk. Only one worker wins the lock. Everyone else waits for the winner to finish and populate the cache.

lock_key = f"{cache_key}:synth-lock"
lock_ttl = max(180, int(polly_timeout * 2) + 30)  # generous TTL
got_lock = r.set(lock_key, b"1", nx=True, ex=lock_ttl)

nx=True means "only set if not exists" — this is atomic in Redis. Exactly one caller gets True; all others get None.

The full per-chunk decision tree:

┌─────────────────────────────────────────────┐
│  r.get(cache_key)                           │
│  ├── HIT  → use cached bytes, continue      │
│  └── MISS → try to acquire synth-lock       │
│             ├── GOT LOCK                    │
│             │   → call Polly                │
│             │   → write result to cache     │
│             │   → release lock              │
│             └── LOCK HELD BY ANOTHER        │
│                 → wait with exponential     │
│                   backoff for cache to      │
│                   populate                  │
│                 ├── cache appeared → HIT    │
│                 └── timeout → try lock      │
│                     once more, or 503       │
└─────────────────────────────────────────────┘

The wait function uses exponential backoff with a cap:

def _redis_wait_for_chunk(r, value_key, *, deadline_monotonic):
    backoff = 0.05
    max_backoff = 0.5
    while time.monotonic() < deadline_monotonic:
        data = r.get(value_key)
        if data:
            return data
        time.sleep(backoff)
        backoff = min(max_backoff, backoff * 1.25)
    return None

Start polling at 50ms, grow by 25% each iteration, cap at 500ms. This keeps Redis query volume low while still responding promptly when the synthesis finishes.

The full route handler handles all three outcomes:

for i, chunk in enumerate(chunks):
    h = hash_chunk(chunk)
    key = _redis_chunk_key(chunk_hash=h)

    # 1. Cache hit - instant return
    cached = r.get(key)
    if cached:
        hits += 1
        mp3_path.write_bytes(cached)
        continue

    # 2. Try to become the synthesizer
    lock_key = _redis_synth_lock_key(chunk_cache_key=key)
    got_lock = r.set(lock_key, b"1", nx=True, ex=lock_ttl)
    if got_lock:
        misses += 1
        try:
            mp3_path = polly_synth_chunk_mp3(text=chunk, ...)
            b = mp3_path.read_bytes()
            if len(b) <= max_cached_bytes:
                r.set(key, b, ex=cache_ttl_sec)
        finally:
            r.delete(lock_key)   # always release, even on error
        continue

    # 3. Someone else holds the lock - wait for their result
    deadline = time.monotonic() + lock_ttl + wait_extra_sec
    waited = _redis_wait_for_chunk(r, key, deadline_monotonic=deadline)
    if waited:
        hits += 1
        mp3_path.write_bytes(waited)
        continue

    # 4. Wait timed out - try to acquire lock one more time
    if r.set(lock_key, b"1", nx=True, ex=lock_ttl):
        # ... synthesize and cache (same as case 2)
        continue

    # 5. Still locked after full wait - give up gracefully
    raise HTTPException(status_code=503, detail="TTS busy synthesizing this segment; retry shortly.")

The finally: r.delete(lock_key) is the most important line. Whether Polly succeeds, errors, times out, or raises an exception, the lock is released. Without this, a failed synthesis leaves the lock held until TTL expiry, blocking all subsequent requests for that chunk for potentially minutes.

Handling Scale: The Full Picture

With caching and locking in place, the behavior under load becomes predictable.

Warm cache (article seen before):
All chunks are in Redis. Every request is N × r.get() + ffmpeg concat + FileResponse. Latency drops to under 300ms for most articles. No Polly calls at all.

Cold cache, 50 simultaneous users (thundering herd):

1 request wins the lock per chunk → calls Polly, writes to cache, releases lock.
49 requests wait on _redis_wait_for_chunk → find cached bytes as soon as the winner finishes.
Total Polly calls: N chunks (7 for our example), not 50 × N = 350.
You can verify this in logs: chunk cache stats hits=49 misses=1 per chunk.

Memory guard:

if len(b) <= max_cached_bytes:
    r.set(key, b, ex=cache_ttl_sec)

Chunks larger than MAX_CACHED_CHUNK_BYTES (default 5MB) are synthesized but not cached. A pathologically long chunk from unusual input won't fill Redis memory.

The Final Architecture Diagram

Client
│  POST /tts/from-text  { text: "..." }
▼
FastAPI  (backend/main.py)
│
├── chunk_text()  →  [chunk_0 .. chunk_N]
│                    (sentence-boundary splitting)
│
└── for each chunk:
      │
      ├── SHA-256 hash  →  cache key
      │
      ├── Redis GET
      │   ├── HIT  →  write bytes to disk
      │   └── MISS
      │         ├── SET NX (acquire synth lock)
      │         │   ├── GOT LOCK
      │         │   │   → Amazon Polly synthesize_speech()
      │         │   │   → write MP3 bytes to disk
      │         │   │   → Redis SET (cache result, 30-day TTL)
      │         │   │   → Redis DEL (release lock)
      │         │   └── LOCK HELD
      │         │       → exponential backoff poll
      │         │       → cache appeared → write bytes to disk
      │         │       → timeout → retry lock → 503
      │
      └── ffmpeg concat  →  combined.mp3
            │
            └── FileResponse  (audio/mpeg)
                background: shutil.rmtree(tmp)

What I'd Do Differently

Async synthesis. The current implementation is synchronous — the HTTP request blocks until all Polly calls return and ffmpeg finishes. For a public API, I'd move to a job queue (Celery, ARQ, or even a simple Redis list): accept the article, return a job ID immediately, poll or subscribe for the result. This eliminates timeout risk on slow connections.

Streaming audio. Instead of waiting for all chunks before returning, you can stream chunk_0 to the client while chunk_1 is still synthesizing. This cuts perceived latency significantly for long articles.

Persistent cache storage. Redis in-memory is fast but expensive per GB at scale. For audio bytes that are valid for months, consider offloading cached chunks to S3 or R2 (using Redis only for the lock and a pointer/URL, not the raw bytes).

Key Takeaways

All TTS providers have character limits. Design your chunker before you pick a provider, not after.
Text synthesis is deterministic. The same text from the same voice always produces the same bytes. Cache aggressively.
Cache keys must include all synthesis parameters. Voice ID, engine, and region are part of the key — not just the text hash.
The thundering herd is real. Without a distributed lock, a cold-cache spike causes N × concurrent_users upstream calls. Redis SET NX is the right primitive for this.
Always release locks in finally blocks. A failed synthesis that doesn't release its lock blocks every subsequent request for that chunk until TTL expiry.

DEV Community