Cloud TTS Chirp3-HD with Caching: Fixing Voice Readout for Accessibility

#tts #cloudtts #caching #a11y

Cloud TTS Chirp3-HD with Caching: Fixing Voice Readout for Accessibility

As a solo developer, keeping the product lean and accessible is paramount. A recent request highlighted a critical need: the ability to have AI chat responses read aloud. This wasn't just about adding a feature; it was about making the platform usable for someone with visual impairments, specifically a user's mother-in-law who struggles with reading text on screen. The initial thought was simple text-to-speech (TTS), but the reality of implementing it well, especially on a single small VM, presented several engineering challenges.

The Genesis: A Need for Voice

The request was clear: "When I ask a question via text, if I don't have time to check it, let me hear the answer via voice." This immediately told me it wasn't about real-time conversational voice, but rather a playback feature for existing text responses. This distinction is crucial for architecture and cost management.

Design Iterations: From Browser to Cloud

The first instinct might be to use the browser's built-in TTS capabilities. However, user feedback quickly shut that down: the native browser voices sounded too robotic and were actively disliked. This pointed towards a cloud-based neural TTS engine. The key requirements that emerged after a couple of back-and-forth discussions were:

Natural Voice: The voice needed to be human-like and pleasant to listen to.
Text Cleaning: Tables, code blocks, and other non-prose elements in the AI's response would create noise if read directly. These needed to be cleaned or handled gracefully.
Efficiency: Repeatedly generating the same audio for the same response was wasteful, both in terms of processing time and cost. A caching mechanism was essential.

Engine Selection: Benchmarking Latency and Quality

I evaluated a couple of cloud TTS options:

Vertex AI Gemini TTS: Offered a natural voice but had a higher latency, taking between 7 to 19 seconds for generation.
Cloud TTS Chirp3-HD: This engine provided a Gemini-level of naturalness and significantly faster generation times, around 2 seconds. It used the ko-KR-Chirp3-HD-Charon/Kore voice, which was consistent with other natural voices used in the product.
Neural2: This was the fastest at around 0.5 seconds but sounded noticeably more like a traditional TTS engine.

Given the balance of speed, naturalness, and voice consistency, Cloud TTS Chirp3-HD was the clear winner. The implementation would involve using its REST API with a service account (SA) that had the cloud-platform scope for authentication, and storing generated audio in Google Cloud Storage (GCS).

Implementation Details: Cleaning, Caching, and Delivery

The implementation involved several components:

Text Cleaning: A new function, clean_for_tts, was introduced. This function preprocesses the AI's response to remove or rephrase elements that don't translate well to speech. For instance, code blocks and tables would be replaced with a message like "Please refer to the screen for tables and code." Links, emphasis, and list markers were also stripped to leave only plain text.
Caching Logic: A cache was implemented using GCS. When a request for audio comes in, the cleaned text is hashed. If an MP3 file with that hash exists in the tts-cache/ GCS bucket, it's served immediately. This is a cache HIT, incurs no extra cost, and doesn't deduct from the daily quota. If the file doesn't exist (a cache MISS), the Cloud TTS API is called to generate the audio. The generated audio is then saved to GCS and logged in the usage_logs table with source='tts'.
Cost Management: To prevent abuse and manage costs on my small VM, a daily cap (TTS_DAILY_CAP) was set. This cap applies only to MISSes, not HITs. The cost for TTS generation is approximately $30 per 1 million characters.
API Endpoint: A new API endpoint, POST /chat/tts, was created. This endpoint is publicly accessible (as it's an accessibility feature) and returns the audio data as a base64 encoded string. It handles cache HITs and MISSes.
Nginx Configuration: To ensure the new /api/chat/tts endpoint was correctly routed and not intercepted by other services (like SSE), an exact match configuration was added to Nginx, drawing from lessons learned from previous video upload handling.
Frontend Integration: A "Listen to response" (🔊) button was added to each chat bubble. This button, visible by default for accessibility, triggers the API call. Upon receiving the audio data, it's played back using the browser's new Audio() constructor. The user's selected voice preference is also respected.

A Hidden Bug: Date Encoding Woes

During testing, a critical bug surfaced related to the daily quota tracking. The system was failing to record usage for sessions that timed out or had no speech output, causing a potential leak in the daily cap. This was traced to how dates were handled in the database query for tracking daily generations. Specifically, passing a date as a string $2::date to asyncpg caused a DataError because the string representation wasn't correctly converted. The fix was to perform the date calculation directly within SQL using (now() AT TIME ZONE 'Asia/Seoul')::date instead of relying on string conversion.

Lesson learned: When using asyncpg for date comparisons, it's safer to perform date calculations within SQL or ensure you're passing proper date objects, rather than relying on string casting, especially across different time zones.

Honest Limitations

While this feature significantly enhances accessibility, it's important to be transparent about its limitations:

The voice used is a natural neural voice, not the device's default.
Tables and code blocks are skipped with a "Please refer to the screen" message.
Automatic playback is not enabled to prevent accidental costs and ensure user control.
While cache HITs are free and unlimited, generating new audio deducts from the daily quota.

Furthermore, a note was added that any replies generated by the user's request would require explicit approval before being sent, a standard procedure for user-facing replies.

...building aicoreutility.com in the open...

💬 This is part of *Riel** — a full AI product I'm building solo, in public (failures and all). Read more build logs → · See the product →*

DEV Community

Cloud TTS Chirp3-HD with Caching: Fixing Voice Readout for Accessibility