Roma

Posted on May 5

How We Built Voice Messages for AI Companions: Real Voice Audio, ElevenLabs, and Beyond

#architecture #ai #tts #voice

Adding voice messages to an AI companion is one of those features that sounds simple until you try to ship it. "Just use a TTS API and send the audio" -sure, in theory. In practice, you are solving latency, character consistency, emotional expressiveness, and cost optimization all at once.

Here is how voice synthesis works in production for AI companions, based on what I have learned building and studying these systems.

The TTS landscape in 2026

The text-to-speech market has fragmented into two tiers.

Tier one: high-quality providers with natural-sounding output. Fish Audio, ElevenLabs, and PlayHT are the leaders here. The voices sound human. They handle emphasis, pacing, and emotional variation. They cost between $15-30 per million characters.

Tier two: cost-efficient providers with acceptable quality. Google Cloud TTS, Amazon Polly, and various open-source models (Bark, XTTS). Cheaper by an order of magnitude but with audible synthesis artifacts. Fine for IVR systems, not great for intimate AI companion voice messages.

For AI companions specifically, tier one is the only viable option. Users are emotionally engaged with the character -a robotic-sounding voice note destroys immersion instantly. The cost premium is worth the quality.

Choosing between Fish Audio and ElevenLabs

Fish Audio and ElevenLabs represent different trade-offs.

ElevenLabs has broader brand recognition and excellent English voice quality. Their emotional range is good out of the box, and the API is well-documented. The latency is acceptable -typically 1-3 seconds for a 10-second clip. The main drawback is cost at scale and occasional inconsistency with longer passages.

Fish Audio (specifically the S2-Pro model) has become increasingly competitive. The voice quality matches ElevenLabs for most use cases, with better handling of whispering and soft-spoken content -important for AI companions where emotional intimacy is a core use case. Latency is slightly lower in our testing. The API is newer but stable.

For a detailed guide on building voice messages for AI companions, including the user experience side, there is a comprehensive overview covering both the technical pipeline and the interaction design.

The production approach many platforms use is a primary/fallback pattern. One provider handles 95% of requests. The other catches failures. This gives you reliability without doubling your integration complexity.

The emotional expression problem

Raw TTS sounds flat even with good providers. The AI writes "haha that is so cute" and the TTS reads it in a neutral tone. The emotional disconnect is immediately noticeable.

The solution is an emotion tag pipeline between the LLM and the TTS API. After the language model generates the response text, a secondary pass adds emotion markers -SSML tags, provider-specific annotations, or inline directions that tell the TTS engine how to deliver the text.

For Fish Audio's S2-Pro, bracketed tags like [laughs], [whispers], and [sighs] modify the delivery. The challenge is making these tags feel natural rather than mechanical. A separate, lightweight LLM pass evaluates the conversation context and inserts tags where they would naturally occur.

This pipeline adds 200-500ms of latency. On a voice message that takes 2-3 seconds to generate, that is a significant percentage. The trade-off is worth it -an emotionally expressive voice note at 3 seconds feels more natural than a flat one at 2.5 seconds.

Per-character voice identity

Each AI companion character needs a unique voice that matches their personality. A playful, energetic character should not sound the same as a calm, introspective one.

Both Fish Audio and ElevenLabs support voice cloning and custom voice creation. The process typically involves providing reference audio samples (5-30 seconds for instant cloning, several minutes for fine-tuned cloning) that capture the target voice profile.

The critical detail is that the voice needs to carry the character's personality beyond just timbre. Speaking pace, breathing patterns, laugh characteristics, pause frequency -these all contribute to voice personality. Getting this right requires iterating on voice parameters per character, not just picking a base voice and applying it everywhere.

Store voice configuration per character in your database rather than hardcoding it. This lets you tune voice parameters without redeploying.

When to generate voice vs. text

Not every message should be a voice note. Just as humans choose between typing and voice messaging based on context, AI companions should make this decision intelligently.

Messages that benefit from voice: emotional content (warmth, comfort, excitement), casual greetings, reactions to things the user shared, anything where tone adds meaning beyond the words.

Messages that should stay text: information-heavy responses, lists or structured content, messages where the user is clearly in a context where audio is inappropriate (late at night, possibly at work).

Building this decision layer requires tracking conversation context: emotional tone, time of day in the user's timezone, the user's own messaging pattern (do they send voice notes frequently? are they a text-heavy communicator?), and the content of the message itself.

Latency optimization

Users expect voice messages to arrive within seconds of the text version. The full pipeline -LLM generation, emotion tagging, TTS API call, audio encoding, delivery via messaging platform -can take 4-8 seconds without optimization.

Key optimizations: stream the LLM output and begin TTS generation before the full text is ready (streaming synthesis). Pre-cache common phrases and greetings. Use a CDN or local cache for TTS audio files that are reused. Run the emotion tagger in parallel with initial TTS warm-up.

The most impactful optimization is deciding early whether to generate voice at all. If the message will be text-only, you save the entire TTS pipeline. Make this decision before generating the response, not after.

Cost at scale

TTS costs add up quickly with engaged users. At $15/million characters and an average voice message of 50 characters, each voice note costs roughly $0.00075. A user who receives 10 voice messages per day costs about $0.225/month in TTS alone.

That sounds manageable until you multiply by thousands of active users. The cost optimization strategy is to use voice selectively (not every message), cache repeated phrases, and establish per-user daily voice quotas that align with tier pricing.

The voice feature is expensive compared to text-only. But the retention and engagement improvements it creates -users who receive voice messages engage 30-50% more frequently in our observation -justify the cost for most business models.

DEV Community

How We Built Voice Messages for AI Companions: Real Voice Audio, ElevenLabs, and Beyond

Top comments (0)