Voice assistant with cloned voice & Mistral AI Voxtral

#mistral #texttospeech #ai #python

Here's what you get at the end: a browser app where you click a button, ask a question aloud, and hear the answer back in a cloned voice. Speech recognition, LLM response, and text-to-speech — all Mistral, all on the free plan.

This article walks through how the pipeline fits together, shows the code for the part most tutorials skip (the STT relay), and covers the cost and compliance angles that are worth knowing before you pick a stack.

How the pipeline fits together

Browser mic → [WebSocket] → Voxtral STT → Mistral LLM → Voxtral TTS → Browser audio

The browser never talks to Mistral directly. It relays audio over WebSocket to a FastAPI backend, which handles all three API calls. There are two reasons for this: you can't expose your API key in browser JavaScript, and Voxtral's realtime speech recognition requires a persistent connection that has to stay open for the full duration of the audio stream.

The relay — the part most tutorials skip

Setting up the WebSocket relay is the piece that trips people up. Here's the core:

from mistralai.client import Mistral
from mistralai.client.models import (
    AudioFormat,
    TranscriptionStreamTextDelta,
    TranscriptionStreamDone,
)

async def do_stt(ws, client):
    connection = await client.audio.realtime.connect(
        model="voxtral-mini-transcribe-realtime-2602",
        audio_format=AudioFormat(encoding="pcm_s16le", sample_rate=16000),
        target_streaming_delay_ms=480,
    )

    async def receive_audio():
        while True:
            data = await ws.receive()
            if "bytes" in data:
                await connection.send_audio(data["bytes"])
            elif "text" in data:
                msg = json.loads(data["text"])
                if msg.get("type") == "stop_listening":
                    break
        await connection.flush_audio()
        await connection.end_audio()

    async def process_events():
        async for event in connection:
            if isinstance(event, TranscriptionStreamTextDelta):
                await ws.send_json({"type": "partial", "text": event.text})
            elif isinstance(event, TranscriptionStreamDone):
                return event.text

    audio_task = asyncio.create_task(receive_audio())
    events_task = asyncio.create_task(process_events())
    await audio_task
    return await asyncio.wait_for(events_task, timeout=10.0)

The browser sends raw PCM audio bytes (16-bit, mono, 16kHz) over WebSocket. The server forwards them to Voxtral and listens for transcript events. TranscriptionStreamTextDelta gives you partial results you can stream back to the UI; TranscriptionStreamDone gives you the final transcript.

One important constraint: AudioFormat only takes encoding and sample_rate. Don't pass channels or bit_depth — the SDK will error.

LLM + TTS

Once you have the transcript, the rest is simpler:

async def respond(text, history, ref_audio_b64):
    messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history[-20:]
    messages.append({"role": "user", "content": text})

    llm_response = client.chat.complete(
        model="mistral-small-latest",
        messages=messages,
        max_tokens=300
    )
    answer = llm_response.choices[0].message.content

    tts_response = client.audio.speech.complete(
        model="voxtral-mini-tts-2603",
        input=answer,
        ref_audio=ref_audio_b64,
        response_format="mp3"
    )
    return answer, base64.b64decode(tts_response.audio_data)

The ref_audio parameter is what makes voice cloning work on the free plan. You pass a base64-encoded audio clip and the model adapts the voice inline — no persistent voice profile, no paid subscription. If you want the full explanation of how that works, part 1 of this series covers it.

What it actually costs

Development is free. Mistral has a free tier with rate limits generous enough to build and test without paying anything.

When you move to paid usage, the numbers are:

STT: $0.003/minute
TTS: $16 per million characters
LLM (mistral-small): $0.10/1M input tokens, $0.30/1M output tokens

A typical turn — 10 seconds of speech, 100-word answer — works out to roughly $0.011. The TTS step dominates; the LLM cost is negligible.

At scale, assuming 10 turns per user per month:

Users	Turns/month	Cost/month
10	100	~$1
1,000	10,000	~$110
100,000	1M	~$11,000
1,000,000	10M	~$110,000

"Zero cost" is accurate at dev scale. At real user scale it's real money — just predictable, per-call money with no surprises.

The comparison with ElevenLabs has two parts. At low volume, ElevenLabs' subscription model works against you: you're paying $5–22/month before you've cloned a single voice. At high volume, the per-character rate matters more — Mistral's TTS is roughly 73% cheaper per character than ElevenLabs. And with Mistral, voice cloning is a parameter, not a plan upgrade. ref_audio works on the free tier; ElevenLabs instant voice cloning requires at minimum the Starter plan.

The sovereignty angle

This is one that most AI tutorials don't mention, but it matters for a real slice of use cases.

Mistral is a French company. Their infrastructure is in Europe. All three API calls in this pipeline — speech recognition, LLM, and text-to-speech — stay within EU jurisdiction.

If you're building for European users, or in any regulated sector like healthcare, education, or legal, this is worth knowing before you choose a stack. GDPR requires that personal data be processed lawfully and, where it concerns EU residents, often requires understanding where it goes. US-based cloud providers are subject to the CLOUD Act, which means US authorities can compel disclosure of data on US-operated systems regardless of where the physical servers are. That creates a real compliance gap that some organizations simply can't accept.

Running voice data through Mistral sidesteps that entirely. If you're building a voice assistant for a school, a GP's surgery, or anything touching personal data in Europe, this isn't a nice-to-have — it can be a hard requirement. Worth knowing before you've already built on ElevenLabs.

What's missing for production

The pipeline as described is a working app. A few things you'd add before running it in front of real users:

Persistent session storage (currently in-memory, resets on restart)
Per-user rate limiting
Retry logic for intermittent 503s from Voxtral — these do happen occasionally
Secure API key handling on the server side

The core flow is solid. I've been testing it through several hundred turns and the STT + LLM + TTS chain is reliable.

If you want the full implementation, the FastAPI backend, WebSocket frontend, state machine, voice upload endpoint, I cover it in my Mistral AI: Voxtral TTS (text to speech), Vision & AI Agents course on Udemy. Everything in the course runs on the free plan.