DEV Community

Atlas Whoff
Atlas Whoff

Posted on

I Built a Voice Interface for My AI Agent in 2 Hours (Flask + Web Speech API + TTS)

I had a free Saturday afternoon and a clear goal: talk to my AI agent out loud and hear it talk back. Two hours later, Atlas had a voice.

Here's exactly how I built it — Flask backend, Web Speech API for input, Mistral's Voxtral TTS for output, and a canvas animation that makes the avatar's eyes glow in sync with the audio.

The Stack

  • Flask — tiny backend, two endpoints
  • Web Speech API — browser-native speech-to-text (Chrome only, push-to-talk)
  • Mistral Voxtral TTSvoxtral-mini-tts-2603, returns base64 MP3
  • macOS say command — fallback when Voxtral is unavailable
  • Web Audio API AnalyserNode — drives the canvas glow animation

Architecture in 30 Seconds

The flow is simple:

  1. User holds Space → Chrome's SpeechRecognition runs locally
  2. On final result, transcript POSTs to /api/chat
  3. Flask calls Mistral chat API (mistral-large-latest) → gets text response
  4. Flask calls Voxtral TTS → returns base64 MP3
  5. Browser decodes the MP3, plays it through an AnalyserNode
  6. Canvas reads frequency data every frame → drives radial gradients over the avatar's eyes

No WebSockets, no streaming. One request, one response. Simple wins.

The TTS Fallback Pattern

This was the most useful piece of code I wrote. Mistral TTS is genuinely good — natural rhythm, low latency. But APIs fail. The macOS say command is always there.

def generate_tts(text: str) -> str | None:
    """Returns base64-encoded MP3. Tries Mistral TTS first, falls back to macOS say."""
    mistral_key = os.getenv("MISTRAL_API_KEY", "")

    if mistral_key:
        try:
            headers = {
                "Authorization": f"Bearer {mistral_key}",
                "Content-Type": "application/json"
            }
            payload = {
                "model": "voxtral-mini-tts-2603",
                "input": text,
                "response_format": "mp3",
                "voice_id": "casual_male"
            }
            resp = requests.post(
                "https://api.mistral.ai/v1/audio/speech",
                headers=headers,
                json=payload,
                timeout=30
            )
            resp.raise_for_status()
            data = resp.json()
            if "audio_data" in data:
                return data["audio_data"]  # already base64
        except Exception:
            pass  # fall through to macOS say

    # Fallback: macOS say → AIFF → MP3 via ffmpeg → base64
    import subprocess, tempfile
    with tempfile.NamedTemporaryFile(suffix=".aiff", delete=False) as aiff:
        aiff_path = aiff.name
    mp3_path = aiff_path.replace(".aiff", ".mp3")
    subprocess.run(
        ["say", "-v", "Reed (English (US))", "-r", "155", "-o", aiff_path, text],
        check=True
    )
    subprocess.run(
        ["ffmpeg", "-y", "-i", aiff_path, "-acodec", "libmp3lame", "-q:a", "2", mp3_path],
        capture_output=True, check=True
    )
    audio_b64 = base64.b64encode(Path(mp3_path).read_bytes()).decode()
    Path(aiff_path).unlink(missing_ok=True)
    Path(mp3_path).unlink(missing_ok=True)
    return audio_b64
Enter fullscreen mode Exit fullscreen mode

The key insight: always return base64. Whether the audio came from a neural TTS API or a 1990s speech synthesizer, the frontend doesn't care — it just decodes and plays.

The Avatar Animation Trick

This is the part I'm most happy with. When Atlas speaks, its eyes glow. The intensity tracks the actual audio frequency content in real time.

// Connect decoded audio through AnalyserNode before playback
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(analyser);
analyser.connect(audioCtx.destination);
source.start(0);

// Each animation frame: read frequency bins, drive canvas radial gradients
function frame() {
    analyser.getByteFrequencyData(freqData);

    // Voice-range amplitude: bins 0-17 ≈ 0-3kHz at 44.1kHz / 256 FFT
    let voiceSum = 0;
    for (let i = 0; i < 18; i++) voiceSum += freqData[i];
    const amplitude = voiceSum / (18 * 255);

    // Presence: bins 8-40
    let presenceSum = 0;
    for (let i = 8; i < 40; i++) presenceSum += freqData[i];
    const presence = presenceSum / (32 * 255);

    // Draw radial gradient over each eye position
    const glowAlpha = 0.25 + presence * 0.75;
    const glowRadius = Math.min(w, h) * (0.04 + presence * 0.09);

    EYES.forEach(eye => {
        const grad = ctx2d.createRadialGradient(ex, ey, 0, ex, ey, glowRadius);
        grad.addColorStop(0, `rgba(0,229,204,${glowAlpha})`);
        grad.addColorStop(0.5, `rgba(0,229,204,${glowAlpha * 0.35})`);
        grad.addColorStop(1, 'rgba(0,229,204,0)');
        ctx2d.fillStyle = grad;
        ctx2d.fillRect(ex - glowRadius, ey - glowRadius, glowRadius * 2, glowRadius * 2);
    });

    animFrameId = requestAnimationFrame(frame);
}
Enter fullscreen mode Exit fullscreen mode

The trick is globalCompositeOperation = 'screen' on the canvas, layered over the avatar image. The cyan glow blends additively — it looks like the eyes are lit from behind, not painted on top.

Eye positions are hardcoded as fractions of the canvas size ({ x: 0.415, y: 0.42 } and { x: 0.585, y: 0.42 }). Rough, but it works. The avatar is a geometric mask with glowing cyan eyes — the style forgives approximation.

What Took the Longest

Not the code. The voice.

Voxtral is genuinely impressive — it sounds like a real person, not a robot. But getting the character right took iteration. Too fast and it sounds frantic. Too slow and it sounds corporate. The macOS say fallback using the Reed voice at 155 WPM is surprisingly usable — just obviously synthetic.

The Web Speech API also has a quirk: Chrome kills the recognition object after each session. The fix is to recreate it fresh each time you start listening rather than reusing the same instance. One line of code, 20 minutes to figure out.

Honest Assessment

Voxtral TTS: great when it works. The latency is low, the quality is high, and voice cloning via reference audio is a one-parameter addition to the payload. I'd use it in production.

macOS say: ugly but reliable. Good for development. Do not ship it.

Web Speech API: Chrome-only. Needs internet (the STT itself calls Google's servers). Push-to-talk works well; continuous mode has edge cases. For a local dev tool, it's perfect.

The whole thing: it's a local tool, not a product. But watching Atlas answer questions with a glowing avatar and synthesized voice in under 2 hours of work made the point — building voice interfaces for AI agents is not hard anymore. The primitives are all there.

What's Next

The logical next step is streaming TTS. Right now there's a 1-2 second gap while the full response generates. Streaming audio chunks would close that gap and make it feel more conversational.

The other thing I want to add: voice cloning. Voxtral supports reference audio — you pass a base64-encoded MP3 of a target voice and it clones the style. Atlas should sound like Atlas, not like a preset.


If you're building AI agents and want one with a voice interface already wired up, or if you want to outsource the automation work entirely — check out whoffagents.com. We build custom AI agents end to end.

Top comments (0)