I had a free Saturday afternoon and a clear goal: talk to my AI agent out loud and hear it talk back. Two hours later, Atlas had a voice.
Here's exactly how I built it — Flask backend, Web Speech API for input, Mistral's Voxtral TTS for output, and a canvas animation that makes the avatar's eyes glow in sync with the audio.
The Stack
- Flask — tiny backend, two endpoints
- Web Speech API — browser-native speech-to-text (Chrome only, push-to-talk)
-
Mistral Voxtral TTS —
voxtral-mini-tts-2603, returns base64 MP3 -
macOS
saycommand — fallback when Voxtral is unavailable - Web Audio API AnalyserNode — drives the canvas glow animation
Architecture in 30 Seconds
The flow is simple:
- User holds Space → Chrome's
SpeechRecognitionruns locally - On final result, transcript POSTs to
/api/chat - Flask calls Mistral chat API (mistral-large-latest) → gets text response
- Flask calls Voxtral TTS → returns base64 MP3
- Browser decodes the MP3, plays it through an
AnalyserNode - Canvas reads frequency data every frame → drives radial gradients over the avatar's eyes
No WebSockets, no streaming. One request, one response. Simple wins.
The TTS Fallback Pattern
This was the most useful piece of code I wrote. Mistral TTS is genuinely good — natural rhythm, low latency. But APIs fail. The macOS say command is always there.
def generate_tts(text: str) -> str | None:
"""Returns base64-encoded MP3. Tries Mistral TTS first, falls back to macOS say."""
mistral_key = os.getenv("MISTRAL_API_KEY", "")
if mistral_key:
try:
headers = {
"Authorization": f"Bearer {mistral_key}",
"Content-Type": "application/json"
}
payload = {
"model": "voxtral-mini-tts-2603",
"input": text,
"response_format": "mp3",
"voice_id": "casual_male"
}
resp = requests.post(
"https://api.mistral.ai/v1/audio/speech",
headers=headers,
json=payload,
timeout=30
)
resp.raise_for_status()
data = resp.json()
if "audio_data" in data:
return data["audio_data"] # already base64
except Exception:
pass # fall through to macOS say
# Fallback: macOS say → AIFF → MP3 via ffmpeg → base64
import subprocess, tempfile
with tempfile.NamedTemporaryFile(suffix=".aiff", delete=False) as aiff:
aiff_path = aiff.name
mp3_path = aiff_path.replace(".aiff", ".mp3")
subprocess.run(
["say", "-v", "Reed (English (US))", "-r", "155", "-o", aiff_path, text],
check=True
)
subprocess.run(
["ffmpeg", "-y", "-i", aiff_path, "-acodec", "libmp3lame", "-q:a", "2", mp3_path],
capture_output=True, check=True
)
audio_b64 = base64.b64encode(Path(mp3_path).read_bytes()).decode()
Path(aiff_path).unlink(missing_ok=True)
Path(mp3_path).unlink(missing_ok=True)
return audio_b64
The key insight: always return base64. Whether the audio came from a neural TTS API or a 1990s speech synthesizer, the frontend doesn't care — it just decodes and plays.
The Avatar Animation Trick
This is the part I'm most happy with. When Atlas speaks, its eyes glow. The intensity tracks the actual audio frequency content in real time.
// Connect decoded audio through AnalyserNode before playback
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(analyser);
analyser.connect(audioCtx.destination);
source.start(0);
// Each animation frame: read frequency bins, drive canvas radial gradients
function frame() {
analyser.getByteFrequencyData(freqData);
// Voice-range amplitude: bins 0-17 ≈ 0-3kHz at 44.1kHz / 256 FFT
let voiceSum = 0;
for (let i = 0; i < 18; i++) voiceSum += freqData[i];
const amplitude = voiceSum / (18 * 255);
// Presence: bins 8-40
let presenceSum = 0;
for (let i = 8; i < 40; i++) presenceSum += freqData[i];
const presence = presenceSum / (32 * 255);
// Draw radial gradient over each eye position
const glowAlpha = 0.25 + presence * 0.75;
const glowRadius = Math.min(w, h) * (0.04 + presence * 0.09);
EYES.forEach(eye => {
const grad = ctx2d.createRadialGradient(ex, ey, 0, ex, ey, glowRadius);
grad.addColorStop(0, `rgba(0,229,204,${glowAlpha})`);
grad.addColorStop(0.5, `rgba(0,229,204,${glowAlpha * 0.35})`);
grad.addColorStop(1, 'rgba(0,229,204,0)');
ctx2d.fillStyle = grad;
ctx2d.fillRect(ex - glowRadius, ey - glowRadius, glowRadius * 2, glowRadius * 2);
});
animFrameId = requestAnimationFrame(frame);
}
The trick is globalCompositeOperation = 'screen' on the canvas, layered over the avatar image. The cyan glow blends additively — it looks like the eyes are lit from behind, not painted on top.
Eye positions are hardcoded as fractions of the canvas size ({ x: 0.415, y: 0.42 } and { x: 0.585, y: 0.42 }). Rough, but it works. The avatar is a geometric mask with glowing cyan eyes — the style forgives approximation.
What Took the Longest
Not the code. The voice.
Voxtral is genuinely impressive — it sounds like a real person, not a robot. But getting the character right took iteration. Too fast and it sounds frantic. Too slow and it sounds corporate. The macOS say fallback using the Reed voice at 155 WPM is surprisingly usable — just obviously synthetic.
The Web Speech API also has a quirk: Chrome kills the recognition object after each session. The fix is to recreate it fresh each time you start listening rather than reusing the same instance. One line of code, 20 minutes to figure out.
Honest Assessment
Voxtral TTS: great when it works. The latency is low, the quality is high, and voice cloning via reference audio is a one-parameter addition to the payload. I'd use it in production.
macOS say: ugly but reliable. Good for development. Do not ship it.
Web Speech API: Chrome-only. Needs internet (the STT itself calls Google's servers). Push-to-talk works well; continuous mode has edge cases. For a local dev tool, it's perfect.
The whole thing: it's a local tool, not a product. But watching Atlas answer questions with a glowing avatar and synthesized voice in under 2 hours of work made the point — building voice interfaces for AI agents is not hard anymore. The primitives are all there.
What's Next
The logical next step is streaming TTS. Right now there's a 1-2 second gap while the full response generates. Streaming audio chunks would close that gap and make it feel more conversational.
The other thing I want to add: voice cloning. Voxtral supports reference audio — you pass a base64-encoded MP3 of a target voice and it clones the style. Atlas should sound like Atlas, not like a preset.
If you're building AI agents and want one with a voice interface already wired up, or if you want to outsource the automation work entirely — check out whoffagents.com. We build custom AI agents end to end.
Top comments (0)