By Xaden
Cloud voice APIs are convenient — until they're not. Latency adds up when every utterance round-trips to a datacenter. Privacy evaporates when your microphone stream leaves your machine. And monthly bills grow linearly with usage.
This guide documents a production-tested architecture for fully local voice AI on Apple Silicon: speech-to-text via Whisper.cpp with Metal GPU acceleration, inference via Ollama, and text-to-speech via Kokoro ONNX with a persistent HTTP server. Every component runs on-device. No API keys. No internet required. No per-token charges.
Target hardware: MacBook Pro M3 Pro (36GB unified memory). The architecture scales down to M1/8GB with smaller models.
Target latency budget:
- STT (Whisper): ~300-500ms
- LLM (Ollama): ~1000-2000ms
- TTS (Kokoro): ~200-500ms
- Audio I/O: ~100ms
- Total: < 3 seconds
Architecture Overview
┌─────────────────────────────────────────────┐
│ voice-chat-fast.sh │
│ (orchestrator / main loop) │
└─────────┬──────────┬──────────┬────────────┘
│ │ │
┌─────────▼───┐ ┌────▼────┐ ┌──▼──────────┐
│ ffmpeg │ │ Ollama │ │ Kokoro TTS │
│ (record) │ │ (LLM) │ │ Server:8181 │
└─────┬───────┘ └────┬────┘ └──┬──────────┘
│ │ │
┌─────▼───────┐ │ ┌───▼──────────┐
│ whisper-cli │ │ │ kokoro-onnx │
│ (STT+Metal)│ │ │ (in-memory) │
└─────────────┘ │ └──────────────┘
│
┌────────────────────▼────────────────────┐
│ Conversation History │
│ (JSON, last N exchanges) │
└─────────────────────────────────────────┘
Stage 1: Speech-to-Text with Whisper.cpp
Why Whisper.cpp over OpenAI's Python Whisper
Whisper.cpp is a C/C++ port that compiles natively for ARM64, uses Metal GPU acceleration out of the box, and loads models in compact GGML format.
brew install whisper-cpp
Model Selection
mkdir -p ~/.local/share/whisper-models
curl -L -o ~/.local/share/whisper-models/ggml-base.en.bin \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin"
Model tradeoffs:
- ggml-tiny.en (75MB) — Fastest, good for voice chat
- ggml-base.en (147MB) — Sweet spot for accuracy/speed
- ggml-small.en (466MB) — Noticeably better accuracy
- ggml-medium.en (1.5GB) — High accuracy, slower
- ggml-large-v3 (3.0GB) — Best, multilingual, heavy
For real-time voice conversation, tiny.en gets transcription under 300ms.
Metal GPU Acceleration
On Apple Silicon, Whisper.cpp automatically detects and uses Metal. Benchmark: 3 seconds of audio transcribed in ~500ms (including model load from cold).
Audio Format Requirements
Whisper expects 16kHz, mono, 16-bit PCM WAV:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -acodec pcm_s16le output.wav
Stage 2: Voice Activity Detection (VAD)
Rather than adding a dedicated VAD library, we exploit ffmpeg's built-in silencedetect audio filter:
ffmpeg -y -f avfoundation -i ":0" \
-ar 16000 -ac 1 -acodec pcm_s16le \
-t 30 \
-af "silencedetect=noise=0.02:d=1.5" \
recording.wav 2>ffmpeg.log &
-
noise=0.02— amplitude threshold for "silence" (increase for noisy environments) -
d=1.5— seconds of silence before triggering
The monitoring loop watches for silence_start and silence_end markers in ffmpeg's stderr.
Stage 3: The Cold-Start Problem — Kokoro ONNX vs. PyTorch
The problem: 9 seconds of silence
Calling the Python CLI per utterance produces ~9 seconds of latency. Breakdown: ~6 seconds of overhead (Python startup, PyTorch import, model loading), only ~2 seconds of actual synthesis.
The fix: two orthogonal optimizations
Optimization 1: Replace PyTorch with ONNX Runtime
ONNX Runtime on ARM64 uses optimized NEON SIMD instructions. 4-10x faster for a model this size.
pip install kokoro-onnx sounddevice onnxruntime
Optimization 2: Persistent process — load once, serve forever
Even with ONNX, loading the model from disk takes ~1-2 seconds. Load it once at startup and keep it in memory as an HTTP server.
These two changes together: ~9 seconds → ~300ms. A 30x improvement.
Stage 4: The Persistent TTS Server
A minimal HTTP server that loads Kokoro ONNX at startup and serves synthesis requests:
#!/usr/bin/env python3
"""tts-server.py — Persistent Kokoro ONNX TTS server."""
import json, sys, time
from http.server import HTTPServer, BaseHTTPRequestHandler
import numpy as np
import sounddevice as sd
from kokoro_onnx import Kokoro
MODEL_PATH = "kokoro-v0_19.onnx"
VOICES_PATH = "voices.bin"
HOST, PORT = "127.0.0.1", 8181
kokoro = Kokoro(MODEL_PATH, VOICES_PATH)
class TTSHandler(BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers.get("Content-Length", 0))
data = json.loads(self.rfile.read(content_length))
text = data.get("text", "")
voice = data.get("voice", "am_puck")
speed = float(data.get("speed", 1.0))
if not text.strip():
self.send_response(200); self.end_headers(); return
samples, sr = kokoro.create(text, voice=voice, speed=speed, lang="en-us")
sd.play(samples, samplerate=sr)
sd.wait()
self.send_response(200)
self.end_headers()
Voice Selection
Kokoro ships with 26 voice style vectors:
American English: af_alloy, af_bella, af_jessica, am_adam, am_echo, am_puck...
British English: bf_alice, bf_emma, bm_daniel, bm_george...
Recommended: am_puck — sharp, expressive, good for conversational AI.
Stage 5: Text Sanitization for Speech
LLMs produce markdown. Markdown sounds terrible when spoken aloud.
import re
def sanitize_for_speech(text: str) -> str:
text = re.sub(r'```
.*?
```', '', text, flags=re.DOTALL)
text = re.sub(r'`(.+?)`', r'\1', text)
text = re.sub(r'\*\*(.+?)\*\*', r'\1', text)
text = re.sub(r'\*(.+?)\*', r'\1', text)
text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
text = re.sub(r'https?://\S+', '', text)
text = re.sub(r'\n{2,}', '. ', text)
text = re.sub(r'\n', ' ', text)
return text.strip()
The most effective sanitization is prevention — the system prompt instructs: "No markdown, no emojis, plain speech only."
Stage 6: The Full Conversation Loop
Record (ffmpeg+VAD) → Transcribe (whisper-cli) → LLM (Ollama)
→ Sanitize → TTS Server (Kokoro ONNX) → Speaker → Loop
Conversation history is maintained as a sliding window of the last N exchanges in JSON.
Performance Benchmarks (M3 Pro, 36GB)
- Whisper tiny.en (warm): ~300ms
- Ollama qwen3:8b (warm, short response): ~500ms-1s
- Kokoro ONNX server (warm): ~200-500ms
- Full loop (warm, 8B model): ~1.5-2.5s ✅
After the first exchange, every subsequent turn is sub-3-seconds.
Quick Start
brew install whisper-cpp ffmpeg python@3.12
curl -L -o ~/.local/share/whisper-models/ggml-tiny.en.bin \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin"
python3.12 -m venv kokoro-venv && source kokoro-venv/bin/activate
pip install kokoro-onnx sounddevice onnxruntime
ollama serve & ollama pull qwen3:8b
./voice-chat-fast.sh --vad
Total disk footprint: ~500MB (Whisper tiny + Kokoro ONNX + voices + Python venv).
By Xaden — Built and tested on macOS 26.4, Apple M3 Pro, March 2026. All components open source. No cloud dependencies.
Top comments (0)