Xaden

Posted on Mar 27

Building a Local Voice AI Stack: Whisper + Ollama + Kokoro TTS on Apple Silicon

#ai #voice #tts #machinelearning

By Xaden

Cloud voice APIs are convenient — until they're not. Latency adds up when every utterance round-trips to a datacenter. Privacy evaporates when your microphone stream leaves your machine. And monthly bills grow linearly with usage.

This guide documents a production-tested architecture for fully local voice AI on Apple Silicon: speech-to-text via Whisper.cpp with Metal GPU acceleration, inference via Ollama, and text-to-speech via Kokoro ONNX with a persistent HTTP server. Every component runs on-device. No API keys. No internet required. No per-token charges.

Target hardware: MacBook Pro M3 Pro (36GB unified memory). The architecture scales down to M1/8GB with smaller models.

Target latency budget:

STT (Whisper): ~300-500ms
LLM (Ollama): ~1000-2000ms
TTS (Kokoro): ~200-500ms
Audio I/O: ~100ms
Total: < 3 seconds

Architecture Overview

                    ┌─────────────────────────────────────────────┐
                    │           voice-chat-fast.sh                │
                    │         (orchestrator / main loop)          │
                    └─────────┬──────────┬──────────┬────────────┘
                              │          │          │
                    ┌─────────▼───┐ ┌────▼────┐ ┌──▼──────────┐
                    │  ffmpeg     │ │ Ollama  │ │ Kokoro TTS  │
                    │  (record)   │ │ (LLM)   │ │ Server:8181 │
                    └─────┬───────┘ └────┬────┘ └──┬──────────┘
                          │              │          │
                    ┌─────▼───────┐      │     ┌───▼──────────┐
                    │ whisper-cli │      │     │ kokoro-onnx  │
                    │ (STT+Metal)│      │     │ (in-memory)  │
                    └─────────────┘      │     └──────────────┘
                                         │
                    ┌────────────────────▼────────────────────┐
                    │         Conversation History            │
                    │       (JSON, last N exchanges)          │
                    └─────────────────────────────────────────┘

Stage 1: Speech-to-Text with Whisper.cpp

Why Whisper.cpp over OpenAI's Python Whisper

Whisper.cpp is a C/C++ port that compiles natively for ARM64, uses Metal GPU acceleration out of the box, and loads models in compact GGML format.

brew install whisper-cpp

Model Selection

mkdir -p ~/.local/share/whisper-models
curl -L -o ~/.local/share/whisper-models/ggml-base.en.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin"

Model tradeoffs:

ggml-tiny.en (75MB) — Fastest, good for voice chat
ggml-base.en (147MB) — Sweet spot for accuracy/speed
ggml-small.en (466MB) — Noticeably better accuracy
ggml-medium.en (1.5GB) — High accuracy, slower
ggml-large-v3 (3.0GB) — Best, multilingual, heavy

For real-time voice conversation, tiny.en gets transcription under 300ms.

Metal GPU Acceleration

On Apple Silicon, Whisper.cpp automatically detects and uses Metal. Benchmark: 3 seconds of audio transcribed in ~500ms (including model load from cold).

Audio Format Requirements

Whisper expects 16kHz, mono, 16-bit PCM WAV:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -acodec pcm_s16le output.wav

Stage 2: Voice Activity Detection (VAD)

Rather than adding a dedicated VAD library, we exploit ffmpeg's built-in silencedetect audio filter:

ffmpeg -y -f avfoundation -i ":0" \
  -ar 16000 -ac 1 -acodec pcm_s16le \
  -t 30 \
  -af "silencedetect=noise=0.02:d=1.5" \
  recording.wav 2>ffmpeg.log &

noise=0.02 — amplitude threshold for "silence" (increase for noisy environments)
d=1.5 — seconds of silence before triggering

The monitoring loop watches for silence_start and silence_end markers in ffmpeg's stderr.

Stage 3: The Cold-Start Problem — Kokoro ONNX vs. PyTorch

The problem: 9 seconds of silence

Calling the Python CLI per utterance produces ~9 seconds of latency. Breakdown: ~6 seconds of overhead (Python startup, PyTorch import, model loading), only ~2 seconds of actual synthesis.

The fix: two orthogonal optimizations

Optimization 1: Replace PyTorch with ONNX Runtime

ONNX Runtime on ARM64 uses optimized NEON SIMD instructions. 4-10x faster for a model this size.

pip install kokoro-onnx sounddevice onnxruntime

Optimization 2: Persistent process — load once, serve forever

Even with ONNX, loading the model from disk takes ~1-2 seconds. Load it once at startup and keep it in memory as an HTTP server.

These two changes together: ~9 seconds → ~300ms. A 30x improvement.

Stage 4: The Persistent TTS Server

A minimal HTTP server that loads Kokoro ONNX at startup and serves synthesis requests:

#!/usr/bin/env python3
"""tts-server.py — Persistent Kokoro ONNX TTS server."""

import json, sys, time
from http.server import HTTPServer, BaseHTTPRequestHandler
import numpy as np
import sounddevice as sd
from kokoro_onnx import Kokoro

MODEL_PATH = "kokoro-v0_19.onnx"
VOICES_PATH = "voices.bin"
HOST, PORT = "127.0.0.1", 8181

kokoro = Kokoro(MODEL_PATH, VOICES_PATH)

class TTSHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers.get("Content-Length", 0))
        data = json.loads(self.rfile.read(content_length))
        text = data.get("text", "")
        voice = data.get("voice", "am_puck")
        speed = float(data.get("speed", 1.0))

        if not text.strip():
            self.send_response(200); self.end_headers(); return

        samples, sr = kokoro.create(text, voice=voice, speed=speed, lang="en-us")
        sd.play(samples, samplerate=sr)
        sd.wait()

        self.send_response(200)
        self.end_headers()

Voice Selection

Kokoro ships with 26 voice style vectors:

American English: af_alloy, af_bella, af_jessica, am_adam, am_echo, am_puck...
British English: bf_alice, bf_emma, bm_daniel, bm_george...

Recommended: am_puck — sharp, expressive, good for conversational AI.

Stage 5: Text Sanitization for Speech

LLMs produce markdown. Markdown sounds terrible when spoken aloud.

import re

def sanitize_for_speech(text: str) -> str:
    text = re.sub(r'```

.*?

```', '', text, flags=re.DOTALL)
    text = re.sub(r'`(.+?)`', r'\1', text)
    text = re.sub(r'\*\*(.+?)\*\*', r'\1', text)
    text = re.sub(r'\*(.+?)\*', r'\1', text)
    text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'\n{2,}', '. ', text)
    text = re.sub(r'\n', ' ', text)
    return text.strip()

The most effective sanitization is prevention — the system prompt instructs: "No markdown, no emojis, plain speech only."

Stage 6: The Full Conversation Loop

Record (ffmpeg+VAD) → Transcribe (whisper-cli) → LLM (Ollama) 
    → Sanitize → TTS Server (Kokoro ONNX) → Speaker → Loop

Conversation history is maintained as a sliding window of the last N exchanges in JSON.

Performance Benchmarks (M3 Pro, 36GB)

Whisper tiny.en (warm): ~300ms
Ollama qwen3:8b (warm, short response): ~500ms-1s
Kokoro ONNX server (warm): ~200-500ms
Full loop (warm, 8B model): ~1.5-2.5s ✅

After the first exchange, every subsequent turn is sub-3-seconds.

Quick Start

brew install whisper-cpp ffmpeg python@3.12
curl -L -o ~/.local/share/whisper-models/ggml-tiny.en.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin"
python3.12 -m venv kokoro-venv && source kokoro-venv/bin/activate
pip install kokoro-onnx sounddevice onnxruntime
ollama serve & ollama pull qwen3:8b
./voice-chat-fast.sh --vad

Total disk footprint: ~500MB (Whisper tiny + Kokoro ONNX + voices + Python venv).

By Xaden — Built and tested on macOS 26.4, Apple M3 Pro, March 2026. All components open source. No cloud dependencies.

DEV Community