DEV Community

Ben Racicot
Ben Racicot

Posted on • Originally published at modelpiper.com

Adding Voice to Ollama on Mac: The 3-Model Chain

Ollama runs language models. It doesn't listen and it doesn't speak. Type a question in the terminal, read the answer on screen. That's the entire interaction model.

Voice changes what local AI feels like. Instead of typing and reading, you talk and listen. But getting there requires three separate AI models working together, and Ollama only handles one of them.

What voice chat requires

Three models running in sequence every time you speak:

Speech-to-text (STT). Your voice in, text transcription out. Needs a dedicated model - Whisper, Parakeet, or similar. Ollama doesn't include one.

Language model (LLM). The transcribed text goes to your chat model. This is what Ollama does well. Any model you've pulled works.

Text-to-speech (TTS). The model's text response gets converted to audio. Another dedicated model. Ollama doesn't include this either.

The hard part isn't running each model. It's the coordination. STT output needs to feed into the LLM prompt. The LLM response needs to stream into TTS as tokens arrive, not after the full response completes. Latency between stages compounds - if each handoff adds 500ms, the conversation feels broken.

You could wire this together with Python scripts, a Whisper server, and a TTS tool. Some people do. It takes hours of setup and the result is fragile.

The pre-wired approach

ToolPiper ships STT, LLM, and TTS as built-in backends, all running on Apple Silicon hardware acceleration. The tp-local-voice-chat pipeline template wires all three together.

STT: Parakeet v3, running on Apple's Neural Engine. Transcribes in real-time.

LLM: ToolPiper's bundled llama.cpp engine or your existing Ollama instance. Your Ollama models appear in the pipeline's LLM block alongside built-in models.

TTS options:

  • PocketTTS - Neural Engine. Near-instant generation. Best for conversational pace.
  • Soprano - Metal GPU. Higher audio quality, slightly more latency.
  • Orpheus - Expressive model with emotional range. Best for content creation.

All three run entirely on-device. No audio leaves the machine.

Latency numbers

Measured on M2 Max 32GB, using Qwen 3.5 3B (Q4):

Stage 3B Model 7B Model 13B Model
STT (Parakeet v3) ~400ms ~400ms ~400ms
LLM time-to-first-token ~300ms ~600ms ~1200ms
TTS first audio (PocketTTS) ~350ms ~350ms ~350ms
Total round-trip ~1.5s ~2.5s ~3.5s
Total RAM (STT + LLM + TTS) ~3GB ~5.5GB ~9.5GB

With a 3B model, the pause between your question and the spoken response is short enough to feel like the model is thinking. With a 13B model, the pause is noticeable - you start wondering if something broke before the first word arrives.

For comparison, ChatGPT's voice mode responds in under a second on optimized server hardware. Local voice chat on consumer hardware can't match that speed, but it runs entirely on-device with no internet connection.

Setup steps

  1. Install ToolPiper from the Mac App Store or modelpiper.com. A starter model downloads on first launch.
  2. If you have Ollama models, add Ollama as a provider - your models appear automatically.
  3. Open the pipeline templates, select tp-local-voice-chat.
  4. Choose your LLM and TTS voice. Click the microphone button. Talk.

The pipeline is three blocks connected in sequence: mic → STT → LLM → TTS → speaker. Push-to-talk by default (more predictable than continuous listening).

Limitations

Latency is real. 1.5s round-trip with a 3B model is the floor. Larger models push it higher. Cloud voice assistants are faster.

Three models in memory. STT (~500MB) + LLM (2-5GB) + TTS (~300MB). On 8GB, stick with 3B chat models. On 16GB+, 7B is comfortable.

No interruption handling. If the model is speaking and you start talking, the current implementation doesn't stop TTS mid-sentence. You wait for it to finish or manually stop playback.

For brainstorming, dictation review, and Q&A while your hands are busy - local voice chat works. For rapid-fire dialogue where sub-second latency matters, cloud voice modes are still faster.

Full walkthrough with voice selection and pipeline customization

Top comments (0)