Kunal

Posted on Jul 5 • Originally published at kunalganglani.com

Local AI Voice Assistant Stack 2026: Whisper + Piper + Ollama Wired Together

#localai #voiceassistant #whisper #pipertts

Originally published at kunalganglani.com — read it there for inline code, hero image, and live links.

A local AI voice assistant is a fully offline speech pipeline where your voice never leaves your home network — microphone audio is transcribed locally, processed by a local LLM, and spoken back through a neural TTS engine, all without a single cloud API call.

Piper TTS got archived in October 2025. Ollama shipped MLX support with up to 90% faster inference on Apple Silicon. Home Assistant introduced Speech-to-Phrase as a faster STT alternative. Every tutorial written before this year is wrong about at least one of those things. This is the 2026-updated guide that covers what actually changed and what you should do about it.

Key Takeaways

The open-source local voice stack in 2026 has 5 components: Whisper (STT), Wyoming Protocol (glue), Home Assistant Assist (intent engine), Ollama (LLM brain), and Piper (TTS). All of it runs on your hardware with zero cloud dependency.
Piper TTS was archived on October 6, 2025 and is now read-only on GitHub, but it still works as a Home Assistant Wyoming add-on. For new projects, evaluate Kokoro TTS or Coqui XTTS instead.
Whisper takes roughly 8 seconds on a Raspberry Pi 4 but under 1 second on an Intel NUC. Speech-to-Phrase is faster for simple home-control commands on constrained hardware.
Ollama 0.31 (June 2026) brings multi-token prediction via MLX on Apple Silicon, hitting up to 90% faster inference. M-series Macs are the sweet spot for this stack right now.
For voice assistant latency, small models like llama3.2:3b, qwen3:4b, or gemma3:4b on Ollama give the best response times on consumer hardware with 16GB RAM.

A voice assistant that phones home on every command isn't smart — it's surveillance with a friendly wake word.

What Is the Local AI Voice Assistant Stack?

Five layers, each handled by a different open-source project. Audio flows through them in this order:

Whisper (or Speech-to-Phrase) — Speech-to-text. OpenAI's Whisper has 104,000+ GitHub stars and was trained on 680,000 hours of multilingual audio. It converts your spoken command into text. Speech-to-Phrase is Home Assistant's newer, constrained alternative that runs in under 1 second even on a Raspberry Pi 4.
Wyoming Protocol — The glue layer. A small JSON-over-TCP protocol from Home Assistant 2023.5 that standardises how STT, TTS, and wake-word services plug into the Assist pipeline. Think of it as USB-C for voice services.
Home Assistant Assist — The intent engine. Parses transcribed text into actions: "turn off the kitchen lights" becomes an entity command. For basic home control, this alone gets you surprisingly far.
Ollama — The LLM brain. When you want conversational responses, context-aware answers, or anything beyond predefined commands, Ollama runs a local LLM as a conversation agent. Models like llama3.2:3b handle voice queries in real time on consumer hardware.
Piper TTS — Text-to-speech. A fast neural TTS system that converts response text back into spoken audio. Archived in October 2025 but still functional.

The Whisper, Piper, and Wyoming Protocol integrations are each used by 8.9% of all active Home Assistant installations as of 2026.7, according to Home Assistant's integration statistics. That's a real installed base for a fully local voice stack.

For a broader look at self-hosting your smart home voice control, see my companion post on self-hosted voice assistants with Home Assistant.

How Does the Full Offline Pipeline Work?

The audio flow from mouth to speaker, with zero packets leaving your LAN:

Microphone → openWakeWord (wake-word detection) → Wyoming STT (Whisper or Speech-to-Phrase) → Home Assistant Assist (intent parsing) → Ollama conversation agent (LLM response) → Wyoming TTS (Piper) → Speaker

Say "Hey Jarvis, what's the weather forecast and turn off the porch lights." openWakeWord catches the wake phrase and activates the pipeline. The audio stream goes via Wyoming to Whisper, which transcribes it to text. Home Assistant Assist receives the transcription and splits the work. The entity command ("turn off the porch lights") gets handled directly through Assist's intent system. The conversational query ("what's the weather forecast") routes to the Ollama conversation agent, which generates a natural-language response using whatever model you've configured. The response text goes via Wyoming to Piper, which synthesises speech and plays it through your speaker.

The entire round trip stays on your network. No audio recordings sitting in someone else's cloud. No transcription logs feeding an ad model. No subscription fee.

As Paulus Schoutsen, founder of Home Assistant and the Open Home Foundation, put it when launching the voice pipeline: this is about building the "World's Most Private Voice Assistant." Three years later, the stack has matured enough to actually deliver on that.

[YOUTUBE:6nsiQXCgnYA|Home Assistance Voice & Ollama Setup Guide - The Ultimate Local LLM Solution!]

Speech-to-Text: Whisper vs. Speech-to-Phrase

You have two STT options in 2026. Pick wrong and you'll either be frustrated by latency or boxed in by what you can say.

OpenAI Whisper is the open-ended option. Trained on 680,000 hours of multilingual audio, 104,000+ GitHub stars, it'll attempt to transcribe anything you say. The trade-off is compute cost. According to Home Assistant's documentation, Whisper takes approximately 8 seconds to process a voice command on a Raspberry Pi 4. On an Intel NUC or equivalent x86 hardware, under 1 second.

Speech-to-Phrase is Home Assistant's newer close-ended model. It only recognises a predefined subset of voice commands — "turn on the lights," "set temperature to 22 degrees," that kind of thing. But it runs in under 1 second even on a Raspberry Pi 4 or Home Assistant Green. If all you need is home control without freeform queries, this is the practical choice for constrained hardware.

Feature	Whisper	Speech-to-Phrase
Transcription type	Open-ended (anything)	Close-ended (known commands)
Pi 4 latency	~8 seconds	< 1 second
NUC/x86 latency	< 1 second	< 1 second
Freeform queries	Yes	No
Shopping lists, timers	Yes	No
Language support	Broad multilingual	Growing (community-translated)
Best for	Powerful hardware + LLM pipeline	Raspberry Pi home control

For the full Ollama-powered conversational pipeline, you need Whisper. Speech-to-Phrase won't pass freeform text to an LLM because it doesn't generate freeform text. But if you're on a Pi 4 and just want fast light switches, Speech-to-Phrase is the right call.

Latency optimisation tip: Use faster-whisper (a CTranslate2 reimplementation) instead of the stock OpenAI Whisper for 2-4x speed improvement. Choose the tiny or base model for speed-critical setups, accepting slightly lower accuracy. The small model hits the sweet spot for most English-language voice assistants. Tune beam size down to 1 for single-command use cases.

From maintaining the benchmark data at kunalganglani.com/llm-benchmarks, I've learned that quantization quality cliffs are model-family-specific. The same principle applies to Whisper model size selection. A blanket "just use tiny" recommendation is wrong. Test with your accent and your actual environment noise.

Piper TTS in 2026: Archived but Not Dead

Most tutorials still don't mention this. Michael Hansen (synesthesiam), creator of Piper and the Rhasspy voice assistant project, archived Piper's GitHub repository on October 6, 2025. The repo has 11,200+ stars, 1,000+ forks, and 396 open issues that will never be fixed.

What this means in practice:

Piper still works. The Home Assistant Wyoming add-on functions fine. You can install it today, pick a voice model, and it will synthesise speech without issues.
No new features. No new voice models, no bug fixes, no security patches. The codebase is frozen.
No new language support. The community had translated Home Assistant voice commands into 45+ languages, but Piper's voice model library won't grow from here.
Building on archived software is technical debt from day one. That's just the reality.

When to still use Piper: You're running the Home Assistant add-on path and want the simplest possible setup. For English and major European languages, the existing voice models are good enough for home assistant responses.

When to look elsewhere: You need voice cloning, new languages, active development, or you're building a Docker-based pipeline outside Home Assistant OS.

Alternatives worth evaluating:

Kokoro TTS — Emerging open-source neural TTS with active development and a growing community. Lighter weight than some alternatives.
Coqui XTTS — Supports voice cloning, broader language coverage. Heavier compute requirements but significantly more capable. Coqui the company shut down, but the XTTS model lives on as open source.
OpenVoice — MIT-licensed, supports cross-lingual voice cloning. Worth a look if multilingual matters to you.

For the Home Assistant pipeline specifically, any TTS that implements the Wyoming protocol can drop in as a Piper replacement. The Wyoming abstraction layer means the rest of your pipeline doesn't care which TTS engine sits behind it.

If you're thinking about the AI security implications of running archived software — the risk is real but bounded. Piper runs locally, accepts text input, produces audio output. The attack surface is narrow compared to a networked LLM endpoint. But frozen dependencies are still frozen dependencies.

The LLM Brain: Connecting Ollama as a Conversation Agent

This is where the stack goes from "smart light switch" to actual assistant. The Ollama integration for Home Assistant adds a conversation agent powered by a local Ollama server. When you ask something conversational — "What should I cook for dinner given what's in my fridge?" — the query routes to a real language model instead of hitting a dead end at Assist's intent parser.

Setup is pretty simple. You need an Ollama server running on a machine accessible to your Home Assistant instance. Doesn't have to be the same machine. For performance, you'll often want Ollama on a beefier box while Home Assistant runs on a Pi or Green.

The configuration options that matter in the Ollama integration:

Model: Which Ollama model to use (e.g., llama3.2:3b, qwen3:4b). Models download automatically during setup.
Instructions: A system prompt template using Home Assistant's templating engine. This is where you define the assistant's personality.
Control Home Assistant: An experimental toggle that gives the LLM access to the Assist API, letting it control exposed entities. Powerful but be careful with this.
Context window size: Defaults to 8,192 tokens (4x Ollama's default of 2,048). For voice assistant use, 8K is more than enough — spoken queries are short.

The Ollama model library shows staggering adoption numbers: llama3.1 at 116.8 million pulls, deepseek-r1 at 89.1 million, llama3.2 at 75.2 million as of mid-2026, per Ollama's library page.

For prompt injection protection: keep the "Control Home Assistant" feature limited to entities you're actually comfortable with an LLM controlling. Don't expose your door locks to a conversation agent that accepts arbitrary voice input. That's basic AI agent security hygiene.

Choosing the Right Ollama Model for Voice Use Cases

Not every model works for voice. Speed is everything here. Nobody wants to wait 15 seconds for an answer to "what time is sunset today?"

For voice assistant pipelines, models in the 1B–7B parameter range hit the best latency-quality trade-off on consumer hardware. My recommendations by hardware tier:

16GB RAM machine (Mac Mini, NUC, mini-PC):

llama3.2:3b — Best latency. Fast enough and conversational enough for home assistant tasks. This is my default recommendation.
qwen3:4b — Slightly larger, better at structured responses. Good pick if you want the LLM to control Home Assistant entities via the experimental API.
gemma3:4b — Google's small model. Strong instruction following, 38.3 million pulls on Ollama.

32GB+ RAM or dedicated GPU:

qwen3:8b — Better reasoning, still fast enough for voice on decent hardware.
llama3.1:8b — The workhorse. Good quality, well-tested, from the most-pulled model family on Ollama.

Raspberry Pi 5 (8GB):

Don't run Ollama on a Pi 5 for voice. The latency will drive you crazy. Offload the LLM to a separate machine and keep the Pi for Home Assistant + Wyoming services.

Ollama 0.31, shipped in June 2026, brings multi-token prediction via MLX on Apple Silicon — up to 90% faster inference compared to previous versions, measured by the Aider polyglot benchmark. If you have an M-series Mac, this makes it the best-value Ollama server for voice use cases right now.

Building and operating this site's multi-agent publishing pipeline taught me that model-per-job-shape beats one-model-everywhere on both cost and quality. The same applies here: use a small, fast model for voice responses and save larger models for other work. Running a 70B model to answer "turn off the lights" is like renting a forklift to carry a grocery bag.

Wyoming Protocol: The Glue That Wires It All Together

Wyoming is what makes this stack modular instead of monolithic. Created by Michael Hansen (synesthesiam), it's a lightweight JSON-over-TCP protocol that lets voice services register with Home Assistant as pluggable components.

It supports 4 service types:

Speech-to-text (Whisper, Speech-to-Phrase)
Text-to-speech (Piper, or any TTS implementing the protocol)
Wake-word detection (openWakeWord)
Intent handling (via Assist pipeline routing)

The reason Wyoming matters: substitutability. Want to swap Piper for Kokoro TTS? Implement the Wyoming protocol and Home Assistant doesn't know the difference. Want to run Whisper on a GPU server in your closet while Home Assistant lives on a Pi in your living room? Wyoming handles it over TCP.

Same architectural principle behind function calling in LLM systems — a standardised interface that decouples the orchestrator from the service providers.

Hardware Tiers: Realistic Performance Expectations

Stop reading guides that don't tell you what hardware you actually need. The honest breakdown for running Whisper + Piper + Ollama concurrently:

Hardware	Whisper Latency	Ollama (3B model)	Can run full stack?	Estimated Cost
Raspberry Pi 4 (4GB)	~8 seconds	Too slow	STT/TTS only, offload LLM	$55
Raspberry Pi 5 (8GB)	~3-4 seconds	Marginal	Barely, with Speech-to-Phrase	$80
Intel NUC / Mini-PC (16GB)	< 1 second	~2-3 seconds	Yes	$300-500
Mac Mini M2 (16GB)	< 1 second	< 1.5 seconds	Yes, excellent	$500-600
Mac Mini M4 (24GB)	< 0.5 seconds	< 1 second	Best value option	$700
Custom PC with RTX 4060+	< 0.5 seconds	< 1 second	Yes, GPU-accelerated	$800+

Minimum viable setup for the full conversational pipeline: 16GB RAM and an x86 or ARM64 processor made in the last 5 years. Below that, offload the LLM to a separate machine.

For a deeper look at GPU requirements, check the local LLM hardware guide and the complete AI hardware guide on this site.

Apple Silicon deserves a specific callout. With Ollama 0.31's MLX multi-token prediction delivering up to 90% faster inference, an M2 or M4 Mac Mini is arguably the best single-box solution for this entire stack. Unified memory means you're not hitting discrete VRAM limits — and from running my own local LLM benchmarks across Apple Silicon hardware, I can tell you that unified memory changes the "VRAM is the bottleneck" intuition entirely. Big models load fine; throughput is the real constraint to watch. I've written more about Apple Silicon vs NVIDIA for local AI. For voice assistant workloads specifically, Apple wins on power efficiency and noise. Fanless operation matters when the device sits in your living room.

Wake Word Detection With openWakeWord

Without a wake word, your voice assistant requires a button press to activate. openWakeWord is the open-source solution that plugs into Wyoming for always-on listening.

openWakeWord runs a small neural network that continuously monitors audio for a trigger phrase. It supports custom wake words — you're not locked into "Hey Google" or "Alexa." Common choices: "Hey Jarvis," "Hey Mycroft," or any custom phrase you train.

The important design decision: openWakeWord runs on the satellite device (the ESP32 or Pi with the microphone), not your central server. Wake-word detection happens at the edge with minimal latency. Only activated audio streams get forwarded to Whisper.

For ESP32-based satellite devices using ESPHome, openWakeWord integrates directly. The Home Assistant community has built voice satellites using cheap ESP32-S3 boards that run openWakeWord locally with surprisingly good accuracy.

This is a far simpler architecture than building custom AI agents from scratch. Wyoming handles the complexity of routing audio between wake-word detection, STT, and TTS. You don't build that plumbing yourself.

Running the Stack Without Home Assistant OS

Not everyone runs Home Assistant OS. If you're on Home Assistant Container, Home Assistant Core, or you want this pipeline without Home Assistant at all, the Docker Compose path works.

The Wyoming services (Whisper, Piper, openWakeWord) are all available as standalone Docker containers. You can wire them together with Home Assistant Container or build your own orchestration.

The architecture for a Docker-based deployment:

Container 1: wyoming-whisper — runs the Whisper STT service, exposes a Wyoming TCP port (default 10300)
Container 2: wyoming-piper — runs Piper TTS, exposes Wyoming TCP port (default 10200)
Container 3: wyoming-openwakeword — runs wake-word detection, exposes Wyoming TCP port (default 10400)
Container 4: ollama — runs the LLM server, exposes HTTP API on port 11434
Container 5: homeassistant — the core instance, connects to all Wyoming services and Ollama via their TCP/HTTP ports

Key configuration: tell Home Assistant where each Wyoming service lives. In the Wyoming integration setup, you point to each container's hostname and port. For Ollama, add the integration and point the URL to http://ollama:11434.

If you want to skip Home Assistant entirely and build a pure Python pipeline, you'll need to implement the intent-parsing layer yourself. Projects like OpenJarvis v1.0 — which launched in May 2026 with built-in Ollama support — are emerging as alternatives for developers who want an agent framework without the smart-home baggage.

For local LLM serving outside the Home Assistant ecosystem, Ollama remains the easiest path. It handles model management, GGUF quantization formats, and API compatibility. See the Ollama vs llama.cpp comparison for the full trade-off analysis.

Privacy: What Stays Local vs. What Leaks

The privacy argument for this stack isn't hand-waving. Here's exactly what goes where:

Data Point	Cloud Assistant (Alexa/Google)	This Local Stack
Audio recordings	Stored on vendor servers	Never leaves your LAN
Transcription text	Processed and stored in cloud	Processed locally, discarded
Command history	Full log retained by vendor	Only in your HA instance
Device state data	Sent to vendor cloud	Stays on your network
Voice profiles	Stored for speaker recognition	Not applicable
Third-party sharing	Shared with skills/actions providers	Zero third parties
Internet requirement	Required for every command	Not required at all

The entire pipeline works with no internet connection. Once you've downloaded the Whisper model, Piper voice files, and your Ollama model, you can unplug your router and the voice assistant keeps working. That's not true of any commercial voice assistant on the market today.

Amazon's move toward paid Alexa subscriptions makes the case even stronger. You're not just avoiding surveillance. You're avoiding a recurring fee for a service that gets worse every year with more ads and partner integrations.

Troubleshooting Common Issues

These are the failure modes you'll actually hit when wiring this stack together, and how to fix them:

Wyoming port not reachable: The most common problem. Make sure Wyoming service containers are on the same Docker network as Home Assistant. Check that TCP ports (10200, 10300, 10400) aren't blocked by your host firewall. On Linux, ss -tlnp | grep 10300 confirms Whisper is listening.

Ollama context window overflow: Voice conversations accumulate context. If Ollama starts returning errors or truncated responses, the context window is full. The Home Assistant Ollama integration defaults to 8,192 tokens. Increase it if needed, but know that larger context windows eat more RAM.

Whisper GPU not detected: If Whisper is running on CPU despite having a GPU available, check that the Docker container has GPU passthrough enabled (--gpus all for NVIDIA, or proper ROCm setup for AMD). ROCm users need additional container configuration.

Piper producing garbled audio: Almost always a sample rate mismatch. Piper outputs 22050 Hz by default. If your audio pipeline expects 16000 Hz or 48000 Hz, you get distorted playback. Match the output sample rate to your speaker setup.

High Whisper latency on good hardware: Check that you're running faster-whisper, not stock Whisper. Verify the model size — accidentally loading large-v3 instead of small on a 16GB machine will crush performance. Monitor RAM usage. If the system is swapping, everything slows to a crawl.

Ollama model not responding to Home Assistant queries: Make sure the model actually downloaded. Run ollama list on the server to confirm. Also verify the Ollama server is listening on 0.0.0.0 rather than localhost if Home Assistant is on a different machine. This one catches people constantly.

What's Next for the Local Voice Stack

A few things are becoming clear about where this is heading:

Piper's archival leaves a TTS gap. Someone will fill it. Kokoro TTS and the community forks around Coqui XTTS are the leading candidates. Whichever project ships a clean Wyoming protocol implementation first will likely become the default. If you're looking at open-source AI projects worth contributing to, a Wyoming-compatible TTS wrapper is a high-impact opportunity right now.

Ollama is becoming the standard local LLM backend. OpenJarvis v1.0 choosing Ollama as its default in May 2026, combined with 116+ million pulls on its top model — that's not an experiment anymore. The Anthropic Messages API compatibility added in January 2026 means existing toolchains port over with minimal friction.

Apple Silicon is the quiet winner. Unified memory (no VRAM limits), MLX multi-token prediction (90% faster), fanless operation. M-series Macs are the ideal hardware for a living-room voice server. I expect this to become the recommended path over the Pi within a year.

Speech-to-Phrase will expand. Home Assistant's constrained STT model is limited in command vocabulary today, but it's exactly the right trade-off for 80% of home automation use cases. Expect the supported command set to grow significantly through 2026-2027, potentially making Whisper unnecessary for most users.

The commercial voice assistant market is fragmenting under subscription pressure and privacy backlash. This open-source stack isn't a hobby project anymore. With 8.9% of all Home Assistant installations already running the Wyoming voice pipeline, it's a legitimate alternative that works today. The question isn't whether local voice assistants will matter. It's whether you'll build yours before the next Alexa price hike.

Originally published on kunalganglani.com

Top comments (1)

elboKazQC • Jul 7

Solid breakdown. From building a local dictation tool on faster-whisper, your "just use tiny is wrong" point hits even harder for code-switching: I dictate Quebec French mixed with English dev jargon, and tiny/base fall apart the moment a word like "useState" lands inside a French sentence. Small was my floor, medium when the machine can take it. And for pure dictation instead of home control, push-to-talk drops the whole wake-word layer and the perceived latency with it. Have you tested Speech-to-Phrase with any non-English commands yet?