Patrick Hughes

Posted on May 3 • Originally published at bmdpat.com

Raspberry Pi 5 Local Voice AI: What Works in 2026

#raspberrypi #edgeai #voiceassistant #localllm

Building a Local Voice AI on Raspberry Pi 5: What Actually Works in 2026

Yes, Raspberry Pi 5 can run voice AI entirely offline in 2026. The 8GB model handles Whisper Tiny (speech-to-text), Piper TTS (text-to-speech), and a 1B–4B quantized LLM via llama.cpp with 8–25 seconds of end-to-end latency. No cloud, no API keys, no subscriptions required.

I wanted a voice assistant that didn't phone home. No API calls. No subscriptions. Nothing leaving my network.

So I built one on a Raspberry Pi 5.

Here's what I learned -- including the parts that don't show up in the tutorials.

Why Bother Going Local?

The obvious reason is privacy. But there's a less-discussed one: reliability. Cloud voice assistants go down. They get deprecated. Pricing changes. And when you're building a custom interface for a client, you want something that works in five years without a vendor making that decision for you.

For this build, the goal was simple: wake word detection, speech-to-text, LLM reasoning, text-to-speech. All on-device. Zero network dependency.

The Hardware Stack

Raspberry Pi 5 (8GB RAM) -- the 8GB model is not optional. You need it. The 4GB variant runs out of headroom fast once the LLM is loaded.

USB microphone -- I used a cheap omnidirectional mic. Quality matters less than you'd think at this stage; the STT model handles noise better than expected.

3.5mm speaker -- the Pi's onboard audio is fine for testing. For production, a small USB audio DAC gives cleaner output.

Optional: Raspberry Pi AI HAT+ 2 -- Hailo's accelerator (released January 2026) adds 40 TOPS of inference capability. It helps with vision workloads but makes less difference for text-only voice pipelines. Skip it unless you're running a camera alongside.

The Software Stack

This is where most tutorials diverge from reality. Here's what actually worked:

Wake word: OpenWakeWord (github.com/dscripka/openWakeWord). Runs on CPU, low latency, customizable. I trained a custom trigger word in about 20 minutes using their web tool.

Speech-to-text: Whisper Tiny or Small via faster-whisper. Tiny processes in 2-3 seconds on Pi 5. Small is more accurate. For normal speech in a quiet room, Tiny is good enough.

LLM: Phi-3 Mini (3.8B params, Q4 quantized) via Ollama. This is the sweet spot for the Pi 5. Larger models are too slow. Smaller ones lose coherence. Phi-3 Mini at Q4 gives about 3-4 tokens per second.

Text-to-speech: Piper TTS. Fast, local, surprisingly natural. The en_US-lessac-medium voice is my default. Full sentence generation takes under a second.

What the Full Pipeline Looks Like

Microphone -> Wake Word Detection -> Record Utterance -> faster-whisper -> LLM -> Piper TTS -> Speaker

Total round-trip: 15-25 seconds on Pi 5 without acceleration. That's slow. But for home automation triggers, reminders, or local data queries, it's workable.

The key trick: stream the TTS output while the LLM is still generating. Don't wait for the full response. Start speaking the first sentence as soon as it's complete. This drops perceived latency to 8-12 seconds, which is much more tolerable.

What I'd Do Differently

Don't use a Pi 5 for latency-sensitive applications. If you need sub-5-second responses, you need a GPU. An RTX 3070 running locally is 10x faster than a Pi 5. The Pi is the right tool for always-on, low-power, embedded use cases.

Plan your context window carefully. The Pi 5's memory constraints mean small models with limited context. Keep system prompts short. Don't expect it to maintain a long conversation history without compression.

Lower the temperature. At defaults, small models are verbose and sometimes incoherent. Drop it to 0.3-0.5 for voice use cases. You want predictable, concise output.

Where This Actually Makes Sense

For a general-purpose voice assistant competing with Alexa in responsiveness -- no, the latency gap is too wide right now.

But for narrow, specific purposes? It works well.

A voice interface for a local medical records system. A floor manager assistant at a factory without reliable internet. A privacy-first home hub that runs on $75 hardware with zero ongoing subscription cost.

That's the opportunity. Not a replacement for cloud assistants. A replacement for the class of problem where cloud assistants aren't an option.

Does Raspberry Pi 5 Support Two AI Models at Once?

Yes, with 8GB RAM. The minimum workable stack — Whisper Tiny + Piper TTS + Phi-3 Mini Q4 — uses roughly 3–4 GB of RAM simultaneously. You have headroom for a second small model, but not two large ones. Running Whisper Small and a 7B LLM at the same time will cause aggressive swapping to the SD card and kill response times.

The practical approach: keep one model hot in memory and load others on demand. Load the LLM at startup. Whisper is fast enough to load per-utterance if RAM is tight.

Raspberry Pi 5 vs Jetson Nano for Local Voice AI

The Jetson Nano (Orin series) has a dedicated GPU that gives faster LLM inference — but the 4GB Orin Nano starts at $249 vs $80 for a Pi 5 8GB, and setup is significantly more complex. For a pure voice pipeline (STT + small LLM + TTS), the Pi 5 is the better starting point.

The Jetson makes sense if you need sub-5-second latency as a hard requirement, are adding computer vision to the same device, or are deploying at scale where per-unit cost matters less than performance. For prototyping and most embedded deployments, start with the Pi 5.

Is 8GB RAM Required for Local Voice AI on Raspberry Pi 5?

For anything beyond the simplest setup, yes.

4GB Pi 5: Can run STT + TTS only. No LLM. Suitable for fixed voice commands, not open-ended conversation.
8GB Pi 5: Runs the full stack — wake word + STT + 1B–4B LLM + TTS — with acceptable latency.
Pi 5 + AI HAT+ 2: Same RAM constraint. The Hailo accelerator helps with vision tasks but makes minimal difference for voice-only pipelines.

If you are on the fence between 4GB and 8GB: buy the 8GB. The $10 difference is not worth finding out the hard way.

Building something similar? Need a custom voice AI interface for a compliance-restricted environment or offline use case? I build these async -- no meetings, flat-rate pricing: https://bmdpat.com/start

DEV Community