Why Run a Voice Assistant on the Edge?
Running speech‑to‑text and intent detection locally gives you:
- Zero latency – no round‑trip to the cloud.
- Privacy – audio never leaves the device.
- Offline reliability – your assistant works even when the internet is down.
In this tutorial we’ll stitch together OpenAI’s Whisper (small model) for transcription, a tiny TensorFlow Lite intent classifier, and a real‑time audio pipeline that lives entirely on a Raspberry Pi 4 (2 GB or more). By the end you’ll have a Python script that listens for commands like “turn on the lamp” and executes a local function instantly.
What You’ll Need
| Item | Reason |
|---|---|
| Raspberry Pi 4 (2 GB+) with Raspberry OS (64‑bit) | Provides enough RAM for Whisper‑small |
| Micro‑USB or USB‑C microphone | Captures audio |
| Python 3.10+ | Modern language features |
ffmpeg |
Required by Whisper |
git, pip, virtualenv
|
Standard development tools |
| Optional: GPIO‑controlled relay | To demonstrate a real command |
Tip: If you’re using a Pi Zero, swap Whisper for a lighter model (e.g.,
tiny.en) or run only the intent recognizer.
1. Set Up the Development Environment
# Update OS and install system deps
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv ffmpeg libportaudio2
# Create a clean virtual environment
python3 -m venv venv
source venv/bin/activate
# Upgrade pip and install core libraries
pip install --upgrade pip
pip install numpy sounddevice tqdm
Install Whisper
Whisper ships as a Python package that downloads the model on first use.
pip install git+https://github.com/openai/whisper.git
Install TensorFlow Lite Runtime
The full TensorFlow package is heavyweight for a Pi. Use the lightweight runtime instead:
pip install tflite-runtime
2. Capture Audio in Real Time
We’ll use sounddevice to stream 16 kHz mono audio directly into a NumPy buffer. Whisper expects 16 kHz, so we set the samplerate accordingly.
import sounddevice as sd
import numpy as np
from collections import deque
SAMPLE_RATE = 16000
CHUNK_DURATION = 0.5 # seconds
CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION)
# A thread‑safe circular buffer
audio_buffer = deque(maxlen=int(5 * SAMPLE_RATE)) # keep last 5 seconds
def audio_callback(indata, frames, time, status):
"""Called by sounddevice for each audio chunk."""
if status:
print(f"Audio status: {status}")
audio_buffer.extend(indata[:, 0]) # mono channel
stream = sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
dtype='float32',
callback=audio_callback,
)
stream.start()
print("🔊 Listening…")
The buffer continuously holds the most recent audio. We’ll pull a 2‑second slice every loop iteration and feed it to Whisper.
3. Run Whisper on the Edge
Whisper‑small (~39 M parameters) fits into the Pi’s RAM and runs at ~2× real‑time on a Pi 4 with the CPU only. For lower latency we’ll use the non‑beam decoding mode.
import whisper
import torch
# Load Whisper‑small on CPU
model = whisper.load_model("small", device="cpu")
def transcribe_chunk(chunk):
"""Accepts a NumPy array of shape (samples,) and returns text."""
# Whisper expects a float32 tensor normalized to [-1, 1]
audio = torch.from_numpy(chunk).float()
result = model.transcribe(audio, language="en", word_timestamps=False, beam_size=1)
return result["text"].strip()
Reducing Compute with TorchScript (optional)
If you want a modest speed boost, script the model once:
scripted = torch.jit.script(model)
# Replace `model` with `scripted` in `transcribe_chunk`
4. Build a Tiny Intent Classifier
Instead of parsing full sentences, we’ll map short utterances to intents using a keyword‑spotting model. The architecture is a 1‑D convolution followed by a dense layer – only ~10 k parameters.
import tensorflow as tf
def build_intent_model(num_classes=4):
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(16000, 1)), # 1‑second raw waveform
tf.keras.layers.Rescaling(1.0 / 32768.0), # Normalize int16 to [-1, 1]
tf.keras.layers.Conv1D(8, 13, strides=2, activation='relu'),
tf.keras.layers.Conv1D(16, 13, strides=2, activation='relu'),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(num_classes, activation='softmax')
])
return model
Training Data (quick example)
Create a tiny dataset with four commands: ["turn on the lamp", "turn off the lamp", "what time is it", "stop listening"]. Record a few seconds for each command, label them, and train for a handful of epochs.
# Assume `X_train` shape = (samples, 16000, 1), y_train one‑hot encoded
model = build_intent_model(num_classes=4)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=15, batch_size=8)
Convert to TensorFlow Lite and Quantize
Quantization shrinks the model to ~30 KB and runs at >100 inferences/sec on the Pi.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT] # post‑training quantization
tflite_model = converter.convert()
with open("intent_classifier.tflite", "wb") as f:
f.write(tflite_model)
print("✅ Saved quantized TFLite model")
Load the TFLite Model
import tflite_runtime.interpreter as tflite
interpreter = tflite.Interpreter(model_path="intent_classifier.tflite")
interpreter.allocate_tensors()
input_idx = interpreter.get_input_details()[0]["index"]
output_idx = interpreter.get_output_details()[0]["index"]
def predict_intent(waveform):
"""waveform: np.ndarray shape (16000,)"""
# Reshape to (1, 16000, 1) and cast to int16 for the quantized model
input_data = waveform.astype(np.int16).reshape(1, -1, 1)
interpreter.set_tensor(input_idx, input_data)
interpreter.invoke()
probs = interpreter.get_tensor(output_idx)[0]
intent_id = np.argmax(probs)
return intent_id, probs[intent_id]
Map IDs to human‑readable intents:
INTENT_MAP = {
0: "TURN_ON",
1: "TURN_OFF",
2: "GET_TIME",
3: "STOP"
}
5. Glue It All Together
Now we combine the audio stream, Whisper transcription, and intent classifier into a single loop. We’ll use a 1‑second sliding window for intent detection (fast) and a 2‑second window for Whisper (more accurate).
import time
import datetime
def execute_intent(intent):
if intent == "TURN_ON":
print("💡 Turning lamp ON")
# Example GPIO call:
# import RPi.GPIO as GPIO
# GPIO.output(LAMP_PIN, GPIO.HIGH)
elif intent == "TURN_OFF":
print("💡 Turning lamp OFF")
elif intent == "GET_TIME":
now = datetime.datetime.now().strftime("%H:%M")
print(f"🕒 The time is {now}")
elif intent == "STOP":
print("👋 Stopping assistant")
raise KeyboardInterrupt
try:
while True:
# ---- Intent detection (fast) ----
if len(audio_buffer) >= SAMPLE_RATE: # need at least 1 sec
recent = np.array(list(audio_buffer)[-SAMPLE_RATE:]) # last 1 sec
intent_id, confidence = predict_intent(recent)
if confidence > 0.85: # ignore low‑confidence guesses
intent = INTENT_MAP[intent_id]
print(f"[Intent] {intent} ({confidence:.2f})")
execute_intent(intent)
# ---- Whisper transcription (every 2 sec) ----
if len(audio_buffer) >= 2 * SAMPLE_RATE:
chunk = np.array(list(audio_buffer)[-2 * SAMPLE_RATE:])
text = transcribe_chunk(chunk)
if text:
print(f"[Transcription] {text}")
time.sleep(0.2) # tiny pause to keep CPU happy
except KeyboardInterrupt:
print("\n🛑 Assistant stopped")
finally:
stream.stop()
stream.close()
What’s happening?
-
Audio callback continuously fills
audio_buffer. - Every loop we grab a 1‑second slice, run the quantized intent model, and instantly act on high‑confidence predictions.
- Every 2 seconds we feed a larger slice to Whisper for a full transcription – useful for debugging or for commands that need more context.
- The script exits gracefully on “stop listening”.
6. Optimizing for Real‑World Use
| Area | Quick win |
|---|---|
| CPU usage | Set torch.set_num_threads(2) to limit Whisper’s thread count. |
| Power | Use pico mode on the Pi (sudo raspi-config → Performance → Low‑Power). |
| Audio quality | Add a simple high‑pass filter (scipy.signal.butter) to remove rumble. |
| Model size | Swap Whisper‑small for Whisper‑tiny if RAM is a bottleneck. |
| Hotword detection | Keep the intent model always on; only invoke Whisper after a hotword is detected (e.g., “hey pi”). |
7. Deploying as a System Service
Running the script manually is fine for testing, but for a production‑grade assistant you’ll want it to start on boot.
sudo nano /etc/systemd/system/voice-assistant.service
Paste:
[Unit]
Description=Edge Voice Assistant
After=network.target
[Service]
WorkingDirectory=/home/pi/voice-assistant
ExecStart=/home/pi/voice-assistant/venv/bin/python3 assistant.py
Restart=on-failure
User=pi
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable voice-assistant.service
sudo systemctl start voice-assistant.service
Check logs with journalctl -u voice-assistant -f.
Key takeaways
- Whisper‑small can run on a Raspberry Pi 4 in real time when you limit beam size and use CPU‑only inference.
- A tiny 1‑D ConvNet, post‑training quantized to TensorFlow Lite, provides sub‑millisecond intent detection.
- Using a circular buffer with
sounddevicelets you stream audio without dropping frames. - Combining a fast intent classifier with occasional Whisper transcriptions yields both low latency and high accuracy.
- Packaging the script as a systemd service makes the assistant start automatically and stay resilient.
Top comments (0)