Have you ever wanted your own Jarvis? A voice assistant that listens, thinks, and speaks back - all running privately on your own hardware? Here's how to build one with Whisper.cpp, Ollama, and Kokoro TTS.
No cloud, no wake-word fees, no data leaving your machine.
Prerequisites
- Hardware: Any modern computer with a microphone
- Software: Python 3.10+, Ollama installed
- Time: ~30 minutes setup
Installation
1. Install Ollama and Pull a Model
ollama pull qwen3:14b
2. Install Whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release
bash models/download-ggml-model.sh medium
3. Install Kokoro TTS
pip install kokoro pyaudio requests
Wiring It All Together
Save this as voice_assistant.py:
import subprocess
import tempfile
import wave
import pyaudio
import requests
from kokoro import KPipeline
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen3:14b"
WHISPER_BIN = "./whisper.cpp/build/bin/whisper-cli"
WHISPER_MODEL = "./whisper.cpp/models/ggml-medium.bin"
tts_pipeline = KPipeline(lang_code='a')
def record_audio(duration=5, sample_rate=16000):
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1,
rate=sample_rate, input=True,
frames_per_buffer=1024)
frames = [stream.read(1024) for _ in range(int(sample_rate / 1024 * duration))]
stream.close(); p.terminate()
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
wf = wave.open(f, 'wb')
wf.setnchannels(1); wf.setsampwidth(2)
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
return f.name
def transcribe(audio_file):
result = subprocess.run([WHISPER_BIN, '-m', WHISPER_MODEL, '-f', audio_file],
capture_output=True, text=True)
return result.stdout.strip()
def ask_llm(prompt):
r = requests.post(OLLAMA_URL, json={"model": MODEL, "prompt": prompt, "stream": False})
return r.json()["response"]
def speak(text):
for result in tts_pipeline(text):
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
f.write(result.audio)
subprocess.run(['ffplay', '-nodisp', '-autoexit', f.name])
# Run it
print("Listening...")
audio_file = record_audio(5)
text = transcribe(audio_file)
print(f"You: {text}")
response = ask_llm(text)
print(f"AI: {response}")
speak(response)
Run it:
python voice_assistant.py
Speak into your mic. Wait 5 seconds. Hear the AI respond.
Performance
- Whisper medium on CPU: transcribes in 2-4 seconds
- Qwen3 14B on RTX 3060: responds in 3-5 seconds
- Kokoro TTS on CPU: speaks in real-time (< 1 second latency)
- Total round-trip: ~10 seconds on modest hardware
For faster responses, use Whisper tiny or a smaller LLM like Llama 3.1 8B.
Originally published on everylocalai.com
Top comments (0)