Build a Private Voice Assistant with Whisper, Ollama, and Kokoro TTS

#ai #python #opensource #tutorial

Have you ever wanted your own Jarvis? A voice assistant that listens, thinks, and speaks back - all running privately on your own hardware? Here's how to build one with Whisper.cpp, Ollama, and Kokoro TTS.

No cloud, no wake-word fees, no data leaving your machine.

Prerequisites

Hardware: Any modern computer with a microphone
Software: Python 3.10+, Ollama installed
Time: ~30 minutes setup

Installation

1. Install Ollama and Pull a Model

ollama pull qwen3:14b

2. Install Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release
bash models/download-ggml-model.sh medium

3. Install Kokoro TTS

pip install kokoro pyaudio requests

Wiring It All Together

Save this as voice_assistant.py:

import subprocess
import tempfile
import wave
import pyaudio
import requests
from kokoro import KPipeline

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen3:14b"
WHISPER_BIN = "./whisper.cpp/build/bin/whisper-cli"
WHISPER_MODEL = "./whisper.cpp/models/ggml-medium.bin"
tts_pipeline = KPipeline(lang_code='a')

def record_audio(duration=5, sample_rate=16000):
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16, channels=1,
                    rate=sample_rate, input=True,
                    frames_per_buffer=1024)
    frames = [stream.read(1024) for _ in range(int(sample_rate / 1024 * duration))]
    stream.close(); p.terminate()
    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        wf = wave.open(f, 'wb')
        wf.setnchannels(1); wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(b''.join(frames))
        return f.name

def transcribe(audio_file):
    result = subprocess.run([WHISPER_BIN, '-m', WHISPER_MODEL, '-f', audio_file],
                          capture_output=True, text=True)
    return result.stdout.strip()

def ask_llm(prompt):
    r = requests.post(OLLAMA_URL, json={"model": MODEL, "prompt": prompt, "stream": False})
    return r.json()["response"]

def speak(text):
    for result in tts_pipeline(text):
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
            f.write(result.audio)
        subprocess.run(['ffplay', '-nodisp', '-autoexit', f.name])

# Run it
print("Listening...")
audio_file = record_audio(5)
text = transcribe(audio_file)
print(f"You: {text}")
response = ask_llm(text)
print(f"AI: {response}")
speak(response)

Run it:

python voice_assistant.py

Speak into your mic. Wait 5 seconds. Hear the AI respond.

Performance

Whisper medium on CPU: transcribes in 2-4 seconds
Qwen3 14B on RTX 3060: responds in 3-5 seconds
Kokoro TTS on CPU: speaks in real-time (< 1 second latency)
Total round-trip: ~10 seconds on modest hardware

For faster responses, use Whisper tiny or a smaller LLM like Llama 3.1 8B.

Originally published on everylocalai.com

DEV Community