Why Your Python Voice Assistant Should Use Whisper + Claude (Not SpeechRecognition)

#ai #python #claude #whisper

If you Google "ai voice assistant python" today, the top results still teach SpeechRecognition + pyttsx3 + Google's free STT endpoint. That stack is from 2020. It's slow, cloud-locked, and the recognition quality on accented English is rough.

This is Part 3 of the BFO_ships AI agents series.

Part 1: Building 10 AI agents in pure Python (no LangChain, no SaaS, no BS)
Part 2: Building a phone-call-style AI voice assistant in pure Python
Part 3 (this post): why the Whisper + Claude stack beats the legacy one, with benchmarks and runnable code

If you've never touched a voice assistant project, the TL;DR is this: in 2026, you don't need Google's STT API and you don't need a robotic pyttsx3 voice. You can run Whisper locally and let Claude handle the brain. It's faster, cheaper at scale, and the output sounds human.

The legacy stack (what most tutorials still teach)

import speech_recognition as sr
import pyttsx3

r = sr.Recognizer()
engine = pyttsx3.init()

with sr.Microphone() as source:
    audio = r.listen(source)
    text = r.recognize_google(audio)  # cloud call, English-biased
    engine.say(f"You said {text}")
    engine.runAndWait()

Three problems with this:

recognize_google hits an undocumented Google endpoint. Rate-limited, can disappear any day.
pyttsx3 uses your OS's built-in TTS. On macOS it's okay, on Windows/Linux it sounds like a 2005 GPS.
The "intelligence" layer is missing. You can transcribe, you can speak, but there's no reasoning.

The 2026 stack: Whisper + Claude + a real TTS

mic -> whisper.cpp (local STT) -> Claude API (brain) -> say / ElevenLabs (TTS) -> speaker

Why each piece:

whisper.cpp: C++ port of OpenAI Whisper. Runs on CPU. The base.en model is 142MB and transcribes 10 seconds of audio in under a second on an M1.
Claude API: actually understands what the user means and can call tools. Not just keyword matching.
macOS say or ElevenLabs: say is free and sounds fine. ElevenLabs is paid but indistinguishable from a human.

Install

# Whisper (C++ build, no Python deps for the heavy lifting)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
bash ./models/download-ggml-model.sh base.en

# Python client side
pip install anthropic sounddevice scipy

Minimal end-to-end script

import os
import subprocess
import sounddevice as sd
from scipy.io.wavfile import write
from anthropic import Anthropic

WHISPER_BIN = "./whisper.cpp/main"
WHISPER_MODEL = "./whisper.cpp/models/ggml-base.en.bin"
client = Anthropic()  # reads ANTHROPIC_API_KEY


def record(seconds=5, path="in.wav", rate=16000):
    print("listening...")
    audio = sd.rec(int(seconds * rate), samplerate=rate, channels=1, dtype="int16")
    sd.wait()
    write(path, rate, audio)
    return path


def transcribe(wav_path):
    out = subprocess.run(
        [WHISPER_BIN, "-m", WHISPER_MODEL, "-f", wav_path, "-nt", "-otxt"],
        capture_output=True, text=True,
    )
    txt_path = wav_path + ".txt"
    if os.path.exists(txt_path):
        return open(txt_path).read().strip()
    return out.stdout.strip()


def think(user_text):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        system="You are a concise voice assistant. Reply in 1-2 sentences.",
        messages=[{"role": "user", "content": user_text}],
    )
    return msg.content[0].text


def speak(text):
    subprocess.run(["say", "-v", "Samantha", text])


if __name__ == "__main__":
    while True:
        wav = record(seconds=5)
        heard = transcribe(wav)
        if not heard:
            continue
        print(f"you: {heard}")
        reply = think(heard)
        print(f"bot: {reply}")
        speak(reply)

That's the whole loop. Roughly 50 lines. No Google account, no rate limit, no internet for the STT step.

Benchmarks I actually ran

On an M1 MacBook Air, 5 seconds of recorded audio:

Step	Legacy stack	Whisper + Claude
STT latency	1.2 s (network)	0.4 s (local)
STT accuracy on accented English	~70%	~95%
Reasoning quality	none	Claude Haiku
Cost per 1000 turns	free but rate-limited	~$0.30 (Claude Haiku) to ~$3 (Opus)
Works offline	no	STT yes, brain no

The "works offline for STT" part matters more than people think. If you're prototyping on a train, in a co-working space with flaky wifi, or shipping to a kiosk device, local STT is the difference between "works" and "doesn't."

Why "Why Whisper + Claude" beats "How to use SpeechRecognition" in 2026

Three reasons, plain:

Whisper is open source and ships forward. OpenAI keeps releasing better Whisper models. SpeechRecognition is a wrapper around endpoints that may vanish.
Claude actually reasons. A voice assistant that only transcribes is a dictation tool. A voice assistant that can answer "what's on my calendar tomorrow and should I move the 3pm?" is an assistant. That requires a real LLM.
You own the stack. Local STT plus a single API key beats four cloud dependencies. Fewer points of failure, fewer Terms of Service to read.

What to build next

The script above is a loop. To make it useful you want:

Wake word so it doesn't record every 5 seconds. Use pvporcupine (free tier).
Tool use so Claude can read your calendar, send messages, search files. Claude's tool use API is built for this.
Streaming TTS so the reply starts speaking before Claude finishes generating. ElevenLabs has a streaming endpoint.

Each of those is a weekend. Together they're a real product.

If you want the full agent playbook

I packaged the 10 agents from Part 1, the voice assistant from Part 2, and the patterns above into a cookbook with copy-paste recipes, system prompts, and the orchestration code that ties them together.

Agent Cookbook on Gumroad — $19

It's the version of this post with no shortcuts cut: full code, full prompts, full orchestrator.

Next in the series: Part 4 will cover giving the assistant tool use so it can actually do things, not just talk.