If you Google "ai voice assistant python" today, the top results still teach SpeechRecognition + pyttsx3 + Google's free STT endpoint. That stack is from 2020. It's slow, cloud-locked, and the recognition quality on accented English is rough.
This is Part 3 of the BFO_ships AI agents series.
- Part 1: Building 10 AI agents in pure Python (no LangChain, no SaaS, no BS)
- Part 2: Building a phone-call-style AI voice assistant in pure Python
- Part 3 (this post): why the Whisper + Claude stack beats the legacy one, with benchmarks and runnable code
If you've never touched a voice assistant project, the TL;DR is this: in 2026, you don't need Google's STT API and you don't need a robotic pyttsx3 voice. You can run Whisper locally and let Claude handle the brain. It's faster, cheaper at scale, and the output sounds human.
The legacy stack (what most tutorials still teach)
import speech_recognition as sr
import pyttsx3
r = sr.Recognizer()
engine = pyttsx3.init()
with sr.Microphone() as source:
audio = r.listen(source)
text = r.recognize_google(audio) # cloud call, English-biased
engine.say(f"You said {text}")
engine.runAndWait()
Three problems with this:
-
recognize_googlehits an undocumented Google endpoint. Rate-limited, can disappear any day. -
pyttsx3uses your OS's built-in TTS. On macOS it's okay, on Windows/Linux it sounds like a 2005 GPS. - The "intelligence" layer is missing. You can transcribe, you can speak, but there's no reasoning.
The 2026 stack: Whisper + Claude + a real TTS
mic -> whisper.cpp (local STT) -> Claude API (brain) -> say / ElevenLabs (TTS) -> speaker
Why each piece:
-
whisper.cpp: C++ port of OpenAI Whisper. Runs on CPU. The
base.enmodel is 142MB and transcribes 10 seconds of audio in under a second on an M1. - Claude API: actually understands what the user means and can call tools. Not just keyword matching.
-
macOS
sayor ElevenLabs:sayis free and sounds fine. ElevenLabs is paid but indistinguishable from a human.
Install
# Whisper (C++ build, no Python deps for the heavy lifting)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
bash ./models/download-ggml-model.sh base.en
# Python client side
pip install anthropic sounddevice scipy
Minimal end-to-end script
import os
import subprocess
import sounddevice as sd
from scipy.io.wavfile import write
from anthropic import Anthropic
WHISPER_BIN = "./whisper.cpp/main"
WHISPER_MODEL = "./whisper.cpp/models/ggml-base.en.bin"
client = Anthropic() # reads ANTHROPIC_API_KEY
def record(seconds=5, path="in.wav", rate=16000):
print("listening...")
audio = sd.rec(int(seconds * rate), samplerate=rate, channels=1, dtype="int16")
sd.wait()
write(path, rate, audio)
return path
def transcribe(wav_path):
out = subprocess.run(
[WHISPER_BIN, "-m", WHISPER_MODEL, "-f", wav_path, "-nt", "-otxt"],
capture_output=True, text=True,
)
txt_path = wav_path + ".txt"
if os.path.exists(txt_path):
return open(txt_path).read().strip()
return out.stdout.strip()
def think(user_text):
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
system="You are a concise voice assistant. Reply in 1-2 sentences.",
messages=[{"role": "user", "content": user_text}],
)
return msg.content[0].text
def speak(text):
subprocess.run(["say", "-v", "Samantha", text])
if __name__ == "__main__":
while True:
wav = record(seconds=5)
heard = transcribe(wav)
if not heard:
continue
print(f"you: {heard}")
reply = think(heard)
print(f"bot: {reply}")
speak(reply)
That's the whole loop. Roughly 50 lines. No Google account, no rate limit, no internet for the STT step.
Benchmarks I actually ran
On an M1 MacBook Air, 5 seconds of recorded audio:
| Step | Legacy stack | Whisper + Claude |
|---|---|---|
| STT latency | 1.2 s (network) | 0.4 s (local) |
| STT accuracy on accented English | ~70% | ~95% |
| Reasoning quality | none | Claude Haiku |
| Cost per 1000 turns | free but rate-limited | ~$0.30 (Claude Haiku) to ~$3 (Opus) |
| Works offline | no | STT yes, brain no |
The "works offline for STT" part matters more than people think. If you're prototyping on a train, in a co-working space with flaky wifi, or shipping to a kiosk device, local STT is the difference between "works" and "doesn't."
Why "Why Whisper + Claude" beats "How to use SpeechRecognition" in 2026
Three reasons, plain:
- Whisper is open source and ships forward. OpenAI keeps releasing better Whisper models. SpeechRecognition is a wrapper around endpoints that may vanish.
- Claude actually reasons. A voice assistant that only transcribes is a dictation tool. A voice assistant that can answer "what's on my calendar tomorrow and should I move the 3pm?" is an assistant. That requires a real LLM.
- You own the stack. Local STT plus a single API key beats four cloud dependencies. Fewer points of failure, fewer Terms of Service to read.
What to build next
The script above is a loop. To make it useful you want:
-
Wake word so it doesn't record every 5 seconds. Use
pvporcupine(free tier). - Tool use so Claude can read your calendar, send messages, search files. Claude's tool use API is built for this.
- Streaming TTS so the reply starts speaking before Claude finishes generating. ElevenLabs has a streaming endpoint.
Each of those is a weekend. Together they're a real product.
If you want the full agent playbook
I packaged the 10 agents from Part 1, the voice assistant from Part 2, and the patterns above into a cookbook with copy-paste recipes, system prompts, and the orchestration code that ties them together.
Agent Cookbook on Gumroad — $19
It's the version of this post with no shortcuts cut: full code, full prompts, full orchestrator.
Next in the series: Part 4 will cover giving the assistant tool use so it can actually do things, not just talk.
Top comments (0)