I wanted an AI I could talk to like a phone call — press a button, speak Mandarin, hear it answer back in 2 seconds. No ChatGPT Voice paywall. No LangChain. No "hosted agent" SaaS that locks my conversation history behind someone else's dashboard.
So I built it in 150 lines of Python. Round-trip latency: 2.4 seconds. Monthly cost: ~$0.
This is part 2 of my pure Python AI series. Part 1 was 10 agents in stdlib. This one is the voice stack.
The pain
Every "AI voice app" on the market is broken in at least one of these ways:
- ChatGPT Voice — gated behind Plus, no API for the realtime voice mode that actually feels like a phone call, and English-first.
- ElevenLabs / PlayHT — billed per character, the free tier evaporates in a week, and you're locked into their TTS.
- Vapi / Retell / Bland — hosted agent platforms, $0.05–$0.15 per minute, your transcripts live on their servers.
- Most open-source voice repos — wired together with LangChain, 14 abstractions deep, breaks the moment a provider deprecates an endpoint.
I'm in Taiwan. I want Traditional Chinese in, Traditional Chinese out, low latency, no monthly bill, and code I can actually read on a Sunday afternoon.
So: stdlib + three subprocesses + one HTTP endpoint.
The architecture
iPhone (Shortcut: record audio + POST)
│ multipart/form-data: audio.wav + convo_id
▼
┌────────────────────────────────────────────┐
│ Mac server :8765 (http.server, stdlib) │
│ │
│ 1. Whisper STT (whisper.cpp / ggml-base)│
│ │ text │
│ ▼ │
│ 2. LLM router (NIM → Gemini → Claude) │
│ │ reply text │
│ ▼ │
│ 3. Edge TTS (zh-TW-HsiaoChenNeural) │
│ │ wav bytes │
└────────────┼───────────────────────────────┘
▼
audio/wav response → iPhone plays
Three local processes, one HTTP handler, one in-memory dict for conversation state. That's the whole thing.
The pipeline (3 snippets that matter)
1. STT via whisper.cpp subprocess
No openai-whisper Python wheel, no PyTorch, no 4 GB of CUDA junk. Just the C++ binary.
import subprocess, tempfile, os
def transcribe(wav_bytes: bytes) -> str:
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(wav_bytes)
path = f.name
try:
subprocess.run(
[
"whisper-cli",
"-m", "models/ggml-base.bin",
"-l", "zh",
"-f", path,
"-otxt", "-of", path,
"--no-prints",
],
capture_output=True, text=True, timeout=30,
)
with open(path + ".txt") as t:
return t.read().strip()
finally:
for p in (path, path + ".txt"):
if os.path.exists(p): os.unlink(p)
ggml-base.bin is 142 MB. On an M-series Mac it transcribes a 5-second clip in ~600ms. Good enough.
2. LLM router with three-tier fallback
A single provider going down should not kill the call. Try the free fast one first, fall back to the free smart one, last resort the paid one.
import os, json, urllib.request
PROVIDERS = [
("nim", "https://integrate.api.nvidia.com/v1/chat/completions",
"meta/llama-4-maverick-17b-128e-instruct",
os.environ["NIM_API_KEY"]),
("gemini", "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
"gemini-2.5-pro",
os.environ["GEMINI_API_KEY"]),
("claude", "https://api.anthropic.com/v1/messages",
"claude-haiku-4-5",
os.environ["ANTHROPIC_API_KEY"]),
]
def llm(messages: list[dict]) -> str:
for name, url, model, key in PROVIDERS:
try:
if name == "claude":
return _call_anthropic(url, key, model, messages)
body = json.dumps({"model": model, "messages": messages,
"temperature": 0.6, "max_tokens": 400}).encode()
req = urllib.request.Request(url, data=body, headers={
"Authorization": f"Bearer {key}",
"Content-Type": "application/json",
})
with urllib.request.urlopen(req, timeout=8) as r:
data = json.loads(r.read())
return data["choices"][0]["message"]["content"].strip()
except Exception as e:
print(f"[llm] {name} failed: {e}")
continue
return "(系統忙線中、請稍後再說一次)"
NVIDIA NIM gives free inference on Llama 4 Maverick at decent rate limits. Gemini Pro free tier covers the rest. Claude is the "I'm willing to pay 3 cents to not embarrass myself" tier.
3. Conversation memory + the HTTP handler
The whole point of "phone-call-style" is multi-turn. Each iPhone request carries a convo_id; we keep history in a dict.
import edge_tts, asyncio
from http.server import BaseHTTPRequestHandler, HTTPServer
import cgi, io
HISTORY: dict[str, list[dict]] = {}
SYSTEM = {"role": "system",
"content": "你是一個說繁體中文的語音助理、回答簡短、口語、自然、不超過 80 字。"}
async def synth(text: str) -> bytes:
buf = io.BytesIO()
comm = edge_tts.Communicate(text, voice="zh-TW-HsiaoChenNeural")
async for chunk in comm.stream():
if chunk["type"] == "audio":
buf.write(chunk["data"])
return buf.getvalue()
class Voice(BaseHTTPRequestHandler):
def do_POST(self):
form = cgi.FieldStorage(fp=self.rfile, headers=self.headers,
environ={"REQUEST_METHOD": "POST"})
convo_id = form.getvalue("convo_id", "default")
wav = form["audio"].file.read()
user_text = transcribe(wav)
hist = HISTORY.setdefault(convo_id, [SYSTEM])
hist.append({"role": "user", "content": user_text})
reply = llm(hist)
hist.append({"role": "assistant", "content": reply})
HISTORY[convo_id] = hist[-21:] # cap at 10 turns + system
audio = asyncio.run(synth(reply))
self.send_response(200)
self.send_header("Content-Type", "audio/mpeg")
self.end_headers()
self.wfile.write(audio)
HTTPServer(("0.0.0.0", 8765), Voice).serve_forever()
iPhone Shortcut: record → POST https://your-tunnel/voice with audio and convo_id → play the response. Done. No app, no TestFlight.
The cost reality
Measured over a week of personal use, ~30 conversations/day, 4–6 turns each:
| Component | Provider | Cost |
|---|---|---|
| STT | whisper.cpp local | $0 |
| LLM (95% of calls) | NVIDIA NIM Llama 4 Maverick | $0 (free tier) |
| LLM (4% fallback) | Gemini 2.5 Pro | $0 (free tier) |
| LLM (1% fallback) | Claude Haiku 4.5 | ~$0.04/week |
| TTS | Microsoft Edge edge-tts
|
$0 |
| Hosting | Mac mini at home + Cloudflare Tunnel | $0 |
| Total | ~$0.16/month |
Compare to Vapi at $0.10/minute — same usage would be ~$45/month. The home-server tax is electricity.
Why no LangChain
I wrote the whole thing in one sitting because every component is the dumbest possible version of itself: a subprocess, an HTTP POST, a dict, an async TTS stream. When NIM changes their model name next month, I edit one tuple. When Edge TTS gets blocked, I drop in Piper without rewriting a "Chain".
Boring is good. Boring code survives provider deprecations. The day a framework abstracts your requests.post into a RunnableLambda, you've signed up to debug someone else's metaphor at 2 AM.
Try it yourself
Full source plus the iPhone Shortcut JSON, the Cloudflare Tunnel config, the Asterisk AGI bridge (so the same backend can answer real SIP phone calls), and 9 more pure-Python AI recipes — packaged as Agent Cookbook for $19, one-time, lifetime updates.
https://vampireheart3.gumroad.com/l/agent-cookbook
If you build something with it, leave a comment — I want to hear what your voice agent sounds like.
— BFO
Part 1: Building 10 AI agents in pure Python (no LangChain)
Part 2 (this one): Phone-call-style voice assistant
Part 3 (next): Giving the assistant tools — real function-calling without an SDK
Top comments (0)