I built an agent I could talk to while cooking.
"Hey, what's the status of the deployment?"
"Deployment completed 10 minutes ago. All 47 tests passed. The API is live."
No keyboard. No screen. Just voice.
Here's how to build one.
Why voice?
Typing requires:
- Hands free
- Eyes on screen
- Full attention
Voice works when:
- Your hands are busy (cooking, driving, building)
- You're away from your desk
- You want quick answers, not long sessions
- Accessibility needs
Different interface, different design.
The architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Speech │ │ Agent │ │ Speech │
│ to Text │────▶│ Loop │────▶│ to Text │
│ (STT) │ │ │ │ (TTS) │
└─────────────┘ └─────────────┘ └─────────────┘
▲ │ │
│ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Microphone │ │ Tools │ │ Speaker │
└─────────────┘ └─────────────┘ └─────────────┘
Same agent loop you know. Just different I/O.
Basic implementation
import openai
import subprocess
from pathlib import Path
client = openai.OpenAI()
def speech_to_text(audio_file):
"""Convert speech to text using Whisper"""
with open(audio_file, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
return transcript.text
def text_to_speech(text, output_file="response.mp3"):
"""Convert text to speech"""
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input=text
)
response.stream_to_file(output_file)
return output_file
def play_audio(file_path):
"""Play audio file"""
subprocess.run(["afplay", file_path]) # macOS
# subprocess.run(["aplay", file_path]) # Linux
# subprocess.run(["start", file_path], shell=True) # Windows
def agent_respond(user_text, tools):
"""Standard agent loop"""
messages = [
{"role": "system", "content": VOICE_SYSTEM_PROMPT},
{"role": "user", "content": user_text}
]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
message = response.choices[0].message
if not message.tool_calls:
return message.content
# Execute tools
for tool_call in message.tool_calls:
result = execute_tool(tool_call)
messages.append({"role": "tool", "content": result, ...})
messages.append(message)
def voice_agent_loop():
"""Main voice loop"""
while True:
print("Listening...")
audio_file = record_audio() # Record until silence
user_text = speech_to_text(audio_file)
print(f"You: {user_text}")
if "goodbye" in user_text.lower() or "exit" in user_text.lower():
play_audio(text_to_speech("Goodbye!"))
break
response_text = agent_respond(user_text, tools)
print(f"Agent: {response_text}")
audio_response = text_to_speech(response_text)
play_audio(audio_response)
The voice system prompt
Voice needs different instructions:
VOICE_SYSTEM_PROMPT = """
You are a voice-controlled coding assistant.
IMPORTANT - Your responses will be spoken aloud:
- Keep responses SHORT (1-3 sentences)
- Don't use markdown, code blocks, or formatting
- Don't use bullet points or numbered lists
- Speak naturally, like a conversation
- Don't spell out URLs or file paths character by character
- Round numbers ("about 50 files" not "47 files")
- Say "check mark" not "✓"
When reporting code or technical details:
- Summarize, don't read code verbatim
- "The function looks correct" not "def calculate_total..."
- "Line 47 has a syntax error" not the full line content
- For file contents, describe what you found
If the user needs to see details, say:
"I found the issue. Check your terminal for details."
Then output details to a file or clipboard.
"""
Compare:
Text agent:
"Here are the failing tests:
- test_login_valid: AssertionError on line 23
- test_logout: TimeoutError after 30s
- test_session: KeyError 'user_id'"
Voice agent:
"Three tests are failing. The login test has an assertion error,
logout is timing out, and there's a missing user ID in the session test.
Want me to look at any of them?"
Same information, different delivery.
Handling latency
Voice users expect instant responses. But STT → Agent → TTS takes time.
Strategy 1: Acknowledgment sounds
def voice_agent_loop():
while True:
audio_file = record_audio()
# Immediate acknowledgment
play_audio("sounds/thinking.mp3") # Short "hmm" sound
user_text = speech_to_text(audio_file)
response_text = agent_respond(user_text, tools)
# Cancel thinking sound if still playing
stop_audio()
audio_response = text_to_speech(response_text)
play_audio(audio_response)
Strategy 2: Streaming TTS
def stream_response(text):
"""Start speaking before full response is ready"""
sentences = split_into_sentences(text)
for sentence in sentences:
audio = text_to_speech(sentence)
play_audio(audio) # Speak while generating next
Strategy 3: Chunked responses
VOICE_SYSTEM_PROMPT += """
For complex answers, give a quick summary first, then ask if they want details.
Good: "The build failed. Type error in the auth module. Want me to explain?"
Bad: "The build failed because on line 47 of auth.ts there's a type error where..."
"""
Quick answer first, details on request.
Tools for voice
Some tools work great for voice. Others don't.
Voice-friendly tools
# Good for voice - returns simple status
- name: check_status
description: Check if services are running
# Returns: "API is up, database is up, cache is down"
- name: run_tests
description: Run tests and report pass/fail count
# Returns: "23 passed, 2 failed"
- name: deploy_status
description: Check deployment status
# Returns: "Deployed 10 minutes ago, healthy"
- name: count_files
description: Count files matching pattern
# Returns: "47 Python files"
Voice-unfriendly tools (need adaptation)
# Bad for voice - returns too much text
- name: read_file
# Returns 500 lines of code - can't speak that
# Adapted version
- name: summarize_file
description: Describe what a file does
# Returns: "This is the main entry point. It sets up the server and routes."
# Bad for voice - returns structured data
- name: git_log
# Returns commit hashes, dates, messages
# Adapted version
- name: recent_changes
description: Summarize recent git activity
# Returns: "3 commits today. Last one fixed the login bug."
Voice-specific tools
- name: save_to_clipboard
description: Save detailed output to clipboard so user can paste later
# Voice: "Saved to your clipboard"
- name: open_file
description: Open a file in the default editor
# Voice: "Opened auth.ts in your editor"
- name: send_details
description: Send detailed output to terminal/file for later review
# Voice: "Details are in your terminal"
Handling transcription errors
Speech-to-text isn't perfect. Design for errors.
Common mishears
CORRECTIONS = {
"cash": "cache",
"get": "git",
"jason": "JSON",
"sequel": "SQL",
"no js": "Node.js",
"pie test": "pytest",
"just": "Jest",
"doctor": "Docker",
"cube control": "kubectl",
}
def fix_transcription(text):
for wrong, right in CORRECTIONS.items():
text = text.replace(wrong, right)
return text
Confirmation for destructive actions
def agent_respond(user_text, tools):
response = get_agent_response(user_text, tools)
# Always confirm destructive actions by voice
if needs_voice_confirmation(response):
return f"I'm about to {response['action']}. Say 'yes' to confirm or 'no' to cancel."
return response
def needs_voice_confirmation(response):
destructive = ["delete", "remove", "drop", "reset", "push --force"]
return any(d in str(response).lower() for d in destructive)
Mishearing "delete the log" as "delete the blog" would be bad.
Ask for clarification
VOICE_SYSTEM_PROMPT += """
If the request is ambiguous or you're unsure what was said, ask for clarification.
Good: "Did you say 'cache' or 'catch'?"
Good: "Delete what files? The logs or the tests?"
Bad: *assumes and deletes wrong thing*
"""
Wake word detection
For hands-free, always-on mode:
import pvporcupine
import pyaudio
import struct
def listen_for_wake_word():
"""Listen for 'Hey computer' wake word"""
porcupine = pvporcupine.create(keywords=["computer"])
pa = pyaudio.PyAudio()
stream = pa.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length
)
print("Listening for wake word...")
while True:
pcm = stream.read(porcupine.frame_length)
pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
if porcupine.process(pcm) >= 0:
print("Wake word detected!")
return True
def always_on_voice_agent():
"""Always listening, activates on wake word"""
while True:
listen_for_wake_word()
play_audio("sounds/ready.mp3")
audio = record_until_silence()
response = process_voice_command(audio)
play_audio(response)
# Go back to listening for wake word
Now you can say "Hey computer, run the tests" from across the room.
Conversation context
Voice conversations are short but need context:
class VoiceAgent:
def __init__(self):
self.context = []
self.max_context = 5 # Keep last 5 exchanges
def respond(self, user_text):
self.context.append({"role": "user", "content": user_text})
# Trim context
if len(self.context) > self.max_context * 2:
self.context = self.context[-self.max_context * 2:]
response = self.get_response(self.context)
self.context.append({"role": "assistant", "content": response})
return response
Allows follow-ups:
User: "Run the tests"
Agent: "2 tests failed. The login tests."
User: "What's wrong with them?" # References previous
Agent: "The login test expects a 200 but gets 401. Looks like auth is failing."
User: "Fix it" # Still in context
Agent: "Fixed. The API key was missing from the test environment."
Full working example
import openai
import pyaudio
import wave
import io
client = openai.OpenAI()
# Tools simplified for voice
tools = [
{
"type": "function",
"function": {
"name": "check_tests",
"description": "Run tests and report results",
"parameters": {}
}
},
{
"type": "function",
"function": {
"name": "check_status",
"description": "Check if the app is running",
"parameters": {}
}
},
{
"type": "function",
"function": {
"name": "recent_changes",
"description": "What changed recently in git",
"parameters": {}
}
}
]
SYSTEM = """
You are a voice assistant for developers.
Keep responses under 2 sentences.
Be conversational and concise.
"""
def record_audio(duration=5):
"""Record audio from microphone"""
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
input=True, frames_per_buffer=1024)
print("🎤 Recording...")
frames = []
for _ in range(0, int(16000 / 1024 * duration)):
frames.append(stream.read(1024))
stream.stop_stream()
stream.close()
p.terminate()
# Save to buffer
buffer = io.BytesIO()
wf = wave.open(buffer, 'wb')
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(b''.join(frames))
wf.close()
buffer.seek(0)
buffer.name = "audio.wav"
return buffer
def main():
print("Voice Agent Ready. Speak after the beep.")
context = []
while True:
audio = record_audio()
# STT
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio
).text
print(f"You: {transcript}")
if "goodbye" in transcript.lower():
break
# Agent
context.append({"role": "user", "content": transcript})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": SYSTEM}] + context,
tools=tools
)
reply = response.choices[0].message.content or "Done."
context.append({"role": "assistant", "content": reply})
print(f"Agent: {reply}")
# TTS
speech = client.audio.speech.create(
model="tts-1",
voice="nova",
input=reply
)
speech.stream_to_file("response.mp3")
subprocess.run(["afplay", "response.mp3"])
if __name__ == "__main__":
main()
Voice + Gantz
You can build voice-controlled tools with Gantz Run:
# gantz.yaml - Voice-friendly tools
tools:
- name: status
description: Quick status check. Returns one sentence.
script:
shell: |
if curl -s localhost:3000/health > /dev/null; then
echo "App is running"
else
echo "App is down"
fi
- name: tests
description: Run tests. Returns pass/fail summary.
script:
shell: |
result=$(npm test 2>&1)
passed=$(echo "$result" | grep -c "✓" || echo "0")
failed=$(echo "$result" | grep -c "✗" || echo "0")
echo "$passed passed, $failed failed"
- name: changes
description: Recent git changes. Returns brief summary.
script:
shell: |
count=$(git log --oneline -24h | wc -l)
last=$(git log -1 --format="%s")
echo "$count commits today. Last one: $last"
Short outputs designed for voice.
Summary
Voice agents need different design:
| Aspect | Text Agent | Voice Agent |
|---|---|---|
| Response length | Any length | 1-3 sentences |
| Formatting | Markdown, code blocks | Plain speech |
| Numbers | Exact (47 files) | Rounded (about 50) |
| Tool output | Full details | Summaries |
| Confirmation | Type 'yes' | Say 'yes' |
| Errors | Show stack trace | "Something went wrong with auth" |
| Details | Inline | "Check your terminal" |
The same agent loop, different I/O, different UX.
Talk to your code.
Have you built a voice interface for development tools? What worked?
Top comments (0)