Ever wanted to just talk to your computer and have it actually do something useful create files, write code, summarize text? That's exactly what I built for this project.
This article covers the architecture, the models I picked, the challenges I hit, and the lessons learned.
ποΈ Architecture
The system has two parts talking over HTTP:
- FastAPI backend β handles all AI inference and file operations
- Streamlit frontend β handles audio input and displays results
Every request goes through 3 stages:
Audio Input
β
[STT] Groq Whisper-large-v3 β transcribed text
β
[Intent] Groq Llama-3.1-8b β JSON task list
β
[Execute] Local tools β create file / write code / summarize / chat
β
Display result in UI
Keeping the backend and frontend separate means I can swap out the UI without touching any AI logic.
π€ Models I Used
| Stage | Model | Why |
|---|---|---|
| Speech-to-Text | whisper-large-v3 (Groq) | Best open STT model, fast via Groq |
| Intent Classification | llama-3.1-8b-instant | Small, fast, reliable at JSON output |
| Code Generation | llama-3.1-8b-instant | Fast enough for short scripts |
| Summarization | llama-3.1-8b-instant | Better quality, acceptable latency |
β οΈ Why Groq API instead of local Whisper?
The assignment recommended running Whisper locally via HuggingFace. However, whisper-large-v3 needs at least 6GB of GPU VRAM to run at a usable speed. On CPU it takes 30β60 seconds per clip β way too slow for an interactive UI.
Groq runs the exact same model on their hardware, returning results in ~700ms. The model is identical, only the compute location differs.
π§ Intent Classification β The Tricky Part
Getting the LLM to output clean, parseable JSON every time was harder than expected. Language models naturally want to add explanations and wrap things in markdown. Both of those break json.loads().
The fix was a very strict system prompt:
SYSTEM_PROMPT = """
You are a strict JSON routing agent.
Return ONLY valid JSON. No explanation. No markdown. No extra text.
Available intents:
- create_file β { filename, content }
- write_code β { filename, language, description }
- summarize β { text, save_to }
- chat β { message }
Always return: { "tasks": [ ...task objects... ] }
Each task: { "intent", "parameters", "confidence" }
Multiple commands β multiple tasks in the list.
If unclear β default to "chat".
"""
Supporting a tasks array from day one is what enables compound commands β the model naturally puts two intents in the list when the user says "write a file and summarize it."
π File Safety
Since the system writes files based on user voice input, path traversal is a real concern. The fix: a sandboxing function that resolves the absolute path and rejects anything outside output/.
OUTPUT_DIR = os.path.abspath("output")
def _safe_path(filename: str) -> str | None:
target = os.path.abspath(os.path.join(OUTPUT_DIR, filename))
# Trailing os.sep is critical β without it, "output_evil/" would pass
if not target.startswith(OUTPUT_DIR + os.sep):
return None
return target
All generated files go into output/. Nothing else is writable.
πΎ Session Memory
The agent keeps two parallel histories:
- Action history β timestamped log shown in the UI sidebar
- Chat context β last 3 user/assistant pairs sent to the LLM on every classify call
This means the user can say "now do the same for the other file" and the model understands the reference. Without it, every request is completely stateless.
π Latency Benchmarks
Averaged across 20 runs with a 5β10 second audio clip:
| Stage | Model | Avg Latency |
|---|---|---|
| Speech-to-Text | whisper-large-v3 | ~720 ms |
| Intent Classification | llama-3.1-8b-instant | ~380 ms |
| Code Generation | llama-3.1-8b-instant | ~950 ms |
| Summarization | llama-3.1-8b-instant | ~1,350 ms |
Total end-to-end for a write_code request: ~2.0β2.5 seconds. Fast enough to feel responsive.
Interesting finding: Intent classification is the fastest stage despite being the most "reasoning-heavy" step β because the strict JSON-only prompt forces the model to skip all its natural language preamble. Constraining output format is free speed.
π Challenges I Faced
1. JSON parse failures β Even with a strict prompt, the model occasionally wraps output in markdown fences. Added a fallback that strips backticks before parsing, plus a catch-all that defaults to chat intent on failure.
2. Audio format handling β Groq's Whisper API requires the correct file extension. Sending a .wav file named audio with no extension caused silent failures. Fix: always preserve the original filename and extension.
3. Two-process state β Streamlit and FastAPI are separate processes. If the backend restarts, all session history is lost. A future fix would be writing to SQLite on every memory.add() call.
4. Browser mic compatibility β The streamlit-audiorecorder component works great in Chrome but has issues in Firefox and Safari. Documented this in the README.
β¨ Bonus Features Built
- β Compound commands β one audio clip triggers multiple tasks
- β Human-in-the-loop β optional confirmation before executing file operations
- β Session memory β rolling chat context + action history sidebar
- β Latency benchmarking β toggle in settings to show model speeds
π Links
- π» GitHub: github.com/IshanNaikele/voice-agent
Built with FastAPI Β· Streamlit Β· Groq API Β· Whisper Β· Llama-3
Top comments (0)