How I built a fully functional voice-to-action AI agent using Groq, Streamlit, and a modular Python pipeline — including the messy parts nobody talks about.
Introduction
What if you could just speak a command and have an AI agent write code, summarize text, and save files to your computer — all without typing a single line?
That's exactly what I set out to build: a Voice-Controlled Local AI Agent that listens to your voice (or typed commands), understands your intent, and executes real actions on your machine. Built with Python, Streamlit, and Groq's blazing-fast LLM inference, this project turned out to be one of the most educational builds I've done — full of unexpected challenges and satisfying breakthroughs.
In this article I'll walk you through:
- The full system architecture
- Why I chose Groq over OpenAI
- How I handled compound commands, memory, and graceful degradation
- The model benchmarking system I built (Llama 3.3 70B vs Llama 3.1 8B)
- Every major challenge I hit — and how I solved them
System Architecture
The agent follows a clean three-stage pipeline:
🎙️ Audio Input
↓
[Stage 1] Speech-to-Text (STT)
↓
[Stage 2] Intent Classification (LLM)
↓
[Stage 3] Tool Execution
↓
📁 Output (files, code, summaries, chat)
Each stage is isolated into its own Python module, making the system easy to swap, extend, or debug independently.
File Structure
voice_agent/
├── app.py # Streamlit UI + pipeline orchestration
├── stt.py # Speech-to-Text module (Groq / OpenAI / Whisper)
├── intent.py # Intent classification via LLM
├── tools.py # Tool execution engine
├── requirements.txt
└── output/ # All generated files land here
Stage 1: Speech-to-Text (STT)
The STT module (stt.py) supports three backends, selectable from the sidebar:
| Backend | Model | Notes |
|---|---|---|
| Groq API | whisper-large-v3 |
Fast, free, cloud-based |
| OpenAI API | whisper-1 |
Accurate, paid |
| Whisper Local |
base / small
|
Fully offline, slower |
For most users, Groq's Whisper Large V3 is the best choice — it's free, fast, and handles a wide variety of accents and audio quality. The local Whisper fallback is great for privacy-sensitive scenarios where you don't want audio leaving your machine.
def transcribe_audio(audio_path: str, config: dict) -> str:
backend = config.get("stt_backend", "Groq API")
if backend == "Groq API":
return _groq_stt(audio_path, config.get("api_key", ""))
elif backend == "OpenAI API":
return _openai_stt(audio_path, config.get("api_key", ""))
else:
return _whisper_local(audio_path)
Stage 2: Intent Classification
This is where the intelligence lives. After transcription, the raw text is sent to an LLM with a carefully engineered system prompt that forces structured JSON output.
The System Prompt Design
The key insight here was to design the prompt around a commands array rather than a single intent. This unlocked compound command support from day one:
{
"intent": "summarize_and_save",
"detail": "Summarize text and save to file",
"commands": [
{
"intent": "summarize_and_save",
"description": "Summarize and write to summary.txt",
"params": {
"filename": "summary.txt",
"text_to_summarize": "..."
}
}
]
}
Supported Intents
| Intent | Action |
|---|---|
create_file |
Create a new file or folder |
write_code |
Generate and save code to a file |
summarize |
Summarize text (output in chat) |
summarize_and_save |
Summarize and save result to a file |
general_chat |
Answer questions, explain concepts |
LLM Backends
| Backend | Model | Speed | Cost |
|---|---|---|---|
| Groq API | llama-3.3-70b-versatile |
⚡ Very fast | Free tier |
| OpenAI API | gpt-4o-mini |
Medium | Paid |
| Ollama Local |
llama3.2, mistral, etc. |
Slow | Free |
I strongly recommend Groq for this use case. The inference speed is genuinely remarkable — what takes GPT-4o-mini 3–5 seconds takes Groq under a second.
Stage 3: Tool Execution
The tools.py module routes each classified intent to the correct handler function. Every handler follows the same contract — it receives params, the output directory, and the config, and returns a result dict:
{
"success": True,
"message": "Code written to output/retry.py",
"output": "# Generated code here..."
}
The write_code handler is the most interesting — it makes a second LLM call to actually generate the code, instructing the model to return only raw code with no markdown fences or explanation:
def _generate_code(description: str, language: str, config: dict) -> str:
prompt = f"""Generate clean, well-commented {language} code for:
{description}
Return ONLY the code. No markdown fences. No explanation outside the code."""
return _llm_query(prompt, config)
Bonus Features Implemented
1. Compound Commands
A single voice input can trigger multiple actions. For example:
"Summarize this article and save it to notes.txt"
The LLM detects two intents (summarize + create_file) and the agent executes them sequentially. The pipeline loops through the commands array and processes each one — with Human-in-the-Loop confirmation for file operations.
2. Human-in-the-Loop (HITL)
Before executing any file-writing operation (create_file, write_code, summarize_and_save), the agent pauses and shows a confirmation dialog:
⚠️ Confirmation Required
Intent: write_code
Action: Generate Python retry function and save to retry.py
[Parameters preview]
✅ Confirm & Execute ❌ Cancel
This is a critical safety feature. Without it, a misheard command could overwrite important files. Even cancelled actions are logged to the session history for full traceability.
3. Graceful Degradation
Real-world voice agents fail constantly — bad audio, network timeouts, quota exceeded, unknown intents. Instead of crashing or showing a raw Python traceback, every failure point is caught and routed to a user-friendly message:
except Exception as e:
err = str(e)
if "api" in err.lower() or "key" in err.lower():
friendly = "API key error — check your key in the sidebar."
elif "timeout" in err.lower():
friendly = "Network error — check your internet connection."
elif "format" in err.lower():
friendly = "Audio format unsupported — try WAV or MP3."
else:
friendly = f"STT failed: {err}"
Unknown intents also degrade gracefully — instead of crashing, they fall back to general_chat and the agent explains what it understood.
4. Session Memory
The agent maintains two types of memory throughout a session:
Action History — every command, intent, result, and timestamp is stored and displayed in the History tab with success/failure statistics and intent frequency chips.
Chat Context — a rolling window of the last 20 conversation turns is passed to the LLM on every request, enabling follow-up commands like:
"Now make it handle exceptions too"
The LLM remembers what was just generated and adds exception handling to it — without the user repeating themselves.
5. Model Benchmarking
The benchmarking tab lets you compare two Groq models head-to-head on any prompt:
- Model A: Llama 3.3 70B Versatile — large, powerful, deep reasoning
- Model B: Llama 3.1 8B Instant — small, ultra-fast, great for simple tasks
Both models run on the same Groq API key — no extra cost. The benchmark shows:
- ⏱ Latency in seconds
- 🔤 Total tokens used
- 🚀 Tokens per second
- 🏆 Speed winner highlighted in green
- 📈 Cumulative stats across multiple runs
After running several benchmarks myself, here's what I found:
| Task Type | Winner |
|---|---|
| Complex code generation | Llama 3.3 70B (better quality) |
| Simple Q&A / chat | Llama 3.1 8B (3–4× faster) |
| Summarization | Roughly equal |
Challenges & How I Solved Them
Challenge 1: Groq API Key vs OpenAI Key — Two Fields, One Config
Problem: The app started with a single API key field, but Groq and OpenAI use different keys. Switching backends required manually changing the key every time.
Solution: Added two separate key fields in the sidebar. The active key is resolved automatically based on which backend is selected:
if "Groq" in stt_backend or "Groq" in llm_backend:
active_api_key = groq_key
elif "OpenAI" in stt_backend or "OpenAI" in llm_backend:
active_api_key = openai_key
Challenge 2: LLM Returns Markdown, Not Pure JSON
Problem: Even when instructed to return only JSON, LLMs often wrap their output in markdown fences like json ....
Solution: Built a robust parser that strips fences and extracts the JSON object using regex, with a safe fallback to general_chat if parsing still fails:
def _parse_intent(raw: str) -> dict:
cleaned = re.sub(r"```
(?:json)?", "", raw).strip().rstrip("`").strip()
match = re.search(r"\{.*\}", cleaned, re.DOTALL)
if match:
try:
return json.loads(match.group())
except json.JSONDecodeError:
pass
return {"intent": "general_chat", "detail": raw[:500], "commands": [...]}
```
---
### Challenge 3: Decommissioned Models Break Silently at Runtime
**Problem:** During development, Groq deprecated `gemma2-9b-it` — a model I was using for benchmarking. The app crashed at runtime with a 400 error, not at startup.
**Solution:** Replaced the decommissioned model with `llama-3.1-8b-instant` (currently active on Groq). The broader lesson: **always wrap model API calls in try/except** and surface the error message directly to the user — never let a model error become a silent failure.
---
### Challenge 4: Streamlit State Management with HITL
**Problem:** The Human-in-the-Loop confirmation requires the app to pause mid-pipeline, show a confirmation UI, and then resume execution on the next Streamlit rerun. Managing this state without losing the action context was tricky.
**Solution:** Used `st.session_state.pending_action` to store the action dict between reruns. When a file intent is detected, the action is saved to state and `st.rerun()` is called. The confirmation UI reads from state and executes (or cancels) on the next button press.
---
### Challenge 5: OpenAI Quota Errors in Benchmarking
**Problem:** A perfectly valid OpenAI API key can still fail with a `429 insufficient_quota` error if the account has no billing credits — this confused early testers who thought their key was broken.
**Solution:** The benchmarking tab wraps each model call in try/except and displays the error inline inside the result card rather than crashing the whole tab. This way, if one model fails, the other still shows its result.
---
## Tech Stack Summary
| Component | Technology |
|---|---|
| UI Framework | Streamlit |
| STT (Cloud) | Groq Whisper Large V3 |
| STT (Local) | OpenAI Whisper (local) |
| LLM (Primary) | Groq — Llama 3.3 70B Versatile |
| LLM (Benchmark) | Groq — Llama 3.1 8B Instant |
| Language | Python 3.10+ |
| File output | Local filesystem (`output/` directory) |
---
## What I'd Build Next
- **Text-to-Speech output** — have the agent speak its responses back using Groq's TTS or ElevenLabs
- **File reading intent** — "Read my notes.txt and summarize it"
- **Web search intent** — "Search for the latest Python 3.13 features and save a summary"
- **Scheduled commands** — "Run this every morning at 9am"
- **Multi-agent routing** — route different intent types to specialized sub-agents
---
## Conclusion
Building a voice-controlled AI agent from scratch taught me that the hard parts aren't the AI — it's the **plumbing**: state management, error handling, audio format compatibility, and API quirks. The core pipeline took a day to build. Making it robust, user-friendly, and production-grade took much longer.
If you're building something similar, my biggest advice is: **handle failures first, features second**. A voice agent that crashes on bad audio or an expired API key is worse than no agent at all.
The full source code is available and runs with a single command:
``{% endraw %}{% raw %}`bash
pip install -r requirements.txt
streamlit run app.py
```
Top comments (0)