Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned
The Goal
I recently built a voice-controlled AI agent for the Mem0 Generative AI internship assignment. The system takes an audio command, converts it to text, classifies the user's intent, and executes the appropriate local action — all displayed in a web UI.
Here's what I built, how it works, and what I learned.
System Architecture
The pipeline has five stages:
Audio → STT → Intent Classification → Tool Execution → UI Display
Each stage is independently swappable — the system degrades gracefully if local hardware can't keep up.
| Stage | Primary | Fallback |
|---|---|---|
| Audio Input | Gradio mic / file upload | — |
| STT | Groq Whisper API | — |
| Intent | Groq llama-3.3-70b-versatile | — |
| Tools | Python stdlib | — |
| UI | Gradio 6.x | — |
Model Choices
Speech-to-Text: Groq Whisper API
I used Groq's hosted Whisper API for transcription. It is near-instant for short clips, requires no local GPU, and has a generous free tier — making it ideal for laptops without powerful hardware.
LLM: Groq llama-3.3-70b-versatile
For intent classification and all text generation (code, summarization, chat), I used Groq's llama-3.3-70b-versatile. Groq's LPU hardware makes inference extremely fast (~200 tokens/s), which feels nearly instant in practice.
For intent classification, I structured the LLM output as JSON:
{
"actions": [
{
"type": "write_code",
"confidence": "high",
"details": "Generate a Python retry function",
"params": {
"filename": "retry.py",
"language": "python",
"description": "a retry decorator with exponential backoff"
}
}
]
}
This made it trivial to support compound commands — the actions array simply holds multiple items.
Key Implementation Decisions
Safety Sandbox
All file operations are restricted to an output/ directory. I implemented path traversal protection:
def _safe_path(filename: str) -> Path:
safe = OUTPUT_DIR / Path(filename).name
resolved = safe.resolve()
if not str(resolved).startswith(str(OUTPUT_DIR)):
raise ValueError("Path traversal attempt blocked")
return resolved
Human-in-the-Loop
Before executing any file write, the UI asks for confirmation. The pending action is stored in gr.State and executed only on the confirmation click.
Session Memory
I maintain two separate state structures:
- Chat context — OpenAI-style message list passed to the LLM on each call
- Action log — displayed in the UI history panel so users can review what the agent has done
Challenges
1. JSON parsing from LLMs
LLMs sometimes wrap JSON output in markdown fences or add preamble text. I wrote a robust parser that strips fences, attempts json.loads, falls back to regex extraction, and as a last resort defaults to a chat intent.
2. Compound command detection
Getting the LLM to reliably emit two action objects for commands like "summarize this and save to a file" required careful prompt engineering. Showing an explicit schema example in the system prompt dramatically improved reliability.
3. Gradio version compatibility
Gradio 6 removed several parameters that existed in older versions (show_download_button, GoogleFont). Updating the code to match the new API was a key debugging step.
What I'd Add Next
- Streaming LLM responses — show code as it's generated token by token
- Wake word detection — "Hey Agent" to trigger recording
- Plugin system — let users define custom tool handlers in YAML
- Persistent memory — save session history to disk across restarts
Source Code
GitHub: https://github.com/akshisharmaaa/Voice-Controlled-Local-AI-Agent
Thanks for reading! If you're building something similar or have questions about the architecture, drop a comment below.
Top comments (0)