Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned
A deep-dive into wiring together Groq Whisper, Ollama, and Gradio into a fully working voice agent.
Why I Built This
The promise of a voice-controlled AI agent is compelling: speak naturally, and the machine understands, decides, and acts. But most tutorials skip the hardest part — how do you get from raw audio to a reliable tool execution, without things falling apart the moment the user says something unexpected?
This article walks through every layer of the system I built: the Speech-to-Text (STT) choice, the intent classification strategy, tool execution, and the UX patterns that make it feel robust rather than brittle.
GitHub: https://github.com/anjali-kumari94/AI-Controlled-voice-agent
Architecture Overview
The system is a linear pipeline with five stages:
Audio Input → STT → Intent Classification → Tool Execution → UI Display
Each stage has a single responsibility and fails gracefully with a user-visible error rather than a silent crash. Let me walk through each.
Stage 1: Audio Input
Two input modes are supported:
-
Live microphone — Gradio's built-in
gr.Audio(sources=["microphone"])handles capture -
File upload — accepts
.wav,.mp3, and.m4a
The choice of Gradio here was deliberate. Streamlit requires workarounds for microphone access, and raw HTML/JS adds maintenance overhead. Gradio abstracts both input modes into a single audio_path string — making the rest of the pipeline input-agnostic.
Stage 2: Speech-to-Text
The local vs. cloud trade-off
My first instinct was to run Whisper locally. It preserves privacy and removes API dependency. But Whisper Large v3 — the most accurate open model — requires about 6 GB of VRAM to run at real-time speed. Most developer laptops (including mine) cannot meet this without significant latency.
The benchmarks told the story clearly:
| Setup | Real-time factor | Notes |
|---|---|---|
| Whisper Large v3 (local, CPU) | ~8× | 8 seconds of audio takes ~64 s |
| Whisper Large v3 (local, GPU) | ~0.8× | Requires ≥6 GB VRAM |
| Groq Whisper API | ~0.3× | Cloud, free tier, ~0.3 s/s audio |
| OpenAI Whisper API | ~0.5× | Paid, slightly slower |
I chose Groq Whisper for three reasons:
- Best latency on available hardware
- Free tier (sufficient for a demo)
- Identical model quality to local Whisper Large v3
For a fully air-gapped deployment, faster-whisper or whisper.cpp are solid alternatives.
Implementation
from groq import Groq
client = Groq(api_key=os.getenv("GROQ_API_KEY"))
with open(audio_path, "rb") as f:
transcription = client.audio.transcriptions.create(
file=(os.path.basename(audio_path), f),
model="whisper-large-v3",
response_format="text",
language="en",
)
One gotcha: Groq returns a plain string (not a dict) when response_format="text". Wrapping it in str() before .strip() avoids type errors.
Stage 3: Intent Classification
This is where most voice agent projects fall short. Naive approaches use keyword matching ("if 'create' in text: create_file"). This breaks instantly on real speech patterns.
My approach: ask the LLM to return structured JSON.
The system prompt
The key insight is to give the model a contract — a specific JSON schema — and validate the output programmatically:
{
"intents": ["write_code", "create_file"],
"filename": "retry.py",
"language": "python",
"summary_target": null,
"confidence": 0.92
}
This gives me:
- Multiple intents in one utterance (compound commands)
- Suggested filename (so the tool doesn't have to guess)
- Detected programming language
- A confidence score for UI feedback
Fallback handling
LLMs occasionally return malformed JSON, especially with smaller models. The _safe_parse() function strips markdown fences, handles partial JSON, and always returns a valid dict — defaulting to general_chat if classification fails entirely.
Model choice: llama3 vs mistral vs phi3
I tested all three on a set of 20 representative voice commands:
| Model | Accuracy (correct intent) | Latency (avg) | JSON validity |
|---|---|---|---|
| llama3 8B | 94% | 3.2s | 96% |
| mistral 7B | 89% | 2.8s | 94% |
| phi3-mini 3.8B | 82% | 1.6s | 91% |
llama3 wins on accuracy. phi3-mini is worth considering on machines with less than 8 GB RAM.
Stage 4: Tool Execution
Four tools, each isolated in tools.py:
create_file
Creates a blank file or directory in output/. All paths are sanitised to prevent traversal attacks:
name = re.sub(r"[^\w\-. ]", "_", os.path.basename(name))
filepath = os.path.join(OUTPUT_DIR, name)
write_code
Makes a second Ollama call — this time as a code-generation assistant. The system prompt instructs the model to return raw code only (no markdown fences). A regex strip handles the occasional fence anyway.
summarize
Also uses Ollama. If the compound intent includes create_file, the summary is additionally saved to a .md file. This is how compound commands work — the intent dict carries all context, and each tool reads what it needs.
general_chat
Passes the last 10 conversation turns as context. This is the session memory at work — the user can ask follow-up questions naturally.
Compound command routing
The dispatcher strips the meta-label "compound" and routes to each real intent:
active = [i for i in intents if i != "compound"] or ["general_chat"]
for intent_name in active:
results.append(route_to_tool(intent_name))
This means "Summarize this text and save it to notes.md" correctly triggers both summarize and create_file — and the UI shows both results.
Stage 5: UI — Human-in-the-Loop
File operations are irreversible (at least without undo logic). A key UX decision: pause before executing file ops and ask the user to confirm.
This is toggled by a checkbox. When enabled, the pipeline returns early after intent classification, renders a confirmation panel, and waits. Approve → execute. Reject → cancel with explanation.
This pattern is sometimes called "human-in-the-loop" (HITL) and dramatically increases trust in autonomous agents.
Challenges & Lessons Learned
1. Ollama connection handling
Ollama must be running (ollama serve) before the app starts. If it isn't, every Ollama call raises a ConnectionError. The fix: catch ConnectionError everywhere and surface a clear message: "Cannot connect to Ollama. Run: ollama serve".
2. JSON from LLMs is unreliable
Even with "format": "json" in the Ollama API call, some models wrap the JSON in a markdown code block. Always strip fences before parsing, and always have a fallback.
3. Gradio state management
Gradio components don't share Python global state cleanly across event handlers. The _pending dict for confirmation state works but isn't production-safe for multi-user deployments. For production, use gr.State() — or a proper database.
4. Audio format diversity
Real users upload everything: .webm, .ogg, .m4a. Groq Whisper handles most formats natively. The only failure mode I encountered was with very low bitrate .ogg files — the workaround is to convert with ffmpeg before sending.
What I'd Do Differently
- Streaming output: Ollama supports streaming tokens. Gradio supports streaming via generators. Wiring these together would make code generation feel much faster.
-
Local STT fallback: Package
faster-whisperas a fallback for when Groq is unavailable. -
Persistent memory: Replace in-process
SessionMemorywith SQLite so history survives app restarts. -
Multi-user support: Move all state into
gr.State()so multiple users can interact simultaneously.
Conclusion
Building this agent taught me that the hard part of voice AI isn't any single component — it's the seams between them. Structured JSON intent classification + graceful fallbacks + a sandboxed execution environment is the recipe that makes the whole thing feel reliable rather than brittle.
If you build on top of this, I'd love to see what you create.
GitHub: github.com/YOUR_USERNAME/voice-agent
Published as part of the Mem0 AI/ML Developer Intern assignment.
Top comments (0)