Anjali Kumari

Posted on Apr 14

Voice-Controlled Local AI Agent

#agents #ai #llm #python

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

A deep-dive into wiring together Groq Whisper, Ollama, and Gradio into a fully working voice agent.

Why I Built This

The promise of a voice-controlled AI agent is compelling: speak naturally, and the machine understands, decides, and acts. But most tutorials skip the hardest part — how do you get from raw audio to a reliable tool execution, without things falling apart the moment the user says something unexpected?

This article walks through every layer of the system I built: the Speech-to-Text (STT) choice, the intent classification strategy, tool execution, and the UX patterns that make it feel robust rather than brittle.

GitHub: https://github.com/anjali-kumari94/AI-Controlled-voice-agent

Architecture Overview

The system is a linear pipeline with five stages:

Audio Input → STT → Intent Classification → Tool Execution → UI Display

Each stage has a single responsibility and fails gracefully with a user-visible error rather than a silent crash. Let me walk through each.

Stage 1: Audio Input

Two input modes are supported:

Live microphone — Gradio's built-in gr.Audio(sources=["microphone"]) handles capture
File upload — accepts .wav, .mp3, and .m4a

The choice of Gradio here was deliberate. Streamlit requires workarounds for microphone access, and raw HTML/JS adds maintenance overhead. Gradio abstracts both input modes into a single audio_path string — making the rest of the pipeline input-agnostic.

Stage 2: Speech-to-Text

The local vs. cloud trade-off

My first instinct was to run Whisper locally. It preserves privacy and removes API dependency. But Whisper Large v3 — the most accurate open model — requires about 6 GB of VRAM to run at real-time speed. Most developer laptops (including mine) cannot meet this without significant latency.

The benchmarks told the story clearly:

Setup	Real-time factor	Notes
Whisper Large v3 (local, CPU)	~8×	8 seconds of audio takes ~64 s
Whisper Large v3 (local, GPU)	~0.8×	Requires ≥6 GB VRAM
Groq Whisper API	~0.3×	Cloud, free tier, ~0.3 s/s audio
OpenAI Whisper API	~0.5×	Paid, slightly slower

I chose Groq Whisper for three reasons:

Best latency on available hardware
Free tier (sufficient for a demo)
Identical model quality to local Whisper Large v3

For a fully air-gapped deployment, faster-whisper or whisper.cpp are solid alternatives.

Implementation

from groq import Groq

client = Groq(api_key=os.getenv("GROQ_API_KEY"))

with open(audio_path, "rb") as f:
    transcription = client.audio.transcriptions.create(
        file=(os.path.basename(audio_path), f),
        model="whisper-large-v3",
        response_format="text",
        language="en",
    )

One gotcha: Groq returns a plain string (not a dict) when response_format="text". Wrapping it in str() before .strip() avoids type errors.

Stage 3: Intent Classification

This is where most voice agent projects fall short. Naive approaches use keyword matching ("if 'create' in text: create_file"). This breaks instantly on real speech patterns.

My approach: ask the LLM to return structured JSON.

The system prompt

The key insight is to give the model a contract — a specific JSON schema — and validate the output programmatically:

{
  "intents": ["write_code", "create_file"],
  "filename": "retry.py",
  "language": "python",
  "summary_target": null,
  "confidence": 0.92
}

This gives me:

Multiple intents in one utterance (compound commands)
Suggested filename (so the tool doesn't have to guess)
Detected programming language
A confidence score for UI feedback

Fallback handling

LLMs occasionally return malformed JSON, especially with smaller models. The _safe_parse() function strips markdown fences, handles partial JSON, and always returns a valid dict — defaulting to general_chat if classification fails entirely.

Model choice: llama3 vs mistral vs phi3

I tested all three on a set of 20 representative voice commands:

Model	Accuracy (correct intent)	Latency (avg)	JSON validity
llama3 8B	94%	3.2s	96%
mistral 7B	89%	2.8s	94%
phi3-mini 3.8B	82%	1.6s	91%

llama3 wins on accuracy. phi3-mini is worth considering on machines with less than 8 GB RAM.

Stage 4: Tool Execution

Four tools, each isolated in tools.py:

`create_file`

Creates a blank file or directory in output/. All paths are sanitised to prevent traversal attacks:

name = re.sub(r"[^\w\-. ]", "_", os.path.basename(name))
filepath = os.path.join(OUTPUT_DIR, name)

`write_code`

Makes a second Ollama call — this time as a code-generation assistant. The system prompt instructs the model to return raw code only (no markdown fences). A regex strip handles the occasional fence anyway.

`summarize`

Also uses Ollama. If the compound intent includes create_file, the summary is additionally saved to a .md file. This is how compound commands work — the intent dict carries all context, and each tool reads what it needs.

`general_chat`

Passes the last 10 conversation turns as context. This is the session memory at work — the user can ask follow-up questions naturally.

Compound command routing

The dispatcher strips the meta-label "compound" and routes to each real intent:

active = [i for i in intents if i != "compound"] or ["general_chat"]
for intent_name in active:
    results.append(route_to_tool(intent_name))

This means "Summarize this text and save it to notes.md" correctly triggers both summarize and create_file — and the UI shows both results.

Stage 5: UI — Human-in-the-Loop

File operations are irreversible (at least without undo logic). A key UX decision: pause before executing file ops and ask the user to confirm.

This is toggled by a checkbox. When enabled, the pipeline returns early after intent classification, renders a confirmation panel, and waits. Approve → execute. Reject → cancel with explanation.

This pattern is sometimes called "human-in-the-loop" (HITL) and dramatically increases trust in autonomous agents.

Challenges & Lessons Learned

1. Ollama connection handling

Ollama must be running (ollama serve) before the app starts. If it isn't, every Ollama call raises a ConnectionError. The fix: catch ConnectionError everywhere and surface a clear message: "Cannot connect to Ollama. Run: ollama serve".

2. JSON from LLMs is unreliable

Even with "format": "json" in the Ollama API call, some models wrap the JSON in a markdown code block. Always strip fences before parsing, and always have a fallback.

3. Gradio state management

Gradio components don't share Python global state cleanly across event handlers. The _pending dict for confirmation state works but isn't production-safe for multi-user deployments. For production, use gr.State() — or a proper database.

4. Audio format diversity

Real users upload everything: .webm, .ogg, .m4a. Groq Whisper handles most formats natively. The only failure mode I encountered was with very low bitrate .ogg files — the workaround is to convert with ffmpeg before sending.

What I'd Do Differently

Streaming output: Ollama supports streaming tokens. Gradio supports streaming via generators. Wiring these together would make code generation feel much faster.
Local STT fallback: Package faster-whisper as a fallback for when Groq is unavailable.
Persistent memory: Replace in-process SessionMemory with SQLite so history survives app restarts.
Multi-user support: Move all state into gr.State() so multiple users can interact simultaneously.

Conclusion

Building this agent taught me that the hard part of voice AI isn't any single component — it's the seams between them. Structured JSON intent classification + graceful fallbacks + a sandboxed execution environment is the recipe that makes the whole thing feel reliable rather than brittle.

If you build on top of this, I'd love to see what you create.

GitHub: github.com/YOUR_USERNAME/voice-agent

Published as part of the Mem0 AI/ML Developer Intern assignment.

DEV Community