Building a Voice-Controlled AI Agent with Groq & Gradio

#agents #ai #llm #showdev

What I Built

A voice-controlled AI agent that: records your voice → transcribes it →
understands your intent → executes the right action → shows everything in a UI.

Built for the Mem0 AI Internship assignment.

Architecture

Voice Input → Groq Whisper (STT) → LLaMA 3.3 70B (Intent) → Tool Executor → Gradio UI

Model Choices

Speech-to-Text: Whisper-large-v3 via Groq
I initially planned to run Whisper locally using HuggingFace, but on a CPU-only
Windows machine this was extremely slow (30+ seconds per clip). Groq's API gives
sub-second transcription for free, making the user experience much better.

LLM: LLaMA 3.3 70B via Groq

Running a 70B model locally requires 40GB+ VRAM. Groq's free tier handles this
instantly. The model is prompted to return structured JSON for reliable intent parsing.

The Intent Classification Trick

The key insight was prompting the LLM to return ONLY JSON — no explanation,
no markdown. Combined with low temperature (0.1), this gives very consistent results.

It also supports compound commands: "Write a retry function AND save it"
correctly returns intents: ["write_code", "create_file"].

Challenges

JSON parsing — LLMs sometimes wrap JSON in markdown fences.
Fixed with regex: re.sub(r"json\s*|\s*", "", raw)
Audio formats — Gradio passes temp file paths.
Opening with "rb" and letting Groq handle format detection solved this.
Safety — Used os.path.basename() to prevent path traversal attacks
when creating files.