Building a Voice-Controlled AI Agent with Groq, Whisper, and Gradio

#agents #ai #python #showdev

I recently built a voice-controlled AI agent that listens to your voice, understands what you want, and actually does it — creates files, writes code, summarizes text, or just chats. Here's how it works and what I learned.
What it does
You speak a command. The agent transcribes it, classifies your intent, and executes the right tool. All results are shown in a clean Gradio UI. Every file operation requires your confirmation before anything touches the file system.
Architecture
The pipeline has four stages:
Audio Input → Speech-to-Text → Intent Classification → Tool Execution
For STT I used Groq's hosted Whisper-large-v3. Running Whisper locally on CPU is around 10x slower than real-time, which makes the UI feel broken. Groq runs the same model at roughly 200x real-time speed for free, which was an easy decision.
For intent classification and tool responses I used Llama-3.3-70b-versatile via Groq. The model receives the transcript and returns structured JSON with three fields: the intent, the parameters needed to execute it, and a one-sentence explanation of its reasoning. This structured output approach makes the routing logic simple and reliable.
The four supported intents are create_file, write_code, summarize, and general_chat. Based on whichever intent is detected, the tool layer either writes a file, generates and saves code, summarizes text, or returns a chat response.
Bonus features
I added two bonus features. The first is Human-in-the-Loop confirmation — before any file or code operation runs, the UI shows a confirmation prompt. The user has to explicitly click Confirm before anything is written to disk. The second is session memory — every interaction is logged and the last few turns are injected into the intent classification prompt as context, so the model can resolve references like "that file" or "the same language as before."
Challenges
The main challenge was that the Groq client initializes at module import time, before python-dotenv loads the .env file. This caused the API key to be missing even when the file existed. The fix was to call load_dotenv() at the top of each module that creates a Groq client, not just in the main app file.
The second challenge was getting the LLM to return clean JSON consistently. Even with explicit instructions to return raw JSON only, the model occasionally wraps output in markdown code fences. A small stripping function at the parse step handles this gracefully.
Models chosen
Whisper-large-v3 for STT because it has the best accuracy among open Whisper variants and Groq hosts it natively. Llama-3.3-70b-versatile for intent and generation because it follows structured output instructions reliably and is fast enough on Groq that the full pipeline feels snappy.
GitHub: github.com/Alokik-29/voice-ai-agent

DEV Community

Building a Voice-Controlled AI Agent with Groq, Whisper, and Gradio

Top comments (0)