How I Built a Voice-Controlled Local AI Agent
python #ai #machinelearning #gradio
From microphone to file creation in under 3 seconds — using Whisper, LLaMA3, and Gradio
Introduction
What if you could just speak to your computer and have it create files, write code, or summarize content — all running locally on your machine? That is exactly what I set out to build for this assignment.
In this article I will walk through how I built a voice-controlled AI agent from scratch, the architecture decisions I made, the models I chose, and the challenges I faced along the way.
What the Agent Does
The agent takes voice input (or typed text), runs it through a full 4-stage pipeline, and returns a result:
- Stage 1 — Audio Input: Accept microphone recording or uploaded .wav/.mp3 file
- Stage 2 — Speech-to-Text: Transcribe the audio using OpenAI Whisper
- Stage 3 — Intent Classification: Send the transcription to an LLM to classify intent
-
Stage 4 — Tool Execution: Run the right tool and save output to an
output/folder
Everything is displayed in a clean Gradio web UI showing each stage of the pipeline.
Architecture Deep Dive
The project is split into 4 Python modules, each handling one stage of the pipeline.
stt.py — Speech to Text
I used Whisper Large v3 via the Groq API as my primary STT backend. Groq provides incredibly fast inference for free, which made it perfect for this project. The module also supports local Whisper (openai-whisper package) and OpenAI's Whisper-1 API as fallbacks, selectable via an environment variable.
intent.py — Intent Classification
This module sends the transcription to an LLM with a carefully crafted system prompt asking it to return structured JSON with four fields: intent, filename, language, and description. I chose a JSON-only output format to make parsing reliable and consistent.
The module supports Ollama (local), Groq (cloud), and OpenAI. There is also a rule-based fallback using keyword matching in case the LLM is unavailable.
The four supported intents are:
-
create_file— create a new file or folder -
write_code— generate code and save it -
summarize— summarize provided text -
general_chat— conversational response
tools.py — Tool Execution
Based on the classified intent, the tools module calls the appropriate function. For write_code, it prompts the LLM a second time with a code-generation system prompt, cleans the output (stripping markdown fences), and saves the result to the output/ directory.
All file operations are sandboxed using os.path.basename() to prevent path traversal attacks — no files can ever be written outside the output/ folder.
app.py — Gradio UI
I built the frontend with Gradio 4, using a custom dark theme with CSS variables. The UI shows all four pipeline outputs simultaneously: transcribed text, detected intent, action taken, and final result. A session history panel shows the last 5 actions for context.
Models I Chose and Why
Whisper Large v3 via Groq (STT)
I chose Groq's hosted Whisper Large v3 for speech-to-text for three reasons:
- It is extremely fast — transcription takes under 1 second for short commands
- Groq provides a generous free tier with no credit card required
- Whisper Large v3 has excellent accuracy even with accented speech and background noise
For users who want full offline operation, the openai-whisper package is also supported as a drop-in replacement.
LLaMA 3 8B via Groq (Intent + Code Generation)
For the LLM I used LLaMA 3 8B through Groq's API. The 8B model is fast and capable enough for intent classification and short code generation tasks.
- For intent classification → temperature set to
0for deterministic JSON output - For code generation → temperature set to
0.3for slightly more creative outputs
For users with a capable local machine, Ollama with llama3 or mistral is fully supported as a drop-in replacement that runs entirely offline.
Challenges I Faced
1. Getting Reliable JSON from the LLM
The biggest challenge was getting the LLM to consistently return valid JSON for intent classification. The solution was writing a strict system prompt that explicitly states:
"Respond ONLY with valid JSON, no markdown, no explanation"
I also added a fallback JSON parser that strips markdown fences and uses regex to extract JSON objects from messy responses.
2. File Safety
When the LLM suggests a filename, it sometimes includes relative paths like ../../etc/passwd. I solved this with os.path.basename() which strips any directory components, combined with a character sanitization step that removes unsafe characters.
3. Audio Format Compatibility
Gradio records audio in WebM format by default on some browsers, but Whisper works best with WAV or MP3. I handled this by detecting the file extension and setting the correct MIME type in the API request.
4. Hardware Constraints
Running Whisper Large locally requires a GPU and significant RAM. I addressed this by making every component swappable via environment variables:
- Low-end machine → use Groq API for both STT and LLM
- Powerful machine → run everything locally through Ollama + openai-whisper
Example Flow
User says: "Create a Python file with a retry function"
🎙️ Audio recorded
↓
📝 Transcribed: "Create a Python file with a retry function"
↓
🧠 Intent: write_code | language: python | filename: retry.py
↓
⚡ LLM generates Python retry function code
↓
💾 Saved to: output/retry.py
↓
✅ UI displays transcription, intent, action, and code preview
Project Structure
voice-ai-agent/
├── app.py # Gradio UI
├── stt.py # Speech-to-Text (Whisper)
├── intent.py # Intent classification (LLaMA3)
├── tools.py # Tool execution
├── requirements.txt # Dependencies
└── output/ # Generated files (sandboxed)
How to Run It
# 1. Clone the repo
git clone https://github.com/AryanJaitely/voice-ai-agent.git
cd voice-ai-agent
# 2. Install dependencies
pip install gradio requests python-dotenv
# 3. Create .env file
echo "STT_BACKEND=groq" > .env
echo "LLM_BACKEND=groq" >> .env
echo "GROQ_API_KEY=your_key_here" >> .env
# 4. Run
python app.py
Get a free Groq API key at: https://console.groq.com
Conclusion
Building this voice agent taught me a lot about chaining AI models together reliably. The key insight is that each component (STT, intent, tools) should be independently swappable — this makes the system both resilient and flexible for different hardware constraints.
The full source code is available on GitHub:
👉 https://github.com/AryanJaitely/voice-ai-agent
Built with Python, Gradio, Groq, Whisper & LLaMA 3
Top comments (0)