Introduction
What if you could just speak to your computer and have it create files, write code, or summarize text, automatically? That's exactly what I built: a voice-controlled local AI agent that accepts audio input, figures out what you want, and executes it.
Here's how I built it, what I used, and what I learned along the way.
The Architecture
The pipeline has 5 stages:
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Display
Audio Input — The user speaks into a microphone or uploads an audio file
Speech-to-Text — The audio is transcribed to text using Whisper
Intent Detection — An LLM reads the text and classifies what the user wants
Tool Execution — The right action is triggered (create file, write code, summarize, or chat)
UI Display — Everything is shown in a clean web interface
Models I Chose
Speech-to-Text: Groq Whisper Large v3
I originally planned to run Whisper locally via HuggingFace. However, my machine couldn't run it efficiently enough for real-time use. I switched to the Groq API, which runs Whisper Large v3 in the cloud at incredible speed — transcription happens in under a second.
LLM: LLaMA 3.3-70b via Groq
For intent classification and response generation, I used LLaMA 3.3-70b served through Groq. I chose this because:
It's free to use on Groq's generous free tier
It's extremely fast (Groq's hardware is purpose-built for LLM inference)
It follows structured JSON instructions reliably, which is critical for intent classification
UI: Gradio
I used Gradio to build the frontend. It lets you spin up a web UI with just a few lines of Python — perfect for a project like this.
Supported Intents
The agent can detect and handle four intents:
Create File — Creates a .txt file in the output/ folder
Write Code — Generates Python code and saves it as a .py file
Summarize — Summarizes the spoken content and saves it
General Chat — Has a normal conversation
Challenges I Faced
- Running Whisper locally My biggest challenge was getting speech-to-text to work. Running Whisper locally via HuggingFace required significant RAM and GPU, which my machine didn't have. Switching to the Groq API solved this instantly.
- Getting structured JSON from the LLM For intent detection, I needed the LLM to return clean JSON every time. Early on, it would sometimes add extra explanation text around the JSON, breaking the parser. I fixed this by making the system prompt very strict — telling it to return only JSON with no preamble.
- Windows file extension issues A surprisingly tricky issue — Windows hides file extensions by default, so saving agent.py in Notepad actually created agent.py.txt. Had to enable "Show file extensions" in File Explorer to fix it.
Example Flow
User says: "Create a Python file with a retry function"
Groq Whisper transcribes the audio to text
LLaMA detects intent: write_code
LLaMA generates the Python retry function
File is saved to output/code_20260415.py
The UI shows the transcription, intent, and the generated code
What I'd Improve Next
Add support for compound commands ("summarize this and save it to a file")
Add a confirmation prompt before executing file operations
Support more intents like web search or sending emails
Build a persistent session memory so the agent remembers context
Conclusion
Building this agent taught me how powerful combining simple APIs can be. Groq's speed makes real-time voice interaction actually feel snappy, and Gradio makes deploying a UI embarrassingly easy.
The full code is available on GitHub: https://github.com/swetha-kk/voice-agent
Top comments (0)