DEV Community

SWETHA K K
SWETHA K K

Posted on

How I Built a Voice Controlled AI Agent That Listens, Thinks, and Acts

Introduction
What if you could just speak to your computer and have it create files, write code, or summarize text, automatically? That's exactly what I built: a voice-controlled local AI agent that accepts audio input, figures out what you want, and executes it.
Here's how I built it, what I used, and what I learned along the way.

The Architecture
The pipeline has 5 stages:
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Display

Audio Input — The user speaks into a microphone or uploads an audio file
Speech-to-Text — The audio is transcribed to text using Whisper
Intent Detection — An LLM reads the text and classifies what the user wants
Tool Execution — The right action is triggered (create file, write code, summarize, or chat)
UI Display — Everything is shown in a clean web interface

Models I Chose
Speech-to-Text: Groq Whisper Large v3
I originally planned to run Whisper locally via HuggingFace. However, my machine couldn't run it efficiently enough for real-time use. I switched to the Groq API, which runs Whisper Large v3 in the cloud at incredible speed — transcription happens in under a second.
LLM: LLaMA 3.3-70b via Groq
For intent classification and response generation, I used LLaMA 3.3-70b served through Groq. I chose this because:

It's free to use on Groq's generous free tier
It's extremely fast (Groq's hardware is purpose-built for LLM inference)
It follows structured JSON instructions reliably, which is critical for intent classification

UI: Gradio
I used Gradio to build the frontend. It lets you spin up a web UI with just a few lines of Python — perfect for a project like this.

Supported Intents
The agent can detect and handle four intents:

Create File — Creates a .txt file in the output/ folder
Write Code — Generates Python code and saves it as a .py file
Summarize — Summarizes the spoken content and saves it
General Chat — Has a normal conversation

Challenges I Faced

  1. Running Whisper locally My biggest challenge was getting speech-to-text to work. Running Whisper locally via HuggingFace required significant RAM and GPU, which my machine didn't have. Switching to the Groq API solved this instantly.
  2. Getting structured JSON from the LLM For intent detection, I needed the LLM to return clean JSON every time. Early on, it would sometimes add extra explanation text around the JSON, breaking the parser. I fixed this by making the system prompt very strict — telling it to return only JSON with no preamble.
  3. Windows file extension issues A surprisingly tricky issue — Windows hides file extensions by default, so saving agent.py in Notepad actually created agent.py.txt. Had to enable "Show file extensions" in File Explorer to fix it.

Example Flow
User says: "Create a Python file with a retry function"

Groq Whisper transcribes the audio to text
LLaMA detects intent: write_code
LLaMA generates the Python retry function
File is saved to output/code_20260415.py
The UI shows the transcription, intent, and the generated code

What I'd Improve Next

Add support for compound commands ("summarize this and save it to a file")
Add a confirmation prompt before executing file operations
Support more intents like web search or sending emails
Build a persistent session memory so the agent remembers context

Conclusion
Building this agent taught me how powerful combining simple APIs can be. Groq's speed makes real-time voice interaction actually feel snappy, and Gradio makes deploying a UI embarrassingly easy.
The full code is available on GitHub: https://github.com/swetha-kk/voice-agent

Top comments (0)