Introduction
I recently built a voice-controlled AI agent that accepts audio input, detects the user's intent, and executes local actions automatically. In this article I'll walk through the architecture, the models I chose, and the challenges I faced while building it.
What the Agent Does
The agent supports four intents:
- Create File — creates a new file in a dedicated output folder
- Write Code — generates code using an LLM and saves it to a file
- Summarize — summarizes provided text in 3-4 lines
- General Chat — answers general questions conversationally
Architecture
The pipeline has five components:
1. stt.py — Speech to Text
Converts uploaded audio to text using Groq's Whisper large-v3 model.
2. intent.py — Intent Detection
Sends the transcribed text to Llama 3.3-70b with a structured prompt asking it to return a JSON object containing the intent, details, and suggested filename.
3. tools.py — Tool Execution
Based on the detected intent, executes the appropriate action. All file operations are restricted to an output/ folder for safety.
4. memory.py — Session Memory
Maintains a list of all commands executed during the session, displayed in the UI with expand/collapse functionality.
5. app.py — Streamlit UI
Connects all components and displays the transcription, detected intent, confirmation prompt, result, and session history.
Models I Chose
Whisper large-v3 via Groq API for speech to text. I chose Groq over running Whisper locally because local inference on my machine is not efficient enough for real-time use. Groq provides extremely fast inference which keeps the pipeline responsive.
Llama 3.3-70b-versatile via Groq API for intent detection and tool execution. I chose this model because it follows structured JSON instructions reliably, which is critical for intent classification. It also generates high quality code for the write_code intent.
Challenges I Faced
1. JSON parsing from LLM responses
Llama sometimes wraps its JSON response in markdown backticks like
json ...
. This caused json.loads() to crash. I fixed this by stripping the backticks before parsing.
2. Audio format conversion
The Groq Whisper API does not support .ogg format directly. I used pydub to automatically convert any uploaded audio format to .wav before sending it to Whisper.
3. Streamlit session state
When the user clicked Yes on the confirmation prompt, Streamlit reruns the entire page which caused all variables to reset. I fixed this by storing the transcribed text, intent result, and output in st.session_state so they persist across reruns.
4. Python 3.13 compatibility
The audioop module was removed in Python 3.13, which broke pydub. I fixed this by installing the audioop-lts package which brings back the missing module.
Bonus Features
- Human-in-the-loop — confirmation prompt before any file operation
- Session Memory — full history of all commands in the UI
- Auto Format Conversion — handles ogg, mp3, m4a, wav automatically
- Graceful Degradation — markdown stripping ensures LLM responses always parse correctly
GitHub Repository
https://github.com/deeptiverma12/voice-local-agent
Conclusion
Building this agent taught me how to wire together STT, LLM intent detection, and local tool execution into a clean pipeline. The biggest learning was handling Streamlit's rerun behaviour with session state. Overall a very practical project that shows how voice can be used as a natural interface for AI agents.
Top comments (0)