Introduction
I built a voice-controlled AI agent that listens to your voice,
understands your intent, and executes actions on your local machine —
all through a clean web UI. In this article, I'll walk through the
architecture, models I chose, and the challenges I faced building this
on Windows.
What It Does
You speak a command like "Create a Python file with a retry function"
and the agent:
- Transcribes your audio to text
- Detects your intent using a local LLM
- Executes the right action (generates code, creates files, summarizes text)
- Shows everything in a Streamlit UI
Architecture
Audio Input → Groq Whisper STT → Ollama LLM (Intent) → Tool Execution → Streamlit UI
Components:
- STT: Groq Whisper large-v3 API
- LLM: llama3.2 via Ollama (runs 100% locally)
- UI: Streamlit
- Tools: File creation, code generation, text summarization, general chat
Models I Chose
Speech-to-Text: Groq Whisper
I initially planned to use OpenAI Whisper locally via HuggingFace.
However, Whisper requires ffmpeg which had PATH issues on Windows.
I switched to Groq's Whisper API which is free, fast, and supports
all audio formats without any local setup.
LLM: llama3.2 via Ollama
I chose Ollama for local LLM inference because it's easy to set up
on Windows and runs completely offline. llama3.2 provided a good
balance between speed and accuracy for intent classification.
Intent Classification
The LLM classifies user speech into four intents:
- WRITE_CODE — generates and saves code to output/
- CREATE_FILE — creates a new file in output/
- SUMMARIZE — summarizes provided text
- GENERAL_CHAT — general conversation
I used structured JSON prompting to get consistent output from the LLM:
SYSTEM_PROMPT = """Classify the intent into one of:
WRITE_CODE, CREATE_FILE, SUMMARIZE, GENERAL_CHAT
Respond in JSON format only."""
Challenges I Faced
1. ffmpeg on Windows
Whisper requires ffmpeg but adding it to PATH on Windows was
problematic due to OneDrive folder paths with spaces. I solved
this by switching to Groq's API entirely.
2. Multiple Python versions
My machine had both Python 3.12 and 3.13 installed. Packages
installed on one version weren't available on the other. I solved
this by always using py -3.12 explicitly.
3. Streamlit state management
Button clicks in Streamlit trigger full page reruns, losing
previous results. I solved this using st.session_state to persist
transcription and intent results across reruns.
4. API Key Security
GitHub Push Protection blocked my push because my Groq API key
was hardcoded. I fixed this by using python-dotenv with a .env
file and environment variables.
Safety
All file operations are restricted to an output/ folder to prevent
accidental system file overwrites.
Demo
Watch the full demo here: https://youtu.be/S2PejSQGpAA
GitHub
Full source code: https://github.com/ayisha-parli/voice-agent
Conclusion
Building a fully local voice AI agent is very achievable with
modern tools like Ollama and Groq. The biggest challenges were
Windows-specific setup issues rather than AI-related problems.
The final system works reliably and can be extended with more
intents and tools easily.
Top comments (0)