Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned
I built a voice agent that listens to your commands, understands your intent, and executes real actions on your machine — creating files, writing code, summarizing text, and more. Here's exactly how I built it.
What It Does
You speak (or upload audio). The system:
- Transcribes your audio using Whisper
- Classifies your intent using an LLM (Claude / GPT-4 / Ollama)
- Executes the action locally (creates files, generates code, summarizes text)
- Shows you the full pipeline result in a clean Streamlit UI
Example: Say "Create a Python file with a retry decorator" → the agent generates the code and saves it to output/retry_decorator.py automatically.
Architecture
Audio Input (mic or file)
↓
STT Module (Whisper) → transcript text
↓
Intent Module (LLM) → { intent, parameters }
↓
Executor → file / code / summarize / chat
↓
Session Memory → context for follow-up commands
↓
Streamlit UI → shows all 4 pipeline steps
Each layer is modular — swap the STT provider, LLM, or executor without touching anything else.
Stage 1: Speech-to-Text
I implemented three STT options in src/stt.py:
- OpenAI Whisper API (default) — 1-2s latency, no GPU needed
- Groq Whisper API — even faster, generous free tier
-
HuggingFace Whisper (local) — fully offline, uses
openai/whisper-base
Why I defaulted to cloud Whisper instead of local
Running whisper-base on CPU takes 20–40 seconds per 10-second clip. whisper-large-v3 takes several minutes without a GPU. For a responsive demo, that latency is a dealbreaker.
The cloud API returns in ~1-2 seconds regardless of hardware. The local HuggingFace option is still available via a sidebar toggle for users with a GPU.
STT Benchmark
| Provider | Latency (10s clip) | Accuracy | Cost |
|---|---|---|---|
| OpenAI Whisper API | ~1-2s | Excellent | $0.006/min |
| Groq Whisper | ~0.5-1s | Excellent | Free tier |
| HuggingFace local (CPU) | ~30-40s | Good | Free |
| HuggingFace local (GPU) | ~2-3s | Good | Free |
Stage 2: Intent Classification
The intent module (src/intent.py) sends the transcript to an LLM with a structured system prompt and gets back a JSON object:
{
"intents": ["write_code"],
"intent": "write_code",
"parameters": {
"language": "python",
"filename": "retry_decorator.py",
"description": "retry decorator function"
},
"confidence": 0.97
}
Supported Intents
| Intent | Example Command |
|---|---|
write_code |
"Create a Python retry function" |
create_file |
"Make a new file called notes.txt" |
summarize_text |
"Summarize this article: ..." |
general_chat |
"What is recursion?" |
create_folder |
"Create a folder called experiments" |
list_files |
"What files have been created?" |
LLM Benchmark
| Provider | Latency | Intent Accuracy | Cost |
|---|---|---|---|
| Claude (Anthropic) | ~1-2s | ~97% | Low |
| GPT-4o-mini | ~1-2s | ~95% | Very low |
| Ollama llama3 (local) | ~3-8s | ~88% | Free |
Claude gave the most reliable structured JSON output with fewest parsing errors, especially for compound commands.
Stage 3: Tool Execution
The executor (src/executor.py) routes each intent to its handler:
-
write_code — prompts the LLM with a code-generation system prompt, strips markdown fences, saves to
output/ -
create_file — writes content directly to
output/ - summarize_text — calls LLM with a summarizer prompt, optionally saves output
- general_chat — returns a conversational response
All file operations are sandboxed to the output/ folder. Filenames are sanitized to prevent path traversal.
Bonus Features
Compound Commands
The intent classifier returns an array of intents. If multiple are detected, the executor chains them:
"Summarize this and save it to summary.txt"
→ [summarize_text, create_file]
→ Step 1: summarize → Step 2: save output to output/summary.txt
Human-in-the-Loop
A sidebar toggle (on by default) shows a confirmation prompt before any file write operation. Users must click "Confirm & Execute" before the agent touches the filesystem.
Session Memory
src/memory.py stores the last 10 commands and their outcomes. This context is passed to the intent classifier on every request, enabling follow-up commands like "now save that to a file".
Graceful Degradation
Every stage is wrapped in try/catch. STT failures, LLM API errors, and unrecognized intents all surface clearly in the UI with actionable error messages instead of crashing.
The UI
Built with Streamlit, styled with a custom dark industrial theme. The UI displays all four pipeline steps as distinct cards:
- Transcription — what was heard
- Intent — colored badge showing detected intent(s) + parameters
- Action — what the agent did
- Result — the output (code preview, summary, file confirmation)
A session history panel on the right tracks all commands in the current session.
Challenges
1. Structured JSON from LLMs is fragile
LLMs sometimes wrap JSON in markdown fences or add preamble text. I wrote a robust _parse_intent_json() function that strips fences, handles missing fields, and falls back to general_chat on parse failure.
2. Compound command detection
Getting the LLM to reliably return multiple intents required careful prompt engineering. The system prompt had to explicitly define the intents array format and give examples of compound commands.
3. Filename inference
When users don't specify a filename ("create a Python retry function"), the agent needs to infer a sensible one. I built _infer_filename() that strips stop words and constructs something like retry_function.py from the command description.
4. Local Whisper performance
As noted above, local Whisper is impractically slow on CPU. The workaround (cloud API with local fallback) balances usability with the project's preference for local models.
Setup
git clone https://github.com/akashkhare315/Agent-ai-voice
pip install -r requirements.txt
cp .env.example .env # add your API keys
streamlit run app.py
You need: ANTHROPIC_API_KEY (or OPENAI_API_KEY) + OPENAI_API_KEY for Whisper STT.
Conclusion
Building this agent taught me that the hardest part isn't any single model — it's the glue between them. Reliable JSON parsing, graceful error handling, and thoughtful UX (like human-in-the-loop confirmation) matter just as much as model accuracy.
The modular architecture means you can upgrade any layer independently: swap Whisper for a better local model when your hardware supports it, switch from Claude to a fully local Ollama model for offline use, or add new intents without touching the rest of the pipeline.
Top comments (0)