Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

Akash Khare — Fri, 10 Apr 2026 09:40:01 +0000

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

I built a voice agent that listens to your commands, understands your intent, and executes real actions on your machine — creating files, writing code, summarizing text, and more. Here's exactly how I built it.

What It Does

You speak (or upload audio). The system:

Transcribes your audio using Whisper
Classifies your intent using an LLM (Claude / GPT-4 / Ollama)
Executes the action locally (creates files, generates code, summarizes text)
Shows you the full pipeline result in a clean Streamlit UI

Example: Say "Create a Python file with a retry decorator" → the agent generates the code and saves it to output/retry_decorator.py automatically.

Architecture

Audio Input (mic or file)
       ↓
STT Module (Whisper)  →  transcript text
       ↓
Intent Module (LLM)   →  { intent, parameters }
       ↓
Executor              →  file / code / summarize / chat
       ↓
Session Memory        →  context for follow-up commands
       ↓
Streamlit UI          →  shows all 4 pipeline steps

Each layer is modular — swap the STT provider, LLM, or executor without touching anything else.

Stage 1: Speech-to-Text

I implemented three STT options in src/stt.py:

OpenAI Whisper API (default) — 1-2s latency, no GPU needed
Groq Whisper API — even faster, generous free tier
HuggingFace Whisper (local) — fully offline, uses openai/whisper-base

Why I defaulted to cloud Whisper instead of local

Running whisper-base on CPU takes 20–40 seconds per 10-second clip. whisper-large-v3 takes several minutes without a GPU. For a responsive demo, that latency is a dealbreaker.

The cloud API returns in ~1-2 seconds regardless of hardware. The local HuggingFace option is still available via a sidebar toggle for users with a GPU.

STT Benchmark

Provider	Latency (10s clip)	Accuracy	Cost
OpenAI Whisper API	~1-2s	Excellent	$0.006/min
Groq Whisper	~0.5-1s	Excellent	Free tier
HuggingFace local (CPU)	~30-40s	Good	Free
HuggingFace local (GPU)	~2-3s	Good	Free

Stage 2: Intent Classification

The intent module (src/intent.py) sends the transcript to an LLM with a structured system prompt and gets back a JSON object:

{
  "intents": ["write_code"],
  "intent": "write_code",
  "parameters": {
    "language": "python",
    "filename": "retry_decorator.py",
    "description": "retry decorator function"
  },
  "confidence": 0.97
}

Supported Intents

Intent	Example Command
`write_code`	"Create a Python retry function"
`create_file`	"Make a new file called notes.txt"
`summarize_text`	"Summarize this article: ..."
`general_chat`	"What is recursion?"
`create_folder`	"Create a folder called experiments"
`list_files`	"What files have been created?"

LLM Benchmark

Provider	Latency	Intent Accuracy	Cost
Claude (Anthropic)	~1-2s	~97%	Low
GPT-4o-mini	~1-2s	~95%	Very low
Ollama llama3 (local)	~3-8s	~88%	Free

Claude gave the most reliable structured JSON output with fewest parsing errors, especially for compound commands.

Stage 3: Tool Execution

The executor (src/executor.py) routes each intent to its handler:

write_code — prompts the LLM with a code-generation system prompt, strips markdown fences, saves to output/
create_file — writes content directly to output/
summarize_text — calls LLM with a summarizer prompt, optionally saves output
general_chat — returns a conversational response

All file operations are sandboxed to the output/ folder. Filenames are sanitized to prevent path traversal.

Bonus Features

Compound Commands

The intent classifier returns an array of intents. If multiple are detected, the executor chains them:

"Summarize this and save it to summary.txt"
→ [summarize_text, create_file]
→ Step 1: summarize → Step 2: save output to output/summary.txt

Human-in-the-Loop

A sidebar toggle (on by default) shows a confirmation prompt before any file write operation. Users must click "Confirm & Execute" before the agent touches the filesystem.

Session Memory

src/memory.py stores the last 10 commands and their outcomes. This context is passed to the intent classifier on every request, enabling follow-up commands like "now save that to a file".

Graceful Degradation

Every stage is wrapped in try/catch. STT failures, LLM API errors, and unrecognized intents all surface clearly in the UI with actionable error messages instead of crashing.

The UI

Built with Streamlit, styled with a custom dark industrial theme. The UI displays all four pipeline steps as distinct cards:

Transcription — what was heard
Intent — colored badge showing detected intent(s) + parameters
Action — what the agent did
Result — the output (code preview, summary, file confirmation)

A session history panel on the right tracks all commands in the current session.

Challenges

1. Structured JSON from LLMs is fragile
LLMs sometimes wrap JSON in markdown fences or add preamble text. I wrote a robust _parse_intent_json() function that strips fences, handles missing fields, and falls back to general_chat on parse failure.

2. Compound command detection
Getting the LLM to reliably return multiple intents required careful prompt engineering. The system prompt had to explicitly define the intents array format and give examples of compound commands.

3. Filename inference
When users don't specify a filename ("create a Python retry function"), the agent needs to infer a sensible one. I built _infer_filename() that strips stop words and constructs something like retry_function.py from the command description.

4. Local Whisper performance
As noted above, local Whisper is impractically slow on CPU. The workaround (cloud API with local fallback) balances usability with the project's preference for local models.

Setup

git clone https://github.com/akashkhare315/Agent-ai-voice
pip install -r requirements.txt
cp .env.example .env  # add your API keys
streamlit run app.py

You need: ANTHROPIC_API_KEY (or OPENAI_API_KEY) + OPENAI_API_KEY for Whisper STT.

Conclusion

Building this agent taught me that the hardest part isn't any single model — it's the glue between them. Reliable JSON parsing, graceful error handling, and thoughtful UX (like human-in-the-loop confirmation) matter just as much as model accuracy.

The modular architecture means you can upgrade any layer independently: swap Whisper for a better local model when your hardware supports it, switch from Claude to a fully local Ollama model for offline use, or add new intents without touching the rest of the pipeline.

GitHub: [https://github.com/akashkhare315/Agent-ai-voice]

DEV Community: Akash Khare

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

What It Does

Architecture

Stage 1: Speech-to-Text

Why I defaulted to cloud Whisper instead of local

STT Benchmark

Stage 2: Intent Classification

Supported Intents

LLM Benchmark

Stage 3: Tool Execution

Bonus Features

Compound Commands

Human-in-the-Loop

Session Memory

Graceful Degradation

The UI

Challenges

Setup

Conclusion