I Built a Voice-Controlled Local AI Agent That Actually Works — Here's Everything I Learned

Srishti Mishra — Sun, 12 Apr 2026 15:36:08 +0000

What I Built

A voice-controlled AI agent that accepts audio input, classifies the user's intent, executes local tools, and displays the full pipeline in a Streamlit UI.

Try it: GitHub | Demo Video

*Architecture
*
Audio (mic / file / text)
↓
[1] STT — Groq Whisper API
↓
[2] Intent Detection — LLaMA 3.3 70B (JSON output)
↓
create_file | write_code | summarize | general_chat | compound
↓
[3] Tool Execution → output/ folder
↓
[4] Streamlit UI

Models Used

Layer	Model	Why
STT	Groq Whisper (whisper-large-v3)	Local Whisper took 12–18s per clip on CPU; Groq does it in under 1s
LLM	LLaMA 3.3 70B via Groq	Reliable structured JSON output; local 3B models were inconsistent

Hardware note: Local openai/whisper-base via HuggingFace works but is too slow for real-time use without a GPU. Groq's free tier is fast enough to feel instant.

** Intent Classification**

The LLM returns strict JSON:

{
"intent": "write_code",
"filename": "retry.py",
"language": "python"
}

For compound commands like "write a retry function and send a leave letter", it returns a tasks array and the agent runs each sub-task sequentially. A keyword-based fallback handles API failures.

Supported Intents

Intent	Example
`create_file`	"Make a notes.txt"
`write_code`	"Write a Python retry decorator"
`summarize`	"Summarize this and save it"
`general_chat`	"What is a linked list?"
`compound`	"Write a C++ sort and a leave letter"

** Key Challenges**

1. Follow-up save commands — "Save that as a text file" has no content. The agent looks backward through chat history to find the last substantive assistant response and writes that to disk.

2. Compound commands — Split on \band\b or commas, run intent detection on each fragment, execute independently.

3. Sandboxing — All file writes are restricted to output/ via _safe_filename() which strips path traversal sequences.

Bonus Features Implemented

✅ Compound commands — multiple actions in one input
✅ Human-in-the-loop — confirmation toggle before any file operation
✅ Graceful degradation — keyword fallback if LLM fails; errors shown in UI
✅ Session memory — last 10 messages passed as context on every call

Setup

git clone https://github.com/Srishti-1806/.git/OA_Submission.git
cd OA_Submission
pip install -r requirements.txt
cp .env # add GROQ_API_KEY
streamlit run app.py

Free Groq API key at console.groq.com.

**
**

True token

streaming in the UI

Structured outputs / function calling instead of regex-cleaned JSON
TTS for voice responses to close the loop

DEV Community: Srishti Mishra

I Built a Voice-Controlled Local AI Agent That Actually Works — Here's Everything I Learned