Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop
Published as part of the Mem0 AI MLOps & AI Infra Internship Assignment
Introduction
What if you could speak to your computer and have it write code, create files, or summarize text — with full transparency into every step of the pipeline?
That's exactly what I built: a voice-controlled AI agent that transcribes audio, classifies intent (including compound multi-intent commands), executes the right tools, maintains persistent session memory, and requires human confirmation before touching your filesystem.
In this article I'll walk through the architecture, model choices, the three bonus features I implemented, and the real bugs I hit along the way.
Architecture Overview
The system is a linear pipeline with four stages:
Audio Input
│
▼
Speech-to-Text (Groq Whisper API — whisper-large-v3)
│
▼
Intent Classification (Groq llama-3.3-70b — structured JSON)
│
▼
Tool Execution (write_code | create_file | summarize | general_chat)
│
├── output/ ← all files created here only
└── session_history.json ← persistent memory
│
▼
Gradio UI (5-step pipeline display + session memory panel)
Each stage is an isolated Python module (stt.py, intent.py, tools.py, memory.py) wired together by a thin orchestrator in app.py. This makes it easy to swap any component independently.
Stage 1: Speech-to-Text — Why Groq Whisper
The assignment recommended running HuggingFace Whisper locally. I tried this first — Whisper large-v3 requires ~4 GB VRAM and runs at about 0.1× realtime on a CPU-only machine. That's a 10-second wait for a 1-second voice clip.
My solution: Groq Whisper API.
Groq runs the same whisper-large-v3 model on their LPU hardware, delivering roughly 200× realtime speed. The API is free, requires one pip install, and the same API key is reused for the LLM too — no extra signup.
client = Groq(api_key=GROQ_API_KEY)
with open(audio_path, "rb") as f:
response = client.audio.transcriptions.create(
file=(Path(audio_path).name, f.read()),
model="whisper-large-v3",
response_format="text",
)
transcription = response # returns plain string when format="text"
For anyone wanting a fully local setup, I kept a commented-out _transcribe_local() function using faster-whisper with the base model — much lighter at ~150 MB and fast enough on CPU.
Lesson: Don't let hardware constraints kill your UX. Document the workaround clearly and provide a fallback.
Stage 2: Intent Classification with Compound Command Support
This is the most interesting part. Given a sentence like "Summarize this text and save it to summary.txt", the system needs to identify two intents — not just one.
I used Groq llama-3.3-70b and designed the system prompt to return a JSON array of intent objects instead of a single object. This unlocks compound command support:
[
{
"intent": "summarize",
"text_to_summarize": "this text",
"save_result_to": "summary.txt"
},
{
"intent": "create_file",
"filename": "summary.txt",
"content_hint": "save the summary result"
}
]
The tool executor then runs each step in order, passing the output of one step as input to the next — so the summary text is automatically injected into the file creation step without any extra user input.
Four supported intents:
-
write_code— generate code and save to file -
create_file— create a file with generated content -
summarize— summarize provided text -
general_chat— conversational fallback
Why Groq LLM instead of a local model?
I initially tried Ollama with llama3 locally. It worked but was slow (~8 seconds per classification on CPU). Groq's hosted llama-3.3-70b responds in under 1 second and is free — a clear win for this use case.
Stage 3: Tool Execution with Safety Constraint
All tools that write files use a hard safety rule — every file is written inside the output/ directory only:
def _safe_path(filename: str) -> Path:
safe_name = Path(filename).name # strips "../" and any directory components
return OUTPUT_DIR / safe_name
For compound commands, the executor passes results between steps:
def execute_all_tools(details: list, original_command: str):
last_result_text = None
for detail in details:
intent = detail.get("intent")
if intent == "create_file" and last_result_text:
# Inject previous step's output (e.g. summary) into file
action, result, err = tool_create_file(detail, original_command,
injected_content=last_result_text)
else:
action, result, err = execute_tool(intent, detail, original_command)
last_result_text = extract_plain_text(result)
Bonus 1: Human-in-the-Loop Confirmation
Before any file operation, the pipeline pauses and shows confirmation buttons in the UI. The file is never created until the user explicitly clicks "Yes, proceed".
This required careful state management in Gradio — I stored the pending intent in a global _pending dict and only executed it on confirmation:
# Stage 1: classify, detect file op, pause
if intent in ("write_code", "create_file"):
_pending = {"transcription": transcription, "details": details}
return ..., gr.update(visible=True) # show confirm buttons
# Stage 2: user clicks confirm
def confirm_execution():
details = _pending["details"]
action, result, err = execute_all_tools(details, ...)
Bonus 2: Persistent Session Memory
Every action is logged to output/session_history.json on disk — not just in memory. This means the history panel survives page refreshes and server restarts.
def add_entry(intent, command, action, status):
history = _load() # read from disk
history.append({
"time": datetime.now().strftime("%H:%M:%S"),
"date": datetime.now().strftime("%Y-%m-%d"),
"intent": intent,
"command": command[:80],
"action": action,
"status": status,
})
_save(history) # write back to disk
The UI loads this file on startup so the memory panel is always populated, even after a restart. Cancelled operations are also logged with a 🚫 status.
Bonus 3: Compound Commands
As described in Stage 2, a single voice input can now trigger multiple sequential actions. The intent classifier returns an ordered list of steps, and the tool executor runs them in sequence, piping output between steps automatically.
Example:
"Summarize: AI is transforming every industry — and save it to summary.txt"
Result:
- Summarizes the text
- Saves the summary to
output/summary.txt - Logs both actions as one entry in session memory
- Displays both step results in the UI
Challenges I Faced
1. Groq returns a string, not an object for STT
When using response_format="text", the API returns a plain string — not an object with a .text attribute. This caused an AttributeError on first run:
text = response if isinstance(response, str) else response.text
2. Gradio 6 breaking changes
Upgrading to Gradio 6 broke two things: show_copy_button was removed from gr.Textbox, and css moved from gr.Blocks() to demo.launch(). Always check the changelog when upgrading UI frameworks.
3. Gemini free tier quota exhausted immediately
I initially used Google Gemini for the LLM. The free tier quota was exhausted after just a few test calls (limit: 0 errors). Switching to Groq solved this — their free tier is genuinely generous with thousands of requests per day.
4. Confirmation buttons disappearing too fast
In Gradio 6, toggling visible on individual buttons inside a Row behaved inconsistently. The fix was to toggle the entire gr.Row container's visibility instead of individual buttons.
5. Session history lost on refresh
Storing history in a Python global variable means it resets on every page refresh. Moving to a JSON file on disk solved this completely.
Project Structure
voice-ai-agent/
├── app.py # Gradio UI + pipeline orchestration
├── src/
│ ├── stt.py # Speech-to-Text (Groq Whisper)
│ ├── intent.py # Intent classification + compound command support
│ ├── tools.py # Tool execution with result chaining
│ └── memory.py # Persistent session memory (JSON on disk)
├── output/ # All generated files + session_history.json
├── requirements.txt
├── .env.example
└── README.md
Tech Stack Summary
| Component | Tool | Cost |
|---|---|---|
| STT | Groq Whisper (whisper-large-v3) | Free |
| LLM | Groq llama-3.3-70b-versatile | Free |
| UI | Gradio 6 | Free / Open Source |
| Memory | JSON file on disk | Free |
| Language | Python 3.12 | Free |
Total API cost: $0 — one free Groq key handles everything.
What I Would Add Next
- Graceful degradation — better error messages for unintelligible audio or unmapped intents
- Model benchmarking — compare Groq vs local Ollama on speed and accuracy
- More compound patterns — e.g. "Write a retry function and also write tests for it"
- File editing — "Open retry.py and add logging to it"
Conclusion
Building this agent taught me that the hardest part of an AI pipeline isn't any single model — it's the glue between components. Making structured JSON output reliable, handling Gradio version changes, managing state for multi-step confirmation flows, and persisting memory across sessions are all engineering challenges that tutorials skip over.
The compound command feature was the most satisfying to build — it turns a simple classifier into something that genuinely understands user intent at a higher level.
Full source code: GitHub
Demo video: Loom Demo
Built with: Python · Gradio · Groq API (Whisper + llama-3.3-70b)
Top comments (0)